Introduction

Road accidents are never a happy issue to discuss. It not only has severe consequences for those involved, it also affects the lives of many others like friends and family. With more vehicles on the road than ever before, its important to understand them in greater detail, and possibly ‘predict’ the locations and consequences of these accidents. Government agencies in the UK have been collecting data about the accidents that were reported since the year 2005. The data includes generic and specific details about the vehicles, driver, number of passengers and number of casualties.

With data available since 2005, one could develop a model to predict the accidents. We know that the data recorded in the database are for reported accidents, so we know for sure these accidents have ‘happened’. We use this data to predict the location of the accident, in terms of latitude and longitude, and also the number of expected casualties of the accidents.

Research goals

The purpose of this research is to:

Assumptions, getting and cleaning data

The data sets are quite large. The ‘Accidents0514’ file has over 1.6 million rows and 32 columns. This is the smallest of the 3 files. As a simplification, we limit our analysis to the accidents of the year 2014 alone.

We download the data from the source, do a set of pre-processing operations to prep the data for exploratory analysis and predictive modelling.

After downloading all the data sets, they were unzipped and read in.

## [1] 146322     32
## [1] 268527     22
## [1] 194477     15

Lets look at a snapshot of the accidents in the UK from 2005 to 2014. Given the scale of the data, one might expect the plots to be ‘crowded’.

## [1] 0

We see clear spikes around the longitude of 0 and latitude of 51.5. Not surprisingly, these are the coordinates of London. We also see spikes around the Manchester and Birmingham areas.

The slightly more subtle trend is that the number of accidents in the london area seems to be on the increase since 2005 as shown below. It may also be noted that the number of accidents in other places have remained either the same or decreased.

These are the overall trend. The focus of all the subsequent steps would be on the year 2014. Once we subset the 2014 data, its important to collate the 3 data frames into one, that can be explored for patterns and analyzed.

Before we do this, its important to understand the problem in hand. The objective is to predict the # casualties and potential hot spots for accidents. Looking into the features of the above 3 data frames, we can observe the following:

We ASSUME that for the purposes of our analysis of predicting the number of casualties, the specific details of the casualties do not contribute to the prediction of the number of casulaties of an accident. So we ignore the casu14 data in all our subsequent analysis.

Now we have the immediate task of converting the 2 data frames into one in a meaningful manner. Since both house data about accidents in a particular year and have matching ‘Accident Indix’, we can join the 2 data frames into one using a sql join.

We also assume that if ‘y’ vehicles are incolved in an accident (irrespective of whose fault the accident is) resulting in ‘x’ casulties, the casualty count of each vehicle is ‘x/y’. This is a fair assumtion given that the task is to predict the number of casulties in an accident, not who caused it. So we assume that all the vehicles involved in a casualty is contribute equally towards the casualty count.

In this data, there are a few variables that are duplicates or more precisely, surrogates to other vaiables. These need to be removed. These variables are

Also there are variables that are unlikely to contribute to the accident or number of casualties, like:

So we remove all these variables. Additionally, we do some ‘housekeeping’ operations like converting the time of accident, day, week of month and month of year into categorical variables.

From the data, it may be inferred that -1 is the NA string for the data set. As part of the ‘housekeeping’ operations, we remove the NA strings. To limit data loss, we first check which columns have greater that 10% NA and selectively remove those columns. We the proceed to remove rows that have any NAs.

## [1] 0
## [1] 266872     38
## 'data.frame':    266872 obs. of  38 variables:
##  $ Vehicle_Type                           : int  8 19 9 1 9 3 9 9 3 9 ...
##  $ Towing_and_Articulation                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Vehicle_Manoeuvre                      : int  18 15 2 14 9 13 4 6 18 2 ...
##  $ Vehicle_Location.Restricted_Lane       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Junction_Location                      : int  0 0 1 1 5 8 1 8 8 0 ...
##  $ Skidding_and_Overturning               : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Hit_Object_in_Carriageway              : int  0 0 0 4 0 0 0 0 0 0 ...
##  $ Vehicle_Leaving_Carriageway            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hit_Object_off_Carriageway             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1st_Point_of_Impact                   : int  4 1 3 4 3 1 1 3 1 3 ...
##  $ Was_Vehicle_Left_Hand_Drive.           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Journey_Purpose_of_Driver              : int  1 6 6 6 6 6 6 6 6 6 ...
##  $ Sex_of_Driver                          : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ Accident_Severity                      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ X1st_Road_Class                        : int  3 3 3 3 3 3 5 3 3 3 ...
##  $ Road_Type                              : int  6 6 6 6 6 6 6 6 6 2 ...
##  $ Speed_limit                            : int  30 30 30 30 30 30 30 30 30 30 ...
##  $ Junction_Detail                        : int  0 0 5 5 3 3 3 7 7 0 ...
##  $ Pedestrian_Crossing.Human_Control      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pedestrian_Crossing.Physical_Facilities: int  0 0 5 5 0 0 1 8 8 0 ...
##  $ Light_Conditions                       : int  1 1 7 7 1 1 4 1 1 1 ...
##  $ Weather_Conditions                     : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ Road_Surface_Conditions                : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ Special_Conditions_at_Site             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Carriageway_Hazards                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Urban_or_Rural_Area                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Police_Force                           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Number_of_Vehicles                     : int  2 2 2 2 2 2 1 2 2 3 ...
##  $ Local_Authority_.District.             : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ X1st_Road_Number                       : int  315 315 3218 3218 308 308 0 4 4 3220 ...
##  $ X2nd_Road_Number                       : int  0 0 3220 3220 0 0 0 4 4 0 ...
##  $ Time                                   : Factor w/ 7 levels "Evening","Evening-Rush",..: 7 7 3 3 4 4 2 5 5 7 ...
##  $ Month                                  : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WeekOfMonth                            : Factor w/ 5 levels "1","2","3","4",..: 2 2 3 3 3 3 3 2 2 3 ...
##  $ Day                                    : Factor w/ 7 levels "Friday","Monday",..: 5 5 2 2 6 6 7 5 5 1 ...
##  $ CasualtiesPerAccident                  : num  0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.333 ...
##  $ Longitude                              : num  -0.206 -0.206 -0.19 -0.19 -0.174 ...
##  $ Latitude                               : num  51.5 51.5 51.5 51.5 51.5 ...

Exploratory Analysis

Lets begin with a few high level summary of the casualty data

## 
##  0.053  0.111  0.125  0.143  0.167    0.2  0.222  0.238   0.25  0.273 
##     19      9     96    119    354   1039      9     21   4423     11 
##  0.286    0.3  0.308  0.333  0.375    0.4  0.429  0.444    0.5  0.556 
##     98     20     13  18907     48    675     91      9 138034      9 
##  0.571    0.6  0.667    0.7  0.714   0.75    0.8  0.833  0.857  0.889 
##     14    465   8985     10     28   1484    225     66      7      9 
##      1    1.2   1.25    1.3  1.333  1.375    1.4  1.429    1.5    1.6 
##  72331     50    276     10   1350      8     20      7   7904     30 
##  1.667   1.75    1.8      2    2.2   2.25  2.333    2.5  2.667   2.75 
##    593     72     10   6292      5     28     72    930     21      4 
##      3    3.2   3.25  3.333    3.4    3.5  3.667    3.8      4    4.5 
##    985      5      4      9      5     84      3      5    277     22 
##  4.667      5    5.5      6    6.5      7    7.5      8    8.5      9 
##      3    102      8     17      2      5      2      5      2      3 
##     10   10.5   20.5     27   43.5   46.5     54 
##      2      4      2      1      2      2      1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0530  0.5000  0.5000  0.7246  1.0000 54.0000

The summary shows that the 75th percentile is 1, meaning that for almost 3/4ths of the accidents, the number of casualties is 1.

Lets look at the locations of these accidents to establish any locations with high concentration of accidents.

This shows stunning collections of accident hot-spots. It may be easily corroborated that these hot spots corresponds to major cities. The hottest accident zone seems to be London area, followed by Birmingham area, Manchester area, Sheffield and Leeds area, and Newcastle area. It would be interesting to see if the predictive models gives these places are the predicted hot spots.

The density plots also reveal that the western and northern most parts of the UK doesnt seem to have very many accidents when compared to the rest of the UK, especially the south-east.

Lets explore the data in greater detail. To make the exploratory plots meaningful, lets consider the following reasons which are likely to be considered as contributing to an accident, hence also to casualties per accident:

It has to be kept in mind that details of these categories are not given in the data set and no attempt shall be made to infer what the categories could be. They shall be taken on face value.

Lets begin by looking at the ‘Day’ of the week.

##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##      0.71      0.71      0.78      0.80      0.70      0.70      0.70
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##       0.5       0.5       0.5       0.5       0.5       0.5       0.5
##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##      0.42      0.54      0.48      0.52      0.41      0.58      0.40

There is a clear effect of day on the number of casualties. Part of that trend is due to the outliers, which obscure the actual scales of the differences of the ‘not-so-extreme’ values. This could be inferred from the differences in the mean values. So its likely that the day of the week has some effect on the casualties, but not significantly.

Lets turn to Week of the month.

##    1    2    3    4    5 
## 0.72 0.72 0.72 0.73 0.72
##   1   2   3   4   5 
## 0.5 0.5 0.5 0.5 0.5
##    1    2    3    4    5 
## 0.49 0.45 0.55 0.44 0.44

We can very clearly see than the fifth week of every month has much lesser accidents than other weeks. But this is because of the fact that the not many months have a ‘full’ fifth week. But apart form this, there is not much of any observable trends.

On to month…

##    1    2    3    4    5    6    7    8    9   10   11   12 
## 0.73 0.73 0.72 0.73 0.72 0.72 0.72 0.74 0.71 0.71 0.72 0.74
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 0.43 0.42 0.43 0.44 0.46 0.59 0.44 0.58 0.43 0.60 0.46 0.43

Again, similar results. Not much of observable trends. There seems to be high variability in the casualties in the months of June, August and October. These can be seen in the boxplot. So month seem to have limited effect on the outcomes.

Lets now look at the effect of times during which these accidents occured. We categorize the times into the customary rush/lean (aking to peak/of-peak) hours in morning, evening, etc.

##      Evening Evening-Rush    Mid-Night      Morning Morning-Rush 
##         0.75         0.70         0.83         0.72         0.67 
##        Night         Noon 
##         0.81         0.73
##      Evening Evening-Rush    Mid-Night      Morning Morning-Rush 
##          0.5          0.5          0.5          0.5          0.5 
##        Night         Noon 
##          0.5          0.5
##      Evening Evening-Rush    Mid-Night      Morning Morning-Rush 
##         0.45         0.40         0.56         0.42         0.62 
##        Night         Noon 
##         0.53         0.43

Now we see some trends. There are observably more accidents in the ‘day time’ of morning rush, noon and evening rush. So the time of the day is a significant predictor of the accidents.

Lets look at the effect of vehicle type.

##    1    2    3    4    5    8    9   10   11   16   17   18   19   20   21 
## 0.54 0.63 0.63 0.66 0.68 0.80 0.75 0.89 0.98 0.54 0.64 0.93 0.69 0.67 0.66 
##   22   23   90   97   98 
## 0.64 0.72 0.70 0.69 0.62
##   1   2   3   4   5   8   9  10  11  16  17  18  19  20  21  22  23  90 
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0.5 
##  97  98 
## 0.5 0.5
##    1    2    3    4    5    8    9   10   11   16   17   18   19   20   21 
## 0.18 0.26 0.28 0.32 0.33 0.45 0.46 0.83 1.40 0.21 0.35 0.34 0.39 0.38 0.49 
##   22   23   90   97   98 
## 0.27 0.47 0.40 0.32 0.31

Lets get the first plot on a more insightful scale.

Vehicle Type clearly is an important predictor, more so clear when plotted on a more consise scale.

Exploring the vehicle manoeuvre that led to the accident.

##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 0.83 0.53 0.64 0.68 0.70 0.67 0.68 0.65 0.70 0.68 0.64 0.65 0.63 0.68 0.60 
##   16   17   18 
## 0.94 0.95 0.75
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 0.5
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 0.36 0.30 0.37 0.39 0.34 0.37 0.36 0.35 0.40 0.39 0.38 0.38 0.37 0.36 0.33 
##   16   17   18 
## 0.90 0.88 0.47

Clearly, there are some manoeuvre that are more likely to result in casualties than others. It can also be inferred that the vehicle manoeuvre is a predictor for latitude and longitude as one might associate certain manoeuvres with certain conditions. Like turning on a curvy part of a road, etc.

Lets conclude the exloratory section by investigation the effect of ambient and surroundings, namely, Light, Weather conditions and the location being Rural or Urban on the sex of the driver and casualties.

##   driverSex         x
## 1         1 0.7276312
## 2         2 0.7376330
## 3         3 0.6305622

##    light driverSex         x
## 1      1         1 0.7083315
## 2      4         1 0.7490012
## 3      5         1 0.7970448
## 4      6         1 0.9407551
## 5      7         1 0.6983057
## 6      1         2 0.7271157
## 7      4         2 0.7404526
## 8      5         2 0.7653305
## 9      6         2 0.9453716
## 10     7         2 0.7041431
## 11     1         3 0.6208016
## 12     4         3 0.6535934
## 13     5         3 0.6063437
## 14     6         3 0.6400392
## 15     7         3 0.6555103

Light condition (#6) seem to affect both men and women drivers the same way, in that both sexes have higher casualty count under light condition 6.

Clearly the weather and light condition do contribute to the casualties, so does sex of the driver, but to a lower extent.

As a final plot, lets visualize the correlation among all the variables.

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.4366000 -0.0231000  0.0001786  0.0076190  0.0284400  0.9804000
## [1] 1

This plot gives great insights on the factors that affect the casualties. The most significant correlation is with Number of Vehicles and has a correlation value of close to -0.3. This correlation makes sense as the number of vehicles involved would definitely influence the number of casualties. Whats counter-intuitive is the sense or sign of the correlation. It seems to suggest that lower the number of vehicles, higher the casualties. This might be indicative that a type of vehicle (possibly large) if ‘prone’ to get into an accident ‘by itself’ resulting in high casualty.

We can also note that the longitude is influenced by ‘Police Force’ and ‘Local Authority District’.

Predictive Model

Since we are dealing with a large data set, its worth setting up parallel processing to save processing time. We do it using the ‘DoParallel’ library.

## [1] "Number of registered cores is 4"

We split the model into training and test set. We use a 70/30 training/test split.

We build a Stochastic Gradient Boosting model for modelling the accident data.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.2079             nan     0.1000    0.0072
##      2        0.2020             nan     0.1000    0.0059
##      3        0.1973             nan     0.1000    0.0048
##      4        0.1934             nan     0.1000    0.0039
##      5        0.1902             nan     0.1000    0.0032
##      6        0.1875             nan     0.1000    0.0026
##      7        0.1853             nan     0.1000    0.0022
##      8        0.1835             nan     0.1000    0.0018
##      9        0.1819             nan     0.1000    0.0016
##     10        0.1807             nan     0.1000    0.0012
##     20        0.1743             nan     0.1000    0.0004
##     40        0.1696             nan     0.1000    0.0001
##     60        0.1675             nan     0.1000    0.0001
##     80        0.1662             nan     0.1000    0.0001
##    100        0.1655             nan     0.1000    0.0000
##    120        0.1644             nan     0.1000    0.0000
##    140        0.1639             nan     0.1000    0.0000
##    150        0.1633             nan     0.1000    0.0000
## Stochastic Gradient Boosting 
## 
## 186812 samples
##     35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   RMSE SD   
##   1                   50      0.4148981  0.1988442  0.02360560
##   1                  100      0.4107483  0.2118668  0.02381988
##   1                  150      0.4089926  0.2172836  0.02390877
##   2                   50      0.4103671  0.2140572  0.02383323
##   2                  100      0.4068020  0.2254588  0.02401245
##   2                  150      0.4054773  0.2297384  0.02403900
##   3                   50      0.4078799  0.2226987  0.02390043
##   3                  100      0.4046594  0.2330524  0.02408430
##   3                  150      0.4037384  0.2359248  0.02389452
##   Rsquared SD
##   0.01806151 
##   0.01919894 
##   0.01965441 
##   0.01932151 
##   0.02029407 
##   0.02039820 
##   0.01980134 
##   0.02069083 
##   0.02000347 
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## [1] 0.47

Similar predictions were made for for Latitude and Longitudes as well.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6593             nan     0.1000    0.2682
##      2        1.4363             nan     0.1000    0.2228
##      3        1.2594             nan     0.1000    0.1771
##      4        1.1118             nan     0.1000    0.1476
##      5        0.9923             nan     0.1000    0.1200
##      6        0.8933             nan     0.1000    0.0989
##      7        0.8165             nan     0.1000    0.0774
##      8        0.7508             nan     0.1000    0.0660
##      9        0.6971             nan     0.1000    0.0535
##     10        0.6589             nan     0.1000    0.0381
##     20        0.4225             nan     0.1000    0.0094
##     40        0.2735             nan     0.1000    0.0071
##     60        0.1971             nan     0.1000    0.0015
##     80        0.1440             nan     0.1000    0.0021
##    100        0.1164             nan     0.1000    0.0006
##    120        0.0979             nan     0.1000    0.0007
##    140        0.0863             nan     0.1000    0.0007
##    150        0.0797             nan     0.1000    0.0007
## Stochastic Gradient Boosting 
## 
## 186812 samples
##     35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   RMSE SD    
##   1                   50      0.8081185  0.7187798  0.002148632
##   1                  100      0.6863972  0.7832400  0.002062606
##   1                  150      0.6314684  0.8072319  0.001931428
##   2                   50      0.5737028  0.8451793  0.002964524
##   2                  100      0.4421717  0.9052640  0.003035276
##   2                  150      0.3595747  0.9368570  0.003742017
##   3                   50      0.4761200  0.8919498  0.004231754
##   3                  100      0.3401443  0.9435872  0.002307612
##   3                  150      0.2842114  0.9597367  0.002087082
##   Rsquared SD 
##   0.0019040074
##   0.0015607715
##   0.0013080224
##   0.0017634120
##   0.0014980194
##   0.0014387903
##   0.0026789789
##   0.0007870951
##   0.0006196809
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## [1] 0.9800938
## [1] 0.28
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6606             nan     0.1000    0.3021
##      2        1.4115             nan     0.1000    0.2471
##      3        1.2128             nan     0.1000    0.1983
##      4        1.0484             nan     0.1000    0.1647
##      5        0.9110             nan     0.1000    0.1384
##      6        0.7996             nan     0.1000    0.1109
##      7        0.6898             nan     0.1000    0.1097
##      8        0.6190             nan     0.1000    0.0706
##      9        0.5412             nan     0.1000    0.0777
##     10        0.4781             nan     0.1000    0.0630
##     20        0.2199             nan     0.1000    0.0137
##     40        0.1086             nan     0.1000    0.0026
##     60        0.0748             nan     0.1000    0.0007
##     80        0.0571             nan     0.1000    0.0005
##    100        0.0452             nan     0.1000    0.0007
##    120        0.0384             nan     0.1000    0.0002
##    140        0.0335             nan     0.1000    0.0001
##    150        0.0319             nan     0.1000    0.0001
## Stochastic Gradient Boosting 
## 
## 186812 samples
##     35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   RMSE SD    
##   1                   50      0.6915950  0.8562537  0.002699728
##   1                  100      0.4862036  0.9119855  0.002474972
##   1                  150      0.4145690  0.9217630  0.002160972
##   2                   50      0.3694534  0.9408833  0.003757698
##   2                  100      0.2704362  0.9641916  0.003695063
##   2                  150      0.2247927  0.9750049  0.002794169
##   3                   50      0.2976922  0.9583498  0.002668562
##   3                  100      0.2127352  0.9779870  0.002685592
##   3                  150      0.1790564  0.9840712  0.002694375
##   Rsquared SD 
##   0.0038390426
##   0.0011016636
##   0.0004799897
##   0.0014909819
##   0.0010122493
##   0.0006408788
##   0.0007504481
##   0.0005613151
##   0.0004254851
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## [1] 0.9921451
## [1] 0.18

Diagnostics, Result and Visualization

Lets look at the model, predictions and results in bit more detail:

Lets visualize the results and explore the errors in a qualitative manner.

The predictions for casualties alligns well with our observations from the exploratory plots.

The predictions for Longitudes also alligns well with our from the exploratory sections that the police force and local district authority are correlated with longitudes.

Lets analyze qualitatively as to how close are the training and test predictions are to the actual values.

## [1] 0.01

It can be seen that fit isnt particularly great. The model doesnt seem to pick the extreme values at all- possibly because of the very low incidence of extreme values( > 3 casualties/accident), i.e around 0.1% of the training data size.

Similar plots can be made for the Latitude and Longitudes

## [1] 80060     3

These plots reveal that the model is ‘reasonably’ successful in predicting the hot-spots.

Results

The research goal was to analyze the UK accident data set and predict accident hot spots and number of casualties. Analyses were set up for the year 2014. The 3 data sets were cleaned and merged into a single data set. Exploratory analysis were performed and it was observed that: