UK Road Accidents

Introduction

Road accidents are never a happy issue to discuss. It not only has severe consequences for those involved, it also affects the lives of many others like friends and family. With more vehicles on the road than ever before, its important to understand them in greater detail, and possibly ‘predict’ the locations and consequences of these accidents. Government agencies in the UK have been collecting data about the accidents that were reported since the year 2005. The data includes generic and specific details about the vehicles, driver, number of passengers and number of casualties.

With data available since 2005, one could develop a model to predict the accidents. We know that the data recorded in the database are for reported accidents, so we know for sure these accidents have ‘happened’. We use this data to predict the location of the accident, in terms of latitude and longitude, and also the number of expected casualties of the accidents.

Research goals

The purpose of this research is to:

Identify and quantify associations (if any) between the number of causalities and other variables in the data set.
Explore whether it is possible to predict accident hotspots based on the data.

Assumptions, getting and cleaning data

The data sets are quite large. The ‘Accidents0514’ file has over 1.6 million rows and 32 columns. This is the smallest of the 3 files. As a simplification, we limit our analysis to the accidents of the year 2014 alone.

We download the data from the source, do a set of pre-processing operations to prep the data for exploratory analysis and predictive modelling.

After downloading all the data sets, they were unzipped and read in.

## [1] 146322     32

## [1] 268527     22

## [1] 194477     15

Lets look at a snapshot of the accidents in the UK from 2005 to 2014. Given the scale of the data, one might expect the plots to be ‘crowded’.

## [1] 0

We see clear spikes around the longitude of 0 and latitude of 51.5. Not surprisingly, these are the coordinates of London. We also see spikes around the Manchester and Birmingham areas.

The slightly more subtle trend is that the number of accidents in the london area seems to be on the increase since 2005 as shown below. It may also be noted that the number of accidents in other places have remained either the same or decreased.

These are the overall trend. The focus of all the subsequent steps would be on the year 2014. Once we subset the 2014 data, its important to collate the 3 data frames into one, that can be explored for patterns and analyzed.

Before we do this, its important to understand the problem in hand. The objective is to predict the # casualties and potential hot spots for accidents. Looking into the features of the above 3 data frames, we can observe the following:

‘Accidents0514’ contains the ‘overall’ factors, or generic factors that are not typically within the driver’s control, like date, time, weather, light condition, longitude and latitude of the accident (fair to assume that this wasnt in his/her control as its likely that the driver wasnt looking to commit an accident, therefore accident happeing at a particular location wasnt under his/her control), number of vehicles and casualties of the accident, etc.
‘Vehicles0514’ contains vehicle and driver specific details like type of vehicle, was it being towed, engine capacity, age and sex of driver, etc.
‘Casualties0514’ contains the details of victims of accident.

We ASSUME that for the purposes of our analysis of predicting the number of casualties, the specific details of the casualties do not contribute to the prediction of the number of casulaties of an accident. So we ignore the casu14 data in all our subsequent analysis.

Now we have the immediate task of converting the 2 data frames into one in a meaningful manner. Since both house data about accidents in a particular year and have matching ‘Accident Indix’, we can join the 2 data frames into one using a sql join.

We also assume that if ‘y’ vehicles are incolved in an accident (irrespective of whose fault the accident is) resulting in ‘x’ casulties, the casualty count of each vehicle is ‘x/y’. This is a fair assumtion given that the task is to predict the number of casulties in an accident, not who caused it. So we assume that all the vehicles involved in a casualty is contribute equally towards the casualty count.

In this data, there are a few variables that are duplicates or more precisely, surrogates to other vaiables. These need to be removed. These variables are

Age_of_Driver: which is captured by the Age_Band_of_Driver variable.
Location_Easting_OSGR, Location_Northing_OSGR: Ordnance Survey Grid Reference is a grid location surrogate for latitude and longitude.
Number_of_Casualties: already captured by ‘CasualtiesPerAccident’
LSOA_of_Accident_Location : Lower Layer Super Output Area is a geographical location surrogate for latitude and longitude.

Also there are variables that are unlikely to contribute to the accident or number of casualties, like:

Did_Police_Officer_Attend_Scene_of_Accident: Police officer(s) are lilely to have visited after the accident.
Accident_Index: Likely to have been assigned after the accident.

So we remove all these variables. Additionally, we do some ‘housekeeping’ operations like converting the time of accident, day, week of month and month of year into categorical variables.

From the data, it may be inferred that -1 is the NA string for the data set. As part of the ‘housekeeping’ operations, we remove the NA strings. To limit data loss, we first check which columns have greater that 10% NA and selectively remove those columns. We the proceed to remove rows that have any NAs.

## [1] 0

## [1] 266872     38

## 'data.frame':    266872 obs. of  38 variables:
##  $ Vehicle_Type                           : int  8 19 9 1 9 3 9 9 3 9 ...
##  $ Towing_and_Articulation                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Vehicle_Manoeuvre                      : int  18 15 2 14 9 13 4 6 18 2 ...
##  $ Vehicle_Location.Restricted_Lane       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Junction_Location                      : int  0 0 1 1 5 8 1 8 8 0 ...
##  $ Skidding_and_Overturning               : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Hit_Object_in_Carriageway              : int  0 0 0 4 0 0 0 0 0 0 ...
##  $ Vehicle_Leaving_Carriageway            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Hit_Object_off_Carriageway             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ X1st_Point_of_Impact                   : int  4 1 3 4 3 1 1 3 1 3 ...
##  $ Was_Vehicle_Left_Hand_Drive.           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Journey_Purpose_of_Driver              : int  1 6 6 6 6 6 6 6 6 6 ...
##  $ Sex_of_Driver                          : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ Accident_Severity                      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ X1st_Road_Class                        : int  3 3 3 3 3 3 5 3 3 3 ...
##  $ Road_Type                              : int  6 6 6 6 6 6 6 6 6 2 ...
##  $ Speed_limit                            : int  30 30 30 30 30 30 30 30 30 30 ...
##  $ Junction_Detail                        : int  0 0 5 5 3 3 3 7 7 0 ...
##  $ Pedestrian_Crossing.Human_Control      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pedestrian_Crossing.Physical_Facilities: int  0 0 5 5 0 0 1 8 8 0 ...
##  $ Light_Conditions                       : int  1 1 7 7 1 1 4 1 1 1 ...
##  $ Weather_Conditions                     : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ Road_Surface_Conditions                : int  2 2 1 1 1 1 1 1 1 1 ...
##  $ Special_Conditions_at_Site             : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Carriageway_Hazards                    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Urban_or_Rural_Area                    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Police_Force                           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Number_of_Vehicles                     : int  2 2 2 2 2 2 1 2 2 3 ...
##  $ Local_Authority_.District.             : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ X1st_Road_Number                       : int  315 315 3218 3218 308 308 0 4 4 3220 ...
##  $ X2nd_Road_Number                       : int  0 0 3220 3220 0 0 0 4 4 0 ...
##  $ Time                                   : Factor w/ 7 levels "Evening","Evening-Rush",..: 7 7 3 3 4 4 2 5 5 7 ...
##  $ Month                                  : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WeekOfMonth                            : Factor w/ 5 levels "1","2","3","4",..: 2 2 3 3 3 3 3 2 2 3 ...
##  $ Day                                    : Factor w/ 7 levels "Friday","Monday",..: 5 5 2 2 6 6 7 5 5 1 ...
##  $ CasualtiesPerAccident                  : num  0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.333 ...
##  $ Longitude                              : num  -0.206 -0.206 -0.19 -0.19 -0.174 ...
##  $ Latitude                               : num  51.5 51.5 51.5 51.5 51.5 ...

Exploratory Analysis

Lets begin with a few high level summary of the casualty data

## 
##  0.053  0.111  0.125  0.143  0.167    0.2  0.222  0.238   0.25  0.273 
##     19      9     96    119    354   1039      9     21   4423     11 
##  0.286    0.3  0.308  0.333  0.375    0.4  0.429  0.444    0.5  0.556 
##     98     20     13  18907     48    675     91      9 138034      9 
##  0.571    0.6  0.667    0.7  0.714   0.75    0.8  0.833  0.857  0.889 
##     14    465   8985     10     28   1484    225     66      7      9 
##      1    1.2   1.25    1.3  1.333  1.375    1.4  1.429    1.5    1.6 
##  72331     50    276     10   1350      8     20      7   7904     30 
##  1.667   1.75    1.8      2    2.2   2.25  2.333    2.5  2.667   2.75 
##    593     72     10   6292      5     28     72    930     21      4 
##      3    3.2   3.25  3.333    3.4    3.5  3.667    3.8      4    4.5 
##    985      5      4      9      5     84      3      5    277     22 
##  4.667      5    5.5      6    6.5      7    7.5      8    8.5      9 
##      3    102      8     17      2      5      2      5      2      3 
##     10   10.5   20.5     27   43.5   46.5     54 
##      2      4      2      1      2      2      1

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0530  0.5000  0.5000  0.7246  1.0000 54.0000

The summary shows that the 75th percentile is 1, meaning that for almost 3/4ths of the accidents, the number of casualties is 1.

Lets look at the locations of these accidents to establish any locations with high concentration of accidents.

This shows stunning collections of accident hot-spots. It may be easily corroborated that these hot spots corresponds to major cities. The hottest accident zone seems to be London area, followed by Birmingham area, Manchester area, Sheffield and Leeds area, and Newcastle area. It would be interesting to see if the predictive models gives these places are the predicted hot spots.

The density plots also reveal that the western and northern most parts of the UK doesnt seem to have very many accidents when compared to the rest of the UK, especially the south-east.

Lets explore the data in greater detail. To make the exploratory plots meaningful, lets consider the following reasons which are likely to be considered as contributing to an accident, hence also to casualties per accident:

Day
Week
Month
Time
Vehicle Type
Vehicle manoeuvre
Road type
Light condition
Weather condition
Road condition
Location Urbal or Rural

It has to be kept in mind that details of these categories are not given in the data set and no attempt shall be made to infer what the categories could be. They shall be taken on face value.

Lets begin by looking at the ‘Day’ of the week.

##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##      0.71      0.71      0.78      0.80      0.70      0.70      0.70

##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##       0.5       0.5       0.5       0.5       0.5       0.5       0.5

##    Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
##      0.42      0.54      0.48      0.52      0.41      0.58      0.40

There is a clear effect of day on the number of casualties. Part of that trend is due to the outliers, which obscure the actual scales of the differences of the ‘not-so-extreme’ values. This could be inferred from the differences in the mean values. So its likely that the day of the week has some effect on the casualties, but not significantly.

Lets turn to Week of the month.

##    1    2    3    4    5 
## 0.72 0.72 0.72 0.73 0.72

##   1   2   3   4   5 
## 0.5 0.5 0.5 0.5 0.5

##    1    2    3    4    5 
## 0.49 0.45 0.55 0.44 0.44

We can very clearly see than the fifth week of every month has much lesser accidents than other weeks. But this is because of the fact that the not many months have a ‘full’ fifth week. But apart form this, there is not much of any observable trends.

On to month…

##    1    2    3    4    5    6    7    8    9   10   11   12 
## 0.73 0.73 0.72 0.73 0.72 0.72 0.72 0.74 0.71 0.71 0.72 0.74

##   1   2   3   4   5   6   7   8   9  10  11  12 
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

##    1    2    3    4    5    6    7    8    9   10   11   12 
## 0.43 0.42 0.43 0.44 0.46 0.59 0.44 0.58 0.43 0.60 0.46 0.43

Again, similar results. Not much of observable trends. There seems to be high variability in the casualties in the months of June, August and October. These can be seen in the boxplot. So month seem to have limited effect on the outcomes.

Lets now look at the effect of times during which these accidents occured. We categorize the times into the customary rush/lean (aking to peak/of-peak) hours in morning, evening, etc.

##      Evening Evening-Rush    Mid-Night      Morning Morning-Rush 
##         0.75         0.70         0.83         0.72         0.67 
##        Night         Noon 
##         0.81         0.73

##      Evening Evening-Rush    Mid-Night      Morning Morning-Rush 
##          0.5          0.5          0.5          0.5          0.5 
##        Night         Noon 
##          0.5          0.5

##      Evening Evening-Rush    Mid-Night      Morning Morning-Rush 
##         0.45         0.40         0.56         0.42         0.62 
##        Night         Noon 
##         0.53         0.43

Now we see some trends. There are observably more accidents in the ‘day time’ of morning rush, noon and evening rush. So the time of the day is a significant predictor of the accidents.

Lets look at the effect of vehicle type.

##    1    2    3    4    5    8    9   10   11   16   17   18   19   20   21 
## 0.54 0.63 0.63 0.66 0.68 0.80 0.75 0.89 0.98 0.54 0.64 0.93 0.69 0.67 0.66 
##   22   23   90   97   98 
## 0.64 0.72 0.70 0.69 0.62

##   1   2   3   4   5   8   9  10  11  16  17  18  19  20  21  22  23  90 
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0.5 
##  97  98 
## 0.5 0.5

##    1    2    3    4    5    8    9   10   11   16   17   18   19   20   21 
## 0.18 0.26 0.28 0.32 0.33 0.45 0.46 0.83 1.40 0.21 0.35 0.34 0.39 0.38 0.49 
##   22   23   90   97   98 
## 0.27 0.47 0.40 0.32 0.31

Lets get the first plot on a more insightful scale.

Vehicle Type clearly is an important predictor, more so clear when plotted on a more consise scale.

Exploring the vehicle manoeuvre that led to the accident.

##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 0.83 0.53 0.64 0.68 0.70 0.67 0.68 0.65 0.70 0.68 0.64 0.65 0.63 0.68 0.60 
##   16   17   18 
## 0.94 0.95 0.75

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
## 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 0.5

##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 0.36 0.30 0.37 0.39 0.34 0.37 0.36 0.35 0.40 0.39 0.38 0.38 0.37 0.36 0.33 
##   16   17   18 
## 0.90 0.88 0.47

Clearly, there are some manoeuvre that are more likely to result in casualties than others. It can also be inferred that the vehicle manoeuvre is a predictor for latitude and longitude as one might associate certain manoeuvres with certain conditions. Like turning on a curvy part of a road, etc.

Lets conclude the exloratory section by investigation the effect of ambient and surroundings, namely, Light, Weather conditions and the location being Rural or Urban on the sex of the driver and casualties.

##   driverSex         x
## 1         1 0.7276312
## 2         2 0.7376330
## 3         3 0.6305622

##    light driverSex         x
## 1      1         1 0.7083315
## 2      4         1 0.7490012
## 3      5         1 0.7970448
## 4      6         1 0.9407551
## 5      7         1 0.6983057
## 6      1         2 0.7271157
## 7      4         2 0.7404526
## 8      5         2 0.7653305
## 9      6         2 0.9453716
## 10     7         2 0.7041431
## 11     1         3 0.6208016
## 12     4         3 0.6535934
## 13     5         3 0.6063437
## 14     6         3 0.6400392
## 15     7         3 0.6555103

Light condition (#6) seem to affect both men and women drivers the same way, in that both sexes have higher casualty count under light condition 6.

Clearly the weather and light condition do contribute to the casualties, so does sex of the driver, but to a lower extent.

As a final plot, lets visualize the correlation among all the variables.

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -0.4366000 -0.0231000  0.0001786  0.0076190  0.0284400  0.9804000

## [1] 1

This plot gives great insights on the factors that affect the casualties. The most significant correlation is with Number of Vehicles and has a correlation value of close to -0.3. This correlation makes sense as the number of vehicles involved would definitely influence the number of casualties. Whats counter-intuitive is the sense or sign of the correlation. It seems to suggest that lower the number of vehicles, higher the casualties. This might be indicative that a type of vehicle (possibly large) if ‘prone’ to get into an accident ‘by itself’ resulting in high casualty.

We can also note that the longitude is influenced by ‘Police Force’ and ‘Local Authority District’.

Predictive Model

Since we are dealing with a large data set, its worth setting up parallel processing to save processing time. We do it using the ‘DoParallel’ library.

## [1] "Number of registered cores is 4"

We split the model into training and test set. We use a 70/30 training/test split.

We build a Stochastic Gradient Boosting model for modelling the accident data.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.2079             nan     0.1000    0.0072
##      2        0.2020             nan     0.1000    0.0059
##      3        0.1973             nan     0.1000    0.0048
##      4        0.1934             nan     0.1000    0.0039
##      5        0.1902             nan     0.1000    0.0032
##      6        0.1875             nan     0.1000    0.0026
##      7        0.1853             nan     0.1000    0.0022
##      8        0.1835             nan     0.1000    0.0018
##      9        0.1819             nan     0.1000    0.0016
##     10        0.1807             nan     0.1000    0.0012
##     20        0.1743             nan     0.1000    0.0004
##     40        0.1696             nan     0.1000    0.0001
##     60        0.1675             nan     0.1000    0.0001
##     80        0.1662             nan     0.1000    0.0001
##    100        0.1655             nan     0.1000    0.0000
##    120        0.1644             nan     0.1000    0.0000
##    140        0.1639             nan     0.1000    0.0000
##    150        0.1633             nan     0.1000    0.0000

## Stochastic Gradient Boosting 
## 
## 186812 samples
##     35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   RMSE SD   
##   1                   50      0.4148981  0.1988442  0.02360560
##   1                  100      0.4107483  0.2118668  0.02381988
##   1                  150      0.4089926  0.2172836  0.02390877
##   2                   50      0.4103671  0.2140572  0.02383323
##   2                  100      0.4068020  0.2254588  0.02401245
##   2                  150      0.4054773  0.2297384  0.02403900
##   3                   50      0.4078799  0.2226987  0.02390043
##   3                  100      0.4046594  0.2330524  0.02408430
##   3                  150      0.4037384  0.2359248  0.02389452
##   Rsquared SD
##   0.01806151 
##   0.01919894 
##   0.01965441 
##   0.01932151 
##   0.02029407 
##   0.02039820 
##   0.01980134 
##   0.02069083 
##   0.02000347 
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

## [1] 0.47

Similar predictions were made for for Latitude and Longitudes as well.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6593             nan     0.1000    0.2682
##      2        1.4363             nan     0.1000    0.2228
##      3        1.2594             nan     0.1000    0.1771
##      4        1.1118             nan     0.1000    0.1476
##      5        0.9923             nan     0.1000    0.1200
##      6        0.8933             nan     0.1000    0.0989
##      7        0.8165             nan     0.1000    0.0774
##      8        0.7508             nan     0.1000    0.0660
##      9        0.6971             nan     0.1000    0.0535
##     10        0.6589             nan     0.1000    0.0381
##     20        0.4225             nan     0.1000    0.0094
##     40        0.2735             nan     0.1000    0.0071
##     60        0.1971             nan     0.1000    0.0015
##     80        0.1440             nan     0.1000    0.0021
##    100        0.1164             nan     0.1000    0.0006
##    120        0.0979             nan     0.1000    0.0007
##    140        0.0863             nan     0.1000    0.0007
##    150        0.0797             nan     0.1000    0.0007

## Stochastic Gradient Boosting 
## 
## 186812 samples
##     35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   RMSE SD    
##   1                   50      0.8081185  0.7187798  0.002148632
##   1                  100      0.6863972  0.7832400  0.002062606
##   1                  150      0.6314684  0.8072319  0.001931428
##   2                   50      0.5737028  0.8451793  0.002964524
##   2                  100      0.4421717  0.9052640  0.003035276
##   2                  150      0.3595747  0.9368570  0.003742017
##   3                   50      0.4761200  0.8919498  0.004231754
##   3                  100      0.3401443  0.9435872  0.002307612
##   3                  150      0.2842114  0.9597367  0.002087082
##   Rsquared SD 
##   0.0019040074
##   0.0015607715
##   0.0013080224
##   0.0017634120
##   0.0014980194
##   0.0014387903
##   0.0026789789
##   0.0007870951
##   0.0006196809
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

## [1] 0.9800938

## [1] 0.28

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6606             nan     0.1000    0.3021
##      2        1.4115             nan     0.1000    0.2471
##      3        1.2128             nan     0.1000    0.1983
##      4        1.0484             nan     0.1000    0.1647
##      5        0.9110             nan     0.1000    0.1384
##      6        0.7996             nan     0.1000    0.1109
##      7        0.6898             nan     0.1000    0.1097
##      8        0.6190             nan     0.1000    0.0706
##      9        0.5412             nan     0.1000    0.0777
##     10        0.4781             nan     0.1000    0.0630
##     20        0.2199             nan     0.1000    0.0137
##     40        0.1086             nan     0.1000    0.0026
##     60        0.0748             nan     0.1000    0.0007
##     80        0.0571             nan     0.1000    0.0005
##    100        0.0452             nan     0.1000    0.0007
##    120        0.0384             nan     0.1000    0.0002
##    140        0.0335             nan     0.1000    0.0001
##    150        0.0319             nan     0.1000    0.0001

## Stochastic Gradient Boosting 
## 
## 186812 samples
##     35 predictor
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  RMSE       Rsquared   RMSE SD    
##   1                   50      0.6915950  0.8562537  0.002699728
##   1                  100      0.4862036  0.9119855  0.002474972
##   1                  150      0.4145690  0.9217630  0.002160972
##   2                   50      0.3694534  0.9408833  0.003757698
##   2                  100      0.2704362  0.9641916  0.003695063
##   2                  150      0.2247927  0.9750049  0.002794169
##   3                   50      0.2976922  0.9583498  0.002668562
##   3                  100      0.2127352  0.9779870  0.002685592
##   3                  150      0.1790564  0.9840712  0.002694375
##   Rsquared SD 
##   0.0038390426
##   0.0011016636
##   0.0004799897
##   0.0014909819
##   0.0010122493
##   0.0006408788
##   0.0007504481
##   0.0005613151
##   0.0004254851
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

## [1] 0.9921451

## [1] 0.18

Diagnostics, Result and Visualization

Lets look at the model, predictions and results in bit more detail:

The root mean square error of casualty per accident in the training set is 0.42. This means, on average, in the training set, difference bwtween the actual number of casualty is 0.42 away from what the model has ‘learnt’ to be the casualty given the conditions.
RMSE for the test data is 0.44. While the interpretation for this stays the same as above, its very close to the RMSE of the training data. So this suggests the model hasn’t overfit the data.
The R-sq values for the casualty prediction is on the lower side ~ 0.22.
On the other hand, the RMSE for latitude and longitide in the training data are 0.18 and 0.28 respectively, while those for test data are 0.18 and 0.28 respectively.
On one hand this is quite satisfying, but on the other hand, it poses the question of any of the variables in the data was/were surrogate(s) to the latitude and longitude. This can be explored if more explanation were available for the variables and their definitions.

Lets visualize the results and explore the errors in a qualitative manner.

The predictions for casualties alligns well with our observations from the exploratory plots.

The predictions for Longitudes also alligns well with our from the exploratory sections that the police force and local district authority are correlated with longitudes.

Lets analyze qualitatively as to how close are the training and test predictions are to the actual values.

## [1] 0.01

It can be seen that fit isnt particularly great. The model doesnt seem to pick the extreme values at all- possibly because of the very low incidence of extreme values( > 3 casualties/accident), i.e around 0.1% of the training data size.

Similar plots can be made for the Latitude and Longitudes

## [1] 80060     3

These plots reveal that the model is ‘reasonably’ successful in predicting the hot-spots.

Results

The research goal was to analyze the UK accident data set and predict accident hot spots and number of casualties. Analyses were set up for the year 2014. The 3 data sets were cleaned and merged into a single data set. Exploratory analysis were performed and it was observed that:

The number of casualties was correlated with Vehicle type, Manoeuvre, Sex of driver, Month, Day, Time, Urban or Rural area, Weather and light conditions.
The model were cleaned for missing values and split into training and test sets. Boosting models were developed and has RMSE of 0.44, 0.18 and 0.28 for casualties, latitude and longitude respectively. The is implies, that on average the algorithm predicte the location of the accident to within (0.18, 0.28). This translates to 37km. So this algorithm is capable of pedicting the location of the accident to within 37km of its actual location. Hypothesis testing and confidence limit analysis may be performed on this statistic, but this is not convered in this study.
The diagnostics suggests there was little overfitting.