Road accidents are never a happy issue to discuss. It not only has severe consequences for those involved, it also affects the lives of many others like friends and family. With more vehicles on the road than ever before, its important to understand them in greater detail, and possibly ‘predict’ the locations and consequences of these accidents. Government agencies in the UK have been collecting data about the accidents that were reported since the year 2005. The data includes generic and specific details about the vehicles, driver, number of passengers and number of casualties.
With data available since 2005, one could develop a model to predict the accidents. We know that the data recorded in the database are for reported accidents, so we know for sure these accidents have ‘happened’. We use this data to predict the location of the accident, in terms of latitude and longitude, and also the number of expected casualties of the accidents.
The purpose of this research is to:
Identify and quantify associations (if any) between the number of causalities and other variables in the data set.
Explore whether it is possible to predict accident hotspots based on the data.
The data sets are quite large. The ‘Accidents0514’ file has over 1.6 million rows and 32 columns. This is the smallest of the 3 files. As a simplification, we limit our analysis to the accidents of the year 2014 alone.
We download the data from the source, do a set of pre-processing operations to prep the data for exploratory analysis and predictive modelling.
After downloading all the data sets, they were unzipped and read in.
## [1] 146322 32
## [1] 268527 22
## [1] 194477 15
Lets look at a snapshot of the accidents in the UK from 2005 to 2014. Given the scale of the data, one might expect the plots to be ‘crowded’.
## [1] 0
We see clear spikes around the longitude of 0 and latitude of 51.5. Not surprisingly, these are the coordinates of London. We also see spikes around the Manchester and Birmingham areas.
The slightly more subtle trend is that the number of accidents in the london area seems to be on the increase since 2005 as shown below. It may also be noted that the number of accidents in other places have remained either the same or decreased.
These are the overall trend. The focus of all the subsequent steps would be on the year 2014. Once we subset the 2014 data, its important to collate the 3 data frames into one, that can be explored for patterns and analyzed.
Before we do this, its important to understand the problem in hand. The objective is to predict the # casualties and potential hot spots for accidents. Looking into the features of the above 3 data frames, we can observe the following:
‘Accidents0514’ contains the ‘overall’ factors, or generic factors that are not typically within the driver’s control, like date, time, weather, light condition, longitude and latitude of the accident (fair to assume that this wasnt in his/her control as its likely that the driver wasnt looking to commit an accident, therefore accident happeing at a particular location wasnt under his/her control), number of vehicles and casualties of the accident, etc.
‘Vehicles0514’ contains vehicle and driver specific details like type of vehicle, was it being towed, engine capacity, age and sex of driver, etc.
‘Casualties0514’ contains the details of victims of accident.
We ASSUME that for the purposes of our analysis of predicting the number of casualties, the specific details of the casualties do not contribute to the prediction of the number of casulaties of an accident. So we ignore the casu14 data in all our subsequent analysis.
Now we have the immediate task of converting the 2 data frames into one in a meaningful manner. Since both house data about accidents in a particular year and have matching ‘Accident Indix’, we can join the 2 data frames into one using a sql join.
We also assume that if ‘y’ vehicles are incolved in an accident (irrespective of whose fault the accident is) resulting in ‘x’ casulties, the casualty count of each vehicle is ‘x/y’. This is a fair assumtion given that the task is to predict the number of casulties in an accident, not who caused it. So we assume that all the vehicles involved in a casualty is contribute equally towards the casualty count.
In this data, there are a few variables that are duplicates or more precisely, surrogates to other vaiables. These need to be removed. These variables are
Age_of_Driver: which is captured by the Age_Band_of_Driver variable.
Location_Easting_OSGR, Location_Northing_OSGR: Ordnance Survey Grid Reference is a grid location surrogate for latitude and longitude.
Number_of_Casualties: already captured by ‘CasualtiesPerAccident’
LSOA_of_Accident_Location : Lower Layer Super Output Area is a geographical location surrogate for latitude and longitude.
Also there are variables that are unlikely to contribute to the accident or number of casualties, like:
Did_Police_Officer_Attend_Scene_of_Accident: Police officer(s) are lilely to have visited after the accident.
Accident_Index: Likely to have been assigned after the accident.
So we remove all these variables. Additionally, we do some ‘housekeeping’ operations like converting the time of accident, day, week of month and month of year into categorical variables.
From the data, it may be inferred that -1 is the NA string for the data set. As part of the ‘housekeeping’ operations, we remove the NA strings. To limit data loss, we first check which columns have greater that 10% NA and selectively remove those columns. We the proceed to remove rows that have any NAs.
## [1] 0
## [1] 266872 38
## 'data.frame': 266872 obs. of 38 variables:
## $ Vehicle_Type : int 8 19 9 1 9 3 9 9 3 9 ...
## $ Towing_and_Articulation : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Vehicle_Manoeuvre : int 18 15 2 14 9 13 4 6 18 2 ...
## $ Vehicle_Location.Restricted_Lane : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Junction_Location : int 0 0 1 1 5 8 1 8 8 0 ...
## $ Skidding_and_Overturning : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Hit_Object_in_Carriageway : int 0 0 0 4 0 0 0 0 0 0 ...
## $ Vehicle_Leaving_Carriageway : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Hit_Object_off_Carriageway : int 0 0 0 0 0 0 0 0 0 0 ...
## $ X1st_Point_of_Impact : int 4 1 3 4 3 1 1 3 1 3 ...
## $ Was_Vehicle_Left_Hand_Drive. : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Journey_Purpose_of_Driver : int 1 6 6 6 6 6 6 6 6 6 ...
## $ Sex_of_Driver : int 1 1 1 2 1 1 1 1 1 1 ...
## $ Accident_Severity : int 3 3 3 3 3 3 3 3 3 3 ...
## $ X1st_Road_Class : int 3 3 3 3 3 3 5 3 3 3 ...
## $ Road_Type : int 6 6 6 6 6 6 6 6 6 2 ...
## $ Speed_limit : int 30 30 30 30 30 30 30 30 30 30 ...
## $ Junction_Detail : int 0 0 5 5 3 3 3 7 7 0 ...
## $ Pedestrian_Crossing.Human_Control : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pedestrian_Crossing.Physical_Facilities: int 0 0 5 5 0 0 1 8 8 0 ...
## $ Light_Conditions : int 1 1 7 7 1 1 4 1 1 1 ...
## $ Weather_Conditions : int 2 2 1 1 1 1 1 1 1 1 ...
## $ Road_Surface_Conditions : int 2 2 1 1 1 1 1 1 1 1 ...
## $ Special_Conditions_at_Site : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Carriageway_Hazards : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Urban_or_Rural_Area : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Police_Force : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Number_of_Vehicles : int 2 2 2 2 2 2 1 2 2 3 ...
## $ Local_Authority_.District. : int 12 12 12 12 12 12 12 12 12 12 ...
## $ X1st_Road_Number : int 315 315 3218 3218 308 308 0 4 4 3220 ...
## $ X2nd_Road_Number : int 0 0 3220 3220 0 0 0 4 4 0 ...
## $ Time : Factor w/ 7 levels "Evening","Evening-Rush",..: 7 7 3 3 4 4 2 5 5 7 ...
## $ Month : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WeekOfMonth : Factor w/ 5 levels "1","2","3","4",..: 2 2 3 3 3 3 3 2 2 3 ...
## $ Day : Factor w/ 7 levels "Friday","Monday",..: 5 5 2 2 6 6 7 5 5 1 ...
## $ CasualtiesPerAccident : num 0.5 0.5 0.5 0.5 0.5 0.5 1 0.5 0.5 0.333 ...
## $ Longitude : num -0.206 -0.206 -0.19 -0.19 -0.174 ...
## $ Latitude : num 51.5 51.5 51.5 51.5 51.5 ...
Lets begin with a few high level summary of the casualty data
##
## 0.053 0.111 0.125 0.143 0.167 0.2 0.222 0.238 0.25 0.273
## 19 9 96 119 354 1039 9 21 4423 11
## 0.286 0.3 0.308 0.333 0.375 0.4 0.429 0.444 0.5 0.556
## 98 20 13 18907 48 675 91 9 138034 9
## 0.571 0.6 0.667 0.7 0.714 0.75 0.8 0.833 0.857 0.889
## 14 465 8985 10 28 1484 225 66 7 9
## 1 1.2 1.25 1.3 1.333 1.375 1.4 1.429 1.5 1.6
## 72331 50 276 10 1350 8 20 7 7904 30
## 1.667 1.75 1.8 2 2.2 2.25 2.333 2.5 2.667 2.75
## 593 72 10 6292 5 28 72 930 21 4
## 3 3.2 3.25 3.333 3.4 3.5 3.667 3.8 4 4.5
## 985 5 4 9 5 84 3 5 277 22
## 4.667 5 5.5 6 6.5 7 7.5 8 8.5 9
## 3 102 8 17 2 5 2 5 2 3
## 10 10.5 20.5 27 43.5 46.5 54
## 2 4 2 1 2 2 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0530 0.5000 0.5000 0.7246 1.0000 54.0000
The summary shows that the 75th percentile is 1, meaning that for almost 3/4ths of the accidents, the number of casualties is 1.
Lets look at the locations of these accidents to establish any locations with high concentration of accidents.
This shows stunning collections of accident hot-spots. It may be easily corroborated that these hot spots corresponds to major cities. The hottest accident zone seems to be London area, followed by Birmingham area, Manchester area, Sheffield and Leeds area, and Newcastle area. It would be interesting to see if the predictive models gives these places are the predicted hot spots.
The density plots also reveal that the western and northern most parts of the UK doesnt seem to have very many accidents when compared to the rest of the UK, especially the south-east.
Lets explore the data in greater detail. To make the exploratory plots meaningful, lets consider the following reasons which are likely to be considered as contributing to an accident, hence also to casualties per accident:
It has to be kept in mind that details of these categories are not given in the data set and no attempt shall be made to infer what the categories could be. They shall be taken on face value.
Lets begin by looking at the ‘Day’ of the week.
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 0.71 0.71 0.78 0.80 0.70 0.70 0.70
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## Friday Monday Saturday Sunday Thursday Tuesday Wednesday
## 0.42 0.54 0.48 0.52 0.41 0.58 0.40
There is a clear effect of day on the number of casualties. Part of that trend is due to the outliers, which obscure the actual scales of the differences of the ‘not-so-extreme’ values. This could be inferred from the differences in the mean values. So its likely that the day of the week has some effect on the casualties, but not significantly.
Lets turn to Week of the month.
## 1 2 3 4 5
## 0.72 0.72 0.72 0.73 0.72
## 1 2 3 4 5
## 0.5 0.5 0.5 0.5 0.5
## 1 2 3 4 5
## 0.49 0.45 0.55 0.44 0.44
We can very clearly see than the fifth week of every month has much lesser accidents than other weeks. But this is because of the fact that the not many months have a ‘full’ fifth week. But apart form this, there is not much of any observable trends.
On to month…
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0.73 0.73 0.72 0.73 0.72 0.72 0.72 0.74 0.71 0.71 0.72 0.74
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0.43 0.42 0.43 0.44 0.46 0.59 0.44 0.58 0.43 0.60 0.46 0.43
Again, similar results. Not much of observable trends. There seems to be high variability in the casualties in the months of June, August and October. These can be seen in the boxplot. So month seem to have limited effect on the outcomes.
Lets now look at the effect of times during which these accidents occured. We categorize the times into the customary rush/lean (aking to peak/of-peak) hours in morning, evening, etc.
## Evening Evening-Rush Mid-Night Morning Morning-Rush
## 0.75 0.70 0.83 0.72 0.67
## Night Noon
## 0.81 0.73
## Evening Evening-Rush Mid-Night Morning Morning-Rush
## 0.5 0.5 0.5 0.5 0.5
## Night Noon
## 0.5 0.5
## Evening Evening-Rush Mid-Night Morning Morning-Rush
## 0.45 0.40 0.56 0.42 0.62
## Night Noon
## 0.53 0.43
Now we see some trends. There are observably more accidents in the ‘day time’ of morning rush, noon and evening rush. So the time of the day is a significant predictor of the accidents.
Lets look at the effect of vehicle type.
## 1 2 3 4 5 8 9 10 11 16 17 18 19 20 21
## 0.54 0.63 0.63 0.66 0.68 0.80 0.75 0.89 0.98 0.54 0.64 0.93 0.69 0.67 0.66
## 22 23 90 97 98
## 0.64 0.72 0.70 0.69 0.62
## 1 2 3 4 5 8 9 10 11 16 17 18 19 20 21 22 23 90
## 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 0.5 0.5 1.0 0.5 0.5 0.5 0.5 0.5 0.5
## 97 98
## 0.5 0.5
## 1 2 3 4 5 8 9 10 11 16 17 18 19 20 21
## 0.18 0.26 0.28 0.32 0.33 0.45 0.46 0.83 1.40 0.21 0.35 0.34 0.39 0.38 0.49
## 22 23 90 97 98
## 0.27 0.47 0.40 0.32 0.31
Lets get the first plot on a more insightful scale.
Vehicle Type clearly is an important predictor, more so clear when plotted on a more consise scale.
Exploring the vehicle manoeuvre that led to the accident.
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 0.83 0.53 0.64 0.68 0.70 0.67 0.68 0.65 0.70 0.68 0.64 0.65 0.63 0.68 0.60
## 16 17 18
## 0.94 0.95 0.75
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 1.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0 1.0 0.5
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 0.36 0.30 0.37 0.39 0.34 0.37 0.36 0.35 0.40 0.39 0.38 0.38 0.37 0.36 0.33
## 16 17 18
## 0.90 0.88 0.47
Clearly, there are some manoeuvre that are more likely to result in casualties than others. It can also be inferred that the vehicle manoeuvre is a predictor for latitude and longitude as one might associate certain manoeuvres with certain conditions. Like turning on a curvy part of a road, etc.
Lets conclude the exloratory section by investigation the effect of ambient and surroundings, namely, Light, Weather conditions and the location being Rural or Urban on the sex of the driver and casualties.
## driverSex x
## 1 1 0.7276312
## 2 2 0.7376330
## 3 3 0.6305622
## light driverSex x
## 1 1 1 0.7083315
## 2 4 1 0.7490012
## 3 5 1 0.7970448
## 4 6 1 0.9407551
## 5 7 1 0.6983057
## 6 1 2 0.7271157
## 7 4 2 0.7404526
## 8 5 2 0.7653305
## 9 6 2 0.9453716
## 10 7 2 0.7041431
## 11 1 3 0.6208016
## 12 4 3 0.6535934
## 13 5 3 0.6063437
## 14 6 3 0.6400392
## 15 7 3 0.6555103
Light condition (#6) seem to affect both men and women drivers the same way, in that both sexes have higher casualty count under light condition 6.
Clearly the weather and light condition do contribute to the casualties, so does sex of the driver, but to a lower extent.
As a final plot, lets visualize the correlation among all the variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.4366000 -0.0231000 0.0001786 0.0076190 0.0284400 0.9804000
## [1] 1
This plot gives great insights on the factors that affect the casualties. The most significant correlation is with Number of Vehicles and has a correlation value of close to -0.3. This correlation makes sense as the number of vehicles involved would definitely influence the number of casualties. Whats counter-intuitive is the sense or sign of the correlation. It seems to suggest that lower the number of vehicles, higher the casualties. This might be indicative that a type of vehicle (possibly large) if ‘prone’ to get into an accident ‘by itself’ resulting in high casualty.
We can also note that the longitude is influenced by ‘Police Force’ and ‘Local Authority District’.
Since we are dealing with a large data set, its worth setting up parallel processing to save processing time. We do it using the ‘DoParallel’ library.
## [1] "Number of registered cores is 4"
We split the model into training and test set. We use a 70/30 training/test split.
We build a Stochastic Gradient Boosting model for modelling the accident data.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2079 nan 0.1000 0.0072
## 2 0.2020 nan 0.1000 0.0059
## 3 0.1973 nan 0.1000 0.0048
## 4 0.1934 nan 0.1000 0.0039
## 5 0.1902 nan 0.1000 0.0032
## 6 0.1875 nan 0.1000 0.0026
## 7 0.1853 nan 0.1000 0.0022
## 8 0.1835 nan 0.1000 0.0018
## 9 0.1819 nan 0.1000 0.0016
## 10 0.1807 nan 0.1000 0.0012
## 20 0.1743 nan 0.1000 0.0004
## 40 0.1696 nan 0.1000 0.0001
## 60 0.1675 nan 0.1000 0.0001
## 80 0.1662 nan 0.1000 0.0001
## 100 0.1655 nan 0.1000 0.0000
## 120 0.1644 nan 0.1000 0.0000
## 140 0.1639 nan 0.1000 0.0000
## 150 0.1633 nan 0.1000 0.0000
## Stochastic Gradient Boosting
##
## 186812 samples
## 35 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared RMSE SD
## 1 50 0.4148981 0.1988442 0.02360560
## 1 100 0.4107483 0.2118668 0.02381988
## 1 150 0.4089926 0.2172836 0.02390877
## 2 50 0.4103671 0.2140572 0.02383323
## 2 100 0.4068020 0.2254588 0.02401245
## 2 150 0.4054773 0.2297384 0.02403900
## 3 50 0.4078799 0.2226987 0.02390043
## 3 100 0.4046594 0.2330524 0.02408430
## 3 150 0.4037384 0.2359248 0.02389452
## Rsquared SD
## 0.01806151
## 0.01919894
## 0.01965441
## 0.01932151
## 0.02029407
## 0.02039820
## 0.01980134
## 0.02069083
## 0.02000347
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## [1] 0.47
Similar predictions were made for for Latitude and Longitudes as well.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6593 nan 0.1000 0.2682
## 2 1.4363 nan 0.1000 0.2228
## 3 1.2594 nan 0.1000 0.1771
## 4 1.1118 nan 0.1000 0.1476
## 5 0.9923 nan 0.1000 0.1200
## 6 0.8933 nan 0.1000 0.0989
## 7 0.8165 nan 0.1000 0.0774
## 8 0.7508 nan 0.1000 0.0660
## 9 0.6971 nan 0.1000 0.0535
## 10 0.6589 nan 0.1000 0.0381
## 20 0.4225 nan 0.1000 0.0094
## 40 0.2735 nan 0.1000 0.0071
## 60 0.1971 nan 0.1000 0.0015
## 80 0.1440 nan 0.1000 0.0021
## 100 0.1164 nan 0.1000 0.0006
## 120 0.0979 nan 0.1000 0.0007
## 140 0.0863 nan 0.1000 0.0007
## 150 0.0797 nan 0.1000 0.0007
## Stochastic Gradient Boosting
##
## 186812 samples
## 35 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared RMSE SD
## 1 50 0.8081185 0.7187798 0.002148632
## 1 100 0.6863972 0.7832400 0.002062606
## 1 150 0.6314684 0.8072319 0.001931428
## 2 50 0.5737028 0.8451793 0.002964524
## 2 100 0.4421717 0.9052640 0.003035276
## 2 150 0.3595747 0.9368570 0.003742017
## 3 50 0.4761200 0.8919498 0.004231754
## 3 100 0.3401443 0.9435872 0.002307612
## 3 150 0.2842114 0.9597367 0.002087082
## Rsquared SD
## 0.0019040074
## 0.0015607715
## 0.0013080224
## 0.0017634120
## 0.0014980194
## 0.0014387903
## 0.0026789789
## 0.0007870951
## 0.0006196809
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## [1] 0.9800938
## [1] 0.28
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6606 nan 0.1000 0.3021
## 2 1.4115 nan 0.1000 0.2471
## 3 1.2128 nan 0.1000 0.1983
## 4 1.0484 nan 0.1000 0.1647
## 5 0.9110 nan 0.1000 0.1384
## 6 0.7996 nan 0.1000 0.1109
## 7 0.6898 nan 0.1000 0.1097
## 8 0.6190 nan 0.1000 0.0706
## 9 0.5412 nan 0.1000 0.0777
## 10 0.4781 nan 0.1000 0.0630
## 20 0.2199 nan 0.1000 0.0137
## 40 0.1086 nan 0.1000 0.0026
## 60 0.0748 nan 0.1000 0.0007
## 80 0.0571 nan 0.1000 0.0005
## 100 0.0452 nan 0.1000 0.0007
## 120 0.0384 nan 0.1000 0.0002
## 140 0.0335 nan 0.1000 0.0001
## 150 0.0319 nan 0.1000 0.0001
## Stochastic Gradient Boosting
##
## 186812 samples
## 35 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 186812, 186812, 186812, 186812, 186812, 186812, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees RMSE Rsquared RMSE SD
## 1 50 0.6915950 0.8562537 0.002699728
## 1 100 0.4862036 0.9119855 0.002474972
## 1 150 0.4145690 0.9217630 0.002160972
## 2 50 0.3694534 0.9408833 0.003757698
## 2 100 0.2704362 0.9641916 0.003695063
## 2 150 0.2247927 0.9750049 0.002794169
## 3 50 0.2976922 0.9583498 0.002668562
## 3 100 0.2127352 0.9779870 0.002685592
## 3 150 0.1790564 0.9840712 0.002694375
## Rsquared SD
## 0.0038390426
## 0.0011016636
## 0.0004799897
## 0.0014909819
## 0.0010122493
## 0.0006408788
## 0.0007504481
## 0.0005613151
## 0.0004254851
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
## [1] 0.9921451
## [1] 0.18
Lets look at the model, predictions and results in bit more detail:
The root mean square error of casualty per accident in the training set is 0.42. This means, on average, in the training set, difference bwtween the actual number of casualty is 0.42 away from what the model has ‘learnt’ to be the casualty given the conditions.
RMSE for the test data is 0.44. While the interpretation for this stays the same as above, its very close to the RMSE of the training data. So this suggests the model hasn’t overfit the data.
The R-sq values for the casualty prediction is on the lower side ~ 0.22.
On the other hand, the RMSE for latitude and longitide in the training data are 0.18 and 0.28 respectively, while those for test data are 0.18 and 0.28 respectively.
On one hand this is quite satisfying, but on the other hand, it poses the question of any of the variables in the data was/were surrogate(s) to the latitude and longitude. This can be explored if more explanation were available for the variables and their definitions.
Lets visualize the results and explore the errors in a qualitative manner.
The predictions for casualties alligns well with our observations from the exploratory plots.
The predictions for Longitudes also alligns well with our from the exploratory sections that the police force and local district authority are correlated with longitudes.
Lets analyze qualitatively as to how close are the training and test predictions are to the actual values.
## [1] 0.01
It can be seen that fit isnt particularly great. The model doesnt seem to pick the extreme values at all- possibly because of the very low incidence of extreme values( > 3 casualties/accident), i.e around 0.1% of the training data size.
Similar plots can be made for the Latitude and Longitudes
## [1] 80060 3
These plots reveal that the model is ‘reasonably’ successful in predicting the hot-spots.
The research goal was to analyze the UK accident data set and predict accident hot spots and number of casualties. Analyses were set up for the year 2014. The 3 data sets were cleaned and merged into a single data set. Exploratory analysis were performed and it was observed that:
The number of casualties was correlated with Vehicle type, Manoeuvre, Sex of driver, Month, Day, Time, Urban or Rural area, Weather and light conditions.
The model were cleaned for missing values and split into training and test sets. Boosting models were developed and has RMSE of 0.44, 0.18 and 0.28 for casualties, latitude and longitude respectively. The is implies, that on average the algorithm predicte the location of the accident to within (0.18, 0.28). This translates to 37km. So this algorithm is capable of pedicting the location of the accident to within 37km of its actual location. Hypothesis testing and confidence limit analysis may be performed on this statistic, but this is not convered in this study.
The diagnostics suggests there was little overfitting.