Load Libraries
1
2
3
4
5
6
|
suppressPackageStartupMessages({
library(tidyverse)
library(janitor)
library(lubridate)
library(e1071)
})
|
Load and Explore Data
I found this huge and interesting dataset on NYC’s Open Data portal and wanted to use it to practice with the new knowledge and skills that I learned recently. My goal was to create a classifier model that can predict if motor vehicle collisions would result in injuries or not to anyone involved in the accident based on several predictors.
I saw there was a way to query the API directly and get the data into R without downloading it as a csv, and I did figure out how to do that for other open datasets that were smaller, but I had trouble with this large dataset. I ended up using the NYC Open Data’s query tool to filter the dataset beforehand and downloaded it from there. I chose the borough of Queens to work with as there are many more highways and potential for accidents resulting in injuries there, rather than Manhattan where vehicles move very slowly through the grid shaped streets. I also filtered it for the calendar year of 2023 to further reduce the number of observations.
1
2
3
|
motor_raw <- read.csv("/Users/michellegulotta/Desktop/motor_raw.csv")
motor_raw <- motor_raw %>%
janitor::clean_names(., "snake")
|
I wanted to also minimize missing values to deal with, so for the sake of simplicity I only focused on the data in the first columns for contributing factor and vehicle type.
1
2
|
motor_cat <- motor_raw %>%
dplyr::select(crash_date, crash_time, collision_id, number_of_persons_injured, number_of_persons_killed, contributing_factor_vehicle_1, vehicle_type_code_1)
|
Now to check out the variables remaining along with the data types associated with each column.
1
2
3
4
5
6
7
8
|
## 'data.frame': 17822 obs. of 7 variables:
## $ crash_date : chr "11/16/2023" "11/17/2023" "11/17/2023" "11/12/2023" ...
## $ crash_time : chr "15:45" "6:57" "14:30" "10:58" ...
## $ collision_id : int 4679634 4679868 4679994 4679991 4663636 4663437 4629913 4631543 4631801 4632069 ...
## $ number_of_persons_injured : int 1 0 0 0 0 0 0 4 0 1 ...
## $ number_of_persons_killed : int 0 0 0 0 0 0 0 0 0 0 ...
## $ contributing_factor_vehicle_1: chr "Driver Inexperience" "Other Vehicular" "Unspecified" "Unspecified" ...
## $ vehicle_type_code_1 : chr "Motorcycle" "Sedan" "Station Wagon/Sport Utility Vehicle" "Station Wagon/Sport Utility Vehicle" ...
|
Clean and Manipulate Data
Time is stored as character and is in the 24 hour format, I just want the hour as I am going to use that to create a time of day category to use as a predictor. I also made a month category using the lubridate package, but I did not end up using this as a predictor, it was interesting to see how the lubridate package can help me in future projects.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
motor_cat <- motor_cat %>%
mutate(hour = substr(motor_cat$crash_time,
start = 1,
stop = 2),
across(
.cols = c(contributing_factor_vehicle_1, vehicle_type_code_1),
.fns = ~if_else(.x == "", NA, .x)
),
month = as.Date(crash_date, format = "%m/%d/%Y")) %>%
rename(type = vehicle_type_code_1,
contrib = contributing_factor_vehicle_1)
motor_cat$hour <- str_replace_all(motor_cat$hour, ":", "")
motor_cat$hour <- as.integer(motor_cat$hour)
motor_cat$month <- month(ymd(motor_cat$month))
|
I’m going to come back to this later, I want to do all of the category creating at once so I’m going to work on the contributing factor and vehicle type variables. In the previous code I renamed these variables to make the name shorter as I’m going to be renaming a lot of stuff later as you’ll see.
1
|
motor_cat %>% tabyl(contrib)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
|
## contrib n percent
## Accelerator Defective 13 7.294355e-04
## Aggressive Driving/Road Rage 145 8.136012e-03
## Alcohol Involvement 408 2.289305e-02
## Animals Action 14 7.855460e-04
## Backing Unsafely 691 3.877230e-02
## Brakes Defective 69 3.871619e-03
## Cell Phone (hand-Held) 11 6.172147e-04
## Cell Phone (hands-free) 1 5.611043e-05
## Driver Inattention/Distraction 4204 2.358882e-01
## Driver Inexperience 372 2.087308e-02
## Driverless/Runaway Vehicle 24 1.346650e-03
## Drugs (illegal) 11 6.172147e-04
## Eating or Drinking 3 1.683313e-04
## Failure to Keep Right 35 1.963865e-03
## Failure to Yield Right-of-Way 1830 1.026821e-01
## Fatigued/Drowsy 27 1.514981e-03
## Fell Asleep 117 6.564920e-03
## Following Too Closely 988 5.543710e-02
## Glare 20 1.122209e-03
## Headlights Defective 1 5.611043e-05
## Illnes 36 2.019975e-03
## Lane Marking Improper/Inadequate 7 3.927730e-04
## Lost Consciousness 50 2.805521e-03
## Obstruction/Debris 27 1.514981e-03
## Other Electronic Device 3 1.683313e-04
## Other Lighting Defects 1 5.611043e-05
## Other Vehicular 362 2.031197e-02
## Outside Car Distraction 47 2.637190e-03
## Oversized Vehicle 33 1.851644e-03
## Passenger Distraction 34 1.907754e-03
## Passing Too Closely 656 3.680844e-02
## Passing or Lane Usage Improper 959 5.380990e-02
## Pavement Defective 14 7.855460e-04
## Pavement Slippery 60 3.366626e-03
## Pedestrian/Bicyclist/Other Pedestrian Error/Confusion 129 7.238245e-03
## Physical Disability 6 3.366626e-04
## Prescription Medication 2 1.122209e-04
## Reaction to Uninvolved Vehicle 125 7.013803e-03
## Shoulders Defective/Improper 1 5.611043e-05
## Steering Failure 53 2.973853e-03
## Tinted Windows 4 2.244417e-04
## Tire Failure/Inadequate 30 1.683313e-03
## Tow Hitch Defective 2 1.122209e-04
## Traffic Control Device Improper/Non-Working 6 3.366626e-04
## Traffic Control Disregarded 742 4.163394e-02
## Turning Improperly 468 2.625968e-02
## Unsafe Lane Changing 193 1.082931e-02
## Unsafe Speed 770 4.320503e-02
## Unspecified 3754 2.106385e-01
## Using On Board Navigation Device 1 5.611043e-05
## Vehicle Vandalism 5 2.805521e-04
## View Obstructed/Limited 150 8.416564e-03
## <NA> 108 6.059926e-03
## valid_percent
## 7.338828e-04
## 8.185616e-03
## 2.303263e-02
## 7.903353e-04
## 3.900869e-02
## 3.895224e-03
## 6.209778e-04
## 5.645252e-05
## 2.373264e-01
## 2.100034e-02
## 1.354861e-03
## 6.209778e-04
## 1.693576e-04
## 1.975838e-03
## 1.033081e-01
## 1.524218e-03
## 6.604945e-03
## 5.577509e-02
## 1.129050e-03
## 5.645252e-05
## 2.032291e-03
## 3.951677e-04
## 2.822626e-03
## 1.524218e-03
## 1.693576e-04
## 5.645252e-05
## 2.043581e-02
## 2.653269e-03
## 1.862933e-03
## 1.919386e-03
## 3.703286e-02
## 5.413797e-02
## 7.903353e-04
## 3.387151e-03
## 7.282376e-03
## 3.387151e-04
## 1.129050e-04
## 7.056565e-03
## 5.645252e-05
## 2.991984e-03
## 2.258101e-04
## 1.693576e-03
## 1.129050e-04
## 3.387151e-04
## 4.188777e-02
## 2.641978e-02
## 1.089534e-02
## 4.346844e-02
## 2.119228e-01
## 5.645252e-05
## 2.822626e-04
## 8.467879e-03
## NA
|
I’m going to categorize these into much fewer categories to make it easier to create a model, how about now with the type of car as the primary vehicle involved in the accident.
1
|
motor_cat %>% tabyl(type)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
|
## type n percent valid_percent
## 2 dr sedan 1 5.611043e-05 5.703205e-05
## 3-Door 2 1.122209e-04 1.140641e-04
## 4 dr sedan 11 6.172147e-04 6.273526e-04
## AMBU 2 1.122209e-04 1.140641e-04
## AMBULANCE 15 8.416564e-04 8.554808e-04
## AMBULENCE 2 1.122209e-04 1.140641e-04
## Ambulance 78 4.376613e-03 4.448500e-03
## Armored Truck 4 2.244417e-04 2.281282e-04
## BOOM MOPED 1 5.611043e-05 5.703205e-05
## Beverage Truck 2 1.122209e-04 1.140641e-04
## Bike 205 1.150264e-02 1.169157e-02
## Box Truck 176 9.875435e-03 1.003764e-02
## Bulk Agriculture 3 1.683313e-04 1.710962e-04
## Bus 247 1.385928e-02 1.408692e-02
## Carry All 7 3.927730e-04 3.992244e-04
## Chassis Cab 6 3.366626e-04 3.421923e-04
## Commercial 1 5.611043e-05 5.703205e-05
## Concrete Mixer 3 1.683313e-04 1.710962e-04
## Convertible 27 1.514981e-03 1.539865e-03
## Courier 1 5.611043e-05 5.703205e-05
## DELIVERY T 1 5.611043e-05 5.703205e-05
## DUMP 1 5.611043e-05 5.703205e-05
## DUMP TRUCK 1 5.611043e-05 5.703205e-05
## Dump 36 2.019975e-03 2.053154e-03
## E-Bike 154 8.641005e-03 8.782936e-03
## E-Scooter 75 4.208282e-03 4.277404e-03
## Electric m 1 5.611043e-05 5.703205e-05
## FDNY 1 5.611043e-05 5.703205e-05
## FDNY AMBUL 2 1.122209e-04 1.140641e-04
## FDNY FIRE 1 5.611043e-05 5.703205e-05
## FDNY truck 1 5.611043e-05 5.703205e-05
## FIRE ENGIN 1 5.611043e-05 5.703205e-05
## FIRE TRUCK 3 1.683313e-04 1.710962e-04
## FIRETRUCK 3 1.683313e-04 1.710962e-04
## FLYWING MO 1 5.611043e-05 5.703205e-05
## FOOD CART 1 5.611043e-05 5.703205e-05
## FORK LIFT 1 5.611043e-05 5.703205e-05
## FREIGHT VA 1 5.611043e-05 5.703205e-05
## Fire Truck 1 5.611043e-05 5.703205e-05
## Fire engin 1 5.611043e-05 5.703205e-05
## Fire truck 1 5.611043e-05 5.703205e-05
## Firetruck 3 1.683313e-04 1.710962e-04
## Flat Bed 26 1.458871e-03 1.482833e-03
## Flat Rack 7 3.927730e-04 3.992244e-04
## Ford FF 1 5.611043e-05 5.703205e-05
## Forklift 1 5.611043e-05 5.703205e-05
## GARBAGE TR 2 1.122209e-04 1.140641e-04
## GAS MOPED 1 5.611043e-05 5.703205e-05
## GOLFCART 1 5.611043e-05 5.703205e-05
## Garbage or Refuse 24 1.346650e-03 1.368769e-03
## Government 1 5.611043e-05 5.703205e-05
## LIMO 3 1.683313e-04 1.710962e-04
## Ladder 1 5.611043e-05 5.703205e-05
## Ladder tru 1 5.611043e-05 5.703205e-05
## Lift Boom 1 5.611043e-05 5.703205e-05
## MC 1 5.611043e-05 5.703205e-05
## METRO TRAN 1 5.611043e-05 5.703205e-05
## MOPED 5 2.805521e-04 2.851603e-04
## MOTOR SCOO 1 5.611043e-05 5.703205e-05
## MOTORIZEDS 1 5.611043e-05 5.703205e-05
## Minicycle 1 5.611043e-05 5.703205e-05
## Moped 94 5.274380e-03 5.361013e-03
## Motorbike 11 6.172147e-04 6.273526e-04
## Motorcycle 188 1.054876e-02 1.072203e-02
## Motorscooter 21 1.178319e-03 1.197673e-03
## Mta bus 1 5.611043e-05 5.703205e-05
## Multi-Wheeled Vehicle 1 5.611043e-05 5.703205e-05
## NYC FIRE T 1 5.611043e-05 5.703205e-05
## OMS 1 5.611043e-05 5.703205e-05
## Omnibus 1 5.611043e-05 5.703205e-05
## Open Body 1 5.611043e-05 5.703205e-05
## PAS 2 1.122209e-04 1.140641e-04
## PK 34 1.907754e-03 1.939090e-03
## Pick-up Truck 428 2.401526e-02 2.440972e-02
## RV 2 1.122209e-04 1.140641e-04
## Red moped 1 5.611043e-05 5.703205e-05
## Refrigerated Van 1 5.611043e-05 5.703205e-05
## Road sweep 1 5.611043e-05 5.703205e-05
## SCHOOL BUS 1 5.611043e-05 5.703205e-05
## SCHOOLBUS 1 5.611043e-05 5.703205e-05
## SCOOTER 1 5.611043e-05 5.703205e-05
## SELF INSUR 1 5.611043e-05 5.703205e-05
## SPC 2 1.122209e-04 1.140641e-04
## Sanitation 1 5.611043e-05 5.703205e-05
## School Bus 2 1.122209e-04 1.140641e-04
## Sedan 8491 4.764336e-01 4.842592e-01
## Sprinter v 1 5.611043e-05 5.703205e-05
## Stake or Rack 1 5.611043e-05 5.703205e-05
## Station Wagon/Sport Utility Vehicle 6625 3.717316e-01 3.778373e-01
## TOW TRUCK 1 5.611043e-05 5.703205e-05
## TRAILER 2 1.122209e-04 1.140641e-04
## TRUCK 1 5.611043e-05 5.703205e-05
## Tanker 8 4.488834e-04 4.562564e-04
## Taxi 251 1.408372e-02 1.431505e-02
## Tow Truck 2 1.122209e-04 1.140641e-04
## Tow Truck / Wrecker 17 9.538772e-04 9.695449e-04
## Tow truck 1 5.611043e-05 5.703205e-05
## Tractor Tr 1 5.611043e-05 5.703205e-05
## Tractor Truck Diesel 64 3.591067e-03 3.650051e-03
## Tractor Truck Gasoline 13 7.294355e-04 7.414167e-04
## Trailer 2 1.122209e-04 1.140641e-04
## Truck 3 1.683313e-04 1.710962e-04
## UHAUL 1 5.611043e-05 5.703205e-05
## USPS 1 5.611043e-05 5.703205e-05
## USPS VEHIC 1 5.611043e-05 5.703205e-05
## UTILITY 1 5.611043e-05 5.703205e-05
## Van 74 4.152171e-03 4.220372e-03
## Van Camper 1 5.611043e-05 5.703205e-05
## Waste truc 1 5.611043e-05 5.703205e-05
## ambulance 1 5.611043e-05 5.703205e-05
## commerical 1 5.611043e-05 5.703205e-05
## delv 1 5.611043e-05 5.703205e-05
## forklift 1 5.611043e-05 5.703205e-05
## pick up tr 1 5.611043e-05 5.703205e-05
## van 2 1.122209e-04 1.140641e-04
## <NA> 288 1.615980e-02 NA
|
Creating Categories
This dataset is a mess, I’m going to clean it up here.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
|
motor_cat <- motor_cat %>%
mutate(contrib_cat = as.factor(case_when(
contrib == "Accelerator Defective" | contrib == "Brakes Defective" | contrib == "Steering Failure" | contrib == "Headlights Defective" | contrib == "Other Lighting Defects" | contrib == "Tinted Windows" | contrib == "Tire Failure/Inadequate" | contrib == "Tow Hitch Defective" ~ "Car Defects",
contrib == "Backing Unsafely" | contrib == "Driver Inexperience" | contrib == "Failure to Keep Right" | contrib == "Failure to Yield Right-of-Way" | contrib == "Following Too Closely" | contrib == "Passing Too Closely" | contrib == "Passing or Lane Usage Improper" | contrib == "Turning Improperly" ~ "Driver Error",
contrib == "Alcohol Involvement" | contrib == "Drugs (illegal)" | contrib == "Prescription Medication" ~ "Substances",
contrib == "Animals Action" | contrib == "Driverless/Runaway Vehicle" | contrib == "Glare" | contrib == "Lane Marking Improper/Inadequate" | contrib == "Obstruction/Debris" | contrib == "Pavement Defective" | contrib == "Pavement Slippery" | contrib == "Pedestrian/Bicyclist/Other Pedestrian Error/Confusion" | contrib == "Reaction to Uninvolved Vehicle" | contrib == "Shoulders Defective/Improper" | contrib == "Traffic Control Device Improper/Non-Working" | contrib == "View Obstructed/Limited" | contrib == "Vehicle Vandalism" ~ "Road Conditions",
contrib == "Cell Phone (hand-Held)" | contrib == "Cell Phone (hands-free)" | contrib == "Driver Inattention/Distraction" | contrib == "Eating or Drinking" | contrib == "Other Electronic Device" | contrib == "Outside Car Distraction" | contrib == "Using Onboard Navigation Device" ~ "Distracted Driver",
contrib == "Fatigued/Drowsy" | contrib == "Fell Asleep" | contrib == "Illnes" | contrib == "Lost Consciousness" | contrib == "Physical Disability" ~ "Condition of Driver",
contrib == "Aggressive Driving/Road Rage" | contrib == "Traffic Control Disregarded" | contrib == "Unsafe Lane Change" | contrib == "Unsafe Speed" ~ "Dangerous Driving",
contrib == "Unspecified" | is.na(contrib) | contrib == "Other Vehicular" ~ NA
)),
type_cat = as.factor(case_when(
type == "2 dr sedan" | type == "3-Door" | type == "4 dr sedan" | type == "Sedan" ~ "Sedan",
type == "Taxi" ~ "Taxi",
type == "Refrigerated Van" | type == "Van" | type == "Van Camper" | type == "van" ~ "Van",
type == "Station Wagon/Sport Utility Vehicle" ~ "SUV",
type == "AMBU" | type == "AMBULANCE" | type == "AMBULENCE" | type == "Ambulance" | type == "FDNY AMBUL" | type == "ambulance" ~ "Ambulance",
type == "Armored Truck" | type == "Beverage Truck" | type == "Box Truck" | type == "Concrete Mixer" | type == "DELIVERY T" | type == "DUMP" | type == "RV" | type == "DUMP TRUCK" | type == "Dump" | type == "Flat Bed" | type == "Flat Rack" | type == "GARBAGE TR" | type == "Garbage or Refuse" | type == "Road sweep" | type == "Sanitation" | type == "TOW TRUCK" | type == "TRAILER" | type == "TRUCK" | type == "Tanker" | type == "Tow Truck" | type == "Tow Truck / Wrecker" | type == "Tow truck" | type == "Tractor Tr" | type == "Tractor Truck Diesel" | type == "Tractor Truck Gasoline" | type == "Trailer" | type == "Truck" | type == "UHAUL" | type == "USPS" | type == "USPS VEHIC" | type == "UTILITY" | type == "Waste truc" ~ "Truck",
type == "Chassis Cab" | type == "Pick-up Truck" | type == "Pick up tr" ~ "Pick Up Truck",
type == "BOOM MOPED" | type == "GAS MOPED" | type == "MOPED" | type == "Moped"| type == "Motorbike" | type == "Motorcycle" | type == "Red moped" ~ "Motorcycle/Moped",
type == "Bike" | type == "E-Bike" | type == "Electric m" | type == "FLYWING MO" ~ "Bike/E-Bike",
type == "E-Scooter" | type == "MOTOR SCOO" | type == "MOTORIZEDS" | type == "Motorscooter" | type == "SCOOTER" ~ "Scooter/E-Scooter",
type == "Bus" | type == "Mta bus" | type == "Omnibus" | type == "SCHOOL BUS" | type == "SCHOOLBUS" | type == "School Bus" ~ "Bus",
type == "FDNY" | type == "FDNY FIRE" | type == "FDNY truck" | type == "FIRE ENGIN" | type == "FIRE TRUCK" | type == "FIRETRUCK" | type == "Fire Truck" | type == "Fire engin" | type == "Fire truck" | type == "Firetruck" | type == "Ladder" | type == "Ladder tru"| type == "NYC Fire T" ~ "Fire Truck",
is.na(type) ~ NA,
TRUE ~ "Other"
)),
time_of_day = as.factor(case_when(
hour < 6 | hour > 20 ~ "Night",
hour < 12 ~ "Morning",
hour < 18 ~ "Afternoon",
hour <= 20 ~ "Evening"
)),
fatal_cat = as.factor(case_when(
number_of_persons_killed > 0 ~ "Fatal",
TRUE ~ "Non Fatal")
),
injured_cat = as.factor(case_when(
number_of_persons_injured > 0 ~ "Injuries",
TRUE ~ "No Injuries")
))
|
Phew, that was a lot of work. Now I’m going to check to make sure that I didn’t leave anything out and everything was either categorized or caught by the catch all at the end of the new type of vehicle category variable.
1
|
motor_cat %>% tabyl(contrib_cat)
|
1
2
3
4
5
6
7
8
9
|
## contrib_cat n percent valid_percent
## Car Defects 173 0.009707104 0.01297143
## Condition of Driver 236 0.013242060 0.01769513
## Dangerous Driving 1657 0.092974975 0.12424083
## Distracted Driver 4269 0.239535406 0.32008698
## Driver Error 5999 0.336606441 0.44980130
## Road Conditions 582 0.032656268 0.04363800
## Substances 421 0.023622489 0.03156632
## <NA> 4485 0.251655258 NA
|
Looks good, now the type of vehicle category.
1
|
motor_cat %>% tabyl(type_cat)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
## type_cat n percent valid_percent
## Ambulance 100 0.005611043 0.005703205
## Bike/E-Bike 361 0.020255864 0.020588571
## Bus 253 0.014195938 0.014429109
## Fire Truck 18 0.001009988 0.001026577
## Motorcycle/Moped 301 0.016889238 0.017166648
## Other 102 0.005723263 0.005817269
## Pick Up Truck 434 0.024351925 0.024751911
## Scooter/E-Scooter 99 0.005554932 0.005646173
## Sedan 8505 0.477219167 0.485057602
## SUV 6625 0.371731568 0.377837345
## Taxi 251 0.014083717 0.014315045
## Truck 407 0.022836943 0.023212045
## Van 78 0.004376613 0.004448500
## <NA> 288 0.016159802 NA
|
Here’s the final dataset, I generally display it in a tibble format as the blogdown format seems to display only the first few rows of tibbles which I like as it does not take up too much space on my blog post.
1
2
3
4
5
6
|
motor_cat_df <- motor_cat %>%
filter(!is.na(contrib_cat) & !is.na(type_cat)) %>%
dplyr::select(collision_id, month, hour, contrib_cat, type_cat, time_of_day, injured_cat)
motor_cat_final <- as_tibble(motor_cat_df)
motor_cat_final
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
## # A tibble: 13,150 × 7
## collision_id month hour contrib_cat type_cat time_of_day injured_cat
## <int> <dbl> <int> <fct> <fct> <fct> <fct>
## 1 4679634 11 15 Driver Error Motorcycl… Afternoon Injuries
## 2 4663636 9 7 Distracted Driver Sedan Morning No Injuries
## 3 4663437 9 0 Driver Error Sedan Night No Injuries
## 4 4629913 5 15 Driver Error Sedan Afternoon No Injuries
## 5 4631543 5 16 Driver Error SUV Afternoon Injuries
## 6 4631801 5 14 Driver Error SUV Afternoon No Injuries
## 7 4632066 5 15 Distracted Driver Pick Up T… Afternoon No Injuries
## 8 4631365 5 9 Distracted Driver Truck Morning No Injuries
## 9 4680340 11 18 Dangerous Driving Motorcycl… Evening Injuries
## 10 4631597 5 15 Driver Error SUV Afternoon No Injuries
## # ℹ 13,140 more rows
|
Data Visualization
I wanted to visualize the predictors that I am going to use in the model just to see how everything is distributed and if there are any obvious differences that the model might pick up on.
Contributing Factor
1
2
3
4
5
6
7
8
9
|
ggplot(data = motor_cat_final, aes(y = contrib_cat, fill = injured_cat)) +
geom_bar(color = "dimgray", alpha = 0.8) +
labs(title = "Count of Accidents and Their Outcomes by Contributing Factor",
y = "Primary Contributing Factor of Accident",
x = "Count",
caption = "Source: NYC Open Data Motor Vehicle Collisions- Crashes (January-December 2023)",
fill = "Injury Outcome") +
theme(plot.title = element_text(hjust=0.5)) +
theme_minimal()
|
The most common primary cause of the accident is distracted driving and driver error, this especially makes sense as distracted driving is made so easy these days by all the electronic devices that we have available.
Type of Vehicle
1
2
3
4
5
6
7
8
9
|
ggplot(data = motor_cat_final, aes(y = type_cat, fill = injured_cat)) +
geom_bar(color = "dimgray", alpha = 0.8) +
labs(title = "Count of Accidents and Their Outcomes by Type of Primary Vehicle",
y = "Primary Vehicle Involved in of Accident",
x = "Count",
caption = "Source: NYC Open Data Motor Vehicle Collisions- Crashes (January-December 2023)",
fill = "Injury Outcome") +
theme(plot.title = element_text(hjust=0.5)) +
theme_minimal()
|
One thing that stuck out to me, which is not very surprising now that I think about it, is the proportion of accidents that resulted in injuries when a bike/e-bike or motorcycle/moped is involved. This really emphasizes the importance of being safe and fully attentive when on the road not just for other drivers, but for bikes and motorcycles too, as well as the importance of bike infrastructure that would keep bikes seperate from other vehicles that can hurt them.
Time of Day
1
2
3
4
5
6
7
8
9
|
ggplot(data = motor_cat_final, aes(x = time_of_day, fill = injured_cat)) +
geom_bar(color = "dimgray", alpha = 0.8) +
labs(title = "Count of Accidents and Their Outcomes by Time of Day",
x = "Time of Day that Accident Occurred",
y = "Count",
caption = "Source: NYC Open Data Motor Vehicle Collisions- Crashes (January-December 2023)",
fill = "Injury Outcome") +
theme(plot.title = element_text(hjust=0.5)) +
theme_minimal()
|
Just inspecting this visually it does not look like there is that big of a difference between the different times of day and the proportion of accidents that result in injuries.
Classification Models to Predict Injury Outcome of Motor Vehicle Collision
Create Training and Test Data
In order to create and test my model that classifies if an accident results in injuries or not for anyone involved, I’m going to seperate my training and test data by taking an arbitrary amount of observations to be considered test and training using the sample_frac() and anti_join() functions.
1
2
3
4
5
6
7
8
|
set.seed(123)
training <- motor_cat_df %>%
sample_frac(.70)
test <- motor_cat_df %>% anti_join(training, by = "collision_id")
training_injury_outcome <- training$injured_cat
test_injury_outcome <- test$injured_cat
|
Logistic Regression
The first model that I wanted to fit was a logistic regression model, as there are only two different outcomes in the data (injuries vs. no injuries). Here I fitted that model with my predictors: contributing factor, type of vehicle, and time of day, along with my response variable which was if the accident resulted in injuries or not.
1
2
3
4
|
glm.inj <- glm(injured_cat ~ contrib_cat + type_cat + time_of_day,
data = training,
family = binomial)
summary(glm.inj)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
|
##
## Call:
## glm(formula = injured_cat ~ contrib_cat + type_cat + time_of_day,
## family = binomial, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0102 -1.2497 0.8528 1.0539 2.3111
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.74107 0.48089 3.621 0.000294 ***
## contrib_catCondition of Driver -0.53689 0.24739 -2.170 0.029991 *
## contrib_catDangerous Driving -0.38954 0.19907 -1.957 0.050375 .
## contrib_catDistracted Driver 0.09706 0.19357 0.501 0.616072
## contrib_catDriver Error -0.03027 0.19242 -0.157 0.874993
## contrib_catRoad Conditions -0.53907 0.21675 -2.487 0.012880 *
## contrib_catSubstances 0.24526 0.22726 1.079 0.280488
## type_catBike/E-Bike -3.93770 0.49501 -7.955 1.79e-15 ***
## type_catBus -1.76695 0.47378 -3.729 0.000192 ***
## type_catFire Truck -0.91710 0.94812 -0.967 0.333402
## type_catMotorcycle/Moped -3.50744 0.49026 -7.154 8.42e-13 ***
## type_catOther -1.65302 0.50726 -3.259 0.001119 **
## type_catPick Up Truck -1.30525 0.46299 -2.819 0.004814 **
## type_catScooter/E-Scooter -3.41098 0.54718 -6.234 4.55e-10 ***
## type_catSedan -1.54048 0.44326 -3.475 0.000510 ***
## type_catSUV -1.66967 0.44347 -3.765 0.000167 ***
## type_catTaxi -2.37742 0.47385 -5.017 5.24e-07 ***
## type_catTruck -1.11660 0.46499 -2.401 0.016336 *
## type_catVan -1.29899 0.56568 -2.296 0.021658 *
## time_of_dayEvening -0.01271 0.06711 -0.189 0.849830
## time_of_dayMorning 0.27367 0.05781 4.734 2.20e-06 ***
## time_of_dayNight 0.52662 0.05700 9.238 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12644 on 9204 degrees of freedom
## Residual deviance: 12059 on 9183 degrees of freedom
## AIC: 12103
##
## Number of Fisher Scoring iterations: 4
|
Then I used the predict() function to predict the outcome using my test data, and compared the real data along with the predicted data using a confusion matrix.
1
2
3
4
5
6
7
|
inj.probs <- predict(glm.inj,
test,
type = "response")
inj.pred <- rep("No Injuries", length(inj.probs))
inj.pred[inj.probs > 0.5] <- "Injuries"
table(inj.pred, test_injury_outcome)
|
1
2
3
4
|
## test_injury_outcome
## inj.pred Injuries No Injuries
## Injuries 1304 1980
## No Injuries 422 239
|
1
|
mean(inj.pred == test_injury_outcome)
|
Yikes, that’s worse than just random guessing. I’m going to use a different model now and see how that compares to the logistic regression.
Naive Bayes
I chose to use the naive bayes model as it does not assume any particular distribution in the data. I fit the model using the naiveBayes() function from the e1071 package and displayed the output of the model object below.
1
2
3
|
nb.inj <- naiveBayes(injured_cat ~ contrib_cat + type_cat + time_of_day,
data = training)
nb.inj
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Injuries No Injuries
## 0.4437806 0.5562194
##
## Conditional probabilities:
## contrib_cat
## Y Car Defects Condition of Driver Dangerous Driving
## Injuries 0.01126071 0.02105263 0.14369645
## No Injuries 0.01445313 0.01484375 0.10761719
## contrib_cat
## Y Distracted Driver Driver Error Road Conditions Substances
## Injuries 0.29987760 0.44504284 0.05483476 0.02423501
## No Injuries 0.34003906 0.45058594 0.03339844 0.03906250
##
## type_cat
## Y Ambulance Bike/E-Bike Bus Fire Truck
## Injuries 0.0014687882 0.0438188494 0.0166462668 0.0004895961
## No Injuries 0.0074218750 0.0044921875 0.0142578125 0.0009765625
## type_cat
## Y Motorcycle/Moped Other Pick Up Truck Scooter/E-Scooter
## Injuries 0.0337821297 0.0073439412 0.0205630355 0.0127294982
## No Injuries 0.0052734375 0.0072265625 0.0292968750 0.0023437500
## type_cat
## Y Sedan SUV Taxi Truck Van
## Injuries 0.4411260710 0.3801713586 0.0212974296 0.0173806610 0.0031823745
## No Injuries 0.5011718750 0.3804687500 0.0117187500 0.0310546875 0.0042968750
##
## time_of_day
## Y Afternoon Evening Morning Night
## Injuries 0.3618115 0.1703794 0.2296206 0.2381885
## No Injuries 0.2933594 0.1333984 0.2527344 0.3205078
|
Now to check the model to see how it compares to my worse-than-random-guessing logistic regression model.
1
2
|
nb.inj.pred <- predict(nb.inj, test)
table(nb.inj.pred, test_injury_outcome)
|
1
2
3
4
|
## test_injury_outcome
## nb.inj.pred Injuries No Injuries
## Injuries 461 263
## No Injuries 1265 1956
|
1
|
mean(nb.inj.pred == test_injury_outcome)
|
Well that’s not great, but it’s definitely an improvement from the previous logistic regression model.
Improve Logistic Regression Model
It looks like the primary contributing factor does not really have much statistical significance if there are injuries in a motor vehicle accident or not, I’m going to remove that variable from each model and see how that changes things.
1
2
3
4
|
glm.inj.new <- glm(injured_cat ~ type_cat + time_of_day,
data = training,
family = binomial)
summary(glm.inj.new)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
##
## Call:
## glm(formula = injured_cat ~ type_cat + time_of_day, family = binomial,
## data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1353 -1.2387 0.9115 1.0588 2.1639
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.651322 0.441470 3.741 0.000184 ***
## type_catBike/E-Bike -3.888045 0.493421 -7.880 3.28e-15 ***
## type_catBus -1.732093 0.472146 -3.669 0.000244 ***
## type_catFire Truck -0.806790 0.946818 -0.852 0.394155
## type_catMotorcycle/Moped -3.482104 0.488651 -7.126 1.03e-12 ***
## type_catOther -1.608261 0.505064 -3.184 0.001451 **
## type_catPick Up Truck -1.283938 0.461426 -2.783 0.005393 **
## type_catScooter/E-Scooter -3.353955 0.545475 -6.149 7.81e-10 ***
## type_catSedan -1.508263 0.441687 -3.415 0.000638 ***
## type_catSUV -1.633338 0.441928 -3.696 0.000219 ***
## type_catTaxi -2.305245 0.471983 -4.884 1.04e-06 ***
## type_catTruck -1.076075 0.463503 -2.322 0.020254 *
## type_catVan -1.271352 0.563454 -2.256 0.024048 *
## time_of_dayEvening -0.003336 0.066715 -0.050 0.960122
## time_of_dayMorning 0.267486 0.057502 4.652 3.29e-06 ***
## time_of_dayNight 0.520432 0.055919 9.307 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12644 on 9204 degrees of freedom
## Residual deviance: 12142 on 9189 degrees of freedom
## AIC: 12174
##
## Number of Fisher Scoring iterations: 4
|
Now checking again
1
2
3
4
5
6
7
|
inj.probs.new <- predict(glm.inj.new,
test,
type = "response")
inj.pred.new <- rep("No Injuries", length(inj.probs.new))
inj.pred.new[inj.probs.new > 0.5] <- "Injuries"
table(inj.pred.new, test_injury_outcome)
|
1
2
3
4
|
## test_injury_outcome
## inj.pred.new Injuries No Injuries
## Injuries 1524 2156
## No Injuries 202 63
|
1
|
mean(inj.pred.new == test_injury_outcome)
|
Still worse than guessing even though this does result in a slight improvement, but the Naive Bayes model still remains on top.
Conclusion
I learned so many new things trying to practice the skills that I’ve learned using classification models. I’m starting to become more familiar with the code on how to make predictions and check a classification model. Also, I learned a different way on how to seperate test and training data using the sample_frac() function in conjunction with the anti_join() function to create two different data frames that have a random fraction of observations in one data frame, and then the ones that were not selected in the other data frame.
I also see how messy working with real life data rather than data from an R package or a textbook is, this dataset is so huge and I only used a very pared down version of it. The original has millions of observations! There are also a lot of instances of human error and just differences in recording observations in the contributing factor and type of vehicle columns.