This post is based on a notebook I wrote a couple years ago. I’d like to revisit and expand on it, as well as correct some errors. The original notebook is here.
In this post, I analyze traffic collision data from Los Angeles County in January 2012. The analysis is sound, but the conclusion in the title is meant to be taken with a grain of salt, due to the limitations of the data source, including the time span and specific location.
The source of the data is a California state traffic data system called SWITRS (Statewide Integrated Traffic Records System). The system allowed for the download of about 5,000 incidents at a time. The data are submitted by local law enforcement agencies based on traffic incidents they respond to. The system allows downloads in different categories: victims, collisions and parties. The parties data includes all parties in a collision, including parties at fault and not at fault. This becomes important in the contextualization of the data.
Male-female, Fault-no fault
The data contains many dimensions that allow for much more in-depth analysis. However, for our purposes, the analysis is quite simple.
The key to this analysis is to bundle up the data into a two-by-two contingency table and interpret it. Using pandas, the following result can be obtained:
We can clearly see that for drivers at fault, males outnumber females 2231 to 1369, immediately suggesting that males are far worse drivers when it comes to causing collisions.
Using the basis that males and females are roughly equal portions of the population, we would expect that if they drive with equal safety, they should be equally represented in drivers at fault for collisions.
If you’re paying attention, you probably notice a problem with this logic. While population is equal, men may be over-represented as drivers since they are more likely in the driving and delivery professions. In fact, with this data we can make some estimates about that.
Correcting errors in reasoning
In my notebook, I made an error in summing the at fault and not at fault as a representation of the overall driving population. This would have been a reasonable move if we had counted all the drivers who were not at fault, not just those who are in a collision. However, this is not feasible.
Instead, we take the the drivers in a collision and only those not at fault as a representation of the drivers on the road at any given time. As far as we know, there is nothing special about these drivers, they just happened to be in the wrong place at the wrong time. The total number of drivers (in or not in a collision) in LA County not at fault is much higher, roughly 5 million [1,2]. Under our assumptions, the ratio of male to female drivers is thus 2577:1938 or 1.33:1. Scaling out it’s about 2.15 million:2.85 million male:female drivers who are not at fault.
In comparison, the numbers of drivers at fault is quite small and is not significant to the overall numbers of drivers.
Now calculate the overall difference
The rate of male drivers causing an accident is thus 2271/2.85/100 = 7.97 drivers at fault / 10,000 male drivers.
The rate of female drivers causing an accident is thus 1369/2.15/100 = 6.37 drivers at fault / 10,000 female drivers.
The ratio of risk would then be 7.97/6.37 = 1.25 or a 25% greater risk for male vs. female drivers.
Conclusion: men are, on average, worse (more dangerous) drivers
 The actual number used here is an estimate. Intermediate values are therefore accurate only to the accuracy of the estimate. The number is important to show scale relative to the numbers of drivers at fault. However, the actual number would cancel out in the calculation of ratio between male and female risk.
 Using only number of drivers here is a bit misleading, since drivers are not on the road for equal amounts of time. In fact, it is reasonable to expect that number of drivers is roughly equal between men and women, but that men spend more hours driving, as they are more likely to be in the driving and delivery professions. Instead, we could estimate 5 million * 30 hrs / month = 150 million driver hours / month. Scaled out it is about 85.6 million:64.4 million driver hours male:female. The accident rates would then be: 2231/85.6/100 = 0.261 drivers at fault / 10,000 male driver hours and 1369/64.4/100 = 0.213 drivers at fault / 10,000 female driver hours.