Month: April 2018

Data vignette: men are worse drivers

This post is based on a notebook I wrote a couple years ago. I’d like to revisit and expand on it, as well as correct some errors. The original notebook is here.

In this post, I analyze traffic collision data from Los Angeles County in January 2012. The analysis is sound, but the conclusion in the title is meant to be taken with a grain of salt, due to the limitations of the data source, including the time span and specific location.

The source of the data is a California state traffic data system called SWITRS (Statewide Integrated Traffic Records System). The system allowed for the download of about 5,000 incidents at a time. The data are submitted by local law enforcement agencies based on traffic incidents they respond to. The system allows downloads in different categories: victims, collisions and parties. The parties data includes all parties in a collision, including parties at fault and not at fault. This becomes important in the contextualization of the data.

Male-female, Fault-no fault

The data contains many dimensions that allow for much more in-depth analysis. However, for our purposes, the analysis is quite simple.

The key to this analysis is to bundle up the data into a two-by-two contingency table and interpret it. Using pandas, the following result can be obtained:


We can clearly see that for drivers at fault, males outnumber females 2231 to 1369, immediately suggesting that males are far worse drivers when it comes to causing collisions.

Using the basis that males and females are roughly equal portions of the population, we would expect that if they drive with equal safety, they should be equally represented in drivers at fault for collisions.

If you’re paying attention, you probably notice a problem with this logic. While population is equal, men may be over-represented as drivers since they are more likely in the driving and delivery professions. In fact, with this data we can make some estimates about that.

Correcting errors in reasoning

In my notebook, I made an error in summing the at fault and not at fault as a representation of the overall driving population. This would have been a reasonable move if we had counted all the drivers who were not at fault, not just those who are in a collision. However, this is not feasible.

Instead, we take the the drivers in a collision and only those not at fault as a representation of the drivers on the road at any given time. As far as we know, there is nothing special about these drivers, they just happened to be in the wrong place at the wrong time. The total number of drivers (in or not in a collision) in LA County not at fault is much higher, roughly 5 million [1,2]. Under our assumptions, the ratio of male to female drivers is thus 2577:1938 or 1.33:1. Scaling out it’s about 2.15 million:2.85 million male:female drivers who are not at fault.

In comparison, the numbers of drivers at fault is quite small and is not significant to the overall numbers of drivers.

Now calculate the overall difference

The rate of male drivers causing an accident is thus 2271/2.85/100 = 7.97 drivers at fault / 10,000 male drivers.

The rate of female drivers causing an accident is thus 1369/2.15/100 = 6.37 drivers at fault / 10,000 female drivers.

The ratio of risk would then be 7.97/6.37 = 1.25 or a 25% greater risk for male vs. female drivers.

Conclusion: men are, on average, worse (more dangerous) drivers



[1] The actual number used here is an estimate. Intermediate values are therefore accurate only to the accuracy of the estimate. The number is important to show scale relative to the numbers of drivers at fault. However, the actual number would cancel out in the calculation of ratio between male and female risk.

[2] Using only number of drivers here is a bit misleading, since drivers are not on the road for equal amounts of time. In fact, it is reasonable to expect that number of drivers is roughly equal between men and women, but that men spend more hours driving, as they are more likely to be in the driving and delivery professions. Instead, we could estimate 5 million * 30 hrs / month = 150 million driver hours / month. Scaled out it is about 85.6 million:64.4 million driver hours male:female. The accident rates would then be: 2231/85.6/100 = 0.261 drivers at fault / 10,000 male driver hours and 1369/64.4/100 = 0.213 drivers at fault / 10,000 female driver hours.


For presentations, focus on narrative

You’ve collected the data, you’ve run the analysis, now you have to decide how to present. You’ve considered it from every angle, and you’re preparing a slide deck to match–detailed, lengthy and technical. Is this the right approach? Probably not.

Rule of thumb: include no more than one figure per topic

When you’re the technical expert, people aren’t always trying to prove you wrong. They expect you to do the work work correctly, until you give them reason not to. You don’t have to take them through the entire process of analyzing the data. One figure is enough to make the point and move on. More than that becomes boring and forgettable.

This was an adjustment for me. Academic seminar talks are all technical, and audience questions take the presenter through the weeds.

There are exceptions to this rule, but for most audiences, one figure is the limit of attention. Choosing one figure to make the point will help to clarify your message. That said, be ready to answer questions from many angles.

If they don’t want data, what do they want? Narrative

Instead of spending most time thinking about and preparing figures, spend your time thinking about narrative and story telling. What are the keys?

A crucial element of narrative is an upward or progressive arc. We like companies that grow, societies that evolve, people that improve themselves and narratives that move from bad to better:

“We had some trouble early in Q2, but learned our lesson and fixed the problem.”

“We sustained initial losses but our heavy investments paid off in the long run.”

Another part of memorable narratives is the goldilocks principle–not too much, not too little:

“We were over-invested in customer service. We were able to reduce costs without affecting our feedback scores.”

“This model achieves the right balance of simplicity and complexity and will maximize profit”

Another type of narrative that appeals to me is the contrarian narrative: the experts say X, but here’s why they’re wrong. This one’s more of a personal taste and should be used carefully.

Narratives operate at different levels and in different contexts

Business executives craft a narrative about their business that must satisfy investors, motivate employees, generate positive PR, and be stable over time.

In a similar way, data scientists must craft narratives about data that are consistent with the facts, satisfying to leadership, palatable to stakeholders and memorable.

So next time you’re presenting your data, think about narrative, and see how much impact you can have.