Beware someone who says data can solve any problem. They’re naive, malicious, or won’t be around to deal with the aftermath.
Of course, data can accomplish a lot. As a data scientist, my job depends on it. But if we’re being honest, the limitations are becoming more and more obvious. As a community, data scientists have to start pumping the brakes on data hype, or take the risk of riding a runaway train.
The data boosters
Data boosterism has taken many forms in many different fields. Much of this activity took place in the period from 2008-2014, before data science really caught on as a field in business. What lessons can we learn?
In journalism, political journalists like Ezra Klein and Nate Silver popularized data-driven journalism. At one time, Nate Silver’s columns reportedly accounted for up to 20% of the New York Times‘ online traffic. Nate Silver aggregated polls and used statistical models to predict elections. His reputation hit new heights after the 2012 elections when he got almost every prediction right in 49 of 50 states, and almost all of the senate races correctly as well.
In education, crusading reformers pushed test score and graduation rate measurements as critical metrics and targets. In 2007, the Washington, DC school board was stripped of power by the city legislature and Michelle Rhee was appointed chancellor of DC schools by mayor Adrian Fenty. Rhee became chancellor at the age of 37 after students in her classrooms increased their test scores. She managed to reform tenure and fire 266 employees that were deemed low performing. In 2009, Beverly Hall of Atlanta was named superintendent of the year. Hall touted data-driven teaching practices and an increase in test scores during here tenure.
The EU increased fuel efficiency requirements for automobiles by tightening CO2 emission regulations. The limit is 130g CO2 per kilometer traveled, and sure enough, in the most recent year (2017), automobiles are beating the limit at 115g/km.
Data journalism increased in popularity. National newspapers released their own predictions. The New York Times predicted an 85% chance of Clinton victory. Professor Sam Wang, a Princeton neuroscientist who waded into the election business was all but certain of a Clinton victory, claiming over 99% probability. Nate Silver was among the more reserved and put a Clinton victory at 65%.
In the end, Trump won in an electoral vote landslide. His keys to victory were a few states–Michigan, Wisconsin, and Pennsylvania–where he performed much better than polls predicted. In Wisconsin a poll aggregator put Clinton’s advantage at 6.5%. Trump won the state by 0.7%, a swing of more than 7%. What’s interesting is that national polls were close to accurate, within a point or so. The state polls that had been so accurate in 2012 turned out to be off in 2016. Sam Wang ate a bug on live television in contrition.
Michelle Rhee’s political ally, mayor Adrian Fenty, lost re-election in 2010. She resigned shortly afterwards. Despite some successes, Rhee would develop a rocky relationship with data.
Even in 2007, the year she was appointed chancellor, Rhee could not provide proof of her own students’ improved test scores when she was a teacher in Baltimore in the 1990s. In Jan 2011, a blogger uncovered the data, which was available online the whole time. The positive results Rhee had touted turned out to be incorrect, or at least exaggerated. While her students’ scores did go up by her third year teaching, it wasn’t as high as she had claimed, and in her first year, the results were low. Rhee countered by saying her principal at the school told her about the results, which she recalled to the best of her ability.
Later in 2011, USA Today reported high levels of answer switching, after testing company McGraw-Hill had flagged several classrooms as having high rates of “wrong-to-right” answer changes. A test security consulting company hired by DC schools reported they “did not find evidence of cheating at any of the schools.” In 2013, A US Department of Education investigation into the alleged cheating confirmed that cheating had been used to influence federal grant dollars, but said the cheating was not widespread.
In Atlanta, a massive cheating scandal was uncovered. All told, 44 of the 56 schools in the district were affected. 178 school employees were implicated and 11 were convicted and sentenced to prison. An 800-page state report described a “data-driven environment” under the leadership of Beverly Hall. Atlanta Public Schools (APS) was awarded funds by the Broad Foundation and the Gates Foundation due to its rising scores. More generally, teacher pay is sometimes tied to test performance, with bonuses of up to $25,000 being awarded.
History seems to be repeating itself, at least in part. In 2018, DC schools are once again under fire. 900 students, one third of all graduating seniors, were improperly allowed to graduate, against the district’s own rules because of “truancy and other problems”. The improper graduations allowed the district to report higher graduation rates. This comes after, according to the Washington Post:
…the D.C. school system has been the crown jewel of public policy in the nation’s capital, held up as a national model for education reformers and a shared source of pride for the District’s fractious elected officials.
As for the EU emissions tests, an independent study found that the emissions tests were not reflective of real-world driving conditions, and that the gap was growing wider every year. During the emissions tests, the cars are equipped with special tires and the test track conditions are ensured to be ideal.
The study used data from applications which motorists used to track actual real-world fuel usage and mileage. The largest of these apps is spritmonitor.de. The study authors were then able to compute the total gas mileage and CO2 emissions for the different automobiles and model years in question. Learn more from this YouTube video I produced.
Why over-reliance on data is flawed
Observing these cases of flawed use of data, patterns emerge. In my view, over-reliance on data is flawed for at least these main reasons:
- Data processes evolve over time: The relationship between variables changes over time and does so in ways we can not always predict. In the data journalism realm, polls which were once a good predictor of outcome were no longer accurate. It’s hard to say why they were no longer accurate, but relying on the proxy metric of polls turned out to be inaccurate in many states.
- Incentives warp behavior: When teachers’ and administrators’ jobs and salaries depend on test scores or graduation rates, it’s not hard to see why metrics were faked. Similarly, when a certain level of CO2 emission is mandated by law and.
- Proxy metrics are imperfect: The real goal of schools–perhaps student learning–is hard and costly to measure. Proxy metrics like test scores are used, but these are only proxies for the real thing.
How does this apply to data science?
Goodhart’s law says:
A metric that becomes a target ceases to be a good metric.
When incentives are at stake, expect that metrics will be abused. This is relevant for data scientists and analysts in business, particularly in fields like personnel management.
In addition to deliberate gaming, we must also worry about more natural, gradual processes of change. These processes can limit the generalizability of machine learning models. Customers may grow tired of old products, or gradual cultural change may shape preferences more subtly. A model trained on last year’s data may no longer be accurate.
Trends in data science culture
I hope I’ve convinced you that data can be misleading without context. How is this relevant for data scientists?
Back to business
Data scientists need to get back to business. That was the message of a recent widely shared article on LinkedIn about data science. Data science has become associated with big data and advanced algorithms like deep learning. But at many companies, simpler models coupled with stronger business knowledge and presentation skills are more important for data scientists.
In Harvard Business Review, there was an article about data science that touched on similar themes written by the host of the DataFramed podcast who has spoken to 35 data scientists on what skills are important for them.
The article sounds a cautious note when it comes to AI and deep learning, saying “…healthy skepticism is in order”. The article also suggests that presentation skills are more valuable for data scientists than deep learning acumen. Data scientists should not be beholden to specific technical skills, but be willing to change approaches as needed, and have a focus on business decision making.
The “AI winter” is a concept that has picked up steam since an article by Filip Pieknewski. Filip’s argument is that the AI and deep learning revolutions are over, at least for now. The hype around “artificial general intelligence” is overblown. Big tech is letting go of their prized researchers, like Yann Lecunn at Facebook. Enthusiasm from the earlier AI boosters like Andrew Ng has waned (far fewer tweets). Model power is not scaling as quickly as the computational power needed, leading to a point of diminishing (or no) returns.
The Information ran a great feature on the difficulty Google is having integrating DeepMind into its larger organization. Google acquired DeepMind for $600 million in 2014. DeepMind gained fame when they trained a computer to beat world champion Go players. Amazing science, and amazing PR, but the practical side–revenue generation–is coming along much more slowly. The AI winter concept is a fascinating idea to unpack. Check out the commentary at Hacker News.
How data can work
By now, I hope you have seen the danger of using metrics as a goal. In some cases, like sales, the metric is identical, or at least close, to the actual end goal. But in other cases, be careful setting metrics as a target, and if possible, avoid it. The moment metrics become targets, as we’ve seen, the trouble begins. Most metrics should be used observationally, i.e. to direct strategy rather than as a target. Don’t guarantee a measurable outcome. You’ve seen how dangerous that can be.
Second, the best way to learn is with experiments that are as fully controlled as possible. Any treatment should be compared to a control group that is as similar to it as possible, except for the treatment. In my work in internet media, a great example of this is article renovation–when old content is rewritten to keep it relevant in hopes of attracting more traffic (particularly from search). One approach is to take a group of articles, renovate them, and see how they perform afterwards. A good approach? Maybe, but you can do better. To control the experiment, you should choose a group of articles that are as similar as possible (traffic, topic, etc.) that are not renovated. This group will have the same seasonal changes and provide a clean baseline for comparison.
For data scientists at most companies, you’re not going to get much mileage out of building your own deep learning models. Many deep learning tasks can be solved using cheap APIs from Google or Microsoft. Professionally, you are going to better off making sure your data is clean and accessible, simplifying your models and learning to present. Learn to walk before you can run.
Have faith in, and sell, a good process over metrics. The shame of the Michelle Rhee situation, is that many of the things she did were good. She may have had more long-term success if she hadn’t pushed test scores and other metrics so hard, and had instead focused more on the process of recruiting and retaining great educators and firing bad ones.