A typical big data analysis goes like this: First, a data scientist finds some obscure data accumulating in a server. Next, he or she spends days or weeks slicing and dicing the numbers, eventually stumbling upon some unusual insights. Then, a meeting is organized to present the findings to business managers, after which, the scientist feels disgruntled or even disrespected while the managers wish they could take the time back.
When these meetings fail, the main points of contention usually include unclear purpose; analyses that are too narrowly focused; and over-confidence in the science, which turns off non-technical managers. If you’re facing this situation, you should read the FiveThirtyEight article on mining the baby names dataset. When you’re done, send the article to your analytics team.
What FiveThirtyEight’s Nate Silver and Allison McCann did with the baby names dataset sets an example for all data analysts: They imbued it with
relevant business problem, attached complementary data, made a bold, but acceptable, assumption to patch a hole in the data, and elaborated their conclusion with a margin of error. Their article represents the best of data journalism. It surpasses most examples of big data analytics, as we know it.
Curated by the Social Security Administration (SSA), the dataset of the first names of all newborn Americans since 1880 is a star of big data. In the past few years, the baby names dataset has been mined to death (pardon the pun). Its fame can be traced to computer scientist Martin Wattenberg, who created the Baby Names Voyager, a user-friendly interface for visualizing the baby names. The purpose of the Voyager is investigating what names were popular when. Since Wattenberg, a line of analysts has pursued numerous projects, such as the most trendy names, the most poisoned names, and the most distinctive name by state.
All this slicing and dicing have produced insights that are little more than sound bites or click bait. And then, Silver and McCann entered the picture.
They imbued the data with a relevant business problem.
Instead of asking what names were popular (or poisoned or trendy or distinctive) in a given period of time, the two data journalists turned the question around and investigated whether someone’s first name provides sufficient information to guess when he or she was born.
This framing of the issue immediately reminds me of the real-world problems of guessing someone’s religion or languages spoken from his or her name, place of residence, and other factors. Many sophisticated businesses use such demographic data to develop customer segmentation. If your business purchases third-party data with those variables, you are already benefiting from the type of analysis Silver and McCann presented. (In practice, direct information on people’s age is more available than religion or languages.)
They attached complementary data.
It is rarely the case that one dataset contains all of the information needed to solve a business problem. The SSA data have information on births but not on deaths. A simple averaging of the birth dates of every Elizabeth ever born leads to a vastly over-stated average age because some of those people are no longer living. To perform the analysis properly, the data journalists incorporated actuarial life tables, which contain estimates of death rates.
They patched a hole in the data.
Actuaries, however, do not care about first names. The death rates can be split by gender, but not by name. The analyst could give up on the project at this stage, or make an assumption and trudge forward. Silver and McCann chose the latter route by assuming that death rates do not vary by first name. This is, without a doubt, a bold move, but one I’m comfortable with because it allows the analysis to reach a satisfactory state. Data analysts often face this type of decision in the course of any big data work. (You can see key analytical decisions in the footnotes of the article.)
They elaborated their conclusion with a margin of error.
The powerful graphics in the article clearly display the potential error sustained if one uses first names to predict a person’s age. Silver and McCann showed that the level of accuracy depends on gender and on the shape of the popularity trend. In some of the better examples, they can bracket someone’s age to within 10 years with 50% confidence. All too often, media reports of big data analyses omit any quantification of their accuracy, a harsh irony given the field’s trumpeting of the scientific method.
All the lessons described here apply easily to any business analytics team. Instead of generating sound bites with scant business relevance, data scientists should consult their business partners early and agree on an interesting business problem before digging into the data. As gigantic as many of today’s datasets are, they may still lack important variables, thus requiring augmentation. Big data analysis is highly valued because it can provide useful predictions, but analysts err when they fail to include a margin of error. Sound business decisions require understanding not only the most likely scenario, but also the range of possibilities. As the discipline of data science and analytics evolves, the process of generating business insights will improve, and there will be less all-around frustration when teams meet about data projects.