GIGO

I’ve watched the rise of “Bloomberg for Startups” companies with great interest. And I want to start by saying they’re dealing with some really difficult data problems. Crunchbase and AngelList, the two best sources for raw data on early-stage companies, do their best to be accurate. But Crunchbase is partially crowd-sourced, so there’s a lot of subjectivity in the classifications, and AngelList is limited to the companies that are on AngelList…while that’s an ever-increasing percentage, there are still gaps.

The lack of a widely-accepted taxonomy compounds the problem. For example, I like to research hardware. Since cleantech, solar, and medical device companies often have a hardware component, I could choose to include them. Usually I don’t, because in the VC ecosystem, those three sectors tend to sit apart…different groups of investors, different conferences. But really, this is a personal call and I’m never entirely sure that it’s the right one. Particularly with the health-monitoring wearables vs medical devices, the line is sometimes very fine. So I try to be really clear about what I’m including.

It’s often hard to do an analysis even within a single sector. For a project on robotics, a friend and I pulled down a few data sets from several different providers. Happened to notice that Boston Dynamics was missing…went on Crunchbase, and there it was, tagged only as a Software company. We edited it for the next person (or scraper), but the point is, even keeping a very limited scope doesn’t preclude dirty-data problems. There are plenty of companies that weren’t top-of-mind that we wouldn’t have caught.

This brings us to the 2014 “VC Trend Reports” that have been popping up lately. One post claimed that the “number of seed rounds raised in 2014 down by 30%” and attributed the drop to macro indicators. The other proclaimed that “2014 saw the highest number of seed VC deals since 2009 with 976 financings.” The third said that “seed rounds have declined by count over the past seven quarters…though an increase in convertible note usage might help explain the recent decrease.” These reports rely on many of the same data sources. And those data sources themselves are largely dependent on seed companies choosing to announce their rounds, or raising priced equity rounds.

The reports are primarily content marketing for the companies, which charge hundreds to thousands of dollars per month for a license. That by itself is fine – customers can pay or not pay depending on the value they derive. People who take the time to clean or validate the data are entitled to charge for their value-adds.

That said, because the pieces are content marketing, the data sets upon which they’re based are kept opaque. We have investor “rankings” based on god knows what – pay to find out! We see articles on funding trends that are largely dependent on the biases of the person or company behind them – is a $50 MM first round something that belongs in an article on Series A funding trends? If not, where should that cap be? $10MM? $8MM? Do we use the SV average to set it, or the national average?

Analysis involves making tough calls. I’m sure none of these companies want a subjective classification decision to result in an inaccurate picture. One solution would be to make the data sets underlying the report public. That way, it would be possible to uncover classification discrepancies and possibly arrive at a standardized taxonomy. Because as it stands, these companies are far more Gartner than Bloomberg.

And the industry would really benefit from a Bloomberg.