The Wheat from the Chaff

If you’ve read any of my blog posts before, you may know that I hate the traditional 5-star feedback systems. I think they are missing so much opportunity and often present a false signal of quality.

I was chatting with the founder of recently and was pleasantly surprised that they are taking a big step forward in a feedback system that’s really designed to separate the proverbial wheat from the chaff.

It made me realize that we’ve all been making a big mistake with feedback systems.  In most marketplaces, we’ve been operating in a one size fits all mode.  Everyone from eBay to Uber to Upwork to Amazon uses a 5-star feedback system on competed transactions to accomplish two very distinct goals.  

Marketplaces have been misusing  the 5 stars to poorly accomplish both of these goals.  

Goal 1) prevent bad transactions from happening in the future. bad)

Goal 2) identify the highest quality good or service.  (Retain & Optimize great)

The standard 5-star system has unwittingly accomplished the first goal of preventing bad transactions, although you really only need a 2 or 3 point scale to accomplish the same thing.  However, due to a variety of reasons —> the 5-star system does a terrible job at identifying the highest quality good or service.  I believe it’s naive to think that the same approach can both identify the worst and the best in a marketplace.  Note that for some marketplaces the goal may be only to prevent bad transactions (Uber / Lyft) while in some other marketplaces (Farfetch / Toptal), the goal should really just be about identifying the best.  Depends if you’re a commodity or master marketplace.

I’d like to use this post to explore how to do a better job of identifying the best.  Let’s proceed.

Why don’t 5 star feedback systems work to identify the best services or goods?

I’ve blogged about this topic, see here.   There are multiple issues:

  • There is massive grade inflation – I adequate.
  • There is an apathy problem – I don’t care to take the energy to rate a middling service.   
  • I want reciprocity – if I rate something 3 stars, will I be punished by the other user?
  • I have misaligned incentives – do I really care about the future users of a marketplace or do I mostly care about myself?

The result of all these things is that you end up with tons of “5-star” rated users or products and can’t tell the difference between them.   

I’m not going in depth on this one since you can read all about it here.  

Let’s talk about what elements could truly improve the ability of feedback systems to identify the best?

To Chess:

ELO ratings have been around since about 1960 and do a remarkably good and accurate force ranking of worldwide chess players.  (My ELO ranking on chess master pro is a wimpy 1150 – a bright beginner).  There are millions of chess players in the world, and I believe the vast majority of serious players know their ELO ranking and also know exactly what it means to be 1000, 1500, or 2000.  It’s also remarkably accurate and a very good predictor of the outcome of a match. 

As a thought experiment, could we rank all of the CEOs in the world in the same way? Could we rank all of the sales people in the same way? The answer is probably not – but there’s still a lot to be learned.  ELO ratings work so well because chess is purely a game of skill (no chance or external factors involved) and every game is a pairwise competition involving 2 and only 2 players.  So we’ll never get to be as clean and accurate as chess, but shouldn’t we try? How valuable would it be if we knew a forced ranking of every software developer in the world? We don’t even need to be that accurate. If we could merely separate the top quartile, we would have ridiculously valuable data for recruiting and hiring.  We actually invested in a company that was using pairwise (1 vs 1) competition style rankings of people to generate this exact data.  It didn’t work out for external factors, but I still believe the data would have been highly accurate and incredibly valuable.  

The lesson we should take from ELO ratings is really the same lesson all of us learned in  college.  People should be graded on a curve.  In marketplace 5-star lingo, someone who gets 4.8 stars might actually be 10 times better than a 4.7 person.  Or they might be the same.  Just need to graph out the distribution curve of feedback and grade people according to a standard deviation.  In ELO rankings, this happens naturally by bumping people up or down the scale based in their wins or losses versus other users.  This can be accomplished easily in a marketplace if the same user has ranked 10 different things.  

And now, off to Dribbble:

Who’s the best graphic designer in the world? It’s a very tough question to answer, but Dribbble provides some answers.  You can go to their site and sort descending by number of likes.   Pretty damn good designers at the top.  This user-generated data based on likes (and not on feedback) is way more meaningful to separate the wheat from the chaff.    

And to Google:

Google is truly the innovator here.  SEO should serve as a very informative field since Google does a pretty good job of picking the top 1-10 webpages out of 30 trillion options for every search query.   To oversimplify everything, Google started with the observation that website links were a great indicator of authority and importance.  Over the past 20 years, they’ve added another 150+ behavioral factors to assess importance and relevance of any webpage to any search query.  One of the great (and scary) parts of this whole Internet thing is that we can track and analyze pretty much everything that everyone does online.  So, we know if people spend 5 seconds or 5 minutes looking at a profile.  Or if they hit like at a 20% rate or a 2% rate.   I believe that behavioral data — data about time on page, click-thru %, % like rate, etc will be one of the more meaningful opportunities to identify the best in any network of lots of people or goods.    

Marketplaces are WAY behind – most have a single search relevance function based on some sort of keyword and a simple feedback score sort option.   Over time, marketplaces will develop more advanced algorithms that incorporate more behavioral data – things like job approval rate %, message response time, on-time delivery %, etc.  

Who’s the best chef in the world?

Easy.  Just look at the 180 or so 3-star Michelin chefs.  It’s probably one of those. Why doesn’t a similar ranking system exist for VCs, founders, sales reps, software developers, or admin assistants.  Well, for one – the Michelin guide is a unique and amazing institution.  Two –  nobody has really tried.  Could you survey and ask 10,000 people who is the best VC you’ve ever worked with directly? What would the results be? Would they be correlated with the VC performance?  I started a habit several years ago that has been very helpful in my career.  After most meetings, if there’s a good connection, the 2 parties ask, “how can I help?”  I’ve tried several different asks  but my favorite is to simply ask for, “please introduce me to the best entrepreneur you know.”  Interestingly enough, most people have no problem identifying who that is and I get some fabulous intros this way.  However, if I asked for, please introduce me to all the 5-star entrepreneurs you know, I think the results would be very different and I’d be left with all the filtering on my plate.  I strongly believe that humans are exceptionally good at assessing relative quality and absolutely terrible at assessing absolute value.  It’s easy to answer which hamburger was better – McDonalds or Roam Burger.  It’s very hard to answer, how good was McDonalds.  

Is my dad qualified as a VC?

I love my dad – he’s as unique as they come, a lifelong electrical engineer, and he’s grown an interest in startup and angel investing.  He met a random startup entrepreneur and decided to invest.  I checked out the profile and thought it was in the bottom 20% of founders I’ve met.  So – the question is – who is more qualified to rank the founders?  And should it matter? I believe it does, but every single marketplace I know treats all user feedback the same.  A Yelp review from a Michelin food critic counts just as much as the Yelp review from an annoying teenager that got rude service on their Groupon date. How could that possibly be the right approach?  User authority and relevance is still completely missing from marketplace feedback systems.  Marketplaces will eventually follow in Google’s footsteps by introducing some notion of user authority.  


We’re clearly still in the early days of marketplace science.  We’re doing ok with our current systems to identify the negative users and products in a marketplace.  But, we have a very long way to go to get great at separating the wheat from the chaff and identifying the absolute best people and products. 

I believe the biggest opportunities are in the following themes: 

1) Grading on a curve (ELO)

2) Analyzing user behavior

3) Asking for relative vs absolute value (who’s better / best)

4) Utilizing user authority (like PageRank)

5) Using more social signals (like Dribbble)

If you’re a marketplace entrepreneur out there doing interesting things to improve feedback science, please do reach out.  Would love to hear from you.