What's the best way to predict user ratings?
I recently read an article on using machine learning to predict a user's 1 to 5 rating on a particular service or product. As test cases, the article considered ratings of a class on Coursera, ratings of a musical instrument on Amazon, and ratings of a hotel on Trip Advisor. In all cases the idea was to use the narrative text of the review to predict the rating. The specific question explored by the article was whether it worked better to approach this as a regression problem or a classification problem.
Let me give a clear example of each to illustrate the difference, starting with regression. Let's say you wanted to create an home appraisal algorithm. What would go into the algorithm would be various features of the house such as number of bedrooms and bathrooms, square footage, and neighborhood. What comes out would be an estimated market value for the house. These market values can be ordered with respect to each other; a market value of $420,000 is more than a value of $345,000, which is more than value of $280,00. Furthermore, the input to the algorithm is related to the output in a systematic way; increase the number of bedrooms and the value will go up. Decrease the square footage and it will go down. This would be a typical regression problem.
Now let's suppose you want to predict what social media platform a person prefers to use based on various demographic information about them. That algorithm would spit out a platform like Instagram or TikTok or Facebook. But you can't put these in order the way you can for market value; Instagram is neither more nor less than Facebook, it's just different. And because of that, the relationship between inputs and outputs is also not so clear-cut. For example, gender is related to preferred social media platform, but a person's gender may make it more likely for them to use some platforms, less likely for them to use other platforms, and have no impact for yet other platforms. This is a typical classification problem.
So back to our original question - is predicting a user rating more of a regression problem (like appraising a house) or more of a classification problem (like predicting preferred one's preferred social media platform). The article's author initially saw it as a regression problem, on the grounds that ratings have a natural ordering - 5 stars is more than 4 stars is more than 3 stars. But in a head-to-head comparison between algorithms, the classification algorithm came out on top.
There are many possible reasons for this result. But I would like to focus on one of them, what economist Tim Harford calls "premature enumeration." When we offer a user a 1 to 5 scale for their rating, we are asking them to take some sort of experience and assign a number to it. But it is a mistake to automatically assume that these ratings will take on the properties we typically associate with numbers, like 5 being more than 4 being more than 3. To know whether that is a reasonable assumption or not, we need to understand how users are mapping experience to numbers.
To make this clear, let's consider one possible scenario. Suppose the rating has almost nothing to do with the actual experience, and mostly reflects the user's personality. Happy-go-lucky Hannahs give out 5's left and right. Equivocal Emilys default to a 3 unless the experience was extraordinarily good or bad. And Curmudgeonly Claires almost always give a 1. If you look at a set of ratings, you'll see bigger and smaller numbers. And you probably could predict the rating from a narrative review. But it would be inappropriate to interpret a 5 as being more than a 3. By using numeric ratings we create the superficial appearance of a regression problem, but this is effectively a classification problem. Furthermore, we would be classifying the users, not the product or service.
This is a somewhat extreme example, but it makes the point that you always need to consider where data points are coming from before making assumptions about what they mean. As a psychologist I happen to think it helps to have a strong social science background, but even if you don't, you can get a lot of mileage out of considering what I will call "the usual suspects" - factors that are known to influence human behavior across a broad range of situations. I'll consider each one in turn.
In psychology the term individual differences is used to refer to relatively stable differences between people that impact how they perceive the world, how they think, and how they act. Personality would be one kind of individual difference, but depending on what kind of behavior you're looking at all kinds of things can fall under this umbrella - whether someone is single or married, whether they are bilingual, or whether they have a disability. One study found that cultural values, specifically individualism versus collectivism, can influence user ratings.
A second factor that frequently shapes human behavior is context. Imagine someone giving a big, public presentation, like a TED talk. Now imagine that same person at a Happy Hour with some close friends. Even though it's the same person, they will likely talk and act very differently in those two situations. In this case the difference in context has to do with the setting, but for consumer behaviors the information available to a person also serves as part of the context. One way information can shape behavior is through anchoring effects. For example, one study found that people tend to make a different (and lower) payment on their credit card when a minimum payment amount is provided, compared to when the statement does not have a minimum payment. The minimum payment amount serves as an anchor that tends to pull the actual payment amount toward it.
In the case of user ratings, there is frequently an anchor present in the form of the average rating of other users. And there is evidence that seeing an average rating skews new ratings toward that average. For anyone who relies on average ratings to help guide purchase decisions (which is most of us), this is a sobering thought.
A third factor that commonly influences data on human behavior is reactivity. Reactivity occurs when people change their behavior in response to some aspect of the measuring process itself. One common form of reactivity is driven by social desirability - when we know we are being observed, we change our behavior to try to make a positive impression on those around us. For example, imagine you are in a meeting at work where a high-ranking executive is promoting the company's CSR efforts. In the middle of their presentation, they suddenly decide to go around the room and ask everyone how much they donated to charity last year. How honest would you be? Unless you happen to be particularly philanthropic, you might be tempted to distort the number upward to appear more generous.
In user ratings, reactivity is a potential concern any time it is possible to link a rating with a person's identity. But it can also crop up in more subtle ways as well. For example, consider the common practice of offering an incentive to get users to leave a review. Interestingly, according to one research study this creates a pair of competing pressures. On the one side is the so-called reciprocity norm - someone does something nice for you, you do something nice for them in return. On the other side is consumers' desire to not be manipulated, which can push them to be more critical of a product. Even if on balance the two pressures cancel out, as they did in this study, the point remains that users' thinking was influenced by the incentive.
I started this article with the question of whether predicting user ratings is better thought of as a regression problem or a classification problem. The point I've tried to make with the rest of the article is that this may be the wrong question - or at least not the question to start with. It may be possible to develop a model that does a great job predicting one kind of data that you don't really understand using another kind of data that you don't really understand. And if the goal is just to come up with better algorithms, that may be a very sensible thing to do. But it would be very risky to use such a model as the basis for important business decisions. If the numbers are going to inform action, it's important to have a good grasp on what they do and don't mean. When the data concern human thought or behavior that can be messy and complicated, and I realize that this messiness can be discouraging. But just as machine learning can help us see things we didn't see before, so can diving into the messiness. In fact, I think if the full promise of big data is going to be realized, we need to do both.