It’s Time to Move On From Liking Scores

Liking scores and the way they’re generated in traditional sensory testing are an inaccurate way of understanding consumer preference. In the Gastrograph AI system, we go through several steps to generate a Perceived Quality score — which provides a more accurate picture of consumer preference.

As we all know, love is a mystery. It’s notoriously difficult to predict which CPG products consumers will love.

Since the 1960s, most companies have used the nine-point hedonic scale to develop an average “liking score.” They use this score to inform all of their product decisions. But just because everyone is using liking scores, it doesn’t mean they’re any good.

Liking scores and the way they’re generated in traditional sensory testing are an inaccurate way of understanding consumer preference. In the Gastrograph AI system, we go through several steps to generate a Perceived Quality score — it provides a more accurate picture of consumer preference.

For CPG companies who want to create products consumers will love, Gastrograph AI is their best option.

What Are Liking Scores?

Liking scores are a way of measuring consumers’ preferences for a product. The ‘hedonic scale’ runs from one (‘dislike extremely’) to nine (‘like extremely’). An average liking score of seven or above on the scale means that the product is acceptable for consumers.

David R. Peryam and Beverley J. Kroll (the founders of P&K Research Corporation) popularized the hedonic scale for liking scores in the 1960s. They experimented with different point scales: one-to-five, -seven, -nine, and -eleven.

The scientists chose nine because of limitations in paper size and fixed-pitch typewriters. A nine-point scale would literally fit best on the page.

type writer

Do you still own a typewriter? 

In the 1990s, they created a kid-friendly version of the scale.


The kids’ P&K scale uses more accessible language: from ‘super good’ to ‘super bad.’

The nine-point scale was quickly adopted by the food industry, and it’s now the most popular way of measuring liking scores. Even cosmetic and household product companies use it to measure consumer preference.

So, What’s the Problem?

Nine-point scales aren’t a good way of measuring liking scores because they don’t force people to differentiate between what they like and don’t like. What’s worse, the way companies typically gather liking scores doesn’t generate accurate data.

A three-point scale has perfect separation. It’s a simple choice between:

  • I don’t like it
  • I’m neutral
  • I like it

But there’s no differentiation — there’s no space to say, I like it a lot, or I like it a little, for example.

A 100-point scale has loads of space for nuance. But there’s too much differentiation, so you get random flips: “two products that they like the same are gonna be ranked slightly differently each time,” explains Jason Cohen, Founder & CEO of Gastrograph AI. For example, you might rate the same product 96 one day but 97 the next day.

Because P&K Research popularized the nine-point scale, it’s now the industry standard. It gives you more differentiation than a three-point scale, but you avoid the variations you’d get on a 100-point scale.

What’s Wrong With a 9-point Scale?

The problem is that to most people, a nine-point scale feels the same as a ten-point scale. We’re used to scoring things out of ten — or watching judges do it on Dancing with the Stars, for example — so our brains treat the nine-point scale in the same way. We see five as neutral, anything below it as negative, and anything above it as positive.

When people are rating products in a tasting trial, they often don’t have the chance to form a confident opinion about them. When we don’t feel confident, we don’t like to commit, so we tend to avoid giving very low or very high scores, and the scores gather around the middle. “If it’s a four or if it’s a six, people are more likely to mark a five because they don’t want to differentiate,” says Jason.

Using a scale that creates this effect means you’re reducing your information gain. If you end up with an average score of five, you can’t be confident that people enjoyed your product.

At Gastrograph, we favor a seven-point scale. There’s no option for a neutral score (even a score that ‘feels’ neutral to most people, like five out of nine), so that forces people to make a decision about how much they like the product. Jason explains, “We find that it removes the avoidance of the endpoint. There’s less tendency towards the mean [because] people are forced to make [a] differentiation.”

Flawed Data Gathering

But the nine-point problem is only the start. A bigger issue in traditional sensory testing is the way companies gather their data.

Before they get consumers to give liking scores, most companies send their products through a descriptive round with expert tasters. Here, the experts try products, and they identify the flavors that they taste. Then, they ask consumers to rate the flavors listed by the experts.

So, perception (what you taste) is separate from preference (how much you like what you taste). Because your data is disjointed, liking scores might be able to tell you whether someone likes a product but not why they like it.

If a consumer tastes a flavor the experts haven’t listed, they have no way of communicating this. They’ll probably ‘offload’ their preference for that flavor onto one of the flavors that is on the list.

Let’s say a consumer tastes an awful cardboard flavor. They can’t add a description, so they translate ‘cardboard’ to ‘bitterness’ and give the ‘bitter’ flavor a low score. So you have no idea your product tastes like cardboard.

Another issue is that companies typically recruit ‘heavy users’ (people who already consume their product a lot). Then, they try to apply liking scores generated by heavy users to predict how much the general population will enjoy their product.

A famous example of this kind of sample frame error comes from the 1936 US presidential election. Researchers surveyed a sample of people selected from car registrations and telephone directories. The results of the survey predicted a Republican victory.


Spoiler alert: the Democrats won.

But in 1936, most Americans didn’t own cars or telephones — those who did were overwhelmingly Republican supporters. So, of course, the prediction was wrong.

Imagine you’re a yogurt brand looking to expand your market, and you want to test out a new passionfruit yogurt. For your sample, you choose yogurt fans. These people eat yogurt several times a day, and they tell you they LOVE your product. Great, you think. My product will be a success!

But the average non-yogurt-obsessed consumer doesn’t enjoy your product as much. The biased sample you selected means your results don’t accurately predict preference among the general population.


Passionfruit yogurt could be a hit with consumers.

As well as not being representative of typical consumers, heavy user samples make it difficult to differentiate. “You’ve already removed everyone who doesn’t like your product,” says Jason, so everyone gives it a high score of seven to nine. When most people give a high score, you don’t learn anything new about preference for your product.

1 2 3 4 5 6 7 8 9

A tell-tale sign of a ‘heavy user’ sample is a graph that looks like this.

Perceived Quality Scores With Gastrograph AI

In the Gastrograph system, we use a seven-point scale, we don’t separate perception from preference, and we have tasting models for populations all over the world. That means Perceived Quality scores from Gastrograph are more accurate than liking scores.

With a quality score out of seven, you don’t get the avoidance of the endpoint like you do with a nine-point scale. In our system, people also have to take time to identify what they taste before they rate the quality of the product.

Conventional testing: taste → rate out of nine

Gastrograph AI: taste → fill out a full sensory profile (identifying flavor attributes & their intensities) → rate out of seven

The result is that we allow consumer tasters to accurately describe what they taste, so we avoid flavor offloading.


Tasters can select a label to describe their flavor or write a new one in any language they want.

Our results are also indicative of sustained preference. Jason explains, “You can get a lot of preference from products that have novelty. If you’ve never tasted, say, a blue raspberry lemonade, and you taste one, and you’re like, oh, wow, that’s really interesting.” But as the novel wears off, preference falls.

With Gastrograph, Jason says, “[tasters] already had to think about all the flavors and work to identify the flavors in our system,” so when they rate the products, the results are more indicative of a two to three-week trial. Getting an idea of preference over time is so important in the CPG industry because a product’s success depends on repeat purchases.

We’ve built our AI models on tasting data from all over the world, which allows us to avoid biased testing samples. In traditional testing, companies typically sample-heavy users, then take an average and use that as an indication of how the general population will respond.

At Gastrograph, we’ve spent years creating perception and preference models for different subpopulations all over the world. Our system understands how different demographics perceive different flavors and how this influences their preference.

When we’re testing out a new product, we collect perceived quality scores from one demographic. We’re then able to translate those scores to predict how the product would fare with whichever population you want to target.

Solve the Mystery of Consumer Preference

Gastrograph gives you an abundance of useful data to inform your product decisions. We can accurately predict consumer preference, so you’re able to successfully launch products you know people will love.








Similar posts

Stay up to date with Gastrograph AI

Be the first to know about company updates and industry-related news, straight from our internal subject matter experts.