Metrics: Swallow the Red Pill

16 May 2024

This is the first part of a trilogy, where I will discuss some high level ideas for measuring end user performance for conversational voice assistant. In this article, I will be discussing how to think about metrics ideologically for conversational voice assistant and some pitfalls from a engineer’s point of view.

The aim is keep the discussion simplified and fairly high level for everyone working in this domain.

Part 1 - Metrics: Swallow the Red Pill
Part 2 - Metrics Reloaded: The Oracle
Part 3 - The Metrics Revolution: Scaling

Quick Intro

I am a senior engineer and tech lead for the Google Assistant. Here is my linkedin profile in case you want to know more about me.

I also hold 2 patents in performance evaluation infrastructure for voice assistants:

Standardizing analysis metrics across multiple devices
Device Agnostic Framework to Measure Reliability During User Interactions (patent pending)

And yes I am a big fan of the “Matrix” trilogy :)

Which metrics matter?

Metrics should be classified into top 2 categories

User Reported Metrics
User Perceived Metrics

Note the word - “user” - is present in both which should be expected. The golden rule for any conversational product or any product in general is “user” should always be given the top priority.

Before going into the definition for the above, a user facing metric usually consists of 2 aspects

Reliability - Whether the user was able to complete a specific action?
Latency - How long did it take for the user to complete a specific action?

Note that high latency can also mean low reliability. No user will wait for 20 seconds for a successful action completion.

User Perceived Metrics

These metrics which capture reliability and latency of actions that can be directly observed by the user.

e.g.

How long does it take for the voice assistant to start playing music when the user asks it to?
How many times does the voice assistant actually start playing music when the user asks it to?

These metrics are similar to system level metrics and hence have high coverage but comparatively lower confidence (by virtue of being derived).

User Reported Metrics

These metrics are directly reported by the users either through feedback channels, internal testing, surveys etc. These metrics are explicitly coming from the user and hence have low coverage but comparatively higher confidence.

The Red Pill

A rule of thumb I follow for metrics is swallow the red pill which means never fully trust a metric - always question negative and positive movement in a metric.

The reason I say this is because metrics cannot be measured explicitly most of the time and need to be derived many times from different signals.

Example 1: “How many times does the voice assistant actually start playing music when the user asks it to?”

We might not be able to capture the true “latency” that the user experiences in this case with a 100% accuracy. But we can observe different components such a long did a specific backend(s) take to respond and use specific heuristics to come up with a metric value which approximates the user perceived latency to the best of our knowledge. This is where system level component metrics come into the picture.

Example 2: ”How many times does the voice assistant actually start playing music when the user asks it to”

This is difficult to determine based on User Perceived Metrics. The simple rationale is if the voice assistant intelligence is able to detect that user wants to play music - then it would started playing music. So, this metric in reality will always reflect 100% reliability which is incorrect. This is where User Reported Metrics come into the picture. However, since user reported metrics can be detached from the system level components, it is important to connect those with system level metrics or User Perceived Metrics to make it actionable.

Conclusion

Both User Perceived Metrics and User Reported Metrics complement each other and form the foundation for a highly reliable conversational assistant. It might not be trivial to enable these metrics at a later stage and may require significant bandwidth. As a result, metric evaluation should be given high priority when designing conversational assistant products. This is the only way to objectively evaluate the user experience and may influence some of the design decisions. As a result, it might be worthwhile to understand the metric requirements in early stages of product development.

There is a need to invest in infrastructure to ensure these metrics easy to use, accurate and actionable - we will be discussing more details on how to do that in the second part of this trilogy. So stay tuned!