2016 was the year of botsconversational interfaces, and intelligent assistants. It showed the world a glimpse of the promise that exists in an important upcoming computing paradigm we call conversational intelligence. It also reminded us that building and evaluating conversational intelligences is hard. How are you supposed to know whether you can trust one, and with what tasks? Will the respective intelligence respond gracefully, or awkwardly? Who will win user trust and introduce these potentially life-changing interfaces into our daily lives?


At Clara, we know that anything less than near-perfect isn’t good enough for a successful conversational interface. Over the last three years, we’ve defined, built, and delivered on the exceptional level of service that our customers require and deserve. As we move the state of the art forward, we also want to commit to looking at our work to measure its precision and sophistication — the things that matter most to winning user trust.


Publishing quality metrics is standard practice for all sorts of established business services and products. It’s something consumers and businesses have come to expect. So we’re always wondering: what metrics will help consumers evaluate the intelligence of this new category of interface?


We are continuously seeking new ways to define and assess quality metrics. In fact, we’ve already cycled through several approaches. There’s no established standard for assessing conversational intelligences, but we believe that by sharing the ways we think about quality, we can help our customers better-understand what they can — and should — expect from using Clara. Periodically, we’ll share our methodologies for capturing Clara’s performance and holding our service to the high standard our customers rely on. As a first step, we’ll share some of the early ways we’ve been measuring quality.


Two of Clara’s early quality metrics


  • Clara’s system-wide intelligence: measured by per-response success rate
  • Clara’s system-wide 24/7 response time: measured by median response time and 80th percentile response time


FAQ — methodology details


What is an error?


Typically when you think about an error, you think in binaries. Things work, or they don’t work. I clicked the button and got a result, or I got an error. Conversational interfaces are more challenging than most: you have to correctly interpret a huge number of variables at a time to produce an intelligent — fully correct — output.


We’re diligently working to clearly define the full scope of potential errors throughout the scheduling process. This means there is an ever-growing list of error categories as we grow Clara’s expertise and release additional features.


How do you know there was an error?


Our Clara Remote Assistants — CRAs — monitor outgoing Clara responses for errors. When an error is found (usually by a CRA, otherwise by a meeting guest that notifies us of the error) a CRA labels the error and its category.


Why do you measure the success rate per response?


We realize that success isn’t just determined by a meeting being successfully scheduled. Success is defined by Clara following every one of a customer’s preferences, adhering to every grammatical rule, and interacting with our customer’s contacts flawlessly every time. If Clara forgets that you only like coffee meetings before 10am, then Clara didn’t deliver. We are setting a high bar on quality and defining quality to ensure we don’t just get a meeting on the calendar, but that Clara is a trusted partner every time.


What does a response success rate mean in practice?


For example: a 97% per response success rate is 1.5 errors out of every 50 responses — meaning that a vast majority of responses are error free. For our system, since CRAs are involved in detecting the errors, we’re often able to fix them before they impact the scheduled meeting or are even noticed by a customer, making their perceived service quality higher.


How is Clara’s response time measured?


Clara’s response time is measured by taking the date-time of every incoming message that requires a response and subtracting it from the date-time at which Clara’s response was sent. This means our response time takes into account our AI/machine learning, any human interactions needed to ensure our AI/machine learning is accurate, and our rigorous QA processes.


Measuring intelligence is challenging. One major advantage of cooperative intelligence over a purely human solution is that quality can be measured at a granular level. That measurement can — and should — be exposed to customers. Our ambition is to continue to evolve these metrics, and ultimately transition them to transparent SLAs for our customers. In the meantime, we maintain internal diagnostic tools to ensure we’re delivering the most sophisticated, dependable scheduling service on the market.