At Wysdom, two things never stop – experimentation and benchmarking.
Experimentation ensures that we continually challenge the way we do things and foster a playground of trials where we constantly develop, adapt, and adopt cutting-edge techniques that will allow enterprises to serve engaging interactions with their end customers using intelligent automation.
Benchmarking introduces rigour and discipline. As a team, we push hard to meet the KPIs that an experiment seeks to serve. After all, what gets measured is what gets optimized.
We found an NLU benchmarking test using data that is way out of our wheelhouse
A benchmarking exercise led by Nguyen Trong Canh that compares the leading NLU engines in the industry recently caught our attention. The exercise uses data aggregated from open data question-answer datasets in Ask Ubuntu, Stack Exchange and a German public transit chatbot to create 4 distinct corpus’ for testing. (The summary of the datasets can be found here.)
Given these datasets are distinctly different from the usual industries that Wysdom deals with, the Wysdom team geared up with excitement to complete the benchmarking exercise and compare our own NLU engine to the data. After all, recent enhancements to our multi-stage NLU pipeline allows us to use a combination of statistical approaches, boosting, and deep learning engines, and gives us the ability to automatically detect and trash garbage utterances, identify and respond to small talk, and more.
F1 scores: A measure of accuracy
Things like intent classification and entity extraction are critical components of a Natural Language Understanding (NLU) system in any bot platform, so ensuring accuracy is the most important goal.
An F-score is a measure of a test’s accuracy. It considers both the precision and the recall of the test to compute the score, where precision is the number of correct positive results divided by the number of all positive results returned by the classifier, and recall is the number of correct positive results divided by the number of all samples that should have been identified as positive. The F1 score is the average of precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.
As you can see from the results comparison, Wysdom offers one of the best NLU classification performances in the industry.
F1-scores for intent classification for each corpus:
While a good f1-score alone does not guarantee an effective bot, a poor f1-score definitely guarantees an ineffective one.
We should also note that the Wysdom Exchange, which provides pretrained models and data across specialized enterprise verticals such as telecommunications, banking, insurance and more, was not at play for this benchmarking exercise given the nature of the benchmarking data. The researchers compared the f1-scores, a very well acknowledged machine learning metric, in the field of information retrieval, which is commonly used to measure the performance of NLU systems.
Wysdom’s NLU is among the best in the world
This exercise provided us with validation that Wysdom’s NLU is right up there with the best and when combined with prebuilt, industry specific knowledge from the Wysdom Exchange, it outperforms even the biggest players in AI.
Interested in learning more about the Wysdom Exchange? Request a demo to see our Conversational AI in action.