By Cassandra Balentine
Technology continues to advance. Using tools like machine learning through artificial intelligence (AI), it is becoming smarter. One area expected to disrupt both consumer and enterprise technology sectors is its ability to comprehend speech and take action from voice commands. This is already happening at home, with consumers adopting intelligent personal assistants—or smart speakers.
However, the possibilities of the technology are much bigger than adding items to a shopping list. With voice-based commands, workers in the field don’t have to stop to type notes into a handheld device, improving productivity. It can also play a role in security, offering the option of voice authentication.
According to Speech and Voice Recognition Market by Technology (Speech Recognition, Voice Recognition), Vertical (Automotive, Consumer, Banking, Financial Services, and Insurance (BFSI), Retail, Education, Healthcare & Government) and Geography – Global Forecast to 2023, a report by MarketsandMarkets, the speech and voice recognition market is expected to be valued at $6.19 billion USD in 2017 and is likely to reach $18.30 billion USD by 2023 at a compound annual growth rate of 19.80 percent within the forecast period.
Consumer and Enterprise Effect
Several factors drive the demand for voice and speech recognition solutions for both consumers and the enterprise.
Michael Kennewick, SVP, cloud development, Voicebox, points to three major trends he sees driving growth in the voice and speech recognition market. The first is connectivity and the rise of cloud computing. The second is an explosion in big data and machine learning. The third comes as a result of the first two, and includes a shift towards conversational speech-based interfaces.
Kennewick says conversational interfaces are fundamentally changing the way consumers interact with technology. “When interfaces were limited to screens and keyboards—or even today’s touchscreens—the model of interaction is that users must actively attend to the device in order to use it. They must pause whatever else they’re doing so they can look at the screen in order to do the pointing and clicking necessary to drive that interface,” he offers.
With conversational interfaces, the technology fades into the background. “It waits, ready to serve you when you need it without you needing to direct your full attention to it. You can simply ask it questions—often with an activation keyword, though we anticipate that requirement fading over time—and get your answers all by voice,” explains Kennewick.
This has a big impact on consumer technology, one aspect is the cost of technology. Kennewick points out that on many devices the most expensive component is the screen. “A microphone array and commodity speaker saves cost, allowing conversational devices to be much more price competitive while also giving users a more efficient interaction model. And as consumers are learning to expect smooth, seamless conversational experiences, they’re demanding them more,” he says.
Julia Webb, EVP sales and marketing, Voice Vault, says current driving factors in the voice recognition market include the recent epidemic of data breaches and the need for organizations to minimize fraud to significantly reduce costs while also enhancing the customer experience. Voice biometrics provide an enabling technology for multi-modal active consumer authentication.
“Rather than voice biometrics driving technology trends, consumer trends are currently driving increased adoption of voice biometric technology,” says Webb. “As the convenience of AI and Internet of Things (IoT) become more mainstream, consumers are more aware of the pitfalls of a purely speech recognition based system.
Webb says consumers are experiencing the power and ease of utilizing their voice and are now, more than ever, prepared for an omnichannel experience where their voice is the only thing needed for authentication. “For example, a bank’s customer needs only to enroll their voice on one channel in order to authenticate within not only the bank’s mobile application (app), but also within their instant voice response on the web.
Kennewick says branded conversational interfaces will require businesses to either invest in their own in-house cloud infrastructures to handle conversational processing at internet scale or outsource that service to companies who can run the conversational interface for them. “Either way, this is a great time to be a speech scientist, computational linguist, or data scientist. Demand for those skills have never been higher,” says Kennewick.
In addition to promoting their own brands though branded conversational interfaces, enterprises also stand to reap huge rewards in terms of cost savings. Conversational interfaces are an excellent fit for customer care applications, providing answers to routine customer service requests and questions. “Big box retailers spend enormous sums of money today answering questions like where the nearest store is and how late it closes. With conversational interfaces, those answers can be automated,” says Kennewick.
Word Error Rates
One challenge holding back the integration of voice and speech recognition tools is the word error rate (WER). This is a performance metric of speech and voice recognition. However, recent breakthroughs are constantly decreasing this issue as major vendors race to reduce their WER.
For example, in May 2017, it was reported that Sundar Pichai, CEO, google, stated that the company’s WER dropped to 4.9 percent.
IBM claims its Watson speech recognition solution’s WER was at 5.5 in March of 2017.
In an October 2016 blog, Microsoft commented that its “5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation.”
Kennewick says WER varies across different domains, generally depending on how wide a domain’s vocabulary is. “Simple domains, such as controlling a vehicle’s heating and air conditioning system, don’t involve large vocabularies and can achieve very high recognition accuracy. Domains like music, which must contend with millions of idiosyncratic names of songs, albums, and artists, are more challenging.”
Kennewick adds that humans have an estimated recognition accuracy of about 97 percent. “The goal of Voicebox is to eventually perform as well as or better than a human across many different applications and environments. There isn’t much room left for improvement in terms of raw WER scores but each improvement continues to make the systems more usable, forever widening groups of people.”
Moving forward, Kennewick sees further improvement in the way of matching the human advantage in detecting recognition errors and automatically correcting them using common sense knowledge; you know when you’ve misheard someone, and you can usually figure out what they said. Detecting when a transcription makes no sense relative to the user’s situation and automatically find a more likely interpretation, even if it’s not what the virtual reality software thought it heard, is still a very difficult software problem.
“The last push of recognition will depend on genuinely understanding users’ situations and what they’re really asking for in order to detect and correct misrecognitions,” explains Kennewick. “But if that’s all your system is doing then you’re missing nearly all of the real value because that same understanding will also allow conversational digital assistants to provide intelligent responses and assistance to users’ queries in much the same way that a human assistant can.”
Kennewick says that within the next decade we will see the next level of performance in speech recognition and digital assistants, “as our R&D teams break new ground in context algorithms, user models, and plan-based reasoning systems that can integrate a wide variety of information about a users’ situation with heuristics to approximate what we call common sense.”
Round Up
Here are several leaders of speech and voice recognition solutions.
Acapela Group creates voices that read, inform, explain, present, guide, educate, tell stories, help to communicate, alarm, notify, and entertain. Its text-to-speech solutions give voices to tiny toys or server farms, AI, screen readers or robots, cars and trains, smartphones, and IoT.
Alphabet Inc., the parent company of Google, offers speech recognition through the Google Cloud Speech API. The product enables developers to convert audio to text by applying powerful neural network models in an easy to use API. The API recognizes over 110 languages and variants. It can transcribe the text of users dictating to an application’s microphone, enable command-and-control through voice, or transcribe audio files, among many other use cases. Recognize audio uploaded in the request, and integrate with audio storage on Google Cloud Storage by using the same technology Google uses to power its own products.
Iflytek Co., Ltd. is dedicated to the research of intelligent speech and language technologies, development of software and chip products, provision of speech information services, and integration of e-government systems. The company’s speech technologies, comprising mainly of speech synthesis and speech recognition, aim to enable human-machine speech communication as convenient as human-human communication. The speech synthesis technology enables a machine to talk and the speech recognition technology enables a machine to hear. As the mobile internet enters a voice era, Iflytek has launched the Iflytek Voice Cloud platform, the world’s first platform providing intelligent speech interaction capabilities over the mobile internet. Based on the platform, Iflytek has launched such demonstration applications as Iflytek Voice Input and Iflytek ViaFly. In cooperation with many partners, iFLYTEK has made great efforts to promote the speech applications in mobile phones, automobiles, home appliances, toys, and a number of other fields, and ignited the innovation in the input and interaction mode in the mobile internet era.
LumenVox, LLC is a speech automation software company providing core speech technologies that include the LumenVox Speech Recognizer, Text-to-Speech Engine, Call Progress Analysis, Speech Tuner, and natural language solution support. Based on industry standards, LumenVox Speech Software is certified as one of the most accurate, natural sounding, and reliable solutions in the industry. LumenVox technology provides tools for effectively connecting and communicating with users, increasing user satisfaction, and improving employee productivity.
Microsoft Azure offers the Bing Speech API to Convert spoken audio to text. The API can be directed to turn on and recognize audio coming from the microphone in real-time, recognize audio coming from a different real time audio source, or to recognize audio from within a file. In all cases, real time streaming is available, so as the audio is being sent to the server, partial recognition results are also being returned. The Speech to Text API enables users to build smart apps that are voice triggered.
Nuance Communications, Inc. offers Dragon NaturallySpeaking and Nuance Recognizer speech recognition solutions. Dragon NaturallySpeaking is speech recognition software for the computer that enables users to create documents, spreadsheets, and email simply by speaking. According to the company it’s three times faster than typing and delivers up to 99 percent accuracy.
Nuance Recognizer is the software at the core of Nuance’s contact center automation solutions. With Nuance Recognizer, organizations can consistently deliver a great customer service experience while improving a self-service system’s containment rate. According to the company, the solution delivers the industry’s highest recognition accuracy even as it encourages natural, human-like conversations.
ReadSpeaker offers text to speech solutions for online and offline content of websites, mobile apps, eBooks, eLearning material, documents, telephony and transport systems, media, robotics, embedded devices, and IoT.
Sensory Inc. provides highly accurate, low-cost embedded speech recognition solutions for both integrated circuits and embedded software platforms. Technologies include voice recognition, speech and music synthesis, text-to-speech synthesis with voice morphing, speaker verification, and interactive robotic features.
Speechmatics offers a cloud-based and on premises speech recognition platform. It allows the user to input the audio or video content and receive a full transcript. The company’s solutions are used in various applications and industries, including call center analytics, call compliance, sub-titling, interview and lecture transcription, and media monitoring.
VoiceBase provides AI for speech recognition, speech analytics, and predictive analytics to surface insights that every business needs. Enterprises utilize the company’s deep learning neural network technology to automatically transcribe audio and video, score contact center calls, predict customer behavior, and automate net promoter scores.
VoiceBox Technologies Corp. provides conversational interfaces. Voicebox Cloud and Voicebox Edge target businesses and enterprises first by supplying the technology that enables those companies to satisfy their consumers. “A good earlier example of this is our work with Samsung, where Voicebox’s software powers the voice interface on the Galaxy series smartphones and tablets. Or with Toyota, where our automotive voice systems help drivers manage their vehicle systems while keeping their eyes on the road. Those are perfect examples, too, of how Voicebox can help companies promote their own brands by helping them keep control of their conversational experiences,” says Kennewick.
VoiceVault, Inc. offers voice biometric solutions, which Webb says significantly reduces the time required to log into a mobile application or the time agents and callers spend verifying a caller’s identity. Customers no longer have to participate in lengthy knowledge-based questions to authenticate. The VoiceVault solution is 100 percent focused on voice biometrics and utilizes the company’s own proprietary, military-grade algorithms. The solution offers flexible deployment options including on premises, cloud-based, and hybrid models, and is platform independent with the ability to support a diversity of applications from the call center to web and mobile.
VoiceVault’s solutions are configurable and specifically set to meet the client’s requirements. Webb says typical high-security applications would be configured to operate at 0.01 percent false acceptance rate, which results in a false recognition rate (FRR) of around six to eight percent when implemented using VoiceVault’s SDK, while other use cases may require a specific FRR operating point to be set.
Technology and the Future
Speech and voice recognition tools are poised to become an essential element of next-generation technology for consumers and businesses alike. As adoption continues and WER declines, expect to see speech and voice related technologies integrated into more devices and apps.
Jan2018, Software Magazine