Despite its potentially widespread applications, creating a speech recognition software capable of cutting through the nuances and variations in the spoken word has been a task fraught with patchy success, at best. The aim many companies have striven toward is to create a software that can recognize the words in a conversation as well as a human would — a key requisite for a truly immersive artificial intelligence experience.
In a major breakthrough in this endeavor, Microsoft announced Tuesday that it had created a technology that enabled speech recognition systems to transcribe a conversation with the same error rate as their human counterparts.
“We’ve reached human parity,” Xuedong Huang, Microsoft’s chief speech scientist, said in a statement. “This is an historic achievement.”
In order to achieve this goal, the researchers at Microsoft used something called a Computational Network Toolkit — a homegrown system available via an open source license on GitHub. The toolkit’s ability to process deep learning algorithms across multiple computers, coupled with the optimization of recurrent neural networks, enabled the team at Microsoft reach human parity.
It is, however, important to note that Microsoft’s speech recognition system is not perfect, but then neither are humans. As the company explains in its statement, the system’s word error rate is 5.9 percent — the same as professional transcriptionists, who also manage to miss 5.9 percent of what they hear.
“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” Harry Shum, executive vice president for Microsoft’s artificial intelligence and research group, said in the statement.
Microsoft plans to use the improved technology in Cortana — its personal voice assistant for Windows and Xbox.
“This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum added.
The company said that the next step would be to ensure that the speech recognition software works as well in real-world settings, where there is a lot of background noise. And eventually, it aims to create software that can not only transcribe speech, but can even understand the words being said.