posted by Andrew Fletcher
The speech processing research community is a dynamic place these days. The commercial prominence of highly popular speech interfaces has taken an already-thriving research community and provided a transforming jolt of attention. The evidence is quite clear at the annual Interspeech conference, the 2019 version of which was held in Graz, Austria from September 15-19. Cobalt had five employees in attendance this year -- Jeff Adams, Ryan Lish, Stan Salvador, Rasmus Dall, and myself -- and we found it invigorating both technically and socially. Plus, we experienced the joys of European tourism! (As an aside, if you’re looking to join a dynamic team in speech and language, reach out to us at firstname.lastname@example.org.)
First, the locale. Jeff, Ryan, and I chose to stay in a quaint cottage in a family-owned Austrian winery. While this meant a commute into Graz, we got to experience the Austrian countryside in its early autumn grandeur. Stan and Rasmus both stayed in the city and thus had easy walking access to the lovely Graz city center. I think it’s fair to say we were all more than satisfied with the experience.
By gathering thousands of speech scientists into one place, it was a great chance to reconnect with friends and colleagues. Whether it be former Cobalt colleagues, friends from our past employments, or friends met at previous Interspeeches, we all had multiple reunions each day. Jeff takes the prize in this department: he has over 25 years being deep within the speech science community and is an extremely extroverted guy. Throughout the week he reconnected with friends in academia and industry, giving him the chance to metaphorically take the pulse of the broader speech community.
A valuable and enjoyable aspect of Interspeech is the chance to broaden our vision of both technical challenges and innovative solutions. We get to see the highlights and accomplishments of a talented group of men and women who have insights into both what problems are worth solving and how to go about finding the best solutions. As many researchers can attest, it requires focus and attention to detail to tackle hard problems. Time spent thinking about various techniques at Interspeech can help to avoid “over-fitting” (to borrow a term from machine learning). We certainly came away with many ideas to improve technology for our customers -- the challenge now will be prioritizing and implementing!
One particular trend we’re watching closely is the extensive research effort going into “end-to-end” speech recognition. Deep neural networks (DNNs) have revolutionized speech processing (and the greater machine learning community) over the last decade and researchers continue to probe means to leverage them to ever greater effect. Many traditional approaches (such as kaldi) have trained separate acoustic and language models for speech recognition and DNNs have proven useful for both models. Such methods are the foundation for production ASR systems and are now referred to as “hybrid.” This contrasts with efforts to do ASR with combined DNN networks to to solve the problem “end-to-end.” End-to-end ASR has been a topic for several years now and is the clear focus of the research community. It is fascinating to see the innovative methods used to overcome the acknowledged challenges particularly in regards to the quantity of data required and designing DNN topologies to exploit data in the most efficient ways.
While end-to-end ASR has not yet overtaken hybrid systems in achieving the state of the art performance, the achievements are both noteworthy and exciting. The very first survey talk of Interspeech 2019 by Ralf Schlüter of RWTH Aachen University compared results to date and explored the place for each style of ASR. In particular, he showed results using the Librispeech corpus showing that the current state of the art is a hybrid system with an LSTM acoustic model and a transformer-based language model. The two talks immediately following presented end-to-end approaches that nearly matched the hybrid system. It was a welcome reminder right as the conference began that ASR continues to be an active competition of ideas to help push the community toward ever more impactful technology. It’s a fun time to be in the speech technology field!
We engaged with researchers discussing speech synthesis, natural language understanding and dialogue systems, voice biometrics, speech analytics, emotion detection, and more. I had some great discussions about techniques like generative adversarial networks, attention networks, and neural vocoders. I won’t presume to summarize the tremendous volume of research presented at the conference (the archive is available here). It was invigorating to be a part of it!
Cobalt remains a committed participant in this dynamic and innovative community. We’re staying current in the latest techniques are are always looking for opportunities to transition cutting edge research into deployed technology. If you’re looking to make these breakthroughs, come see how we can help you.
About the Author
Dr. Andrew Fletcher is an experienced research engineer with broad expertise in system architecture, machine learning, information theory, communication systems, signal processing, and quantum computing. He has led projects at Cobalt in voice user interface design and implementation, speech recognition, natural language understanding, speaker biometrics, text to speech, and keyword spotting. Prior to joining Cobalt, Andrew spent 15 years at MIT Lincoln Laboratory. He led research projects on undersea laser communication and quantum cryptography and was a leading system engineer for free-space laser communication systems for NASA and the US Department of Defense. Andrew earned a Ph.D. from the Massachusetts Institute of Technology in Electrical Engineering. At MIT, he was in the Laboratory for Information and Decision Systems (LIDS), and was advised by Peter Shor, a leading theorist in quantum computing and information. He earned B.S. and M.S. degrees in Electrical and Computer Engineering at Brigham Young University.