What is the USM? (Universal Speech Model)

February 19, 2024

The Universal Speech Model (USM) represents a groundbreaking leap in speech recognition technology, aiming to democratize access to automatic speech recognition (ASR) across a vast linguistic landscape. Developed by Google, USM is designed to understand over 300 languages, including those that are under-resourced or spoken by relatively small populations. This initiative is part of Google's broader 1,000 Languages Initiative, with the goal of making speech and language technologies more inclusive and accessible worldwide.

USM is a family of state-of-the-art speech models that utilize 2 billion parameters and have been trained on an unprecedented scale, with 12 million hours of speech and 28 billion sentences of text. Such extensive training enables USM to perform ASR not only on widely spoken languages like English and Mandarin but also on a diverse array of languages including Punjabi, Assamese, and Balinese, among others. This inclusivity addresses a significant gap in current ASR technologies, which often overlook languages with fewer speakers or limited available data.

One of the key strengths of USM lies in its training methodology, which combines the use of a large unlabeled multilingual dataset for pre-training the encoder, followed by fine-tuning on a smaller set of labeled data. This approach allows for effective adaptation to new languages and data, showcasing the model's versatility and scalability. Moreover, USM has demonstrated superior performance across multiple languages and tasks, achieving lower word error rates (WER) compared to existing models like Whisper, even with limited supervised data. This is particularly notable in its application to YouTube captions, where it supports 73 languages and outperforms the current state-of-the-art models.

USM's architecture leverages the Conformer model, which combines attention, feed-forward, and convolutional modules to process speech signals efficiently. This architecture enables USM to achieve high-quality results in both ASR and automatic speech translation (AST), with significant improvements over existing technologies in terms of accuracy and language coverage.

The development of USM marks a significant milestone towards achieving Google's vision of making information universally accessible. By supporting a wide range of languages, USM has the potential to bring the benefits of speech technology to billions of people around the world, fostering greater inclusion and accessibility in the digital age.

Read the paper here: https://arxiv.org/abs/2303.01037