Spirit LM is an open-source multimodal language model developed by Meta that seamlessly integrates text and speech, enabling more natural and expressive human-computer interactions. Unlike traditional models that process text and speech separately, Spirit LM combines these modalities into a single stream of tokens, allowing for fluid transitions between spoken and written language.
Built upon a 7-billion-parameter pretrained text model, Spirit LM was further trained to incorporate speech by using a word-level interleaving method. This approach involves concatenating text and speech sequences and training the model on both modalities simultaneously. The training data includes a small, automatically curated speech-text parallel corpus, which helps the model understand and generate language that encompasses both written and spoken forms.
There are two versions of Spirit LM:
- Spirit LM Base: Utilizes phonetic tokens derived from speech, focusing on accurate speech representation.
- Spirit LM Expressive: Incorporates additional tokens for pitch and style, enabling the model to capture and convey emotions such as excitement or sadness in its speech outputs.
This dual-version approach allows users to choose between a standard speech model and one that adds emotional depth to interactions.
Spirit LM's capabilities extend to learning new tasks across different modalities, including automatic speech recognition (ASR), text-to-speech (TTS) conversion, and speech classification. By maintaining the expressive qualities of human speech, the model enhances applications like virtual assistants and customer service bots, making interactions more engaging and lifelike.
Released under Meta's FAIR Noncommercial Research License, Spirit LM is available for non-commercial research purposes. This open-source release aims to encourage further exploration and development in the integration of speech and text within the AI research community.
For more background, read the research paper: Spirit LM: Interleaved Spoken and Written Language Model.
And check out the introductory video below: