Speech-to-text, also known as speech recognition, enables real-time transcription of audio streams into text. Your applications, tools, or devices can consume, display, and take action on this text as command input. This service is powered by the same recognition technology that.
-->Important
- Speech-to-Text and its enhanced phone call models are already powering Google Cloud’s powerful solution, Contact Center AI.
- Speech to Text The IBM Watson Speech to Text service uses speech recognition capabilities to convert Arabic, English, Spanish, French, Brazilian Portuguese, Japanese, Korean, German,.
- SpeechTexter is a free professional multilingual speech-to-text application aimed at assisting you with transcription of any type of documents, books, reports, blog posts, etc by using your voice. SpeechTexter's custom dictionary allows adding short commands for inserting frequently used data (punctuation marks, phone numbers, addresses, etc).
- The Speech to Text service converts the human voice into the written word. The service uses deep-learning AI to apply knowledge of grammar, language structure, and the composition of audio and voice signals to accurately transcribe human speech. It can be used in applications such as voice-automated chatbots, analytic tools for customer-service call centers, and multi-media transcription.
Transport Layer Security (TLS) 1.2 is now enforced for all HTTP requests to this service. For more information, see Azure Cognitive Services security.
In this overview, you learn about the benefits and capabilities of the text-to-speech service, which enables your applications, tools, or devices to convert text into human-like synthesized speech. Use human-like neural voices, or create a custom voice unique to your product or brand. For a full list of supported voices, languages, and locales, see supported languages.
Text To Speech Com
This documentation contains the following article types:
- Quickstarts are getting-started instructions to guide you through making requests to the service.
- How-to guides contain instructions for using the service in more specific or customized ways.
- Concepts provide in-depth explanations of the service functionality and features.
- Tutorials are longer guides that show you how to use the service as a component in broader business solutions.
Note
Bing Speech was decommissioned on October 15, 2019. If your applications, tools, or products are using the Bing Speech APIs or Custom Speech, we've created guides to help you migrate to the Speech service.
Core features
Speech synthesis - Use the Speech SDK or REST API to convert text-to-speech using standard, neural, or custom voices.
Asynchronous synthesis of long audio - Use the Long Audio API to asynchronously synthesize text-to-speech files longer than 10 minutes (for example audio books or lectures). Unlike synthesis performed using the Speech SDK or speech-to-text REST API, responses aren't returned in real time. The expectation is that requests are sent asynchronously, responses are polled for, and that the synthesized audio is downloaded when made available from the service. Only custom neural voices are supported.
Neural voices - Deep neural networks are used to overcome the limits of traditional speech synthesis with regard to stress and intonation in spoken language. Prosody prediction and voice synthesis are performed simultaneously, which results in more fluid and natural-sounding outputs. Neural voices can be used to make interactions with chatbots and voice assistants more natural and engaging, convert digital texts such as e-books into audiobooks, and enhance in-car navigation systems. With the human-like natural prosody and clear articulation of words, neural voices significantly reduce listening fatigue when you interact with AI systems. For a full list of neural voices, see supported languages.
Adjust speaking styles with SSML - Speech Synthesis Markup Language (SSML) is an XML-based markup language used to customize speech-to-text outputs. With SSML, you can adjust pitch, add pauses, improve pronunciation, speed up or slow down speaking rate, increase or decrease volume, and attribute multiple voices to a single document. See the how-to for adjusting speaking styles.
Visemes - Visemes are the key poses in observed speech, including the position of the lips, jaw and tongue when producing a particular phoneme. Visemes have a strong correlation with voices and phonemes. Using viseme events in Speech SDK, you can generate facial animation data, which can be used to animate faces in lip-reading communication, education, entertainment, and customer service.
Note
Viseme events are currently only supported for the en-US-AriaNeural
voice.
Get started
See the quickstart to get started with text-to-speech. The text-to-speech service is available via the Speech SDK, the REST API, and the Speech CLI
Sample code
Sample code for text-to-speech is available on GitHub. These samples cover text-to-speech conversion in most popular programming languages.
Customization
In addition to neural voices, you can create and fine-tune custom voices unique to your product or brand. All it takes to get started are a handful of audio files and the associated transcriptions. For more information, see Get started with Custom Voice
Pricing note
When using the text-to-speech service, you are billed for each character that is converted to speech, including punctuation. While the SSML document itself is not billable, optional elements that are used to adjust how the text is converted to speech, like phonemes and pitch, are counted as billable characters. Here's a list of what's billable:
- Text passed to the text-to-speech service in the SSML body of the request
- All markup within the text field of the request body in the SSML format, except for
<speak>
and<voice>
tags - Letters, punctuation, spaces, tabs, markup, and all white-space characters
- Every code point defined in Unicode
For detailed information, see Pricing.
Important
Speech To Text Google Docs
Each Chinese, Japanese, and Korean language character is counted as two characters for billing.