Speech

Speech‑to‑Text, Text‑to‑Speech, real‑time translation and speaker recognition with Azure Cognitive services.

Technology Cluster · Back to Cognitive Services · Voice use cases

Core services

Speech‑to‑Text (STT)

Real‑time or batch transcription with punctuation, diarization and multilingual support.

Text‑to‑Speech (TTS)

Natural neural voices; tone, rate and prosody controls.

Speech Translation

Live speech translation for meetings, support and multilingual content.

Speaker Recognition

Speaker identification and verification for security and personalization.

Custom Neural Voice

Create a branded voice (where allowed) with consent processes, review and monitoring.

Integration patterns

APIs & SDKs

Streaming STT example (websocket/SDK):

POST /speech/recognition/conversation/cognitiveservices/v1?language=en-US
Ocp-Apim-Subscription-Key: <key>
Content-Type: audio/wav

Optimize audio format (16kHz mono PCM), chunking and retries.

Architecture

Real‑time with WebSocket/SignalR; batch with Functions + Blob Storage. For advanced customization use Azure ML.

Quick comparison

ServiceWhen to useOutput
STTReal‑time or batch transcriptionText with timestamps, diarization
TTSVoice assistants, audio contentSynthesized audio, SSML
TranslationMultilingual meetings/supportLive transcriptions & translations
SpeakerAuthentication & personalizationSpeaker ID/verification with scores
Custom VoiceControlled branded voiceVoice model + usage policies

Best practices

Audio quality

Reduce noise/reverberation, consistent mics, proper gain, 16kHz sampling.

Privacy & consent

Clear notices, minimal retention, anonymization and access roles.

Latency & costs

Streaming for real‑time, batch for long files; caching, compression and quota controls.

FAQ

Do I need client‑side GPUs?

No. Processing runs in Azure; optimize codec/bitrate and network conditions.

How to handle accents and dialects?

Pick the right locale, apply lexical adaptation and evaluate custom dictionaries.

Can I moderate TTS output?

Yes. Apply content filters and SSML rules; add human review for public content.