Which metrics matter for STT/TTS?

For STT track WER/CER and latency; for TTS track naturalness (MOS), intelligibility and latency. Measure on representative samples.

Can I build a custom voice?

Yes, with Custom Neural Voice where available. It requires consent and responsible AI controls; enable governance and auditing.

Use streaming when possible, batching for long files, cache repeated results and monitor quotas.

Speech‑to‑Text, Text‑to‑Speech, real‑time translation and speaker recognition with Azure Cognitive services.

Real‑time or batch transcription with punctuation, diarization and multilingual support.

Natural neural voices; tone, rate and prosody controls.

Live speech translation for meetings, support and multilingual content.

Speaker identification and verification for security and personalization.

Create a branded voice (where allowed) with consent processes, review and monitoring.

Streaming STT example (websocket/SDK):

POST /speech/recognition/conversation/cognitiveservices/v1?language=en-US
Ocp-Apim-Subscription-Key: <key>
Content-Type: audio/wav

Optimize audio format (16kHz mono PCM), chunking and retries.

Real‑time with WebSocket/SignalR; batch with Functions + Blob Storage. For advanced customization use Azure ML.

Service	When to use	Output
STT	Real‑time or batch transcription	Text with timestamps, diarization
TTS	Voice assistants, audio content	Synthesized audio, SSML
Translation	Multilingual meetings/support	Live transcriptions & translations
Speaker	Authentication & personalization	Speaker ID/verification with scores
Custom Voice	Controlled branded voice	Voice model + usage policies

Reduce noise/reverberation, consistent mics, proper gain, 16kHz sampling.

Clear notices, minimal retention, anonymization and access roles.

Streaming for real‑time, batch for long files; caching, compression and quota controls.

Do I need client‑side GPUs?

No. Processing runs in Azure; optimize codec/bitrate and network conditions.

How to handle accents and dialects?

Pick the right locale, apply lexical adaptation and evaluate custom dictionaries.

Can I moderate TTS output?

Yes. Apply content filters and SSML rules; add human review for public content.