🗣️ TeLeS: Making AI More Honest About What It Hears

🎧 What If AI Could Say, “I’m Not Sure”?
Automatic Speech Recognition (ASR) is at the heart of voice assistants, transcription tools, and smart speakers. But here’s the problem: these systems often act overconfident — even when they get things wrong.
That’s dangerous when accuracy matters — like in healthcare, law, or low-resource languages.
So this paper tackles a fundamental problem: How can we make ASR models know when they might be wrong?
The answer? A clever new score called TeLeS — a blend of time alignment + word similarity — that teaches AI to estimate how confident it really is.
🧠 The Innovation: TeLeS = Temporal + Lexeme Similarity
Most confidence estimation models use binary logic:
- ✅ Right word → score = 1
- ❌ Wrong word → score = 0
But that’s… too blunt. What about:
- Minor typos? (e.g. “president” → “presidant”)
- Timing errors?
- Mixed accents?
TeLeS (Temporal Lexeme Similarity) solves this by assigning a score between 0 and 1 based on:
- 🕰️ Temporal similarity: Did the spoken and predicted words occur at the same time?
- 🔤 Lexeme similarity: How similar are the actual and predicted words in spelling and meaning?
Then it trains a separate confidence model (called WLC) using this fine-grained score.
⚙️ How It Works (Simplified)
- ASR makes a prediction — say, turning speech into words.
- TeLeS aligns the ASR output with the ground truth and computes:
- Lexical similarity (word overlap)
- Temporal similarity (start/end time of words)
- Lexical similarity (word overlap)
- These become the target scores for a new model that learns to predict how confident it should be.
- During testing, this confidence score helps identify:
- What the AI thinks it got right
- What it’s unsure about
- What the AI thinks it got right
🚀 Bonus: TeLeS Learns Smarter with “Shrinkage Loss”
Because most words in a transcript are correct, models can become biased toward high confidence.
To fix this, the paper uses shrinkage loss:
- Focuses on learning from hard-to-learn (wrong or borderline) examples
- Ignores the “too easy” (obvious) ones
This makes the confidence model more balanced and robust.
💡 Active Learning: TeLeS-A Knows What to Ask
The authors go a step further with TeLeS-A — an active learning system that picks:
- 🤔 Uncertain predictions to send to human annotators
- 🤖 Confident ones to self-label and add to training data
This human-in-the-loop setup improves the ASR over time — using TeLeS scores to guide what to learn next.
🌍 Tested in 3 Indian Languages
This isn’t theory — it’s been tried on real-world datasets in:
- 🇮🇳 Hindi (Prasar Bharati and Common Voice)
- 🏛️ Tamil (IISc-MILE)
- 🎙️ Kannada (IISc-MILE)
And even on mismatched domains (i.e., data very different from training), TeLeS held up beautifully — better than state-of-the-art.
📊 Results: More Trustworthy Predictions
Compared to previous methods, TeLeS:
- Had better calibration (the gap between confidence and accuracy was smaller)
- Handled subtle errors better than binary-label approaches
- Achieved lower Word Error Rates (WER) in active learning settings
Method | WER ↓ | Calibration Error ↓ | Score Quality ↑ |
Class-Prob | ❌ High | ❌ Poor | ⚠️ Basic |
Entropy-Based | ❌ Mid | ⚠️ So-so | ⚠️ Inconsistent |
Binary Labels | ⚠️ Mid | ⚠️ Inaccurate | ⚠️ Blunt Scoring |
TeLeS | ✅ Low | ✅ Well-calibrated | ✅ Fine-Grained |
🧵 TL;DR: TeLeS Makes Voice AI More Honest
By fusing how similar a word sounds with when it was said, this paper introduces a new confidence scoring system that helps speech recognition systems:
- Be more aware of their own errors
- Learn smarter, with fewer annotations
- Work better across languages and accents
It’s a step toward trustworthy AI that knows its limits — and asks for help when it’s unsure.