Chemin

What Real Malaysian Voices Reveal About Speech AI Limitations

19 December, 2025Insights
What Real Malaysian Voices Reveal About Speech AI Limitations

Speech AI performs reliably under constrained conditions and degrades as speech becomes less structured and more context-dependent.

This tension is evident across many speech systems in use today. Models achieve strong results when inputs are clean, predictable, and well-segmented. Performance weakens once those assumptions about structure, consistency, and acoustic clarity are no longer valid.

Over two days in Melaka, we observed how speech AI behaves when those assumptions are removed in practice, through a classroom-based workshop where language was informal, multilingual, and shaped by noise and interruption.

Evaluating Speech Under Narrow Assumptions

Most speech AI systems assume speech will be clear, captured in quiet environments, and spoken within a single language at a time. These constraints support comparability, reproducibility, and measurable performance gains, while also shaping the range of speech patterns the system learns to interpret.

In everyday settings, speech rarely conforms to those boundaries. Informal phrasing, mixed languages, cultural references, interruptions, and background noise introduce forms of context that resist normalization. When such conditions are underrepresented during training and evaluation, systems perform reliably within benchmarks but degrade once the assumptions no longer apply.

This gap between benchmark performance and operational reliability emerges whenever evaluation conditions diverge from how speech is produced and understood in practice.

Understanding this gap requires observing system behavior beyond benchmark settings, in environments where assumptions are relaxed rather than enforced, and variability reveals which behaviors persist, which degrade, and why.

Observing Speech AI Under Real Conditions

To make these dynamics visible to students, Chemin partnered with four secondary schools in Melaka. The workshop was conducted in classrooms and designed to help students understand how speech systems behave when presented with informal, multilingual, and acoustically variable inputs. Students responded to open-ended prompts in the language with which they were most comfortable and were encouraged to speak naturally, rather than following scripted responses.

Across two days, the workshop produced:

  • 706 voice recordings
  • 21 hours of audio
  • Coverage across four languages: English, Malay, Mandarin, and Tamil
  • Participation from 178 students, with 95 active recorders on Day One and 82 on Day Two

Recordings took place in classrooms without acoustic controls. Background noise, interruptions, and uneven microphone placement were common. These conditions were preserved rather than corrected, as they reflect how speech is typically produced outside studio environments, within the scope of the workshop setting.

What Real Malaysian Voices Reveal About Speech AI Limitations

Students and facilitators during the Melaka workshop, where speech AI systems were tested against informal, multilingual classroom speech.

From Voice to Transcription: How the System Interprets Speech

To make system behavior observable, Chemin implemented a simplified, production-inspired pipeline that enabled participants to see how spoken input is processed and constrained at each stage.

Students worked through four stages that mirrored how speech AI systems typically handle audio:

  1. Prompt selection, where each student was matched with a language-appropriate speaking task.
  2. Audio capture and speech recognition, where recorded speech was converted into text.
  3. Transcription processing, where raw output was structured and normalized.
  4. Transcript review, where students could see how their speech had been interpreted and where meaning was altered or lost.
What Real Malaysian Voices Reveal About Speech AI Limitations

Figure: A simplified view of how speech AI systems process spoken input, from raw voice capture through transcription, review, and downstream interaction.

This structure helped separate how speech was produced from how it was interpreted, making it easier to capture where meaning shifted and where errors occurred.

A separate demo environment showed how transcripts could be used in a basic voice agent. The purpose was to illustrate how transcription quality affects interaction, not to demonstrate conversational performance. Additionally, the demo operated entirely in isolation from Chemin's production systems.

What Real Malaysian Voices Reveal About Speech AI Limitations

Students reviewing automated transcripts to observe how their speech was interpreted, simplified, or misrecognized by the system.

How Real Speech Exposed System Limits

When students tested the system with informal, multilingual, and noisy speech, recurring patterns became visible. These patterns were not isolated errors. They reflected how the system behaved once the conditions it implicitly relied on for accuracy were no longer present.

From a commercial data perspective, many recordings resembled the kinds of everyday audio that pose challenges for speech systems. Several students prepared responses in advance to meet recording length requirements, which resulted in portions of audio sounding more scripted than conversational. Code-switching occurred frequently, often mid-sentence, as students moved between languages to express ideas more naturally. Ambient noise was present across most recordings due to classroom activity, movement, and interruptions.

Under these conditions, the system exhibited consistent behaviors:

  • Transcription accuracy declined when multiple languages were mixed within a single utterance.
  • Local terms and culturally specific phrases were frequently misrecognized or omitted.
  • Background noise interfered with recognition even when the speech volume was sufficient.

In one session, the system failed to recognize the phrase "char kuey teow." This reflected a limited exposure to local terms rather than a technical error in the system itself. For students, this moment sets a clear example of how speech systems can struggle with culturally specific vocabulary that falls outside their training data.

Taken together, these behaviors followed a clear pattern. They emerged because the system was operating outside the conditions for which it had been trained and evaluated. As speech departed from structured assumptions, accuracy declined in expected ways.

What Real Speech Reveals About System Design

This workshop was not designed to improve performance metrics. Its value lay in making system limitations visible under real conditions.

Across 21 hours of multilingual audio, performance degraded steadily as speech became less structured and more context-dependent. At an individual level, students observed how their own speech was simplified, misinterpreted, or fragmented during transcription.

For Chemin, this exercise reinforced a critical distinction: clean data supports measurement, while representative data exposes risk. Both are necessary, but they serve different purposes in system design and evaluation.

All recordings were collected strictly for educational purposes, with full school approval and participant consent. The audio remained within the workshop environment and was not exported, retained for future use, or incorporated into any Chemin datasets.

This work highlights that speech AI limitations are not solely a question of model capability. They are also shaped by data selection, evaluation design, and the conditions under which systems are observed. Systems trained primarily on structured inputs will struggle when exposed to informal, multilingual, or noisy speech. These failures are repeatable and foreseeable, not exceptional.

Understanding how systems behave under real speech conditions is as important as measuring their performance under ideal conditions. For teams building speech-enabled systems, this understanding should inform data strategy, evaluation criteria, and expectations of performance in real-world use.

At Chemin, we collaborate with teams developing speech systems to examine how underlying assumptions hold up in real usage.

While the Melaka workshop was purely educational, similar principles guide our advisory work with partners. From multilingual voice collection to interpretation-focused evaluation pipelines, our work focuses on making system limitations visible before they become apparent in deployment.

If you're developing voice-driven systems and want a clearer understanding of how they behave beyond controlled settings, reach out to us or explore our case studies.

Share

Discover more

What Real Malaysian Voices Reveal About Speech AI Limitations | Chemin AI