Can AI translate conversations in real time?

AI translation technology can process speech and return text in another language in real time — the latency gap has shrunk dramatically in recent years. But real-time translation is not the same as real-time conversation. Conversation requires not just translation of words but interpretation of meaning in context: understanding what is implied, not just what is said; recognizing when something is a question and when it is a statement; adapting register and formality for the relationship. Current AI translation handles the lexical layer well and the conversational layer imperfectly. The gap shows up most clearly in live, spontaneous voice conversation where there is no opportunity to review or revise.

What are the main limitations of machine translation?

The main limitations of machine translation in conversational contexts are: (1) Context dependence — the same word can have different meanings depending on what came before, and machine translation systems sometimes misread the context. (2) Latency — even sub-second delays create unnatural pauses in live conversation that disrupt the rhythm and social dynamics. (3) Prosody and tone — translation systems deliver text or synthesized speech that lacks the emotional coloring of the original speaker's voice, which carries significant meaning in conversation. (4) Idiomatic language — humor, sarcasm, regional expressions, and cultural references are translated literally, which can produce technically correct but socially wrong output. (5) Proper nouns and domain terminology — names, technical terms, and specialized vocabulary require context that general translation systems often lack.

What is the difference between translation and interpretation?

Translation is the conversion of text from one language to another. Interpretation is the real-time conversion of spoken language, which includes not just words but also meaning, tone, cultural context, and the speaker's intent. Professional conference interpreters are trained for years specifically in interpretation because it is a fundamentally different skill from translation. AI tools have made enormous progress on text translation and are making progress on speech-to-text translation, but the interpretation layer — understanding what a speaker means, not just what they said — is significantly harder to automate and is where current technology most often produces errors that matter in real conversation.

Is Google Translate good enough for conversation?

Google Translate's Conversation Mode works for simple, structured exchanges — ordering food, asking for directions, basic factual questions. It becomes unreliable for longer, more complex conversations with domain-specific vocabulary, cultural references, idiomatic expressions, or emotional nuance. The interface itself imposes friction: both speakers must know to tap a button and hold a device between them, which creates an artificial formality that changes the dynamic of the conversation. For brief transactional exchanges, Google Translate Conversation Mode is a reasonable tool. For genuine human connection across language lines — the kind of conversation that builds relationships — it falls short.

April 19, 2026 · 7 min read

Why AI translation isn’t enough for real conversation.

The quality of machine translation has improved dramatically in the last decade. You can paste a paragraph of French into Google Translate and get an English version that is, on its own terms, fluent. What you cannot do is have a genuine conversation through it — and the reason is not a latency problem or a vocabulary problem. It is a fundamental difference between what translation does and what conversation is.

Translation is the movement of meaning between codes. You have a message in French. You produce the equivalent message in English. If the translation is accurate, the English message means what the French message meant. This is a well-defined problem, and AI is increasingly good at it.

Conversation is something else. Conversation is not the exchange of messages. It is the co-construction of a shared reality between two or more people, in real time, with incomplete information, using language that is as much about relationship as about content. When you say "I’ll think about it," you might mean that you’ll think about it, or you might mean no. The words are the same. The meaning depends on your relationship with the person, the context of what was just discussed, your tone of voice, and a hundred other factors that are not in the text.

The five gaps that AI translation hasn’t closed

1. The context gap

AI translation systems produce their best output when given full sentences or paragraphs of text. Real conversation is not structured in full sentences. It is fragments, interruptions, references to things said minutes or hours ago, and half-finished thoughts that both speakers complete together. "Did you see it?" in a conversation means something completely different depending on what came before. Translation systems that don’t carry the full context of a conversation — and most don’t — produce technically correct but contextually wrong output.

The most common failure mode is pronoun reference. English uses "it" as a placeholder for almost anything. So does Spanish, French, and German, but the rules for which pronoun matches which noun are different in each language, and they depend on what was said before. A translation system working on one utterance at a time will guess, and will sometimes be wrong in ways that are confusing or embarrassing.

2. The latency gap

Current voice-to-voice translation systems add a delay that ranges from fractions of a second to several seconds, depending on the language pair, the system, and network conditions. This sounds trivial until you remember that human conversation is exquisitely sensitive to timing. The difference between a reply that comes 200 milliseconds after a question and one that comes two seconds after is the difference between a natural response and an awkward pause — and awkward pauses carry their own meaning. A two-second pause before "I love you" is very different from a two-second pause before "it was fine."

250ms

The average human conversational response time. Delays beyond 500ms start to feel socially unnatural. Many translation systems still introduce 1–3 seconds of latency on average, changing the character of the conversation entirely.

The latency problem is not just about comfort. It changes what people say. When there is a delay, speakers simplify their messages to reduce the translation burden. They speak in shorter sentences. They avoid idioms, jokes, and anything that requires context. The conversation that emerges through a translation system with meaningful latency is a flatter, simpler, less human version of the conversation that would have happened without it.

3. The prosody gap

A significant fraction of conversational meaning is carried not by words but by how words are said: pitch, pace, emphasis, pausing, vocal quality. "That’s interesting" said with rising intonation and wide eyes is curiosity. "That’s interesting" said with a flat tone and a long pause is skepticism verging on dismissal. The words are identical.

Most AI translation systems deliver their output as text, which loses the prosodic layer entirely, or as synthetic speech, which approximates prosody based on text-level cues but cannot replicate the emotional nuance of the original speaker’s voice. The listener is receiving a translation of what was said, not a translation of what was meant. For transactional exchanges, this rarely matters. For conversations where emotional register matters — almost all human conversation worth having — it matters enormously.

4. The cultural gap

Languages are not isomorphic. They do not carve reality into the same pieces. Some concepts that exist as single words in one language require phrases or entire sentences in another. Some concepts simply do not exist in certain languages because the culture that produced the language did not have occasion to name them. When a Japanese speaker uses "wa" as a topic marker, the grammatical function has no direct equivalent in English, and a literal translation loses something about the relationship between what is being said and what has already been established.

Humor is particularly brutal for machine translation. Puns, wordplay, and most forms of irony are language-internal — they work because of specific features of the source language that do not transfer. The best professional interpreters in the world routinely substitute jokes rather than translate them, finding the culturally equivalent moment of levity in the target language. AI systems do not have this option; they translate the words and the joke dies.

5. The friction gap

The final gap is the gap between a tool and infrastructure. A tool is something you pick up, use, and put down. It requires you to know you need it, decide to use it, and operate it. Infrastructure is something that works without any of that — you don’t "use" electricity or internet connectivity, you just do what you were going to do and they work.

Current AI translation tools are still tools. You open an app. You tap a microphone button. You hold the phone between you and another person in an awkward position. One of you speaks. There is a pause. A translation appears or plays. The other responds. You swap the microphone again. The conversation is mediated by a device that both parties are aware of at all times, which changes the dynamic of the interaction in subtle but important ways.

“The best technology disappears. You stop being aware of it and just do the thing you wanted to do. Translation hasn’t disappeared yet. It’s still a tool you can see using.”

What the gap means in practice

The result of all five gaps is that AI translation tools work well for a specific subset of conversational scenarios: structured exchanges where context is obvious, the stakes of a misunderstanding are low, and the emotional register of the conversation is not the point. This is a genuinely useful subset. Ordering food in a country where you don’t speak the language, asking for directions, making simple transactional requests — for these, Google Translate’s Conversation Mode or similar tools are adequate and sometimes excellent.

The subset they don’t serve well is the one that defines human connection: the longer conversations where something real is being worked out between two people, where tone and timing matter, where the relationship is evolving through the exchange itself. This is where language barriers actually cost us the most — not the inability to order coffee, but the inability to build a friendship with someone who happened to grow up speaking a different language.

7,000+

Languages spoken in the world. Google Translate covers around 133. The vast majority of human languages — and the communities that speak them — remain outside even the best AI translation tools.

The goal for the next generation of multilingual communication infrastructure is not better translation in the technical sense. The translation quality of current systems is already good enough for most use cases. The goal is to close the other four gaps: to make the context persistent, the latency invisible, the prosody preserved, the cultural register handled, and the friction eliminated. When those gaps are closed, translation stops being something you do and starts being something that happens — and conversation between people who grew up speaking different languages becomes as natural as conversation between people who grew up speaking the same one.

Babel is building the infrastructure, not just the tool.

Real-time multilingual conversation — invisible, ambient, and built into every interaction.

Join the Waitlist →