State of the Art

A common question we receive: "Why build specialized sign language technology when large language models are improving so rapidly?"

This document demonstrates why general-purpose AI models—despite their impressive capabilities in text and image generation—fundamentally fail at sign language translation. The gap between current AI capabilities and functional sign language translation is exactly the opportunity InReach addresses.

Why General AI Models Fail at Sign Language

Sign language is not simply "gestures" that can be generated from text descriptions. It's a complete language with:

Phonological structure: Handshape, location, movement, orientation, non-manual features
Spatial grammar: 3D space used for reference, agreement, classifiers
Simultaneity: Multiple articulators conveying different information at once
Linguistic precision: Small differences in hand position change meaning entirely

General AI models treat sign language as a visual pattern to mimic, not a language to translate. This is why they fail consistently.

Spoken-to-Signed Translation

Video Generation Models

State-of-the-art video generation models (Veo 2, Sora) can create photorealistic humans performing movements—but they cannot generate linguistically accurate sign language.

Google Veo 2 (2025/05/10)

Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

Video 1	Video 2

Result: Generates plausible-looking hand movements, but not the actual ASL sign for "House" (which requires two flat hands forming a roof shape).

OpenAI Sora (2024/12/14)

Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

Settings

Aspect ratio: 1:1
Resolution: 480p
Duration: 5 seconds
Quantity: 2

Video 1	Video 2

Result: Similar failure—generates hand movements that look sign-like but are linguistically incorrect.

Attempting Movement Descriptions

When we provide detailed movement instructions (using signwriting-description) rather than asking for translation:

Prompt

Sign language interpreter, green screen background, performing an American Sign Language sign. With your dominant hand open, touch your forehead and move your hand away, palm facing out.

Video 1	Video 2

Result: Better adherence to instructions, but still under-specified—the model doesn't understand that hand orientation, speed, and facial expression are part of the linguistic sign for "Hello."

Why this matters: Even if you could describe every phonological feature in text (you can't—it would take paragraphs per sign), the model has no understanding of sign language grammar, non-manual features, or spatial relationships.

Image Generation Models

Image models are more mature than video models, but face the same fundamental problem: they generate visual patterns, not linguistic structures.

OpenAI DALL-E 3 via ChatGPT 4o (2024/12/08)

ChatGPT 4o Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

ChatGPT reformulates the prompt for DALL-E 3:

DALL-E 3 Prompt

A professional sign language interpreter signing the American Sign Language (ASL) sign for 'House' with clear and accurate hand positioning. The background is a plain green screen, suitable for chroma keying. The interpreter is dressed in neutral, professional attire, and the scene is well-lit to ensure clarity of the sign and facial expressions.

DALL-E 3 Image Generated using ChatGPT 4o Prompt

Result: Wrong handshapes, incorrect positioning, not a linguistically valid sign. The model has no training data mapping English words to ASL signs.

StabilityAI Stable Diffusion 3.5 Large (2024/12/08)

Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

Stable Diffusion 3.5 Large Image Generated using Prompt

Result: Similar failure—generates plausible-looking hand configurations that are not the actual sign.

Google DeepMind Imagen 3 (2024/12/05)

Result: The model refuses to generate images of sign language signs entirely.

Why these models fail: They are trained on internet images with text captions. There are virtually no labeled datasets of "ASL sign for [word]" at the scale needed to train these models. Sign language exists primarily in video, not static images with text labels.

Signed-to-Spoken Translation

OpenAI

OpenAI models (GPT-4o, GPT-4 Turbo) do not currently support video inputs, making sign language recognition impossible.

Google

Google is also conducting research on sign language translation^[2], but publicly available models fail dramatically at the task.

Google Gemini 1.5 Pro (2024/12/08)

We tested Gemini 1.5 Pro with a video of a woman signing "What is your name?" in ASL.

Prompt + Attached video "example-sentence.mp4"

Translate the attached American Sign Language video "example-sentence.mp4" into English.

Gemini 1.5 Pro responses (5 attempts):

"I'm sorry, I don't understand."
"Excuse me. I have a question."
"Hello, my name is [name]. Nice to meet you."
"Excuse me. Do you mind if I sit here?"
"Excuse me. Do you have a second? Do you mind if I ask you a question?"

Actual meaning: "What is your name?"

Gemini 2.0 Flash (2024/12/15) responses:

"Don't understand."
"Stop... I am thinking about this... I'm not sure."
"Stop, I think I'm done?"
"I don't know."
"I don't understand."

Analysis: The model is hallucinating English text with no relationship to the actual signing. It cannot recognize individual signs, let alone understand sign language grammar or translate fluently.

Why this fails: General vision-language models are trained on internet videos with spoken audio or text captions. They have no training data for sign language video → text translation. They literally cannot "see" sign language as language—only as hand movements.

The Specialized Approach: Why InReach Works

General AI models fail at sign language because they:

Lack sign language training data at scale
Treat signing as visual patterns, not linguistic structures
Cannot handle spatial grammar or multi-channel simultaneity
Have no understanding of phonology (handshape, location, movement, orientation)
Cannot generate or recognize non-manual features (facial grammar)

InReach's specialized approach:

Spoken-to-Signed Pipeline

Text → SignWriting translation: Uses machine translation models trained on parallel sign language corpora
SignWriting → Pose generation: Converts linguistic notation to 3D skeletal poses
Pose → Video rendering: Generates photorealistic or avatar-based signing from poses

Why this works: By using SignWriting as an intermediate representation, we separate the translation problem (language-to-language) from the rendering problem (notation-to-video). General AI models try to solve both simultaneously and fail at both.

Signed-to-Spoken Pipeline

Video → Pose estimation: MediaPipe Holistic extracts 543 3D keypoints
Pose → SignWriting transcription: Trained models recognize signs from pose sequences
SignWriting → Text translation: Machine translation to spoken language

Why this works: We use pose estimation to reduce the visual complexity, then apply linguistic models trained on sign language data. General AI models try to go directly from pixels to text and hallucinate.

The Competitive Moat

Question: "Won't OpenAI/Google/Meta eventually solve this?"

Answer: Yes, eventually—but not soon, and not better than specialized solutions. Here's why:

1. Data Scarcity

General models need millions of labeled examples
Sign language video data is scarce and expensive to annotate
Even Meta/Google's research efforts struggle with limited datasets

2. Linguistic Complexity

Sign languages are not "gestural encodings" of spoken languages
Spatial grammar, non-manual features, classifiers require linguistic understanding
General models don't have the architecture to handle multi-channel simultaneity

3. Deployment Model

Even if they achieve good translation, they'll offer it as cloud APIs
InReach's client-side processing provides privacy, offline capability, and zero server costs
Our deployment model (browser extension, zero redesign) is a competitive advantage independent of translation quality

4. Specialization Beats Generalization

Medical imaging AI beats general vision models for diagnosis
Code-specific models (GitHub Copilot) beat general LLMs for programming
Sign language requires specialized architecture, data, and evaluation

Timeline estimate: General AI companies might achieve functional sign language translation in 3-5 years. By then, InReach will have:

Millions of users
Partnerships with major platforms
Proprietary training data from user interactions
3-5 year technical lead in specialized architectures

Conclusion

The state-of-the-art in general AI demonstrates exactly why InReach exists:

✅ General models cannot translate sign language despite billions in R&D
✅ Specialized approaches work (proven in academic research)
✅ The gap between general AI and functional sign language translation is our opportunity
✅ Our deployment model (client-side, universal) provides a moat even as models improve

We're not building sign language technology because general AI is failing.
We're building it because sign language is a complete language that requires specialized, linguistically-informed approaches.

And we're building it now because the technology stack (MediaPipe, TensorFlow.js, Transformers, SignWriting) has finally matured enough to make real-time, client-side translation possible.

Rust et al. 2024. Towards Privacy-Aware Sign Language Translation at Scale. ↩︎
Zhang et al. 2024. Scaling Sign Language Translation. ↩︎

Population

Literacy

Market

State of the Art

State of the Art

Why General AI Models Fail at Sign Language

Spoken-to-Signed Translation

Video Generation Models

Google Veo 2 (2025/05/10)

OpenAI Sora (2024/12/14)

Attempting Movement Descriptions

Image Generation Models

OpenAI DALL-E 3 via ChatGPT 4o (2024/12/08)

StabilityAI Stable Diffusion 3.5 Large (2024/12/08)

Google DeepMind Imagen 3 (2024/12/05)

Signed-to-Spoken Translation

OpenAI

Meta

Google

Google Gemini 1.5 Pro (2024/12/08)

The Specialized Approach: Why InReach Works

Spoken-to-Signed Pipeline

Signed-to-Spoken Pipeline

The Competitive Moat

1. Data Scarcity

2. Linguistic Complexity

3. Deployment Model

4. Specialization Beats Generalization

Conclusion

State of the Art ​

Why General AI Models Fail at Sign Language ​

Spoken-to-Signed Translation ​

Video Generation Models ​

Google Veo 2 (2025/05/10) ​

OpenAI Sora (2024/12/14) ​

Attempting Movement Descriptions ​

Image Generation Models ​

OpenAI DALL-E 3 via ChatGPT 4o (2024/12/08) ​

StabilityAI Stable Diffusion 3.5 Large (2024/12/08) ​

Google DeepMind Imagen 3 (2024/12/05) ​

Signed-to-Spoken Translation ​

OpenAI ​

Meta ​

Google ​

Google Gemini 1.5 Pro (2024/12/08) ​

The Specialized Approach: Why InReach Works ​

Spoken-to-Signed Pipeline ​

Signed-to-Spoken Pipeline ​

The Competitive Moat ​

1. Data Scarcity ​

2. Linguistic Complexity ​

3. Deployment Model ​

4. Specialization Beats Generalization ​

Conclusion ​

State of the Art

Why General AI Models Fail at Sign Language

Spoken-to-Signed Translation

Video Generation Models

Google Veo 2 (2025/05/10)

OpenAI Sora (2024/12/14)

Attempting Movement Descriptions

Image Generation Models

OpenAI DALL-E 3 via ChatGPT 4o (2024/12/08)

StabilityAI Stable Diffusion 3.5 Large (2024/12/08)

Google DeepMind Imagen 3 (2024/12/05)

Signed-to-Spoken Translation

OpenAI

Meta

Google

Google Gemini 1.5 Pro (2024/12/08)

The Specialized Approach: Why InReach Works

Spoken-to-Signed Pipeline

Signed-to-Spoken Pipeline

The Competitive Moat

1. Data Scarcity

2. Linguistic Complexity

3. Deployment Model

4. Specialization Beats Generalization

Conclusion