Skip to content

State of the Art

A common question we receive: "Why build specialized sign language technology when large language models are improving so rapidly?"

This document demonstrates why general-purpose AI models—despite their impressive capabilities in text and image generation—fundamentally fail at sign language translation. The gap between current AI capabilities and functional sign language translation is exactly the opportunity InReach addresses.

Why General AI Models Fail at Sign Language

Sign language is not simply "gestures" that can be generated from text descriptions. It's a complete language with:

  • Phonological structure: Handshape, location, movement, orientation, non-manual features
  • Spatial grammar: 3D space used for reference, agreement, classifiers
  • Simultaneity: Multiple articulators conveying different information at once
  • Linguistic precision: Small differences in hand position change meaning entirely

General AI models treat sign language as a visual pattern to mimic, not a language to translate. This is why they fail consistently.


Spoken-to-Signed Translation

Video Generation Models

State-of-the-art video generation models (Veo 2, Sora) can create photorealistic humans performing movements—but they cannot generate linguistically accurate sign language.

Google Veo 2 (2025/05/10)

Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

Video 1Video 2
Veo2 Video 1 Generated using PromptVeo2 Video 2 Generated using Prompt
Veo2 Video 1 Generated using PromptVeo2 Video 2 Generated using Prompt

Result: Generates plausible-looking hand movements, but not the actual ASL sign for "House" (which requires two flat hands forming a roof shape).

OpenAI Sora (2024/12/14)

Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

Settings

  • Aspect ratio: 1:1
  • Resolution: 480p
  • Duration: 5 seconds
  • Quantity: 2
Video 1Video 2
Sora Video 1 Generated using PromptSora Video 2 Generated using Prompt

Result: Similar failure—generates hand movements that look sign-like but are linguistically incorrect.

Attempting Movement Descriptions

When we provide detailed movement instructions (using signwriting-description) rather than asking for translation:

Prompt

Sign language interpreter, green screen background, performing an American Sign Language sign. With your dominant hand open, touch your forehead and move your hand away, palm facing out.

Video 1Video 2
Sora Video 1 Generated using PromptSora Video 2 Generated using Prompt

Result: Better adherence to instructions, but still under-specified—the model doesn't understand that hand orientation, speed, and facial expression are part of the linguistic sign for "Hello."

Why this matters: Even if you could describe every phonological feature in text (you can't—it would take paragraphs per sign), the model has no understanding of sign language grammar, non-manual features, or spatial relationships.

Image Generation Models

Image models are more mature than video models, but face the same fundamental problem: they generate visual patterns, not linguistic structures.

OpenAI DALL-E 3 via ChatGPT 4o (2024/12/08)

ChatGPT 4o Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

ChatGPT reformulates the prompt for DALL-E 3:

DALL-E 3 Prompt

A professional sign language interpreter signing the American Sign Language (ASL) sign for 'House' with clear and accurate hand positioning. The background is a plain green screen, suitable for chroma keying. The interpreter is dressed in neutral, professional attire, and the scene is well-lit to ensure clarity of the sign and facial expressions.

DALL-E 3 Image Generated using ChatGPT 4o Prompt

Result: Wrong handshapes, incorrect positioning, not a linguistically valid sign. The model has no training data mapping English words to ASL signs.

StabilityAI Stable Diffusion 3.5 Large (2024/12/08)

Prompt

Sign language interpreter, green screen background, signing the American Sign Language sign for "House".

Stable Diffusion 3.5 Large Image Generated using Prompt

Result: Similar failure—generates plausible-looking hand configurations that are not the actual sign.

Google DeepMind Imagen 3 (2024/12/05)

Result: The model refuses to generate images of sign language signs entirely.

Why these models fail: They are trained on internet images with text captions. There are virtually no labeled datasets of "ASL sign for [word]" at the scale needed to train these models. Sign language exists primarily in video, not static images with text labels.


Signed-to-Spoken Translation

OpenAI

OpenAI models (GPT-4o, GPT-4 Turbo) do not currently support video inputs, making sign language recognition impossible.

Meta

Meta is conducting research on sign language translation[1], but their models are not publicly available and remain research-only.

Google

Google is also conducting research on sign language translation[2], but publicly available models fail dramatically at the task.

Google Gemini 1.5 Pro (2024/12/08)

We tested Gemini 1.5 Pro with a video of a woman signing "What is your name?" in ASL.

Prompt + Attached video "example-sentence.mp4"

Translate the attached American Sign Language video "example-sentence.mp4" into English.

Gemini 1.5 Pro responses (5 attempts):

  1. "I'm sorry, I don't understand."
  2. "Excuse me. I have a question."
  3. "Hello, my name is [name]. Nice to meet you."
  4. "Excuse me. Do you mind if I sit here?"
  5. "Excuse me. Do you have a second? Do you mind if I ask you a question?"

Actual meaning: "What is your name?"

Gemini 2.0 Flash (2024/12/15) responses:

  1. "Don't understand."
  2. "Stop... I am thinking about this... I'm not sure."
  3. "Stop, I think I'm done?"
  4. "I don't know."
  5. "I don't understand."

Analysis: The model is hallucinating English text with no relationship to the actual signing. It cannot recognize individual signs, let alone understand sign language grammar or translate fluently.

Why this fails: General vision-language models are trained on internet videos with spoken audio or text captions. They have no training data for sign language video → text translation. They literally cannot "see" sign language as language—only as hand movements.


The Specialized Approach: Why InReach Works

General AI models fail at sign language because they:

  1. Lack sign language training data at scale
  2. Treat signing as visual patterns, not linguistic structures
  3. Cannot handle spatial grammar or multi-channel simultaneity
  4. Have no understanding of phonology (handshape, location, movement, orientation)
  5. Cannot generate or recognize non-manual features (facial grammar)

InReach's specialized approach:

Spoken-to-Signed Pipeline

  1. Text → SignWriting translation: Uses machine translation models trained on parallel sign language corpora
  2. SignWriting → Pose generation: Converts linguistic notation to 3D skeletal poses
  3. Pose → Video rendering: Generates photorealistic or avatar-based signing from poses

Why this works: By using SignWriting as an intermediate representation, we separate the translation problem (language-to-language) from the rendering problem (notation-to-video). General AI models try to solve both simultaneously and fail at both.

Signed-to-Spoken Pipeline

  1. Video → Pose estimation: MediaPipe Holistic extracts 543 3D keypoints
  2. Pose → SignWriting transcription: Trained models recognize signs from pose sequences
  3. SignWriting → Text translation: Machine translation to spoken language

Why this works: We use pose estimation to reduce the visual complexity, then apply linguistic models trained on sign language data. General AI models try to go directly from pixels to text and hallucinate.


The Competitive Moat

Question: "Won't OpenAI/Google/Meta eventually solve this?"

Answer: Yes, eventually—but not soon, and not better than specialized solutions. Here's why:

1. Data Scarcity

  • General models need millions of labeled examples
  • Sign language video data is scarce and expensive to annotate
  • Even Meta/Google's research efforts struggle with limited datasets

2. Linguistic Complexity

  • Sign languages are not "gestural encodings" of spoken languages
  • Spatial grammar, non-manual features, classifiers require linguistic understanding
  • General models don't have the architecture to handle multi-channel simultaneity

3. Deployment Model

  • Even if they achieve good translation, they'll offer it as cloud APIs
  • InReach's client-side processing provides privacy, offline capability, and zero server costs
  • Our deployment model (browser extension, zero redesign) is a competitive advantage independent of translation quality

4. Specialization Beats Generalization

  • Medical imaging AI beats general vision models for diagnosis
  • Code-specific models (GitHub Copilot) beat general LLMs for programming
  • Sign language requires specialized architecture, data, and evaluation

Timeline estimate: General AI companies might achieve functional sign language translation in 3-5 years. By then, InReach will have:

  • Millions of users
  • Partnerships with major platforms
  • Proprietary training data from user interactions
  • 3-5 year technical lead in specialized architectures

Conclusion

The state-of-the-art in general AI demonstrates exactly why InReach exists:

General models cannot translate sign language despite billions in R&D
Specialized approaches work (proven in academic research)
The gap between general AI and functional sign language translation is our opportunity
Our deployment model (client-side, universal) provides a moat even as models improve

We're not building sign language technology because general AI is failing.
We're building it because sign language is a complete language that requires specialized, linguistically-informed approaches.

And we're building it now because the technology stack (MediaPipe, TensorFlow.js, Transformers, SignWriting) has finally matured enough to make real-time, client-side translation possible.


  1. Rust et al. 2024. Towards Privacy-Aware Sign Language Translation at Scale. ↩︎

  2. Zhang et al. 2024. Scaling Sign Language Translation. ↩︎