Skip to content

Our Approach

We've built InReach as a modular sign language translation system that separates computer vision tasks from language translation tasks. This architectural decision ensures robustness, maintainability, and the ability to improve each component independently.

System Architecture Overview

InReach is built as a distributed system with all AI processing happening client-side for privacy and offline capability. Below is our high-level system context:

InReach System Context Diagram

InReach integrates with any digital platform without requiring changes, serving deaf users, hearing users, and institutions through browser extensions, mobile apps, and optional APIs.

Spoken-to-Signed Language Translation

Below is the technical pipeline powering our spoken-to-signed translation. Each node represents a distinct module in our architecture.

Container Architecture

Spoken-to-Signed Container Diagram

The spoken-to-signed pipeline processes audio or text input through language detection, normalization, and translation to SignWriting, then generates pose sequences and renders them as sign language video—all on the user's device.

Pipeline Overview

Pipeline Status Legend:

  • Green edges: Production-ready, high-quality modules
  • Orange edges: Functional but need optimization
  • Red edges: In development, contributions welcome

InReach implements two parallel translation approaches, each with distinct trade-offs:

  1. Dictionary-Based Translation (Baseline)
  2. SignWriting-Based Machine Translation (Primary)

Dictionary-Based Translation

Our baseline approach uses dictionary lookup for rapid deployment but sacrifices linguistic accuracy and natural fluency. This method serves as a fallback and benchmark for measuring improvements in our primary approach.

Translation Pipeline

  1. Text-to-Gloss Conversion: Input text undergoes lemmatization, word reordering, and article removal to approximate sign language structure
  2. Gloss-to-Pose Mapping: Glosses are matched to pre-recorded skeletal poses from our sign language dictionary
  3. Pose-to-Video Rendering: Pose sequences are interpolated and rendered into video output

Data Requirements

Requires a comprehensive dictionary of isolated signs (letters, words, phrases) in video format. Current coverage: ~10,000 signs across major sign languages.

Known Limitations

  • Incomplete representation: Glosses cannot capture the full grammatical complexity of sign languages
  • Missing linguistic features: Facial expressions, spatial grammar, and non-manual markers are poorly represented
  • Unnatural output: Interpolation between dictionary signs lacks the fluency of native signing
  • Limited expressiveness: Cannot handle classifiers, role-shifting, or contextual modifications

Use case: Fallback for unsupported language pairs, baseline for quality comparison, rapid prototyping.

SignWriting-Based Machine Translation

Our primary approach treats sign language translation as a true machine translation problem, comparable in quality and fluency to systems like Google Translate. This enables bidirectional translation and respects the linguistic complexity of sign languages.

Translation Pipeline

  1. Text-to-SignWriting Translation: Neural machine translation models convert spoken language into SignWriting notation (Formal SignWriting format)
  2. SignWriting-to-Pose Generation: SignWriting sequences are animated into fluent 3D pose sequences using trained models
  3. Pose-to-Video Rendering: Pose sequences are rendered into photo-realistic or avatar-based video output

Data Strategy

We synthesize large-scale training data by combining:

  • ~100k isolated signs: Manually transcribed with SignWriting annotations
  • Segmented continuous signing: Natural signing with phrase boundaries
  • Large video corpora: Automatically transcribed using the above as seed data

This bootstrapping approach generates millions of training examples for both translation stages without requiring full manual annotation.

Quality Targets

  • Linguistic accuracy: Preserves sign language grammar, spatial relationships, and non-manual features
  • Natural fluency: Generates signing that native signers recognize as natural
  • Expressiveness: Handles lexical signs, classifiers, facial expressions, and role-shifting
  • Bidirectionality: Enables both spoken-to-signed and signed-to-spoken translation

Target performance: Comparable to human interpreters on benchmark datasets (BLEU >40, native signer preference >70%).

Why SignWriting Works Better

SignWriting serves as an intermediate representation that bridges the modality gap between spoken language (linear text) and sign language (spatial, multi-channel). Unlike glosses:

  • Captures non-manual features: Facial expressions, head movements, body posture
  • Preserves spatial grammar: Location, orientation, and movement in signing space
  • Supports linguistic analysis: Enables proper machine translation techniques
  • Universal notation: Works across 200+ sign languages with minor adaptations
Example 1: Robustness to Variations

Minor spelling variations in the input (even incorrect ones) produce the same correct SignWriting output, while dictionary translation fails—demonstrating that our machine translation understands meaning, not just string matching.

Example 2: Emotional Context

Changing punctuation (exclamation vs. question) produces different facial expressions in SignWriting output, while dictionary translation is identical. Non-manual features (facial expressions) carry grammatical meaning in sign languages—our system learns this from data.

Signed-to-Spoken Language Translation

Our signed-to-spoken pipeline enables deaf individuals to communicate with hearing individuals by translating sign language videos into spoken language text or audio.

Container Architecture

Signed-to-Spoken Container Diagram

The signed-to-spoken pipeline captures video, extracts pose keypoints using MediaPipe Holistic, segments signs using linguistic cues, recognizes them as SignWriting, and translates to spoken language—entirely on the user's device for privacy.

Pipeline Overview

Pipeline Status:

  • Pose Estimation: Production-ready using MediaPipe Holistic (543 keypoints)
  • Segmentation: Functional, based on linguistic cues
  • SignWriting Transcription: In development, improving accuracy
  • Translation: Leverages same models as spoken-to-signed (bidirectional)

Key Challenge: Sign language recognition is inherently harder than production due to:

  • Signer variation (regional accents, personal styles)
  • Video quality and lighting conditions
  • Occlusion and motion blur
  • Co-articulation between signs

Current Focus: Improving segmentation accuracy and expanding training data with diverse signers.


Deployment Architecture

InReach's "zero redesign needed" promise is enabled by our client-side deployment model:

InReach Deployment Diagram

All processing happens on user devices (browser, mobile, desktop) with models cached locally. Optional CDN serves initial downloads only. Target platforms require zero changes—extensions inject accessibility layers seamlessly.

Integration Methods

  1. Browser Extension: Injects sign language layer into any website
  2. Mobile Apps: Standalone translation with camera integration
  3. Desktop Apps: Screen capture and overlay functionality
  4. API/SDK: Optional integration for platforms wanting deeper embedding

Client-Side Processing

All translation happens on the user's device:

  • Privacy: No video or audio uploaded to servers
  • Offline capability: Works without internet connection
  • Zero platform changes: Host platforms remain unchanged
  • Scalability: No server costs, unlimited users

Technology Stack

  • Frontend: Progressive Web Apps, React Native, Electron
  • AI/ML: TensorFlow.js, MediaPipe Holistic, Transformer models
  • Rendering: Three.js (3D Avatar), WebGL (Skeleton), GANs (Photo-realistic)
  • Storage: IndexedDB (models ~50-200MB), LocalStorage (preferences)