More Than a Thousand Words

Teaching AI to See for the Visually Impaired

By Taraneh Ghandi

Ferdowsi University of Mashhad

The Challenge

A picture is worth a thousand words, but only if you can see it.

253M

people worldwide live with visual impairments.

For them, the digital world, built on images, can feel like a book with most of its pages torn out. Standard AI offers literal, unhelpful captions, missing the story and emotion that give an image meaning.

Standard AI sees:

"A group of people on a street."

What's really there:

A triumphant marathon finish line.

Our Innovation: A New Approach

We don't just need AI that labels images; we need AI that understands them.

Building a Blueprint of the Scene

Before our system writes a single word, it creates a "scene graph"—a structured blueprint of everything in the image. It doesn't just see a "bird"; it identifies the bird, its attributes ("small," "blue"), and its relationship to other objects ("flying over the water").

By converting visual chaos into an organized map, we give the AI a deeper understanding. We fused this with powerful models like CLIP and GPT-2 to construct descriptions from a foundation of understood facts.

ClipCapGAT Architecture Diagram

The ClipCapGAT architecture, combining scene graphs with CLIP and GPT-2.

Captions on Demand

This deeper understanding unlocks our most important feature: hierarchical, controllable captions.

Putting the User in Control

Instead of one static description, the user can explore the image at their own pace. This simple shift empowers the user, giving them the agency to explore a visual world on their own terms.

"A woman is standing in a kitchen."

"A smiling woman with a red apron is standing in a modern kitchen, holding a wooden spoon."

"A smiling woman with a red apron is standing in a modern kitchen, holding a wooden spoon over a steaming pot on the stove. Sunlight is streaming through a window on the left."

Measuring Success

We evaluated our model using standard metrics that compare AI-generated captions to human-written ones.

ClipCapGAT vs. Baseline

Metric Baseline (ClipCap) Our Model (ClipCapGAT) Improvement
BLEU-1 35.6 43.7 +22.7%
METEOR 7.3 9.5 +30.1%
ROUGE-L 27.8 32.0 +15.1%
CIDEr 4.1 6.8 +65.8%

Results show significant improvements across all major captioning evaluation metrics, especially CIDEr, which is designed to measure human-like consensus.

The Vision for a More Accessible World

This project is a step toward a future where AI serves as a true visual interpreter.

Future Work

  • Use sub-graphs to model different priority levels for even more nuanced descriptions.
  • Explore other forms of feature aggregation to capture more complex relationships.
  • Expand the model to understand and describe text within images.

The Goal

The work continues, but the vision is clear: to harness the power of AI not just to see the world, but to share its stories with everyone, offering not just access, but understanding.