The Challenge
A picture is worth a thousand words, but only if you can see it.
253M
people worldwide live with visual impairments.
For them, the digital world, built on images, can feel like a book with most of its pages torn out. Standard AI offers literal, unhelpful captions, missing the story and emotion that give an image meaning.
Standard AI sees:
"A group of people on a street."
What's really there:
A triumphant marathon finish line.
Our Innovation: A New Approach
We don't just need AI that labels images; we need AI that understands them.
Building a Blueprint of the Scene
Before our system writes a single word, it creates a "scene graph"—a structured blueprint of everything in the image. It doesn't just see a "bird"; it identifies the bird, its attributes ("small," "blue"), and its relationship to other objects ("flying over the water").
By converting visual chaos into an organized map, we give the AI a deeper understanding. We fused this with powerful models like CLIP and GPT-2 to construct descriptions from a foundation of understood facts.
The ClipCapGAT architecture, combining scene graphs with CLIP and GPT-2.
Captions on Demand
This deeper understanding unlocks our most important feature: hierarchical, controllable captions.
Putting the User in Control
Instead of one static description, the user can explore the image at their own pace. This simple shift empowers the user, giving them the agency to explore a visual world on their own terms.
"A woman is standing in a kitchen."
"A smiling woman with a red apron is standing in a modern kitchen, holding a wooden spoon."
"A smiling woman with a red apron is standing in a modern kitchen, holding a wooden spoon over a steaming pot on the stove. Sunlight is streaming through a window on the left."
Measuring Success
We evaluated our model using standard metrics that compare AI-generated captions to human-written ones.
ClipCapGAT vs. Baseline
| Metric | Baseline (ClipCap) | Our Model (ClipCapGAT) | Improvement |
|---|---|---|---|
| BLEU-1 | 35.6 | 43.7 | +22.7% |
| METEOR | 7.3 | 9.5 | +30.1% |
| ROUGE-L | 27.8 | 32.0 | +15.1% |
| CIDEr | 4.1 | 6.8 | +65.8% |
Results show significant improvements across all major captioning evaluation metrics, especially CIDEr, which is designed to measure human-like consensus.
The Vision for a More Accessible World
This project is a step toward a future where AI serves as a true visual interpreter.
Future Work
- Use sub-graphs to model different priority levels for even more nuanced descriptions.
- Explore other forms of feature aggregation to capture more complex relationships.
- Expand the model to understand and describe text within images.
The Goal
The work continues, but the vision is clear: to harness the power of AI not just to see the world, but to share its stories with everyone, offering not just access, but understanding.