A

AI Image Caption Bot

We took a data set from Kaggle named Flickr (with 30K images). In our ML-DL model these set of images were fed as the input and corresponding to it a caption was generated with each image.

Created on 13th December 2020

A

AI Image Caption Bot

We took a data set from Kaggle named Flickr (with 30K images). In our ML-DL model these set of images were fed as the input and corresponding to it a caption was generated with each image.

The problem AI Image Caption Bot solves

The problems the AI captioning bot solves are as follows:

  1. Aid to the blind — We can create a product for the blind which will guide them travelling on the roads without the support of anyone else. We can do this by first converting the scene into text and then the text to voice.
  2. Self driving cars — Automatic driving is one of the biggest challenges and if we can properly caption the scene around the car, it can give a boost to the self driving system.
  3. CCTV cameras are everywhere today, but along with viewing the world, if we can also generate relevant captions, then we can raise alarms as soon as there is some malicious activity going on somewhere. This could probably help reduce some crime and/or accidents.
  4. Automatic Captioning can help, make Google Image Search as good as Google Search, as then every image could be first converted into a caption and then search can be performed based on the caption.

Challenges we ran into

The first challenge stems from the compositional nature of natural language and visual scenes. While the training dataset contains co-occurrences of some objects in their context, a captioning system should be able to generalize by composing objects in other contexts.
The second challenge is the dataset bias impacting current captioning systems. The trained models overfit to the common objects that co-occur in a common context , which leads to a problem where such systems struggle to generalize to scenes where the same objects appear in unseen contexts . Although reducing the dataset bias is in itself a challenge, open research problem, we propose a diagnostic tool to quantify how biased a given captioning system is.
The third challenge is in the evaluation of the quality of generated captions. Using automated metrics, though partially helpful, is still unsatisfactory since they do not take the image into account. In many cases, their scoring remains inadequate and sometimes even misleading — especially when scoring diverse and descriptive captions.
Progress on automatic image captioning and scene understanding will make computer vision systems more reliable for use as personal assistants for visually impaired people and in improving their day-to-day life. The semantic gap in bridging language and vision points to the need for incorporating common sense and reasoning into scene understanding.

Discussion

Builders also viewed

See more projects on Devfolio