Created on 13th December 2020
•
The problems the AI captioning bot solves are as follows:
The first challenge stems from the compositional nature of natural language and visual scenes. While the training dataset contains co-occurrences of some objects in their context, a captioning system should be able to generalize by composing objects in other contexts.
The second challenge is the dataset bias impacting current captioning systems. The trained models overfit to the common objects that co-occur in a common context , which leads to a problem where such systems struggle to generalize to scenes where the same objects appear in unseen contexts . Although reducing the dataset bias is in itself a challenge, open research problem, we propose a diagnostic tool to quantify how biased a given captioning system is.
The third challenge is in the evaluation of the quality of generated captions. Using automated metrics, though partially helpful, is still unsatisfactory since they do not take the image into account. In many cases, their scoring remains inadequate and sometimes even misleading — especially when scoring diverse and descriptive captions.
Progress on automatic image captioning and scene understanding will make computer vision systems more reliable for use as personal assistants for visually impaired people and in improving their day-to-day life. The semantic gap in bridging language and vision points to the need for incorporating common sense and reasoning into scene understanding.