V

VISION

Vision is a real time, voice automated web app for blind people. It describes what’s happening around the user, reads small texts and identifies unknown and known faces (with names).

The problem VISION solves

Its quite easy for us to see things happening around us or go to our places of interest be it a restaurant, park etc or identify our friends and family members. But what about those who can't see? So, here is VISION a web app for the blind. The app has three features -
Identifying faces - Its quite easy for us to spot our mom, dad or friends but for them this can be quite challenging. If the user finds all unknown people around him he can somewhat be alerted. The app is customized for every user. It identifies known and unknown faces and also tells the name of the known faces.
Describing what's going on around the user - This can help the blind in knowing if some harmful activities are going on around him and he should be away from such a place. Not necessarily harmful activities, for every single thing happening around these people, they have to rely on others to know about it. So, this feature can really help the blind in becoming more self-dependent.
Reading texts - The app can help the user by prompting small texts such as under construction board, alert or names of shop. Without reading certain warning boards a blind can really run into trouble.
The app is completely voice automated so the user has no trouble using the app. The user says the command and the app automatically capture a picture and tells him whatever he wants to know.

Challenges we ran into

Our initial plan was to work on image captioning using MS coco dataset, but this dataset is around 14 GB. We trained traing but very soon we reaised that we won't be able to train it within the duratin of this hackathon so we switched to flickr8k dataset which is around 1.4 GB and we were successfully able to train in ten hours.
Text detection using EAST and pytessseract was giving some unexpected special characters and spelling mistakes so we first removed those special characters and then used text blob which is basically a dictionary method to return closest meaningful word.
We deployed the three individual parts separately on flask but integrating them together was tough.
Taking voice as input from the webpage and sending it as a request to the main flask framework and then sending the output from flask back to the webpage and displaying it as speech was also tough. But eventully we were successful to implement these using HTML 5 speech recognition and speech synthesis API.
As we were implementing this in real time by continuously capturing video we realised that the battery was draining quite fast so as soon as the user said the command we opened the camera captured an image and then instantaneously closed the camera.
Facial matching was not very accurate initially so we used data augmentation to diversify our dataset and then the results were quite good.

Discussion