NewsWarriors
Using an AI Bot infrastructure and Web Crawler infrastructure to verify news articles to curb the spread of misinformation and fake news
Created on 11th October 2020
•
NewsWarriors
Using an AI Bot infrastructure and Web Crawler infrastructure to verify news articles to curb the spread of misinformation and fake news
The problem NewsWarriors solves
The unchecked circulation of fake news and misinformation is extremely dangerous to the members of any community, especially in unusual circumstances like the ones we are facing today, this misinformation could be more dangerous than the pandemic itself. The present methods of checking and verifying news are cumbersome and not user-friendly, so to incorporate ease into this crucial process, we have implemented a bot infrastructure, currently on Telegram, but further scalable to Whatsapp and other Social Media platforms, that takes extracts of news articles as inputs from the users and matches it to the articles we have already scrapped and stores into our databases using our crawler systems. The database has been cleaned and preprocessed in advance to facilitate easy matching using Machine Learning and Natural Language Processing. The user input is matched with the news articles in our database, which contains only articles from creditable and trustworthy websites, and if found the bot conveys to the user that the news is verified, the source of the news as well as the link to the actual article.
In case the article is not found, the user receives a message saying that a creditable source could not be found. By making the bots exist as contacts on both telegram and whatsapp, we make the process of news verification extremely easy and time efficient, and by having leaderboards and in-app points displaying people's activeness and enthusiasm, we believe we can land a huge userbase.
Challenges we ran into
Problem 1: Anti Scrapping and SSL limitations on websites made it difficult to crawl the data, especially because they were dynamic. We had to regularly adjust our crawler architecture to get the data from the news sites.
Problem 2: Semantic matching for articles is very easy if the user inputs the full article, but this is seldom the case. We had to be able to match the users input which would typically be a small extract of the article to the whole article, which is very tough. Due to limitation of time in the Hackathon, we have directly used the similarity functions available in the Doc2Vec models, which can give decent matching but still not very accurate. We will further build on this model, and use LSTMs and BERT in order to increase the matching accuracy.