R

ReSearch

ReSearch is an NLP based search engine to query and retrieve the CORD (COVID19 Open Research Dataset).

The problem ReSearch solves

We would like to speed up the process of information retrieval for a research scientist or a drug developer. In the past 3 months, more than 50K research papers have been published related to Coronavirus. We would like to narrow down the information gap between research scientist and the research papers. Now the researcher should search for the paper using keywords and title and abstract. ReSearch could help them to get to the papers they need. Researchers could just use natural language queries and he could get relevant results from the abstract of the papers. He can also narrow down the search by augmenting keywords retrieved from his query by entering his own keywords. ()

Challenges we ran into

We are using two deep learning models (universal sentence encoder and sci-bert) for retrieving the relevant sentence from the abstract. The deep learning model is as good as the data we are feeding in. So we had to narrow down the data at the same time should also check for quality. So we extracted keywords from the query, used a in-house trained word2vec model to retrieve the keywords related to the keywords in the query. Now we feed this query and keywords to PySerini and get the relevant results and feed them to rank and retrieve relevant sentence. Even then the results are slightly off, so we also show the keywords to the user, where he can remove the keywords which seems unrelated and could improve the query.

Discussion