researchGPT : The AI-Powered research Assistant.

Utilise documents without actually reading them!

7.5K

Built at Stranger Hacks: GPT Era

Created on 2nd April 2023

•

researchGPT : The AI-Powered research Assistant.

Utilise documents without actually reading them!

The problem researchGPT : The AI-Powered research Assistant. solves

We have designed an aid for the everyday researcher and knowledge gatherer. You never have to comb through a sea of text trying to find that one statistic, that one conclusion that can win you an argument or complete your citation. Just plug those lengthy pdfs into this tool and ask questions to your very own research assistant (at a salary: of Rs. 0/hr impressive, right?).

A user can upload several well-formatted (mostly plaintext) PDFs. This tool is focused on research papers due to their standard format, making it very easy to extract text. Also, the need for such a tool is greater because research papers tend to have a lot of data and sifting through many research papers to obtain data related to a query can become tedious.

Challenges we ran into

The first issue we faced was cleaning the text while extracting text from the pdf. The text extraction was a bit messy cause the PDF contained a lot of special characters and we also have to make sure that it doesn't break sentences or misread the data which can harm the output. So that we have used the pypdf library for basic pdf-to-text conversion and used our own logic to create a text which can be fed to the model.

We faced an issue while deploying our model when we used hugging face sentence transforms for embedding and openAI completion API for generating better responses. The build that was getting generated was too size so we needed to figure out which platform should we use to deploy our model and how to reduce our build size so that we can deploy our product in a cost-effective way.
So that we have created two approaches for this. A local approach uses a fairly performant and cost-effective model (can produce accurate and swift results without a GPU) and uses dot product as a similarity metric. This model was chosen after careful deliberation due to the following factors:
Best SBERT model in terms of performance
High output vector size (768 dimensions)
Based on the Microsoft mpnet model and then trained on 215M+ question-answer samples from various sources.
Cost-effective
The second approach uses the OpenAI embeddings API and creates a 1536-dimension dense vector, offloading demanding tasks to remote servers. This uses a cosine similarity metric.
Whenever the user asks a question, it is converted into a vector using the same embedding model. The order of similarity metrics between vectors in the knowledge base and query vector provides relevance. Text metadata from the top three vectors with the highest similarity is further used as a context into the large language model powered completions API by OpenAI which provides a great human-readable answer.

Technologies used

Flask

Pinecone

openAI Completion API

openAI embedding API

Discussion

Builders also viewed

See more projects on Devfolio