Extracting relevant information and the main idea from research papers and transforming them into a format that can be quantitatively compared with other papers in the IPFS database requires a sophisticated approach.
One of the primary challenges is information extraction, which involves identifying and extracting relevant information from research papers. This challenge is compounded by the fact that research papers have varying structures and formats, making it difficult to extract the same information across all papers. Additionally, research papers often contain large amounts of irrelevant or redundant information that can make it challenging to identify the most important concepts and ideas.
To address these challenges, a robust approach to information extraction must involve advanced techniques such as natural language processing (NLP), which is used in this product.
Firstly, we addressed the issue of irrelevant or redundant information within the papers by implementing a pre-processing step that involved the removal of stop words, followed by the lemmatization of the remaining text. This allowed us to distill the essence of the papers and focus only on the most important concepts and ideas.
Next, we tokenized the pre-processed text to generate a weighted dictionary that captured the key themes and concepts within each paper. The weighting was determined by the frequency of occurrence of each token, allowing us to prioritize the most significant aspects of each paper.
To facilitate a quantitative comparison of the papers, we used cosine similarity as a measure of their similarity. This allowed us to compute the degree of similarity between the weighted dictionaries for each pair of papers.
We were also caught up in HuddleSDK and LightHouseSDK but mentors helped us with it throughout
Tracks Applied (3)
Cheering for a project means supporting a project you like with as little as 0.0025 ETH. Right now, you can Cheer using ETH on Arbitrum, Optimism and Base.