Scientific Summarizer

Scientific Summarizer

A space to easily summarize the long documents and download the summary.

Created on 21st January 2023

Scientific Summarizer

Scientific Summarizer

A space to easily summarize the long documents and download the summary.

The problem Scientific Summarizer solves

  • It's a long document summarizer since long research papers on various fields are published daily, and we don't have time to review the paper. Therefore, this summarizer will condense the long documents into concise summaries, preserving the essential information.
  • Most long document summarizers present today has a limit to how long a document it can summarize. With our summarizer, there's no limit. Since it's built like that, first, it does extractive summarization and then performs an abstractive summarization.
  • This summarizer uses state-of-the-art technology like transformers and the best statistical summarization algorithm, like text rank.
  • The web app is very flexible, and a user can tweak various hyperparameters like beam width,
    length penalty, token batch length, repetition penalty and no repeat n-gram size.
  • Also, the user can download the generated summaries.
  • The application and model are hosted on a hugging face hub, which provides a free CPU tier for model inferences.

Challenges I ran into

There were two main challenges:

  1. How to increase the max_length of models they can take, which is about 1024 tokens?
  • To solve this problem, I changed the approach; instead of tweaking the maximum length, which was impossible as these models are pre-trained, I performed extractive summarization first, followed by abstractive summarization.
  1. Which algorithm and model to use for extractive and abstraction summarization, respectively?
  • For extractive summarization, several algorithms like lex-rank, LDA, LSA, and PLSA came to mind, but I went with the best algorithm, which was text-rank, the successor to google's page-rank algorithm.
    For abstractive summarization, there were mainly three candidates google's T-5, facebook's BART and Pegasus. I chose Bart because it has a maximum token length of 1024 and is presently the SOTA.
    There were other architectural difficulties too, but in the end, all were resolved.

Discussion

Builders also viewed

See more projects on Devfolio