Scientific Summarizer

A space to easily summarize the long documents and download the summary.

Created on 21st January 2023

•

It's a long document summarizer since long research papers on various fields are published daily, and we don't have time to review the paper. Therefore, this summarizer will condense the long documents into concise summaries, preserving the essential information.
Most long document summarizers present today has a limit to how long a document it can summarize. With our summarizer, there's no limit. Since it's built like that, first, it does extractive summarization and then performs an abstractive summarization.
This summarizer uses state-of-the-art technology like transformers and the best statistical summarization algorithm, like text rank.
The web app is very flexible, and a user can tweak various hyperparameters like beam width,
length penalty, token batch length, repetition penalty and no repeat n-gram size.
Also, the user can download the generated summaries.
The application and model are hosted on a hugging face hub, which provides a free CPU tier for model inferences.

There were two main challenges:

How to increase the max_length of models they can take, which is about 1024 tokens?

To solve this problem, I changed the approach; instead of tweaking the maximum length, which was impossible as these models are pre-trained, I performed extractive summarization first, followed by abstractive summarization.

Which algorithm and model to use for extractive and abstraction summarization, respectively?

For extractive summarization, several algorithms like lex-rank, LDA, LSA, and PLSA came to mind, but I went with the best algorithm, which was text-rank, the successor to google's page-rank algorithm.
For abstractive summarization, there were mainly three candidates google's T-5, facebook's BART and Pegasus. I chose Bart because it has a maximum token length of 1024 and is presently the SOTA.
There were other architectural difficulties too, but in the end, all were resolved.

Technologies used

NumPy

XML

TRANSFORMERS

nltk

torch

Gradio

sentence-transformers

accelerate

wandb

datasets