There were two main challenges:
- How to increase the max_length of models they can take, which is about 1024 tokens?
- To solve this problem, I changed the approach; instead of tweaking the maximum length, which was impossible as these models are pre-trained, I performed extractive summarization first, followed by abstractive summarization.
- Which algorithm and model to use for extractive and abstraction summarization, respectively?
- For extractive summarization, several algorithms like lex-rank, LDA, LSA, and PLSA came to mind, but I went with the best algorithm, which was text-rank, the successor to google's page-rank algorithm.
For abstractive summarization, there were mainly three candidates google's T-5, facebook's BART and Pegasus. I chose Bart because it has a maximum token length of 1024 and is presently the SOTA.
There were other architectural difficulties too, but in the end, all were resolved.