Robotic process automation has been gaining a lot of traction as many data sources are still unstructured and don’t provide an easy way to consume this data. RPA faces severe challenges w.r.t performance of scraping data from websites and downloading multiple files with accuracy.
Our project can download 100s of files concurrently handling file completeness, retry mechanisms, network disruptions.
We do not rely on any standard download manager from any browser, instead have our own mechanism for downloading.
We achieve all these goals without sacrificing speed or accuracy.
At first, our approach was to use selinium to make queries and get the data required. But soon we realised that that wouldn't work since the website was maintaining sessions accross all the pages and hence had to look for an alternative.
We arrived at puppeteer and it ticked all the checkboxes. Puppeteer is a headless browser that emulates everything that a browser does without opening and it provides APIs for developers to control it using code.
Managing cookies was a tricky situation but we managed to overcome it with Puppeteer
Technologies used
Discussion