Created on 1st May 2021
•
Ever since the onset of the pandemic, people have been displeased because of the lockdown and the quarantine for months and didn't have the opportunity to go out and mingle with their friends and family. The only hope that the people had was on the vaccine, which was a beacon of light, which was believed to liberate them from the never-ending lockdown. The vaccine, when it came, succeeded in doing so, giving people hope. But, the vaccine came with its own side effects and people feared the vaccine. News outlets and social media spread fears regarding the vaccine and a new fear regarding the safety of the vaccine grew among the people. That is where we come in. We are here to predict the possible side effects of the vaccine so that people know that they do not have to fear much when it comes to this vaccine and can get vaccinated immediately to help themselves and their society.
The first and possibly the biggest challenge was the preprocessing of the data. The data chosen for this purpose is the VAERS dataset, which is the data relating to the adverse effects when it comes to the vaccines administered. The data was in an unstructured state with some columns having text data and some columns having categorical features and the list of side effects in a separate CSV file which was not in a good format. This data is a general record, not a feature engineered or data meant for any analysis purpose. We had to manually process the data and combine the columns and put all the data into a single CSV file, manually selecting everything that was needed. Next was the conversion of the categorical symptoms (which was text data) had to be converted to multi hot feature as we need to predict them. Next was the conversion of the patients' previous medical history to a proper column with numbers. The medical history was text data that had no format and different rows had different ways of representing the medical history. So, to convert that to a categorical feature, we had to count the number of each word in that column and display that in descending order and choose the top medical conditions so that they can be encoded into numbers. Another challenge we faced was the scarcity of data. So, we had to choose only the labels (side effects) which were common and significant.