E

Eureka

Eureka! Eureka! We've found it!

The problem Eureka solves

Ever had difficulty coming up with ideas for a hackathon?
Remember your first hackathon and not having any idea what you’re supposed to do?

Or perhaps an old hackathon project that had potential but you just kinda forgot about it?

We at team Caffeine Overflow certainly do, and we’ve come up with a solution to help hackers and tech enthusiasts getting into the world of hackathons

Eureka is a website that collects data from previous hackathon projects, clusters them and presents them, encouraging discussion and further development of ideas

Challenges we ran into

  1. We ran into a problem where we had to deal with a lot of duplicate data while scraping as we couldn't control how often the site was refreshed.
    SOLUTION: Stored a local database of previously scraped data, so that once we had new data with us, we could validate the data and pick the unique ones out.

  2. Since the data that we scraped was not labeled, we could not train a classifier. Hence, the only other option was to cluster the data. But this proved to be harder than we expected because the data that we received had a lot of outliers, where the descriptions were either poorly written or there were no descriptions at all. Also because every event had different themes, the data that we scraped was very diverse. Thus simple algorithm like k-means wasn't enough to cluster these projects.
    SOLUTION: We tried to cluster the data using both expectation-maximization and Latent Dirichlet allocation. And we found out that the LDA model worked best with the data we had because we realized the heavy emphasis on the TF-IDF of the words was not the best approach.
    LDA is a mixed membership model approach. It can predict the probability of the document is in multiple themes.

Discussion