The problem DevFoolio solves
The Problem it Solves
Problem
Hackathons are designed to foster innovation, but with the sheer volume of project submissions, ensuring each entry’s originality has become a significant challenge. Copying or subtly repurposing existing projects undermines the spirit of creativity and fair competition. Manually verifying projects for originality is not only time-consuming but also susceptible to human error, making it difficult to maintain a level playing field.
How This Project Helps
Our platform offers automated and reliable plagiarism detection specifically designed for Devfolio hackathons. By scanning and analyzing new project submissions and comparing them against a database of past projects, our tool efficiently identifies similarities and potential duplicates.
Key Benefits:
- Ensures Integrity: Helps hackathon organizers verify originality, upholding the integrity of the event.
- Supports Authentic Work: Allows participants to showcase their unique ideas with confidence, knowing their work will stand out.
- Reduces Manual Effort: Eliminates the need for manual checks, streamlining the review process while minimizing human error.
With our system, hackathons can remain true to their mission of encouraging fresh, authentic ideas and fostering an environment of fair competition.
Challenges we ran into
Challenges We Ran Into
1. Handling Large-Scale Data
- Hurdle: With over 180,000 projects on Devfolio, efficiently managing and comparing this extensive dataset posed a major challenge. Processing such a volume could easily lead to slow performance and high resource usage.
- Solution: We leveraged a combination of optimized data structures, efficient database indexing, and preprocessing techniques to handle and streamline the data, enabling faster similarity checks without compromising accuracy.
2. Dynamic Content Parsing and Scraping
- Hurdle: Many Devfolio project pages use dynamic content that required us to adapt our scraping techniques for consistency and accuracy. Changing class names and dynamic content required solutions that could adapt without breaking with each page load.
- Solution: To overcome this, we implemented a smart scraping approach using robust tools and selector fallback methods, allowing us to reliably access the required data regardless of page structure changes.
3. Textual Crux Extraction and Vectorization
- Hurdle: Summarizing each project’s description into a concise "crux" for efficient comparison was challenging, especially with the variation in text length and structure across projects.
- Solution: We incorporated an NLP model to automatically generate concise descriptions, capturing each project’s essence. These summaries were then vectorized to enable high-speed similarity calculations, achieving efficient and relevant comparisons.
4. Ensuring Accuracy in Similarity Detection
- Hurdle: Balancing accuracy and speed in detecting similarities was complex. Setting thresholds too high or low could lead to either missed detections or excessive false positives.
- Solution: We fine-tuned the similarity thresholds based on testing data to ensure accurate results, refining the model’s parameters to strike a balance between precision and recall.