Lynx
Find the fraud hiding in plain public data.
The problem Lynx solves
The Problem Solved
Fragmented public procurement data hides ₹1-3 lakh crore annual fraud/waste across India's 50+ govt portals (eProcure tenders, PFMS payments, CAG audits), where "three ledgers" (plans vs. money vs. reality) never cross-reference.
What People Use It For
Procurement Intelligence Engine empowers truth-seekers to uncover hidden discrepancies in infrastructure projects effortlessly.
- Journalists: Generate publishable stories from one query (e.g., NH-12 fraud)—cuts weeks of manual scraping to minutes, with citable evidence chains.
- NGOs/Activists: File precise RTIs and advocate with risk-scored reports—automates RTI drafting, boosts credibility via source-linked facts.
- Govt Auditors: Screen thousands of projects for red flags (cost overruns, cartels)—scales vigilance 100x, prioritizes high-risk cases safely.
How It Makes Tasks Easier/Safer
Replaces risky manual digging (JS portals, PDFs) with automated, verifiable intelligence, safer (no proxy blocks/legal gray areas), faster (ReAct agent + dashboard), scalable (caches compound nationwide).
Challenges we ran into
Challenges I Ran Into
1. Brightdata .gov.in Blockade (CRITICAL)
Hurdle: Brightdata blocks all Indian govt domains (wbtenders.gov.in, eProcure.gov.in, PFMS) by policy—"Access denied: Government domain." Lost 70% of primary tender/payment sources.
Solution:
- Archive.org workaround: Cached govt portals (2015-2023) load perfectly in Brightdata's scraping browser
- 11 alternative sources (PPP India DB, Tofler, BidAssist) captured 85% needed data
- VPS roadmap: Self-hosted Playwright on Indian IPs for production
2. PDF Extraction Hell
Hurdle: CAG reports, SOR rate tables returned binary garbage. Scanned Indic PDFs + table layouts broke basic scrapers completely.
Solution:
PyMuPDF (digital PDFs) → Surya OCR (scanned) → Claude API (table structure)
text
Built quality scoring—low-confidence extractions queue for human review.
3. Entity Name Chaos
Hurdle: "MS Sharma Constructions" vs "Sharma Const." vs "SHARMA PVT LTD" across portals. CIN rare for small contractors.
Solution:
CIN/DIN exact match (large contractors)
Jaro-Winkler fuzzy (85% auto-merge, 70-85% human review)
Alias persistence (once resolved, variants auto-link)
text
4. JS-Heavy Portals (GeM, PFMS, TenderTiger)
Hurdle: Empty HTML from basic scrapers—needed full browser rendering + session auth.
Solution: Brightdata
scrapingbrowsernavigate
for dynamic sites. GeM API exploration underway.5. Historical Tender Data Gap
Hurdle: State portals show only recent tenders. Needed 2015-2023 for longitudinal analysis.
Solution: Archive.org monthly snapshots + BidAssist/TenderTiger archives = 1,225+ WB PWD road tenders captured.
Tracks Applied (3)
AI/ML
Open Innovation
Requestly
Requestly
