Github Ingestion Engine
Ingesting All the Data
Created on 8th November 2025
•
Github Ingestion Engine
Ingesting All the Data
The problem Github Ingestion Engine solves
For any given username, the project ingests their most recent GitHub activity including user information, issues, pull requests, commits, stars, and more. The data is not refreshed if the last fetch occurred less than 24 hours ago; this is a trade-off between reducing API usage and keeping the data reasonably up to date.
Since the data store is primarily used for analytics or awarding users based on their activity, real-time updates are not necessary. The data is only refreshed if the same user is re-ingested or if a query requests an update and the existing data is more than 24 hours old.
The project includes measures such as rate limiting and a retry mechanism for reducing the usage of the Github API. Both of these are implemented by overriding a couple of methods in the standard Go http.Client.
The requests sent to the engine are queued and a worker pool picks them up. The limit of the queue is 1000 and we have 10 workers (defaults), this allows the engine to ingest a lot of data at once.

Challenges I ran into
The github API ended up being harder to work with than expected. I discovered a pretty good go library but still it behaved in ways I did not expect. Filtering everything on the basis of the user was a little unconventional and the semantics around that were generally not great. I used this library github-go and had to query for everything using weird queries of the following type:
issueQuery := "author:" + username + " type:issue"
Also, implementing the worker pool, the job queue and rate limiting ended up being more challenging than I thought. Although, there are a lot of existing implementations, I learnt a lot about concurrency patterns, channels and the go standard library.
Technologies used