Back to labs main.

Calmcode Labs Presents

gitlit logogitlit.

Our 6th experiment involves scraping GitHub.

A while ago we started investigating methods to reduce training times on some of our open-source repositories. It started with our pytest-duration-insights tool, but we've since elaborated our efforts to also include a scraper.

We've called the scraper project gitlit. We've provisioned a cronjob on a cloud VM that scrapes the GitHub API for GitHub Action times for popular open source repositories. Once the data is fetched, it is committed to a GitHub repository such that the data is available for anyone to download.

Because the dataset is hosted on GitHub, it's also very easy for us to deploy a hosted streamlit app that uses the same dataset. Streamlit is a neat way to quickly turn a dataset into an interactive service and there's a free tier that allows you to host these interactive visuals.

The only thing you need to do is to add a file to the repository that streamlit can detect. If this sounds new to you, you may appreciate our course on streamlit to learn more.

This is a very nice setup because every time that our GitHub repository is updated our streamlit dashboard will also refresh.

Useful Deployment Pattern

As a data sharing pattern, it's a pretty neat setup.

The cronjob is the only serice you'd need to maintain. One the file is updated in GitHub it's all static files and streamlit is able to give you a neat serverless setup.

Quick Insights

You can view the hosted streamlit app to learn some interesting lessons on testing times.

  • Just running GitHub Actions for numpy takes about 20 hours a day on average.
  • The Apache Airflow project takes a fair bit more. Just running the unit tests takes about 50 hours a day but it also requires another 50 to build the whole project.
  • Pandas is a heavyweight too, averaging just over 60 hours a day.
  • Many projects that are led by an individual, as opposed to a company or a large community, seems to be more lightweight. The FastAPI project only about 30 minutes for their unit tests.

The project isn't able to collect reliable CI information for every open source project because many projects don't use GitHub Actions. We hope though that, as-is, the gitlit project gives a meaningful peak into the compute that is required to run some of these open source projects. GitHub offers CI for free for open source projects, but assuming the $0.008 per minute rate projects like pandas would require $28.8/day. That's about $10K a year. That's not nothing!

Back to labs main.