Fetch data for all PyPI packages by hugovk · Pull Request #41 · hugovk/top-pypi-packages (original) (raw)
I bumped into your blog post, super interesting thanks (I originally came to your blog for setup-python free-threaded support)! Thanks as well for this repo, I found it super useful in the past, in particular to get some data without having to worry about running Google BigQuery queries, free quota and all 🙏!
Thanks!
I completely get the operational cost thing and that it is more convenient to take all downloads (rather than only
installer='pip'
) and that fetching the data for all the packages costs the same so you may as well provide the data for all PyPI packages.
By the way, I tried fetching all but for some reason the process was killed on the server, so I moved it back down to 15k (still up from 8k): #44. I might try increasing it again in the future...
Having said this, I am a bit worried that for packages that are downloaded less, people will start trusting these numbers blindly without being aware of the caveats (most downloads are from mirrors and not
pip
oruv
installs). Even for packages that are downloaded more, the difference can be 10% which is not small.For example recently I had a closer look at skore downloads and out of the 12K downloads in the last month, only 1.5k (so 12%) were from
pip
oruv
.
Yeah, there's always trade-offs, and that's a good point about less-downloaded packages.
Let's compare skore with the package currently at rank 100 (referencing):
❯ pypinfo --markdown --all --percent --start-date 2025-03 --end-date 2025-03 skore installer Served from cache: False Data processed: 4.18 GiB Data billed: 4.18 GiB Estimated cost: $0.03
installer_name | percent | download_count |
---|---|---|
bandersnatch | 34.32% | 775 |
requests | 18.11% | 409 |
uv | 15.32% | 346 |
pip | 13.55% | 306 |
Browser | 10.27% | 232 |
None | 7.66% | 173 |
conda | 0.62% | 14 |
poetry | 0.13% | 3 |
Total | 2,258 |
❯ pypinfo --markdown --all --percent --start-date 2025-03 --end-date 2025-03 referencing installer Served from cache: False Data processed: 3.50 GiB Data billed: 3.50 GiB Estimated cost: $0.02
installer_name | percent | download_count |
---|---|---|
pip | 77.01% | 95,788,632 |
uv | 16.35% | 20,336,635 |
poetry | 4.80% | 5,974,879 |
requests | 1.44% | 1,793,027 |
None | 0.29% | 355,183 |
Nexus | 0.05% | 62,962 |
Browser | 0.02% | 28,823 |
Bazel | 0.02% | 25,304 |
pdm | 0.01% | 11,603 |
bandersnatch | 0.00% | 4,624 |
Total | 124,381,672 |
That's 33% uv+pip for skore and for 93% uv+pip referencing.
I expect we'll get a similar high uv+pip share near the top, and similar high mirror share near the bottom? And that the share will change fairly smoothly as we go down the list?
If so, in any given month, I don't think it should affect relative positioning too much? Those ranking near the top will have similar numbers and a similar share, and those near skore will also have similar numbers and a similar share. There'll of course be some outliers.
Comparing absolute numbers one month to another isn't necessarily a good idea, because of the changes I unfortunately need to make to stay within quota. I've tried to list them in the changelog for visibility: https://hugovk.github.io/top-pypi-packages/#changelog.
However, I usually see people using this data when they want to do some study on the most popular packages (see https://hugovk.github.io/top-pypi-packages/#users), so these relative numbers are hopefully useful as a recent accessible snapshot.
Full disclosure: I have an upcoming PyCon Italia talk about some of the PyPI download stats caveats I have discovered along my journey, here is the abstract.
Looks interesting! I'll be at PyCon Italia too :)