Fetch for 29 days, not 30 which uses too much quota by hugovk · Pull Request #42 · hugovk/top-pypi-packages (original) (raw)
As seen in #36 and https://dev.to/hugovk/a-surprising-thing-about-pypis-bigquery-data-2g9o we were very close to going over the monthly quota with 30 days.
And as luck would have it, we've gone over again.
It's possible to perform a dry run to find out how many bytes would be processed:
https://cloud.google.com/bigquery/docs/running-queries#dry-run
For example:
from google.cloud import bigquery
Construct a BigQuery client object.
client = bigquery.Client()
client = bigquery.Client.from_service_account_json( "/path/to/credentials.json" )
job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)
Start the query, passing in the extra configuration.
query_job = client.query(
(
"""
SELECT
file.project as project,
COUNT(*) as download_count,
FROM bigquery-public-data.pypi.file_downloads
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -31 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
GROUP BY
project
ORDER BY
download_count DESC
"""
),
job_config=job_config,
) # Make an API request.
A dry run query completes immediately.
print( f"This query will process {query_job.total_bytes_processed:,} bytes or" f" {query_job.total_bytes_processed / 2**40:.2f} TiB." )
This query will process 1,109,890,761,919 bytes or 1.01 TiB.
Changing to 29 days:
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -31 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -30 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)
This query will process 1,072,148,343,976 bytes = 0.98 TiB.
Let's try this.
For now, the filenames are still called top-pypi-packages-30-days*
. At some point I'll rename it to remove -30-days
, possibly with a redirect to make the change clear, but maybe not.