Fetch for 29 days, not 30 which uses too much quota by hugovk · Pull Request #42 · hugovk/top-pypi-packages (original) (raw)

As seen in #36 and https://dev.to/hugovk/a-surprising-thing-about-pypis-bigquery-data-2g9o we were very close to going over the monthly quota with 30 days.

And as luck would have it, we've gone over again.

It's possible to perform a dry run to find out how many bytes would be processed:

https://cloud.google.com/bigquery/docs/running-queries#dry-run

For example:

from google.cloud import bigquery

Construct a BigQuery client object.

client = bigquery.Client()

client = bigquery.Client.from_service_account_json( "/path/to/credentials.json" )

job_config = bigquery.QueryJobConfig(dry_run=True, use_query_cache=False)

Start the query, passing in the extra configuration.

query_job = client.query( ( """ SELECT file.project as project, COUNT(*) as download_count, FROM bigquery-public-data.pypi.file_downloads WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -31 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY) GROUP BY project ORDER BY download_count DESC """ ), job_config=job_config, ) # Make an API request.

A dry run query completes immediately.

print( f"This query will process {query_job.total_bytes_processed:,} bytes or" f" {query_job.total_bytes_processed / 2**40:.2f} TiB." )

This query will process 1,109,890,761,919 bytes or 1.01 TiB.

Changing to 29 days:

   WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -31 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)

   WHERE timestamp BETWEEN TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -30 DAY) AND TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -1 DAY)

This query will process 1,072,148,343,976 bytes = 0.98 TiB.

Let's try this.

For now, the filenames are still called top-pypi-packages-30-days*. At some point I'll rename it to remove -30-days, possibly with a redirect to make the change clear, but maybe not.