Unable to open an S3 object with # in the URL · Issue #25945 · pandas-dev/pandas (original) (raw)

import pandas as pd df = pd.read_csv('s3://bucket/key#1.csv') df = pd.read_csv('s3://bucket/key%231.csv')

Problem description

Pandas can't open an object from S3 if it has a # sign in the URL, both in the case where the URL path is percent encoded and not. The reason is that urllib.parse.urlparse(), which is used in io/s3.py to parse the URL, treats the # sign as the beginning of the URL fragment, and thus it is lost (in the case of not percent encoded).

I see two possible solutions to the problem, but I'm not sure which one is best, since there does not seem to be a 'specification' for the S3 URL scheme (at least that I can find):

  1. Use allow_fragments=False when calling urllib.parse.urlparse(). This would allow the non-percent-encoded case to work, but seems slightly wrong.
  2. Call urllib.parse.unquote() on S3 paths before passing to s3fs. s3fs seems to want just a bucket/key as input, so pandas would have to remove the percent encoding. This would allow the percent-encoded case to work. It seems a bit more correct, but it might change some existing behavior where users could be loading URLs with % characters in them.