Unable to open an S3 object with # in the URL · Issue #25945 · pandas-dev/pandas (original) (raw)
import pandas as pd df = pd.read_csv('s3://bucket/key#1.csv') df = pd.read_csv('s3://bucket/key%231.csv')
Problem description
Pandas can't open an object from S3 if it has a # sign in the URL, both in the case where the URL path is percent encoded and not. The reason is that urllib.parse.urlparse(), which is used in io/s3.py to parse the URL, treats the # sign as the beginning of the URL fragment, and thus it is lost (in the case of not percent encoded).
I see two possible solutions to the problem, but I'm not sure which one is best, since there does not seem to be a 'specification' for the S3 URL scheme (at least that I can find):
- Use
allow_fragments=False
when callingurllib.parse.urlparse()
. This would allow the non-percent-encoded case to work, but seems slightly wrong. - Call
urllib.parse.unquote()
on S3 paths before passing to s3fs. s3fs seems to want just a bucket/key as input, so pandas would have to remove the percent encoding. This would allow the percent-encoded case to work. It seems a bit more correct, but it might change some existing behavior where users could be loading URLs with % characters in them.