Issue 21509: json.load fails to read UTF-8 file with (BOM) Byte Order Marks (original) (raw)

Created on 2014-05-14 20:32 by Kristian.Benoit, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (7)

msg218573 - (view)

Author: Kristian Benoit (Kristian.Benoit) *

Date: 2014-05-14 20:32

I'm trying to parse a json and keep getting ValueError. File reports the file as being "UTF-8 Unicode (with BOM) text", vim reports it as UTF-8, ...

json.load docs says it support UTF-8 out of the box.

Here's a link to the file : http://donnees.ville.sherbrooke.qc.ca/storage/f/2014-03-10T17%3A45%3A18.959Z/matieres-residuelles.json

msg218579 - (view)

Author: STINNER Victor (vstinner) * (Python committer)

Date: 2014-05-14 21:49

In Python 2, json.loads() accepts str and unicode types. You can support JSON starting with a UTF-8 BOM using the Python codec "utf-8-sig". Example:

codecs.BOM_UTF8 + b'{\n}' '\xef\xbb\xbf{\n}' json.loads(codecs.BOM_UTF8 + b'{\n}') Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.7/json/init.py", line 338, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded json.loads((codecs.BOM_UTF8 + b'{\n}').decode('utf-8-sig')) {}

msg218594 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2014-05-15 07:08

Currently json.load/loads don't support binary input. See and .

msg218643 - (view)

Author: Chris Rebert (cvrebert) *

Date: 2014-05-16 04:45

The new JSON RFC now at least mentions BOM handling: https://tools.ietf.org/html/rfc7159#section-8.1 :

Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.

msg218705 - (view)

Author: Kristian Benoit (Kristian.Benoit) *

Date: 2014-05-17 15:06

I added code to skip the bom if present when encoding is either None or "utf-8". The problem I have with Victor's solution is that users don't know these files are not plain UTF-8. Most text editor says it's utf-8 encoded, how can a user figure out there 3 hidden bytes at the start of the file ?

Kristian

msg218823 - (view)

Author: Santoso Wijaya (santoso.wijaya) *

Date: 2014-05-19 22:33

I think you should use codecs.BOM_UTF8 rather than using hardcoded string "\xef\xbb\xbf" directly.

And why special casing UTF-8 while we're at it? What about other encodings and their BOMs?

msg289168 - (view)

Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)

Date: 2017-03-07 15:53

This issue is outdated since implementing automatic encoding detecting in .

History

Date

User

Action

Args

2022-04-11 14:58:03

admin

set

github: 65708

2017-03-07 15:53:47

serhiy.storchaka

set

status: open -> closed
resolution: out of date
messages: +

stage: resolved

2014-05-19 22:33:07

santoso.wijaya

set

nosy: + santoso.wijaya
messages: +

2014-05-17 16:17:42

Kristian.Benoit

set

files: + json.v2.patch

2014-05-17 15:07:00

Kristian.Benoit

set

files: + json.patch
keywords: + patch
messages: +

2014-05-16 04:45:26

cvrebert

set

nosy: + cvrebert
messages: +

2014-05-15 07:08:40

serhiy.storchaka

set

messages: +

2014-05-15 00:52:14

pitrou

set

nosy: + serhiy.storchaka

2014-05-14 21:49:58

vstinner

set

nosy: + vstinner
messages: +

2014-05-14 20:32:52

Kristian.Benoit

create