Issue 21509: json.load fails to read UTF-8 file with (BOM) Byte Order Marks (original) (raw)
Created on 2014-05-14 20:32 by Kristian.Benoit, last changed 2022-04-11 14:58 by admin. This issue is now closed.
Messages (7)
Author: Kristian Benoit (Kristian.Benoit) *
Date: 2014-05-14 20:32
I'm trying to parse a json and keep getting ValueError. File reports the file as being "UTF-8 Unicode (with BOM) text", vim reports it as UTF-8, ...
json.load docs says it support UTF-8 out of the box.
Here's a link to the file : http://donnees.ville.sherbrooke.qc.ca/storage/f/2014-03-10T17%3A45%3A18.959Z/matieres-residuelles.json
Author: STINNER Victor (vstinner) *
Date: 2014-05-14 21:49
In Python 2, json.loads() accepts str and unicode types. You can support JSON starting with a UTF-8 BOM using the Python codec "utf-8-sig". Example:
codecs.BOM_UTF8 + b'{\n}' '\xef\xbb\xbf{\n}' json.loads(codecs.BOM_UTF8 + b'{\n}') Traceback (most recent call last): File "", line 1, in File "/usr/lib64/python2.7/json/init.py", line 338, in loads return _default_decoder.decode(s) File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode raise ValueError("No JSON object could be decoded") ValueError: No JSON object could be decoded json.loads((codecs.BOM_UTF8 + b'{\n}').decode('utf-8-sig')) {}
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2014-05-15 07:08
Currently json.load/loads don't support binary input. See and .
Author: Chris Rebert (cvrebert) *
Date: 2014-05-16 04:45
The new JSON RFC now at least mentions BOM handling: https://tools.ietf.org/html/rfc7159#section-8.1 :
Implementations MUST NOT add a byte order mark to the beginning of a JSON text. In the interests of interoperability, implementations that parse JSON texts MAY ignore the presence of a byte order mark rather than treating it as an error.
Author: Kristian Benoit (Kristian.Benoit) *
Date: 2014-05-17 15:06
I added code to skip the bom if present when encoding is either None or "utf-8". The problem I have with Victor's solution is that users don't know these files are not plain UTF-8. Most text editor says it's utf-8 encoded, how can a user figure out there 3 hidden bytes at the start of the file ?
Kristian
Author: Santoso Wijaya (santoso.wijaya) *
Date: 2014-05-19 22:33
I think you should use codecs.BOM_UTF8 rather than using hardcoded string "\xef\xbb\xbf" directly.
And why special casing UTF-8 while we're at it? What about other encodings and their BOMs?
Author: Serhiy Storchaka (serhiy.storchaka) *
Date: 2017-03-07 15:53
This issue is outdated since implementing automatic encoding detecting in .
History
Date
User
Action
Args
2022-04-11 14:58:03
admin
set
github: 65708
2017-03-07 15:53:47
serhiy.storchaka
set
status: open -> closed
resolution: out of date
messages: +
stage: resolved
2014-05-19 22:33:07
santoso.wijaya
set
nosy: + santoso.wijaya
messages: +
2014-05-17 16:17:42
Kristian.Benoit
set
files: + json.v2.patch
2014-05-17 15:07:00
Kristian.Benoit
set
files: + json.patch
keywords: + patch
messages: +
2014-05-16 04:45:26
cvrebert
set
nosy: + cvrebert
messages: +
2014-05-15 07:08:40
serhiy.storchaka
set
messages: +
2014-05-15 00:52:14
pitrou
set
nosy: + serhiy.storchaka
2014-05-14 21:49:58
vstinner
set
nosy: + vstinner
messages: +
2014-05-14 20:32:52
Kristian.Benoit
create