json_normalize does not normalize subrecords properly if any subrecords values are NoneType · Issue #20030 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
data_fail_to_normalize =
[{'info': None},
{'info':
{'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
'author_name':
{'first': 'Jane', 'last_name': 'Doe'}
}]
data_partial_fail =
[{'info': None,
'author_name':
{'first': 'Smith', 'last_name': 'Appleseed'}
},
{'info':
{'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
'author_name':
{'first': 'Jane', 'last_name': 'Doe'}
}]
import pandas as pd pd.io.json.json_normalize(data_fail_to_normalize)
Output 1
author_name | info | |
---|---|---|
0 | nan | None |
1 | {'first': 'Jane', 'last_name': 'Doe'} | {'created_at': '11/08/1993', 'last_updated': '26/05/2012'} |
pd.io.json.json_normalize(data_partial_fail)
Output 2
author_name.first | author_name.last_name | info | info.created_at | info.last_updated | |
---|---|---|---|---|---|
0 | Smith | Appleseed | nan | nan | nan |
1 | Jane | Doe | nan | 11/08/1993 | 26/05/2012 |
Problem description
I expected that the json_normalize function takes into account the presence of NoneTypes in the dictionaries. This leads to 2 separate issues (If I should open this as 2 separate issues, let me know).
I have already written a fix that solves this issue - if anyone else can validate that this is not working as intended, I can set up a PR.
Output 1
Does not unnest json after encountering NoneType at first instance of subrecord, line 192 of pandas/io/json/normalize.py
Output 2
Keeps the None value when encountered, [{k: {'alpha': 'foo', 'beta': 'bar'}}, {k: None}], see nested_to_record function. Creates additional column of nan
s which would not otherwise occur if that particular key was removed.
Expected Output
Output 1
author_name.first | author_name.last_name | info.created_at | info.last_updated | |
---|---|---|---|---|
0 | nan | nan | nan | nan |
1 | Jane | Doe | 11/08/1993 | 26/05/2012 |
Output 2
author_name.first | author_name.last_name | info.created_at | info.last_updated | |
---|---|---|---|---|
0 | Smith | Appleseed | nan | nan |
1 | Jane | Doe | 11/08/1993 | 26/05/2012 |
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-104-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None