json_normalize does not normalize subrecords properly if any subrecords values are NoneType · Issue #20030 · pandas-dev/pandas (original) (raw)

Code Sample, a copy-pastable example if possible

data_fail_to_normalize =
[{'info': None},

    {'info': 
     {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
    'author_name': 
     {'first': 'Jane', 'last_name': 'Doe'}
    }]

data_partial_fail =
[{'info': None, 'author_name': {'first': 'Smith', 'last_name': 'Appleseed'} },

    {'info': 
     {'created_at': '11/08/1993', 'last_updated': '26/05/2012'},
    'author_name': 
     {'first': 'Jane', 'last_name': 'Doe'}
    }]

import pandas as pd pd.io.json.json_normalize(data_fail_to_normalize)

Output 1

author_name info
0 nan None
1 {'first': 'Jane', 'last_name': 'Doe'} {'created_at': '11/08/1993', 'last_updated': '26/05/2012'}

pd.io.json.json_normalize(data_partial_fail)

Output 2

author_name.first author_name.last_name info info.created_at info.last_updated
0 Smith Appleseed nan nan nan
1 Jane Doe nan 11/08/1993 26/05/2012

Problem description

I expected that the json_normalize function takes into account the presence of NoneTypes in the dictionaries. This leads to 2 separate issues (If I should open this as 2 separate issues, let me know).

I have already written a fix that solves this issue - if anyone else can validate that this is not working as intended, I can set up a PR.

Output 1

Does not unnest json after encountering NoneType at first instance of subrecord, line 192 of pandas/io/json/normalize.py

Output 2

Keeps the None value when encountered, [{k: {'alpha': 'foo', 'beta': 'bar'}}, {k: None}], see nested_to_record function. Creates additional column of nans which would not otherwise occur if that particular key was removed.

Expected Output

Output 1

author_name.first author_name.last_name info.created_at info.last_updated
0 nan nan nan nan
1 Jane Doe 11/08/1993 26/05/2012

Output 2

author_name.first author_name.last_name info.created_at info.last_updated
0 Smith Appleseed nan nan
1 Jane Doe 11/08/1993 26/05/2012

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-104-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.13.1
scipy: None
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None