SparseDataFrame constructor has horrible performance for df with many columns (original) (raw)

Code Sample

This is an example taken directly from the docs, only that I've changed the sparsity of the arrays from 90% to 99%.

import pandas as pd from scipy.sparse import csr_matrix import numpy as np

arr = np.random.random(size=(1000, 5)) arr[arr < .99] = 0 sp_arr = csr_matrix(arr) %timeit sdf = pd.SparseDataFrame(sp_arr)

 4.78 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now, here's what happens when I increase the number of columns from 5 to 2000:

import pandas as pd from scipy.sparse import csr_matrix import numpy as np

arr = np.random.random(size=(1000, 2000)) arr[arr < .99] = 0 sp_arr = csr_matrix(arr) %timeit sdf = pd.SparseDataFrame(sp_arr)

8.69 s ± 208 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Note that initializing a the scipy.sparse.csr_matrix object itself is way (!!!) faster:

import pandas as pd from scipy.sparse import csr_matrix import numpy as np

arr = np.random.random(size=(1000, 2000)) arr[arr < .99] = 0 %timeit sp_arr = csr_matrix(arr)

13 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Problem description

The construction of a SparseDataFrame with many columns is ridiculously slow. I've traced the problem to this line in the SparseDataFrame._init_dict() function. I don't know why the data frame is constructed by assigning individual columns of a DataFrame object. I think the DataFrame._init_dict method uses a much more efficient method.

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-24-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.1
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

SparseDataFrame constructor has horrible performance for df with many columns (original) (raw)

Code Sample

Problem description

Output of pd.show_versions()

Output of `pd.show_versions()`