SparseDataFrame constructor has horrible performance for df with many columns (original) (raw)
Code Sample
This is an example taken directly from the docs, only that I've changed the sparsity of the arrays from 90% to 99%.
import pandas as pd from scipy.sparse import csr_matrix import numpy as np
arr = np.random.random(size=(1000, 5)) arr[arr < .99] = 0 sp_arr = csr_matrix(arr) %timeit sdf = pd.SparseDataFrame(sp_arr)
4.78 ms ± 381 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Now, here's what happens when I increase the number of columns from 5 to 2000:
import pandas as pd from scipy.sparse import csr_matrix import numpy as np
arr = np.random.random(size=(1000, 2000)) arr[arr < .99] = 0 sp_arr = csr_matrix(arr) %timeit sdf = pd.SparseDataFrame(sp_arr)
8.69 s ± 208 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note that initializing a the scipy.sparse.csr_matrix object itself is way (!!!) faster:
import pandas as pd from scipy.sparse import csr_matrix import numpy as np
arr = np.random.random(size=(1000, 2000)) arr[arr < .99] = 0 %timeit sp_arr = csr_matrix(arr)
13 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Problem description
The construction of a SparseDataFrame with many columns is ridiculously slow. I've traced the problem to this line in the SparseDataFrame._init_dict() function. I don't know why the data frame is constructed by assigning individual columns of a DataFrame object. I think the DataFrame._init_dict method uses a much more efficient method.
Output of pd.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-24-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8
pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 36.0.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 6.1.0
sphinx: 1.6.1
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: 4.6.0
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None