BUG: Series.rank(pct=True).max() != 1 for a large series of floats · Issue #18271 · pandas-dev/pandas (original) (raw)
Code Sample, a copy-pastable example if possible
import numpy as np import pandas as pd
rs = np.random.RandomState(seed=0, )
len_df = 20000000 df = pd.DataFrame(data=rs.rand(len_df), columns=['abc'], ).sort_values('abc')
df['Rank'] = df['abc'].rank() df['Rank_Pct']= df['abc'].rank(pct=True) df['Rank_Pct_Manual'] = df['Rank']/len_df
df.describe()
Output:
abc Rank Rank_Pct Rank_Pct_Manual
count 2.000000e+07 2.000000e+07 2.000000e+07 2.000000e+07
mean 4.999223e-01 1.000000e+07 5.960465e-01 5.000000e-01
std 2.886891e-01 5.773503e+06 3.441276e-01 2.886751e-01
min 1.036192e-08 1.000000e+00 5.960464e-08 5.000000e-08
25% 2.498756e-01 5.000001e+06 2.980233e-01 2.500000e-01
50% 4.999781e-01 1.000000e+07 5.960465e-01 5.000000e-01
75% 7.499111e-01 1.500000e+07 8.940697e-01 7.500000e-01
max 1.000000e+00 2.000000e+07 1.192093e+00 1.000000e+00
Problem description
I have a set of 20 million floats, and I am trying to follow this StackOverflow example. This code discusses calculating the percentile ranking of a column using either the pct=True
option for the rank()
function, or by manually dividing the output of rank(pct=True)
by the length of the Series
.
I noticed that the former values have a maximum that is not 1, while the latter have the expected maximum of 1.
I have tried this with the latest (0.21.0) version of pandas
, and can replicate it with an array of random floats.
It seems to be related to the number of rows being greater than 2^23 – you can see this by comparing the output when lendf=16770000
and lendf=16780000
.
I believe the responsible code is pandas/_libs/algos.pyx/rank_1d_float64()
.
I'm working on a PR now.
Expected Output
I would expect the values of Rank_Pct
and Rank_Pct_Manual
to be the same, and that the maximum of both should be 1.
Output of pd.show_versions()
commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Darwin
OS-release: 16.7.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.3.0
Cython: None
numpy: 1.13.3
scipy: 0.19.1
pyarrow: 0.7.1
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: 1.1.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None