drop_duplicates destroys non-duplicated data under 0.17 (original) (raw)

The drop_duplicates() function in Python 3 is broken. Take the following example snippet:

import pandas as pd

raw_data = {'x': [7,6,3,3,4,8,0],'y': [0,6,5,5,9,1,2]}
df = pd.DataFrame(raw_data, columns = ['x', 'y'])

print("Before:", df)
df = df.drop_duplicates()
print("After:", df)

When run under python 2, the results are correct, but when running under python 3, pandas removes 6,6 from the frame, which is a completely unique row. When using this function with large CSV files, it causes thousands of lines of unique data loss.

See:
http://stackoverflow.com/questions/33224356/why-is-pandas-dropping-unique-rows