Use np.random's RandomState when seed is None by ariddell · Pull Request #13161 · pandas-dev/pandas (original) (raw)

If I move the seed() call after instantiating the dataframe, I still get inconsistent behavior with calls to sample(), except when I provide the random_state arg. Example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> np.random.seed(12345678)
>>> df.sample(n=2)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2)
          a         b         c         d         e
4  0.895497  0.609853 -1.548664 -1.238415 -1.058904
5  0.196420  0.472877 -0.918205  1.019862 -0.631993
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679

Note that this time, the first call to sample() uses the seed, but the second call does not use the seed. Is it expected that seed() needs to be called before every sample call? I thought it is supposed to be, "set once", and all future randomization-related calls should use (including my original example, where randn() is called after seed() and before sample()).

(Also note that I did verify that calling seed() before every call to sample() does indeed produce the same sampled rows.)