pandas.Index.factorize — pandas 0.24.0rc1 documentation (original) (raw)
Index.
factorize
(sort=False, na_sentinel=-1)[source]¶
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorizeis available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().
Parameters: | sort : boolean, default False Sort uniques and shuffle labels to maintain the relationship. na_sentinel : int, default -1 Value to mark “not found”. |
---|---|
Returns: | labels : ndarray An integer ndarray that’s an indexer into uniques.uniques.take(labels) will have the same values as values. uniques : ndarray, Index, or Categorical The unique valid values. When values is Categorical, uniquesis a Categorical. When values is some other pandas object, anIndex is returned. Otherwise, a 1-D ndarray is returned. Note Even if there’s a missing value in values, uniques will_not_ contain an entry for it. |
See also
Discretize continuous-valued array.
Find the unique value in an array.
Examples
These examples all show factorize as a top-level method likepd.factorize(values)
. The results are identical for methods likeSeries.factorize().
labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) labels array([0, 0, 1, 2, 0]) uniques array(['b', 'a', 'c'], dtype=object)
With sort=True
, the uniques will be sorted, and labels will be shuffled so that the relationship is the maintained.
labels, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) labels array([1, 1, 0, 2, 1]) uniques array(['a', 'b', 'c'], dtype=object)
Missing values are indicated in labels with na_sentinel(-1
by default). Note that missing values are never included in uniques.
labels, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) labels array([ 0, -1, 1, 2, 0]) uniques array(['b', 'a', 'c'], dtype=object)
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniqueswill differ. For Categoricals, a Categorical is returned.
cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) labels, uniques = pd.factorize(cat) labels array([0, 0, 1]) uniques [a, c] Categories (3, object): [a, b, c]
Notice that 'b'
is in uniques.categories
, despite not being present in cat.values
.
For all other pandas objects, an Index of the appropriate type is returned.
cat = pd.Series(['a', 'a', 'c']) labels, uniques = pd.factorize(cat) labels array([0, 0, 1]) uniques Index(['a', 'c'], dtype='object')