pandas.Index.factorize — pandas 1.0.1 documentation (original) (raw)

Encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorizeis available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

Parameters

sortbool, default False

Sort uniques and shuffle codes to maintain the relationship.

na_sentinelint, default -1

Value to mark “not found”.

Returns

codesndarray

An integer ndarray that’s an indexer into uniques.uniques.take(codes) will have the same values as values.

uniquesndarray, Index, or Categorical

The unique valid values. When values is Categorical, uniquesis a Categorical. When values is some other pandas object, anIndex is returned. Otherwise, a 1-D ndarray is returned.

Note

Even if there’s a missing value in values, uniques will_not_ contain an entry for it.

See also

cut

Discretize continuous-valued array.

unique

Find the unique value in an array.

Examples

These examples all show factorize as a top-level method likepd.factorize(values). The results are identical for methods likeSeries.factorize().

codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) codes array([0, 0, 1, 2, 0]) uniques array(['b', 'a', 'c'], dtype=object)

With sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the maintained.

codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) codes array([1, 1, 0, 2, 1]) uniques array(['a', 'b', 'c'], dtype=object)

Missing values are indicated in codes with na_sentinel(-1 by default). Note that missing values are never included in uniques.

codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) codes array([ 0, -1, 1, 2, 0]) uniques array(['b', 'a', 'c'], dtype=object)

Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniqueswill differ. For Categoricals, a Categorical is returned.

cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) codes, uniques = pd.factorize(cat) codes array([0, 0, 1]) uniques [a, c] Categories (3, object): [a, b, c]

Notice that 'b' is in uniques.categories, despite not being present in cat.values.

For all other pandas objects, an Index of the appropriate type is returned.

cat = pd.Series(['a', 'a', 'c']) codes, uniques = pd.factorize(cat) codes array([0, 0, 1]) uniques Index(['a', 'c'], dtype='object')