WIP/DO NOT MERGE: Categorical improvements by jankatins · Pull Request #7444 · pandas-dev/pandas (original) (raw)
This is a PR to make discussing the doc changes easier. See #7217 for the main PR
Categorical¶
New in version 0.15.
Note
While there was in pandas.Categorical in earlier versions, the ability to use Categorical data in Series and DataFrame is new.
This is a short introduction to pandas Categorical type, including a short comparison with R’s factor.
Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (commonly called levels). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.
In contrast to statistical categorical variables, a Categorical might have an order (e.g. ‘strongly agree’ vs ‘agree’ or ‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, ...) are not possible.
All values of the Categorical are either in levels or np.nan. Order is defined by the order of the levels, not lexical order of the values. Internally, the data structure consists of a levels array and an integer array of level_codes which point to the real value in the levels array.
Categoricals are useful in the following cases:
- A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory.
- The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the levels, sorting and min/max will use the logical order instead of the lexical order.
- As a signal to other python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types)
See also the API docs on Categoricals.
Object Creation¶
Categorical Series or columns in a DataFrame can be crated in several ways:
By passing a Categorical object to a Series or assigning it to a DataFrame:
In [1]: raw_cat = pd.Categorical(["a","b","c","a"])
In [2]: s = pd.Series(raw_cat)
In [3]: s
Out[3]:
0 a
1 b
2 c
3 a
dtype: category
In [4]: df = pd.DataFrame({"A":["a","b","c","a"]})
In [5]: df["B"] = raw_cat
In [6]: df
Out[6]:
A B
0 a a
1 b b
2 c c
3 a a
By converting an existing Series or column to a category type:
In [7]: df = pd.DataFrame({"A":["a","b","c","a"]})
In [8]: df["B"] = df["A"].astype('category')
In [9]: df
Out[9]:
A B
0 a a
1 b b
2 c c
3 a a
By using some special functions:
In [10]: df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
In [11]: labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]
In [12]: df['group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
In [13]: df.head(10)
Out[13]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
Categoricals have a specific category dtype:
In [14]: df.dtypes Out[14]: value int32 group category dtype: object
Note
In contrast to R’s factor function, a Categorical is not converting input values to string and levels will end up the same data type as the original values.
Note
I contrast to R’s factor function, there is currently no way to assign/change labels at creation time. Use levels to change the levels after creation time.
To get back to the original Series or numpy array, use Series.astype(original_dtype) ornp.asarray(categorical):
In [15]: s = pd.Series(["a","b","c","a"])
In [16]: s
Out[16]:
0 a
1 b
2 c
3 a
dtype: object
In [17]: s2 = s.astype('category')
In [18]: s2
Out[18]:
0 a
1 b
2 c
3 a
dtype: category
In [19]: s3 = s2.astype('string')
In [20]: s3
Out[20]:
0 a
1 b
2 c
3 a
dtype: object
In [21]: np.asarray(s2.cat)
Out[21]: array(['a', 'b', 'c', 'a'], dtype=object)
Working with levels¶
Categoricals have a levels property, which list their possible values. If you don’t manually specify levels, they are inferred from the passed in values. Series of typecategory expose the same interface via their cat property.
In [22]: raw_cat = pd.Categorical(["a","b","c","a"])
In [23]: raw_cat.levels
Out[23]: Index([u'a', u'b', u'c'], dtype='object')
In [24]: raw_cat.ordered
Out[24]: True
Series of type "category" also expose these interface via the .cat property:
In [25]: s = pd.Series(raw_cat)
In [26]: s.cat.levels
Out[26]: Index([u'a', u'b', u'c'], dtype='object')
In [27]: s.cat.ordered
Out[27]: True
Note
New Categorical are automatically ordered if the passed in values are sortable or a levels argument is supplied. This is a difference to R’s factors, which are unordered unless explicitly told to be ordered (ordered=TRUE).
It’s also possible to pass in the levels in a specific order:
In [28]: raw_cat = pd.Categorical(["a","b","c","a"], levels=["c","b","a"])
In [29]: s = pd.Series(raw_cat)
In [30]: s.cat.levels
Out[30]: Index([u'c', u'b', u'a'], dtype='object')
In [31]: s.cat.ordered
Out[31]: True
Note
Passing in a levels argument implies ordered=True.
Any value omitted in the levels argument will be replaced by np.nan:
In [32]: raw_cat = pd.Categorical(["a","b","c","a"], levels=["a","b"])
In [33]: s = pd.Series(raw_cat)
In [34]: s.cat.levels
Out[34]: Index([u'a', u'b'], dtype='object')
In [35]: s
Out[35]:
0 a
1 b
2 NaN
3 a
dtype: category
Renaming levels is done by assigning new values to the Category.levels orSeries.cat.levels property:
In [36]: s = pd.Series(pd.Categorical(["a","b","c","a"]))
In [37]: s
Out[37]:
0 a
1 b
2 c
3 a
dtype: category
In [38]: s.cat.levels = ["Group %s" % g for g in s.cat.levels]
In [39]: s
Out[39]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
In [40]: s.cat.levels = [1,2,3]
In [41]: s
Out[41]:
0 1
1 2
2 3
3 1
dtype: category
Note
I contrast to R’s factor function, a Categorical can have levels of other types than string.
Levels must be unique or a ValueError is raised:
In [42]: try: ....: s.cat.levels = [1,1,1] ....: except ValueError as e: ....: print("ValueError: " + str(e)) ....: ValueError: Categorical levels must be unique
Appending a level can be done by assigning a levels list longer than the current levels:
In [43]: s.cat.levels = [1,2,3,4]
In [44]: s.cat.levels
Out[44]: Int64Index([1, 2, 3, 4], dtype='int64')
In [45]: s
Out[45]:
0 1
1 2
2 3
3 1
dtype: category
Removing a level is also possible, but only the last level(s) can be removed by assigning a shorter list than current levels. Values which are omitted are replaced by np.nan.
In [46]: s.levels = [1,2]
In [47]: s
Out[47]:
0 1
1 2
2 3
3 1
dtype: category
Note
It’s only possible to remove or add a level at the last position. If that’s not where you want to remove an old or add a new level, use Category.reorder_levels(new_order) orSeries.cat.reorder_levels(new_order) methods before or after.
Removing unused levels can also be done:
In [48]: raw = pd.Categorical(["a","b","a"], levels=["a","b","c","d"])
In [49]: c = pd.Series(raw)
In [50]: raw
Out[50]:
a
b
a
Levels (4): Index(['a', 'b', 'c', 'd'], dtype=object), ordered
In [51]: raw.remove_unused_levels()
In [52]: raw
Out[52]:
a
b
a
Levels (2): Index(['a', 'b'], dtype=object), ordered
In [53]: c.cat.remove_unused_levels()
In [54]: c
Out[54]:
0 a
1 b
2 a
dtype: category
Note
In contrast to R’s factor function, passing a Categorical as the sole input to the Categorical constructor will not remove unused levels but create a new Categorical which is equal to the passed in one!
Ordered or not...¶
If a Categoricals is ordered (cat.ordered == True), then the order of the levels has a meaning and certain operations are possible. If the categorical is unordered, a TypeError is raised.
In [55]: s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=False))
In [56]: try:
....: s.sort()
....: except TypeError as e:
....: print("TypeError: " + str(e))
....:
TypeError: Categorical not ordered
In [57]: s = pd.Series(pd.Categorical(["a","b","c","a"], ordered=True))
In [58]: s.sort()
In [59]: s
Out[59]:
0 a
3 a
1 b
2 c
dtype: category
In [60]: print(s.min(), s.max())
('a', 'c')
Note
ordered=True is not necessary needed in the second case, as lists of strings are sortable and so the resulting Categorical is ordered.
Sorting will use the order defined by levels, not any lexical order present on the data type. This is even true for strings and numeric data:
In [61]: s = pd.Series(pd.Categorical([1,2,3,1]))
In [62]: s.cat.levels = [2,3,1]
In [63]: s
Out[63]:
0 2
1 3
2 1
3 2
dtype: category
In [64]: s.sort()
In [65]: s
Out[65]:
0 2
3 2
1 3
2 1
dtype: category
In [66]: print(s.min(), s.max())
(2, 1)
Reordering the levels is possible via the Categorical.reorder_levels(new_levels) orSeries.cat.reorder_levels(new_levels) methods:
In [67]: s2 = pd.Series(pd.Categorical([1,2,3,1]))
In [68]: s2.cat.reorder_levels([2,3,1])
In [69]: s2
Out[69]:
0 1
1 2
2 3
3 1
dtype: category
In [70]: s2.sort()
In [71]: s2
Out[71]:
1 2
2 3
0 1
3 1
dtype: category
In [72]: print(s2.min(), s2.max())
(2, 1)
Note
Note the difference between assigning new level names and reordering the levels: the first renames levels and therefore the individual values in the Series, but if the first position was sorted last, the renamed value will still be sorted last. Reordering means that the way values are sorted is different afterwards, but not that individual values in the Series are changed.
Operations¶
The following operations are possible with categorical data:
Getting the minimum and maximum, if the categorical is ordered:
In [73]: s = pd.Series(pd.Categorical(["a","b","c","a"], levels=["c","a","b","d"]))
In [74]: print(s.min(), s.max())
('c', 'b')
Note
If the Categorical is not ordered, Categorical.min() and Categorical.max() and the corresponding operations on Series will raise TypeError.
The mode:
In [75]: raw_cat = pd.Categorical(["a","b","c","c"], levels=["c","a","b","d"])
In [76]: s = pd.Series(raw_cat)
In [77]: raw_cat.mode()
Out[77]:
c
Levels (4): Index(['c', 'a', 'b', 'd'], dtype=object), ordered
In [78]: s.mode()
Out[78]:
0 c
dtype: category
Note
Numeric operations like +, -, *, / and operations based on them (e.g..median(), which would need to compute the mean between two values if the length of an array is even) do not work and raise a TypeError.
Series methods like Series.value_counts() will use all levels, even if some levels are not present in the data:
In [79]: s = pd.Series(pd.Categorical(["a","b","c","c"], levels=["c","a","b","d"]))
In [80]: s.value_counts()
Out[80]:
c 2
b 1
a 1
d 0
dtype: int64
Groupby will also show “unused” levels:
In [81]: cats = pd.Categorical(["a","b","b","b","c","c","c"], levels=["a","b","c","d"])
In [82]: df = pd.DataFrame({"cats":cats,"values":[1,2,2,2,3,4,5]})
In [83]: df.groupby("cats").mean()
Out[83]:
values
cats
a 1
b 2
c 4
d NaN
In [84]: cats2 = pd.Categorical(["a","a","b","b"], levels=["a","b","c"])
In [85]: df2 = pd.DataFrame({"cats":cats2,"B":["c","d","c","d"], "values":[1,2,3,4]})
This doesn't work yet with two columns -> see failing unittests
In [86]: df2.groupby(["cats","B"]).mean()
Out[86]:
values
cats B
a c 1
d 2
b c 3
d 4
Pivot tables:
In [87]: raw_cat = pd.Categorical(["a","a","b","b"], levels=["a","b","c"])
In [88]: df = pd.DataFrame({"A":raw_cat,"B":["c","d","c","d"], "values":[1,2,3,4]})
In [89]: pd.pivot_table(df, values='values', index=['A', 'B'])
Out[89]:
A B
a c 1
d 2
b c 3
d 4
Name: values, dtype: int64
Data munging¶
The optimized pandas data access methods .loc, .iloc, .ix .at, and .iat, work as normal, the only difference is the return type (for getting) and that only values already in the levels can be assigned.
Getting¶
If the slicing operation returns either a DataFrame or a a column of type Series, the category dtype is preserved.
In [90]: cats = pd.Categorical(["a","b","b","b","c","c","c"], levels=["a","b","c"])
In [91]: idx = pd.Index(["h","i","j","k","l","m","n",])
In [92]: values= [1,2,2,2,3,4,5]
In [93]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [94]: df.iloc[2:4,:]
Out[94]:
cats values
j b 2
k b 2
In [95]: df.iloc[2:4,:].dtypes
Out[95]:
cats category
values int64
dtype: object
In [96]: df.loc["h":"j","cats"]
Out[96]:
h a
i b
j b
Name: cats, dtype: category
In [97]: df.ix["h":"j",0:1]
Out[97]:
cats
h a
i b
j b
In [98]: df[df["cats"] == "b"]
Out[98]:
cats values
i b 2
j b 2
k b 2
An example where the Categorical is not preserved is if you take one single row: the resulting Series is of dtype object:
get the complete "h" row as a Series
In [99]: df.loc["h", :] Out[99]: cats a values 1 Name: h, dtype: object
Returning a single item from a Categorical will also return the value, not a Categorical of length “1”.
In [100]: df.iat[0,0] Out[100]: 'a'
In [101]: df["cats"].cat.levels = ["x","y","z"]
In [102]: df.at["h","cats"] # returns a string
Out[102]: 'x'
Note
This is a difference to R’s factor function, where factor(c(1,2,3))[1]returns a single value factor.
To get a single value Series of type category pass in a single value list:
In [103]: df.loc[["h"],"cats"] Out[103]: h x Name: cats, dtype: category
Setting¶
Setting values in a categorical column (or Series) works as long as the value is included in the levels:
In [104]: cats = pd.Categorical(["a","a","a","a","a","a","a"], levels=["a","b"])
In [105]: idx = pd.Index(["h","i","j","k","l","m","n"])
In [106]: values = [1,1,1,1,1,1,1]
In [107]: df = pd.DataFrame({"cats":cats,"values":values}, index=idx)
In [108]: df.iloc[2:4,:] = [["b",2],["b",2]]
In [109]: df
Out[109]:
cats values
h a 1
i a 1
j b 2
k b 2
l a 1
m a 1
n a 1
In [110]: try:
.....: df.iloc[2:4,:] = [["c",3],["c",3]]
.....: except ValueError as e:
.....: print("ValueError: " + str(e))
.....:
ValueError: cannot setitem on a Categorical with a new level, set the levels first
Setting values by assigning a Categorical will also check that the levels match:
In [111]: df.loc["j":"k","cats"] = pd.Categorical(["a","a"], levels=["a","b"])
In [112]: df
Out[112]:
cats values
h a 1
i a 1
j a 2
k a 2
l a 1
m a 1
n a 1
In [113]: try:
.....: df.loc["j":"k","cats"] = pd.Categorical(["b","b"], levels=["a","b","c"])
.....: except ValueError as e:
.....: print("ValueError: " + str(e))
.....:
ValueError: cannot set a Categorical with another, without identical levels
Assigning a Categorical to parts of a column of other types will use the values:
In [114]: df = pd.DataFrame({"a":[1,1,1,1,1], "b":["a","a","a","a","a"]})
In [115]: df.loc[1:2,"a"] = pd.Categorical(["b","b"], levels=["a","b"])
In [116]: df.loc[2:3,"b"] = pd.Categorical(["b","b"], levels=["a","b"])
In [117]: df
Out[117]:
a b
0 1 a
1 b a
2 b b
3 1 b
4 1 a
In [118]: df.dtypes
Out[118]:
a object
b object
dtype: object
Merging¶
You can concat two DataFrames containing categorical data together, but the levels of these Categoricals need to be the same:
In [119]: cat = pd.Categorical(["a","b"], levels=["a","b"])
In [120]: vals = [1,2]
In [121]: df = pd.DataFrame({"cats":cat, "vals":vals})
In [122]: res = pd.concat([df,df])
In [123]: res
Out[123]:
cats vals
0 a 1
1 b 2
0 a 1
1 b 2
In [124]: res.dtypes
Out[124]:
cats category
vals int64
dtype: object
In [125]: df_different = df.copy()
In [126]: df_different["cats"].cat.levels = ["a","b","c"]
In [127]: try:
.....: pd.concat([df,df])
.....: except ValueError as e:
.....: print("ValueError: " + str(e))
.....:
The same applies to df.append(df).
Getting Data In/Out¶
Writing data (Series, Frames) to a HDF store and reading it in entirety works. Querying the hdf store does not yet work.
In [128]: hdf_file = "test.h5"
In [129]: s = pd.Series(pd.Categorical(['a', 'b', 'b', 'a', 'a', 'c'], levels=['a','b','c','d']))
In [130]: df = pd.DataFrame({"s":s, "vals":[1,2,3,4,5,6]})
In [131]: df.to_hdf(hdf_file, "frame")
In [132]: df2 = pd.read_hdf(hdf_file, "frame")
In [133]: df2
Out[133]:
s vals
0 a 1
1 b 2
2 b 3
3 a 4
4 a 5
5 c 6
In [134]: try:
.....: pd.read_hdf(hdf_file, "frame", where = ['index>2'])
.....: except TypeError as e:
.....: print("TypeError: " + str(e))
.....:
TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety
Writing to a csv file will convert the data, effectively removing any information about the Categorical (levels and ordering). So if you read back the csv file you have to convert the relevant columns back to category and assign the right levels and level ordering.
In [135]: s = pd.Series(pd.Categorical(['a', 'b', 'b', 'a', 'a', 'd']))
rename the levels
In [136]: s.cat.levels = ["very good", "good", "bad"]
add new levels at the end
In [137]: s.cat.levels = list(s.cat.levels) + ["medium", "very bad"]
reorder the levels
In [138]: s.cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
In [139]: df = pd.DataFrame({"s":s, "vals":[1,2,3,4,5,6]})
In [140]: df.to_csv(csv_file)
IndexError Traceback (most recent call last)
in ()
----> 1 df.to_csv(csv_file)
c:\data\external\pandas\pandas\util\decorators.pyc in wrapper(*args, **kwargs)
58 else:
59 kwargs[new_arg_name] = old_arg_value
---> 60 return func(*args, **kwargs)
61 return wrapper
62 return _deprecate_kwarg
c:\data\external\pandas\pandas\core\frame.pyc in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, quoting, quotechar, line_terminator, chunksize, tupleize_cols, date_format, doublequote, escapechar, **kwds)
1139 doublequote=doublequote,
1140 escapechar=escapechar)
-> 1141 formatter.save()
1142
1143 if path_or_buf is None:
c:\data\external\pandas\pandas\core\format.pyc in save(self)
1312
1313 else:
-> 1314 self._save()
1315
1316 finally:
c:\data\external\pandas\pandas\core\format.pyc in _save(self)
1412 break
1413
-> 1414 self._save_chunk(start_i, end_i)
1415
1416 def _save_chunk(self, start_i, end_i):
c:\data\external\pandas\pandas\core\format.pyc in _save_chunk(self, start_i, end_i)
1424 d = b.to_native_types(slicer=slicer, na_rep=self.na_rep,
1425 float_format=self.float_format,
-> 1426 date_format=self.date_format)
1427
1428 for col_loc, col in zip(b.mgr_locs, d):
c:\data\external\pandas\pandas\core\internals.pyc in to_native_types(self, slicer, na_rep, **kwargs)
446 values = self.values
447 if slicer is not None:
--> 448 values = values[:, slicer]
449 values = np.array(values, dtype=object)
450 mask = isnull(values)
c:\data\external\pandas\pandas\core\categorical.pyc in getitem(self, key)
669 return self.levels[i]
670 else:
--> 671 return Categorical(values=self._codes[key], levels=self.levels,
672 ordered=self.ordered, fastpath=True)
673
IndexError: too many indices
In [141]: df2 = pd.read_csv(csv_file)
CParserError Traceback (most recent call last)
in ()
----> 1 df2 = pd.read_csv(csv_file)
c:\data\external\pandas\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
450 infer_datetime_format=infer_datetime_format)
451
--> 452 return _read(filepath_or_buffer, kwds)
453
454 parser_f.**name** = name
c:\data\external\pandas\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
232
233 # Create the parser.
--> 234 parser = TextFileReader(filepath_or_buffer, **kwds)
235
236 if (nrows is not None) and (chunksize is not None):
c:\data\external\pandas\pandas\io\parsers.pyc in init(self, f, engine, **kwds)
540 self.options['has_index_names'] = kwds['has_index_names']
541
--> 542 self._make_engine(self.engine)
543
544 def _get_options_with_defaults(self, engine):
c:\data\external\pandas\pandas\io\parsers.pyc in _make_engine(self, engine)
677 def _make_engine(self, engine='c'):
678 if engine == 'c':
--> 679 self._engine = CParserWrapper(self.f, **self.options)
680 else:
681 if engine == 'python':
c:\data\external\pandas\pandas\io\parsers.pyc in init(self, src, **kwds)
1039 kwds['allow_leading_cols'] = self.index_col is not False
1040
-> 1041 self._reader = _parser.TextReader(src, **kwds)
1042
1043 # XXX
c:\data\external\pandas\pandas\parser.pyd in pandas.parser.TextReader.cinit (pandas\parser.c:4629)()
c:\data\external\pandas\pandas\parser.pyd in pandas.parser.TextReader._get_header (pandas\parser.c:6092)()
CParserError: Passed header=0 but only 0 lines in file
In [142]: df2.dtypes
Out[142]:
s category
vals int64
dtype: object
In [143]: df2["vals"]
Out[143]:
0 1
1 2
2 3
3 4
4 5
5 6
Name: vals, dtype: int64
Redo the category
In [144]: df2["vals"] = df2["vals"].astype("category")
In [145]: df2["vals"].cat.levels = list(df2["vals"].cat.levels) + ["medium", "very bad"]
In [146]: df2["vals"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
ValueError Traceback (most recent call last)
in ()
----> 1 df2["vals"].cat.reorder_levels(["very bad", "bad", "medium", "good", "very good"])
c:\data\external\pandas\pandas\core\categorical.pyc in reorder_levels(self, new_levels, ordered)
342
343 if len(new_levels) != len(self._levels):
--> 344 raise ValueError('Reordered levels must be of same length as old levels')
345 if len(new_levels-self._levels):
346 raise ValueError('Reordered levels be the same as the original levels')
ValueError: Reordered levels must be of same length as old levels
In [147]: df2.dtypes
Out[147]:
s category
vals category
dtype: object
In [148]: df2["vals"]
Out[148]:
0 1
1 2
2 3
3 4
4 5
5 6
Name: vals, dtype: category
Missing Data¶
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section
There are two ways a np.nan can be represented in Categorical: either the value is not available or np.nan is a valid level.
In [149]: s = pd.Series(pd.Categorical(["a","b",np.nan,"a"]))
In [150]: s
Out[150]:
0 a
1 b
2 NaN
3 a
dtype: category
only two levels
In [151]: s.cat.levels
Out[151]: Index([u'a', u'b'], dtype='object')
In [152]: s2 = pd.Series(pd.Categorical(["a","b","c","a"]))
In [153]: s2.cat.levels = [1,2,np.nan]
In [154]: s2
Out[154]:
0 1
1 2
2 NaN
3 1
dtype: category
three levels, np.nan included
Note: as int arrays can't hold NaN the levels were converted to float
In [155]: s2.cat.levels
Out[155]: Float64Index([1.0, 2.0, nan], dtype='float64')
Gotchas¶
Categorical is not a numpy array¶
Currently, Categorical and the corresponding category Series is implemented as a python object and not as a low level numpy array dtype. This leads to some problems.
numpy itself doesn’t know about the new dtype:
In [156]: try: .....: np.dtype("category") .....: except TypeError as e: .....: print("TypeError: " + str(e)) .....: TypeError: data type "category" not understood
In [157]: dtype = pd.Categorical(["a"]).dtype
In [158]: try:
.....: np.dtype(dtype)
.....: except TypeError as e:
.....: print("TypeError: " + str(e))
.....:
TypeError: data type not understood
dtype comparisons work:
In [159]: dtype == np.str_
Out[159]: False
In [160]: np.str_ == dtype
Out[160]: False
Using numpy functions on a Series of type category should not work as Categoricals are not numeric data (even in the case that .levels is numeric).
In [161]: s = pd.Series(pd.Categorical([1,2,3,4]))
In [162]: try:
.....: np.sum(s)
.....: except TypeError as e:
.....: print("TypeError: " + str(e))
.....:
TypeError: Categorical cannot perform the operation sum
Side effects¶
Constructing a Series from a Categorical will not copy the input Categorical. This means that changes to the Series will in most cases change the original Categorical:
In [163]: cat = pd.Categorical([1,2,3,10], levels=[1,2,3,4,10])
In [164]: s = pd.Series(cat, name="cat")
In [165]: cat
Out[165]:
1
2
3
10
Levels (5): Int64Index([ 1, 2, 3, 4, 10], dtype=int64), ordered
In [166]: s.iloc[0:2] = 10
In [167]: cat
Out[167]:
10
10
3
10
Levels (5): Int64Index([ 1, 2, 3, 4, 10], dtype=int64), ordered
In [168]: df = pd.DataFrame(s)
In [169]: df["cat"].cat.levels = [1,2,3,4,5]
In [170]: cat
Out[170]:
5
5
3
5
Levels (5): Int64Index([1, 2, 3, 4, 5], dtype=int64), ordered
Use copy=True to prevent such a behaviour:
In [171]: cat = pd.Categorical([1,2,3,10], levels=[1,2,3,4,10])
In [172]: s = pd.Series(cat, name="cat", copy=True)
In [173]: cat
Out[173]:
1
2
3
10
Levels (5): Int64Index([ 1, 2, 3, 4, 10], dtype=int64), ordered
In [174]: s.iloc[0:2] = 10
In [175]: cat
Out[175]:
1
2
3
10
Levels (5): Int64Index([ 1, 2, 3, 4, 10], dtype=int64), ordered
Note
This also happens in some cases when you supply a numpy array instea dof a Categorical: using an int array (e.g. np.array([1,2,3,4])) will exhibit the same behaviour, but using a string array (e.g. np.array(["a","b","c","a"])) will not.
Danger of confusion¶
Both Series and Categorical have a method .reorder_levels() but for different things. For Series of type category this means that there is some danger to confuse both methods.
In [176]: s = pd.Series(pd.Categorical([1,2,3,4]))
In [177]: print(s.cat.levels)
Int64Index([1, 2, 3, 4], dtype='int64')
wrong and raises an error:
In [178]: try:
.....: s.reorder_levels([4,3,2,1])
.....: except Exception as e:
.....: print("Exception: " + str(e))
.....:
Exception: Can only reorder levels on a hierarchical axis.
right
In [179]: s.cat.reorder_levels([4,3,2,1])
In [180]: print(s.cat.levels)
Int64Index([4, 3, 2, 1], dtype='int64')
See also the API documentation for pandas.Series.reorder_levels() andpandas.Categorical.reorder_levels()
Old style constructor usage¶
I earlier versions, a Categorical could be constructed by passing in precomputed level_codes (called then labels) instead of values with levels. The level_codes are interpreted as pointers to the levels with -1 as NaN. This usage is now deprecated and not available unlesscompat=True is passed to the constructor of Categorical.
In [181]: cat = pd.Categorical([1,2], levels=[1,2,3], compat=True)
In [182]: cat.get_values()
Out[182]: array([2, 3], dtype=int64)
In the default case (compat=False) the first argument is interpreted as values.
In [183]: cat = pd.Categorical([1,2], levels=[1,2,3], compat=False)
In [184]: cat.get_values()
Out[184]: array([1, 2], dtype=int64)
Warning
Using Categorical with precomputed level_codes and levels is deprecated and a FutureWarning is raised. Please change your code to use one of the proper constructor modes instead of adding compat=False.
No categorical index¶
There is currently no index of type category, so setting the index to a Categorical will convert the Categorical to a normal numpy array first and therefore remove any custom ordering of the levels:
In [185]: cats = pd.Categorical([1,2,3,4], levels=[4,2,3,1])
In [186]: strings = ["a","b","c","d"]
In [187]: values = [4,2,3,1]
In [188]: df = pd.DataFrame({"strings":strings, "values":values}, index=cats)
In [189]: df.index
Out[189]: Int64Index([1, 2, 3, 4], dtype='int64')
This should sort by levels but does not as there is no CategoricalIndex!
In [190]: df.sort_index()
Out[190]:
strings values
1 a 4
2 b 2
3 c 3
4 d 1
Note
This could change if a CategoricalIndex is implemented (see#7629)
dtype in apply¶
Pandas currently does not preserve the dtype in apply functions: If you apply along rows you get a Series of object dtype (same as getting a row -> getting one element will return a basic type) and applying along columns will also convert to object.
In [191]: df = pd.DataFrame({"a":[1,2,3,4], "b":["a","b","c","d"], "cats":pd.Categorical([1,2,3,2])})
In [192]: df.apply(lambda row: type(row["cats"]), axis=1)
Out[192]:
0 <type 'long'>
1 <type 'long'>
2 <type 'long'>
3 <type 'long'>
dtype: object
In [193]: df.apply(lambda col: col.dtype, axis=0)
Out[193]:
a object
b object
cats object
dtype: object
Future compatibility¶
As Categorical is not a native numpy dtype, the implementation details of Series.cat can change if such a numpy dtype is implemented.