ENH: categorical scatter plot by MarcoGorelli · Pull Request #34293 · pandas-dev/pandas (original) (raw)

In your last example, the column passed to c is a column with color names, and we use those to color the points

yeah, right now, c accepts quite wide range of cases, and color referred by names is one of the allowed cases. It's allowing color name, or color code, or a list of them, or column name/loc whose numeric values which points to the loc of colors in the colormap.

distinguish the case where the passed "categorical" values are 1) color names or

I think color names and color codes are natively and widely supported in matplotlib, so not sure if we even need to put efforts distinguishing it, in other words, no matter if it is already categorical type or general object type, they should be plotted correctly. Therefore, I think, even without this PR, the code below should also work (haven't tried out though, might wrong):

df = pd.DataFrame( [[5.1, 3.5], [4.9, 3.0], [7.0, 3.2], [6.4, 3.2], [5.9, 3.0]], columns=["length", "width"], )

df['specifics'] = pd.Categorical(['r', 'b', 'b', 'r', 'g']) df.plot.scatter(x=0, y=1, c='specifics')

  1. discrete non-numeric values that we want to use default colors for

I, as a user, the c is about colors, and that's what we want to use to color the points, that's why I had some initial concerns about what we should accept here.

I think the feature @MarcoGorelli wants to implement here is when categorical values are not color names which matplotlib cannot recognise, e.g. random category names. And we need to figure out how they could map to colors. Therefore, to me, supporting sorted categorical values are more intuitive, because order can be viewed as discrete numbers, and we could use them to find color in colormap (that's how they use numeric values to find colors, but difference is now in this case, the numeric values are discrete, so its kind of a special case of numeric values). Thus, I would prefer to see discrete names in color bar to be plotted (as in his second plot), which could presents the link between the random category name and the colormap. And for unsorted categorical values, we could just treat them as the order is the order of appearance, like how we do to unsorted objects.

So above is the main reason why I would prefer to have discrete values in colormap for categorical cases than having them as legends (which also might bring in ambiguity to users) although they might not look as nice as legends in the plot

Sorry to put so many words, and will be happy to discuss if you have further questions. @jorisvandenbossche