DOC: update the pandas.DataFrame.replace docstring by math-and-data · Pull Request #20271 · pandas-dev/pandas (original) (raw)
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
- PR title is "DOC: update the docstring"
- The validation script passes:
scripts/validate_docstrings.py <your-function-or-method> - The PEP8 style check passes:
git diff upstream/master -u -- "*.py" | flake8 --diff - The html version looks good:
python doc/make.py --single <your-function-or-method> - It has been proofread on language by another sprint participant
Note: Just did a minor improvement, not a full change!
Still a few verification errors:
- Errors in parameters section
- Parameter "to_replace" description should start with capital letter
- Parameter "axis" description should finish with "."
- Examples do not pass tests
################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################
Replace values given in 'to_replace' with 'value'.
Values of the DataFrame or a Series are being replaced with
other values. One or several values can be replaced with one
or several values.
Parameters
----------
to_replace : str, regex, list, dict, Series, numeric, or None
* numeric, str or regex:
- numeric: numeric values equal to ``to_replace`` will be
replaced with ``value``
- str: string exactly matching ``to_replace`` will be replaced
with ``value``
- regex: regexs matching ``to_replace`` will be replaced with
``value``
* list of str, regex, or numeric:
- First, if ``to_replace`` and ``value`` are both lists, they
**must** be the same length.
- Second, if ``regex=True`` then all of the strings in **both**
lists will be interpreted as regexs otherwise they will match
directly. This doesn't matter much for ``value`` since there
are only a few possible substitution regexes you can use.
- str, regex and numeric rules apply as above.
* dict:
- Dicts can be used to specify different replacement values
for different existing values. For example,
{'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and
'y' with 'z'. To use a dict in this way the ``value``
parameter should be ``None``.
- For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
{'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and
the value 'z' in column 'b' and replaces these values with
whatever is specified in ``value``. The ``value`` parameter
should not be ``None`` in this case. You can treat this as a
special case of passing two lists except that you are
specifying the column to search in.
- For a DataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}, are read as follows: look in column 'a'
for the value 'b' and replace it with NaN. The ``value``
parameter should be ``None`` to use a nested dict in this
way. You can nest regular expressions as well. Note that
column names (the top-level dictionary keys in a nested
dictionary) **cannot** be regular expressions.
* None:
- This means that the ``regex`` argument must be a string,
compiled regular expression, or list, dict, ndarray or Series
of such elements. If ``value`` is also ``None`` then this
**must** be a nested dictionary or ``Series``.
See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
Value to replace any values matching ``to_replace`` with.
For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be
filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplace : boolean, default False
If True, in place. Note: this will modify any
other views on this object (e.g. a column from a DataFrame).
Returns the caller if this is True.
limit : int, default None
Maximum size gap to forward or backward fill.
regex : bool or same types as ``to_replace``, default False
Whether to interpret ``to_replace`` and/or ``value`` as regular
expressions. If this is ``True`` then ``to_replace`` *must* be a
string. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
``to_replace`` must be ``None``.
method : string, optional, {'pad', 'ffill', 'bfill'}, default is 'pad'
The method to use when for replacement, when ``to_replace`` is a
scalar, list or tuple and ``value`` is None.
axis : None
Deprecated.
.. versionchanged:: 0.23.0
Added to DataFrame
See Also
--------
DataFrame.fillna : Fill NA/NaN values
DataFrame.where : Replace values based on boolean condition
Returns
-------
DataFrame
Some values have been substituted for new values.
Raises
------
AssertionError
* If ``regex`` is not a ``bool`` and ``to_replace`` is not
``None``.
TypeError
* If ``to_replace`` is a ``dict`` and ``value`` is not a ``list``,
``dict``, ``ndarray``, or ``Series``
* If ``to_replace`` is ``None`` and ``regex`` is not compilable
into a regular expression or is a list, dict, ndarray, or
Series.
* When replacing multiple ``bool`` or ``datetime64`` objects and
the arguments to ``to_replace`` does not match the type of the
value being replaced
ValueError
* If a ``list`` or an ``ndarray`` is passed to ``to_replace`` and
`value` but they are not the same length.
Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
cannot provide, for example, a regular expression matching floating
point numbers and expect the columns in your frame that have a
numeric dtype to be matched. However, if those floating point
numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
and play with this method to gain intuition about how it works.
Examples
--------
>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0 5
1 1
2 2
3 3
4 4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
... 'B': [5, 6, 7, 8, 9],
... 'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
A B C
0 5 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace([0, 1, 2, 3], 4)
A B C
0 4 5 a
1 4 6 b
2 4 7 c
3 4 8 d
4 4 9 e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
A B C
0 4 5 a
1 3 6 b
2 2 7 c
3 1 8 d
4 4 9 e
>>> s.replace([1, 2], method='bfill')
0 0
1 3
2 3
3 3
4 4
dtype: int64
>>> df.replace({0: 10, 1: 100})
A B C
0 10 5 a
1 100 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace({'A': 0, 'B': 5}, 100)
A B C
0 100 100 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace({'A': {0: 100, 4: 400}})
A B C
0 100 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 400 9 e
>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
... 'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
A B
0 new abc
1 foo new
2 bait xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
A B
0 new abc
1 foo bar
2 bait xyz
>>> df.replace(regex=r'^ba.$', value='new')
A B
0 new abc
1 foo new
2 bait xyz
>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
A B
0 new abc
1 xyz new
2 bait xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
A B
0 new abc
1 new new
2 bait xyz
Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the ``to_replace`` parameter must match the data
type of the value being replaced:
>>> df = pd.DataFrame({'A': [True, False, True],
... 'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False}) # raises
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.
Compare the behavior of
``s.replace('a', None)`` and ``s.replace({'a': None})`` to understand
the pecularities of the ``to_replace`` parameter.
``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``,
because when ``value=None`` and ``to_replace`` is a scalar, list or
tuple, ``replace`` uses the method parameter to do the replacement.
So this is why the 'a' values are being replaced by 30 in rows 3 and 4
and 'b' in row 6 in this case. However, this behaviour does not occur
when you use a dict as the ``to_replace`` value. In this case, it is
like the value(s) in the dict are equal to the value parameter.
>>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a'])
>>> print(s)
0 10
1 20
2 30
3 a
4 a
5 b
6 a
dtype: object
>>> print(s.replace('a', None))
0 10
1 20
2 30
3 30
4 30
5 b
6 b
dtype: object
>>> print(s.replace({'a': None}))
0 10
1 20
2 30
3 None
4 None
5 b
6 None
dtype: object
################################################################################
################################## Validation ##################################
################################################################################
Errors found:
Errors in parameters section
Parameter "to_replace" description should start with capital letter
Parameter "axis" description should finish with "."
Examples do not pass tests
################################################################################
################################### Doctests ###################################
################################################################################
**********************************************************************
Line 229, in pandas.DataFrame.replace
Failed example:
df.replace({'a string': 'new value', True: False}) # raises
Exception raised:
Traceback (most recent call last):
File "C:\Users\thisi\AppData\Local\conda\conda\envs\pandas_dev\lib\doctest.py", line 1330, in __run
compileflags, 1), test.globs)
File "<doctest pandas.DataFrame.replace[17]>", line 1, in <module>
df.replace({'a string': 'new value', True: False}) # raises
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
method=method, axis=axis)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5208, in replace
limit=limit, regex=regex)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
method=method, axis=axis)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5257, in replace
regex=regex)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in replace_list
masks = [comp(s) for i, s in enumerate(src_list)]
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in <listcomp>
masks = [comp(s) for i, s in enumerate(src_list)]
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3694, in comp
return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 5122, in _maybe_compare
b=type_names[1]))
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
| .. versionchanged:: 0.23.0 |
| Added to DataFrame |
| .. versionchanged:: 0.23.0 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't what you want to do - make sure you keep the versionchanged directive below the method argument as that's what was added in v0.23
| list, dict, or array of regular expressions in which case |
|---|
| ``to_replace`` must be ``None``. |
| method : string, optional, {'pad', 'ffill', 'bfill'} |
| method : string, optional, {'pad', 'ffill', 'bfill'}, default is 'pad' |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
method : {'pad', 'ffill', 'bfill', `None`}
| The method to use when for replacement, when ``to_replace`` is a |
|---|
| scalar, list or tuple and ``value`` is None. |
| axis : None |
| Deprecated. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning says this will be removed in v0.13? Woof...I guess OK to document for this change but should have a follow up change to actually go ahead and remove - care to take a stab at that?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to take a stab at this - always nice when I can remove code too
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@math-and-data awesome thanks! Can you open a separate issue for this?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd I was waiting for this PR to be approved, then I would open a new request where I change the relevant code (remove the 'axis' reference) and edit the documentation accordingly. Is there anything else I had missed in this PR (other than the suggestion of breaking out the DataFrame and Series examples)?
| _shared_docs['replace'] = (""" |
|---|
| Replace values given in 'to_replace' with 'value'. |
| Values of the DataFrame or a Series are being replaced with |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this extended description is adding much. Better served to make mention of how this can replace values with a dynamic set of inputs like dicts
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thank you for the suggestion
| Parameters |
| ---------- |
| to_replace : str, regex, list, dict, Series, numeric, or None |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say int, float instead of numeric (if float is even valid?)
| the pecularities of the ``to_replace`` parameter. |
|---|
| ``s.replace('a', None)`` is actually equivalent to |
| ``s.replace(to_replace='a', value=None, method='pad')``, |
| because when ``value=None`` and ``to_replace`` is a scalar, list or |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting as I was not aware of this behavior. Certainly great to have it documented, though I would move the majority of the writing into the Notes section and shorten the blurb introducing the comparison here.
| ``s.replace(to_replace='a', value=None, method='pad')``, |
|---|
| because when ``value=None`` and ``to_replace`` is a scalar, list or |
| tuple, ``replace`` uses the method parameter to do the replacement. |
| So this is why the 'a' values are being replaced by 30 in rows 3 and 4 |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just reinforce that it's the fill behavior that is really replacing values here
| like the value(s) in the dict are equal to the value parameter. |
|---|
| >>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a']) |
| >>> print(s) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Series is simple enough where you don't need to explicitly print it - the constructor shows you everything of interest
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally have found the visual of inspecting the changes before/after easier for such replacements (both in vertical positions). You have more experience and I'll rely on your suggestion and make the change.
| 5 b |
|---|
| 6 b |
| dtype: object |
| >>> print(s.replace({'a': None})) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would put this example first as it is (from my perspective) the behavior most would expect. Having it first makes it a better segue into the nuance that you want to describe with the other example
| when you use a dict as the ``to_replace`` value. In this case, it is |
|---|
| like the value(s) in the dict are equal to the value parameter. |
| >>> s = pd.Series([10, 20, 30, 'a', 'a', 'b', 'a']) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to keep things concise why don't you get rid of 10 and 20 in this example? They don't serve any real purpose but make the documentation longer. Can also replace 30 with 1
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great suggestion of simplifying.
| The method to use when for replacement, when ``to_replace`` is a |
|---|
| scalar, list or tuple and ``value`` is None. |
| axis : None |
| Deprecated. |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 5 b |
|---|
| 6 a |
| dtype: object |
| >>> print(s.replace('a', None)) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the prints, use a blank line between cases. Having an expl for each case is also nice.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
- Docstring validation not passing
################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################
Replace values given in 'to_replace' with 'value'.
Values of the DataFrame or a Series are being replaced with
other values in a dynamic way. Instead of replacing values in a
specific cell (row/column combination), this method allows for more
flexibility with replacements. For instance, values can be replaced
by specifying lists of values and replacements separately or
with a dynamic set of inputs like dicts.
Parameters
----------
to_replace : str, regex, list, dict, Series, int, float, or None
* numeric, str or regex:
- numeric: numeric values equal to ``to_replace`` will be
replaced with ``value``
- str: string exactly matching ``to_replace`` will be replaced
with ``value``
- regex: regexs matching ``to_replace`` will be replaced with
``value``
* list of str, regex, or numeric:
- First, if ``to_replace`` and ``value`` are both lists, they
**must** be the same length.
- Second, if ``regex=True`` then all of the strings in **both**
lists will be interpreted as regexs otherwise they will match
directly. This doesn't matter much for ``value`` since there
are only a few possible substitution regexes you can use.
- str, regex and numeric rules apply as above.
* dict:
- Dicts can be used to specify different replacement values
for different existing values. For example,
{'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and
'y' with 'z'. To use a dict in this way the ``value``
parameter should be ``None``.
- For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
{'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and
the value 'z' in column 'b' and replaces these values with
whatever is specified in ``value``. The ``value`` parameter
should not be ``None`` in this case. You can treat this as a
special case of passing two lists except that you are
specifying the column to search in.
- For a DataFrame nested dictionaries, e.g.,
{'a': {'b': np.nan}}, are read as follows: look in column
'a' for the value 'b' and replace it with NaN. The ``value``
parameter should be ``None`` to use a nested dict in this
way. You can nest regular expressions as well. Note that
column names (the top-level dictionary keys in a nested
dictionary) **cannot** be regular expressions.
* None:
- This means that the ``regex`` argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If ``value`` is also ``None`` then
this **must** be a nested dictionary or ``Series``.
See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
Value to replace any values matching ``to_replace`` with.
For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be
filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplace : boolean, default False
If True, in place. Note: this will modify any
other views on this object (e.g. a column from a DataFrame).
Returns the caller if this is True.
limit : int, default None
Maximum size gap to forward or backward fill.
regex : bool or same types as ``to_replace``, default False
Whether to interpret ``to_replace`` and/or ``value`` as regular
expressions. If this is ``True`` then ``to_replace`` *must* be a
string. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
``to_replace`` must be ``None``.
method : {'pad', 'ffill', 'bfill', `None`}
The method to use when for replacement, when ``to_replace`` is a
scalar, list or tuple and ``value`` is `None`.
.. versionchanged:: 0.23.0
Added to DataFrame.
axis : None
Deprecated.
See Also
--------
DataFrame.fillna : Fill `NaN` values
DataFrame.where : Replace values based on boolean condition
Returns
-------
DataFrame
Object after replacement.
Raises
------
AssertionError
* If ``regex`` is not a ``bool`` and ``to_replace`` is not
``None``.
TypeError
* If ``to_replace`` is a ``dict`` and ``value`` is not a ``list``,
``dict``, ``ndarray``, or ``Series``
* If ``to_replace`` is ``None`` and ``regex`` is not compilable
into a regular expression or is a list, dict, ndarray, or
Series.
* When replacing multiple ``bool`` or ``datetime64`` objects and
the arguments to ``to_replace`` does not match the type of the
value being replaced
ValueError
* If a ``list`` or an ``ndarray`` is passed to ``to_replace`` and
`value` but they are not the same length.
Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
cannot provide, for example, a regular expression matching floating
point numbers and expect the columns in your frame that have a
numeric dtype to be matched. However, if those floating point
numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
and play with this method to gain intuition about how it works.
* When dict is used as the ``to_replace`` value, it is like
key(s) in the dict are the to_replace part and
value(s) in the dict are the value parameter.
Examples
--------
>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0 5
1 1
2 2
3 3
4 4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
... 'B': [5, 6, 7, 8, 9],
... 'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
A B C
0 5 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace([0, 1, 2, 3], 4)
A B C
0 4 5 a
1 4 6 b
2 4 7 c
3 4 8 d
4 4 9 e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
A B C
0 4 5 a
1 3 6 b
2 2 7 c
3 1 8 d
4 4 9 e
>>> s.replace([1, 2], method='bfill')
0 0
1 3
2 3
3 3
4 4
dtype: int64
>>> df.replace({0: 10, 1: 100})
A B C
0 10 5 a
1 100 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace({'A': 0, 'B': 5}, 100)
A B C
0 100 100 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace({'A': {0: 100, 4: 400}})
A B C
0 100 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 400 9 e
>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
... 'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
A B
0 new abc
1 foo new
2 bait xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
A B
0 new abc
1 foo bar
2 bait xyz
>>> df.replace(regex=r'^ba.$', value='new')
A B
0 new abc
1 foo new
2 bait xyz
>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
A B
0 new abc
1 xyz new
2 bait xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
A B
0 new abc
1 new new
2 bait xyz
Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the ``to_replace`` parameter must match the data
type of the value being replaced:
>>> df = pd.DataFrame({'A': [True, False, True],
... 'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False}) # raises
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.
Compare the behavior of ``s.replace({'a': None})`` and
``s.replace('a', None)`` to understand the pecularities
of the ``to_replace`` parameter:
>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])
When one uses a dict as the ``to_replace`` value, it is like the
value(s) in the dict are equal to the value parameter.
``s.replace({'a': None})`` is equivalent to
``s.replace(to_replace={'a': None}, value=None, method=None)``:
>>> s.replace({'a': None})
0 10
1 None
2 None
3 b
4 None
dtype: object
When ``value=None`` and ``to_replace`` are a scalar, list or
tuple, ``replace`` uses the method parameter (default 'pad') to do the
replacement. So this is why the 'a' values are being replaced by 10
in rows 1 and 2 and 'b' in row 4 in this case.
The command ``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``:
>>> s.replace('a', None)
0 10
1 10
2 10
3 b
4 b
dtype: object
################################################################################
################################## Validation ##################################
################################################################################
Errors found:
Errors in parameters section
Parameter "to_replace" description should start with capital letter
Examples do not pass tests
################################################################################
################################### Doctests ###################################
################################################################################
**********************************************************************
Line 233, in pandas.DataFrame.replace
Failed example:
df.replace({'a string': 'new value', True: False}) # raises
Exception raised:
Traceback (most recent call last):
File "C:\Users\thisi\AppData\Local\conda\conda\envs\pandas_dev\lib\doctest.py", line 1330, in __run
compileflags, 1), test.globs)
File "<doctest pandas.DataFrame.replace[17]>", line 1, in <module>
df.replace({'a string': 'new value', True: False}) # raises
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
method=method, axis=axis)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5205, in replace
limit=limit, regex=regex)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\frame.py", line 3136, in replace
method=method, axis=axis)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\generic.py", line 5254, in replace
regex=regex)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in replace_list
masks = [comp(s) for i, s in enumerate(src_list)]
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3696, in <listcomp>
masks = [comp(s) for i, s in enumerate(src_list)]
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 3694, in comp
return _maybe_compare(values, getattr(s, 'asm8', s), operator.eq)
File "C:\Users\thisi\Documents\GitHub\pandas\pandas\core\internals.py", line 5122, in _maybe_compare
b=type_names[1]))
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
Section headers.
Consistent quoting.
Formatting.
Traceback.
Updated
Details
################################################################################
##################### Docstring (pandas.DataFrame.replace) #####################
################################################################################
Replace values given in `to_replace` with `value`.
Values of the DataFrame are replaced with other values dynamically.
This differs from updating with ``.loc`` or ``.iloc``, which require
you to specify a location to update with some value.
Parameters
----------
to_replace : str, regex, list, dict, Series, int, float, or None
How to find the values that will be replaced.
* numeric, str or regex:
- numeric: numeric values equal to `to_replace` will be
replaced with `value`
- str: string exactly matching `to_replace` will be replaced
with `value`
- regex: regexs matching `to_replace` will be replaced with
`value`
* list of str, regex, or numeric:
- First, if `to_replace` and `value` are both lists, they
**must** be the same length.
- Second, if ``regex=True`` then all of the strings in **both**
lists will be interpreted as regexs otherwise they will match
directly. This doesn't matter much for `value` since there
are only a few possible substitution regexes you can use.
- str, regex and numeric rules apply as above.
* dict:
- Dicts can be used to specify different replacement values
for different existing values. For example,
``{'a': 'b', 'y': 'z'}`` replaces the value 'a' with 'b' and
'y' with 'z'. To use a dict in this way the `value`
parameter should be `None`.
- For a DataFrame a dict can specify that different values
should be replaced in different columns. For example,
``{'a': 1, 'b': 'z'}`` looks for the value 1 in column 'a'
and the value 'z' in column 'b' and replaces these values
with whatever is specified in `value`. The `value` parameter
should not be ``None`` in this case. You can treat this as a
special case of passing two lists except that you are
specifying the column to search in.
- For a DataFrame nested dictionaries, e.g.,
``{'a': {'b': np.nan}}``, are read as follows: look in column
'a' for the value 'b' and replace it with NaN. The `value`
parameter should be ``None`` to use a nested dict in this
way. You can nest regular expressions as well. Note that
column names (the top-level dictionary keys in a nested
dictionary) **cannot** be regular expressions.
* None:
- This means that the `regex` argument must be a string,
compiled regular expression, or list, dict, ndarray or
Series of such elements. If `value` is also ``None`` then
this **must** be a nested dictionary or Series.
See the examples section for examples of each of these.
value : scalar, dict, list, str, regex, default None
Value to replace any values matching `to_replace` with.
For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be
filled). Regular expressions, strings and lists or dicts of such
objects are also allowed.
inplace : boolean, default False
If True, in place. Note: this will modify any
other views on this object (e.g. a column from a DataFrame).
Returns the caller if this is True.
limit : int, default None
Maximum size gap to forward or backward fill.
regex : bool or same types as `to_replace`, default False
Whether to interpret `to_replace` and/or `value` as regular
expressions. If this is ``True`` then `to_replace` *must* be a
string. Alternatively, this could be a regular expression or a
list, dict, or array of regular expressions in which case
`to_replace` must be ``None``.
method : {'pad', 'ffill', 'bfill', `None`}
The method to use when for replacement, when `to_replace` is a
scalar, list or tuple and `value` is ``None``.
.. versionchanged:: 0.23.0
Added to DataFrame.
axis : None
Deprecated.
See Also
--------
DataFrame.fillna : Fill `NaN` values
DataFrame.where : Replace values based on boolean condition
Series.str.replace : Simple string replacement.
Returns
-------
DataFrame
Object after replacement.
Raises
------
AssertionError
* If `regex` is not a ``bool`` and `to_replace` is not
``None``.
TypeError
* If `to_replace` is a ``dict`` and `value` is not a ``list``,
``dict``, ``ndarray``, or ``Series``
* If `to_replace` is ``None`` and `regex` is not compilable
into a regular expression or is a list, dict, ndarray, or
Series.
* When replacing multiple ``bool`` or ``datetime64`` objects and
the arguments to `to_replace` does not match the type of the
value being replaced
ValueError
* If a ``list`` or an ``ndarray`` is passed to `to_replace` and
`value` but they are not the same length.
Notes
-----
* Regex substitution is performed under the hood with ``re.sub``. The
rules for substitution for ``re.sub`` are the same.
* Regular expressions will only substitute on strings, meaning you
cannot provide, for example, a regular expression matching floating
point numbers and expect the columns in your frame that have a
numeric dtype to be matched. However, if those floating point
numbers *are* strings, then you can do this.
* This method has *a lot* of options. You are encouraged to experiment
and play with this method to gain intuition about how it works.
* When dict is used as the `to_replace` value, it is like
key(s) in the dict are the to_replace part and
value(s) in the dict are the value parameter.
Examples
--------
**Scalar `to_replace` and `value`**
>>> s = pd.Series([0, 1, 2, 3, 4])
>>> s.replace(0, 5)
0 5
1 1
2 2
3 3
4 4
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
... 'B': [5, 6, 7, 8, 9],
... 'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
A B C
0 5 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
**List-like `to_replace`**
>>> df.replace([0, 1, 2, 3], 4)
A B C
0 4 5 a
1 4 6 b
2 4 7 c
3 4 8 d
4 4 9 e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
A B C
0 4 5 a
1 3 6 b
2 2 7 c
3 1 8 d
4 4 9 e
>>> s.replace([1, 2], method='bfill')
0 0
1 3
2 3
3 3
4 4
dtype: int64
**dict-like `to_replace`**
>>> df.replace({0: 10, 1: 100})
A B C
0 10 5 a
1 100 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace({'A': 0, 'B': 5}, 100)
A B C
0 100 100 a
1 1 6 b
2 2 7 c
3 3 8 d
4 4 9 e
>>> df.replace({'A': {0: 100, 4: 400}})
A B C
0 100 5 a
1 1 6 b
2 2 7 c
3 3 8 d
4 400 9 e
**Regular expression `to_replace`**
>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
... 'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
A B
0 new abc
1 foo new
2 bait xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
A B
0 new abc
1 foo bar
2 bait xyz
>>> df.replace(regex=r'^ba.$', value='new')
A B
0 new abc
1 foo new
2 bait xyz
>>> df.replace(regex={r'^ba.$':'new', 'foo':'xyz'})
A B
0 new abc
1 xyz new
2 bait xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
A B
0 new abc
1 new new
2 bait xyz
Note that when replacing multiple ``bool`` or ``datetime64`` objects,
the data types in the `to_replace` parameter must match the data
type of the value being replaced:
>>> df = pd.DataFrame({'A': [True, False, True],
... 'B': [False, True, False]})
>>> df.replace({'a string': 'new value', True: False}) # raises
Traceback (most recent call last):
...
TypeError: Cannot compare types 'ndarray(dtype=bool)' and 'str'
This raises a ``TypeError`` because one of the ``dict`` keys is not of
the correct type for replacement.
Compare the behavior of ``s.replace({'a': None})`` and
``s.replace('a', None)`` to understand the pecularities
of the `to_replace` parameter:
>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])
When one uses a dict as the `to_replace` value, it is like the
value(s) in the dict are equal to the `value` parameter.
``s.replace({'a': None})`` is equivalent to
``s.replace(to_replace={'a': None}, value=None, method=None)``:
>>> s.replace({'a': None})
0 10
1 None
2 None
3 b
4 None
dtype: object
When ``value=None`` and `to_replace` is a scalar, list or
tuple, `replace` uses the method parameter (default 'pad') to do the
replacement. So this is why the 'a' values are being replaced by 10
in rows 1 and 2 and 'b' in row 4 in this case.
The command ``s.replace('a', None)`` is actually equivalent to
``s.replace(to_replace='a', value=None, method='pad')``:
>>> s.replace('a', None)
0 10
1 10
2 10
3 b
4 b
dtype: object
################################################################################
################################## Validation ##################################
################################################################################
Docstring for "pandas.DataFrame.replace" correct. :)
I would personally split this docstring in separate ones for series and dataframe, it's becoming quite a monster :)
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One very minor edit but otherwise lgtm
| See Also |
|---|
| -------- |
| %(klass)s.fillna : Fill NA/NaN values |
| %(klass)s.fillna : Fill `NaN` values |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be better as Fill NA values since it is talking about the concept of missing data and not necessarily the NaN value itself
minor change as requested
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed the linting failure. Let's get this merged when that passes.
