Right-to-left characters in repr (original) (raw)

February 12, 2026, 12:42pm 1

On modern Unicode-aware terminals and browsers, right-to-left characters not only displayed in the order different from how they are stored, but affect neutral characters, including punctuation, spaces and digits. For example:

>>> '12\u05be34'
'12־34'

You must see '1234X’ where X is a RTL character. It is shown as the 5th character, even if it is actually the 3rd character.

Worse, this can escape the string’s repr and affect representation of other items:

>>> list('12\u05be34')
['1', '2', '־', '3', '4']

You must see [’1’, ‘2’, ‘4’ ,’3’ ,’X’]. Not only it moves the item with the RTL character to the end of the list (it is the third item), but shows the following items in the reversed order.

This is bad. repr() of string supposed to show you accurate representation of the string content, this is why it escapes control characters (including the marks that explicitly control the text direction). The only solution – treat the strong right-to-left characters as non-printable and escape it in repr().

As a side effect, str.isprintable() will return False for such characters. This may be surprising, but this is it’s definition – it returns True for and only for characters which are not escaped in repr(). In particularly, it returns False for whitespaces except the space, and True for characters which can be not supported by your terminal.

encukou (Petr Viktorin) February 12, 2026, 2:13pm 2

This would break the reprs of Arabic or Hebrew text; that seems to contradict today’s blog post from the PSF.

I guess an alternative could go along the lines of adding left-to-right marks (LRMs) outside the quotes if strong RTL characters are present, and allowing LRMs in the tokenizer.

>>> print(*(f'\N{LRM}{repr(c)}\N{LRM}' for c in '12\u05be34'))
'1'‎ ‎'2'‎ ‎'־'‎ ‎'3'‎ ‎'4'‎

steve.dower (Steve Dower) February 12, 2026, 2:34pm 3

(Background: I was quite deeply involved in the Unicode working group that analysed the risk of RTL and invisible characters in source code and how it should be treated by programming languages, so I’m drawing on a lot of on-topic discussions with the literal experts in this field.)

Basically, we need to draw a line between rendering of text and storage of text, because the way we represent stored text is allowed to be different from how it’s rendered.

In short, I think Serhiy is correct. Our repr function should produce a representation of the stored string, which means control characters are escaped rather than functional. If you’d like an easier example here, the repr of a string containing an ANSI colour code should show the code itself, not the colour.

If you’d like an even easier example, the repr of the string <b>Text</b> should show the HTML elements, rather than making Text bold.

When it comes to rendering a string, the only place we do that is in our repl. Everywhere else, rendering is left to user code. Currently, our repl doesn’t support a RTL mode, and so trying to render bidirectional text is ultimately going to fail, but it’s the one place where we could attempt to do it. Other repls, such as those built into IDEs, are able to render it[1].

Explicit directional characters[2] are control characters, and as such should be converted to visible codes in a repr. Our renderer isn’t able to process them, so could reasonably convert them to something visible, and our storage format treats them normally so that RTL-aware renderers can handle them properly.

It’s permitted for RTL text to be “weird” if the user hasn’t chosen an overall RTL mode - you get a mix if the code generating the text assumes the default will be RTL and it isn’t, so the user gets to set the default. Highly defensive code will generate marks at the start and end of each paragraph. ↩︎
Bit of a side note, but some printable characters implicitly control the current RTL direction as well. ↩︎

JamesParrott (James Parrott) February 12, 2026, 2:50pm 4

Is the code that’s shifting the RTL character a bug, or an intentional normalization? If there are many possible reprs for that string, repr must make a choice, just like how it decides to use ' or " for string literals.

storchaka (Serhiy Storchaka) February 12, 2026, 3:01pm 5

This would break a lot of code that does something like repr(s)[1:-1].

An alternative could be adding left-to-right marks inside the quotes, but then we will lost invariant s == eval(repr(s)).

storchaka (Serhiy Storchaka) February 12, 2026, 3:14pm 6

This is definitely not intentional. Python does nothing special with this. It is terminal and browser who reorder output. This is relatively new issue. More primitive terminals, like xterm, don’t have such issue (yet) – they output characters as they are stored.

guido (Guido van Rossum) February 12, 2026, 3:17pm 7

I think it’s okay to evolve our thinking here. Long ago, repr() of unicode strings would escape non-ASCII characters. We fixed that. I think we ought to do the same for RTL text – if necessary by adding RTL markers to restrict the scope of the RTL.

IOW I want repr("Hello مرحبا world") to render as

"Hello مرحبا world"

I’d be okay if the RTL/LTR markers were shown as escaped sequences though.

storchaka (Serhiy Storchaka) February 12, 2026, 4:43pm 8

paulehoffman (Paul Hoffman) February 12, 2026, 7:06pm 9

Changing the output of reprat this late date is certain to cause innumerable problems in code that uses it. Even though repr is only used in Python for the Python repl, we have no idea where people have used it in their code.

Having said that, I think creating something new like urepr from scratch that follows all the hard-fought Unicode guidance would be a good thing, and I volunteer to help on it. (My background is being active in Unicode in the early 2000’s, and one of the primary authors of the IDNA standard for using beyond-ASCII characters in the DNS.)

Rosuav (Chris Angelico) February 12, 2026, 7:12pm 10

The reprs for various objects HAVE changed though. What code is depending on the specifics?

steve.dower (Steve Dower) February 12, 2026, 8:21pm 11

The repr for str has been pretty consistently “safe to serialise and reevaluate later”, which basically requires assuming a lack of active (unescaped) control characters. Changing that is considerably more drastic than the type names in object reprs, which are the ones that we tend to allow to change.

Strictly speaking, sys.displayhook is used for the repl, and so we can already choose to render strings differently in the repl than what repr does. (There’s also pprint, which might be a better home.)

Changing print()'s behaviour probably requires changing repr, but I don’t know that we want to claim that print() supports RTL output when it’s entirely beholden to the version of libc you’ve compiled with?

Jos_Verlinde (Jos Verlinde) February 12, 2026, 10:33pm 12

I’d think that identifying and fixing a potential issue now is to be preferred to working around a possible error forever.
( Forever > Late)

JamesParrott (James Parrott) February 13, 2026, 11:58am 13

To expand on Steve’s distinction between representation and rendering, I think the issue is how these terminals render Python’s repr.

I can reproduce this in Pyodide in Chrome browser, and in Konsole 26.03.70 (so far, I’m not a fan) on Windows (but not in Windows Terminal).

It’s tricky to decide how to change Python code to guarantee the same behaviour in different environments at the best of times. Suffice to say, there is great potential for confusing users.

>>> s0 = '12\u05be34'
>>> s1 = '1234\u05be'
>>> s0 == s1
False
>>> repr(s0) == repr(s1)
False
>>> repr(s0)
"'12־34'"
>>> repr(s1)
"'1234־'"
>>> list(s0)[-1]
'4'
>>> list(s0)
['1', '2', '־', '3', '4']
>>> repr(s0)[-2]
'4'

Similar behaviour is seen with .py files that print reprs of such strings.

Carmina16 (Carmina16) February 18, 2026, 11:41am 14

The optimal solution will be:

Modify repr to fence the strings containing RTL characters with LRMs;
Modify the Python lexer to skip over the BiDi formatting characters outside of strings, so the new output can be parsed back.

shaib (Shai Berger) March 3, 2026, 11:07am 15

Hi, I have no opinion about the correct fix here, but I’d like to point that if any use of BiDi control characters fencing is going to be part of it, then the control characters to use should not be LRM/RLM but rather FSI/PDI (First Strong Isolate and Pop Directional Isolation, U+2068/U+2069) – these protect both the context from the string, and the string from the context, to preserve

where possible.

mkzeender (Marckie Z) March 4, 2026, 11:32am 16

We should be careful about allowing arbitrary control characters in the source code. I imagine a supply-chain attack where the code reviewer sees, for example:

if job.action == 'command':
   executor.run_command(job.body)

but the parser sees

if job.action == 'dnammoc':
   exec(job.body).utorrun_command

which introduces a backdoor triggered by “command” spelled backwards.

Hopefully this would be mostly avoidable if the control characters were only allowed directly next to strings?

vstinner (Victor Stinner) March 4, 2026, 5:43pm 17

cben (Beni Cherniavsky-Paskin) March 4, 2026, 11:33pm 18

The issues are not limited to strings. Example:

>>> @dataclass  # shorter repr without Endlish `object at 0x`
... class אבג: pass
... 
>>> x = אבג()
>>> 
>>> [x, 0, 1, 2]
[אבג(), 0, 1, 2]
>>> [3, x, 4, 5]
[3, אבג(), 4, 5]
>>> [6, 7, 8, x]
[6, 7, 8, אבג()]

The outputs are correct in logical order and show fine on LTR xterm, but render confusingly on say gnome-terminal (and here in Discourse).

[This is contrived example—while Python syntax supports unicode identifiers fine, there is AFAIK no editor that handles mix of LTR keywords & APIs with RTL names sanely. It’s just too painful/ambiguous/confusing for anyone to use. (Educational languages like Hedy get away with localizing keywords and core APIs but that’s out of scope for Python.)
But other examples are possible with RTL reprs, from Enum to custom __repr__…]

The trouble is, at the time repr() runs, there is no way to know where it’s destined: To a dumb LTR terminal, to a bidi-aware terminal, to a full-terminal TUI that has to handles bidi on its own, to a log file (which again can go anywhere), to a bidi-aware browser, or to other forms of UI…

Many of these can be overriden too, by context like LRO/RLO control chars or CSS…
(inspect the code block above, add unicode-bidi: bidi-override;)
Terminal bidi is not well standardized but terminal-wg draft spec was a serious attempt. Support, or lack of, is generally impossible to detect, and some terminals allow overriding it by escape sequences. E.g. VTE (gnome-terminal etc.) implemented:

>>> print("\x1b[8l")  # disable bidi ("explicit" mode) => same reprs render like xterm!  
>>> # ...repeat prints from above code block...  
>>> print("\x1b[8h")  # enable bidi ("implicit" mode)

Additionally, repr() is frequently recursive, used to build up bigger reprs (and/or other non-quite-Python notations).
Bidi control chars like FSI..PDI do nest though max_depth=125 limits to I think ~64 levels?

It’s a complex area, I don’t want to jump to conclusions

As a Hebrew speaker, I’d hate to lose ability to read my language in strings!
Sure, invisible controls SHOULD be escaped, but same issue affects all printable letters in Arabic/Farsi/Hebrew/etc.
On one hand, I feel repr() is too early to handle bidi, it belongs later e.g. sys.displayhook. This may be good enough if answer is to simply suppress bidi e.g. LRO..PDF—which not totally crazy! Letter-by-letter LTR does impact reading (doubly so if it breaks shaping ), but it’s still way better than escaping all letters \u…, and it is the best way to see the logical order in strings. (In limited cases RTL users could want RLO…PDF too.)
OTOH, wrapping FSI..PDI while building up the structure is the Right way to allow per-level bidi but preserve structure. It allows the best string reading order, but it’s inherently ambiguous to read, even for strings—and grows more ambiguous with nesting.
Or perhaps supress bidi is safe default, and non-default recursive solutions belong in pprint/reprlib?

steve.dower (Steve Dower) March 5, 2026, 8:19pm 19

Correct, but UTS #55 has those conclusions for you. We spent a few months going through all the alternatives and aspects you raise, and they should all be answered.

Though the general takeaway for us here is that languages(/syntax/compilers) should deal with character streams, representing “bytes/chars in memory” so to speak, rather than worrying about rendering. Editors/viewers should worry about rendering. The repr of a string looks like a string, so any bidi-aware code renderer should be able to render it correctly within the quotes.