Python 3.X Strings Tutorial by Mark Lutz (original) (raw)
[ ![]() |
This article was originally written for Pythons 3.0 and 2.6, but applies to all 3.X and 2.X. It became a new chapter in the book Learning Python and was revised and expanded in the 5th Edition. After evolving independently here, its 3.X coverage formed the genesis of the 6th Edition's Unicode chapter. "Contents" below opens a table of contents inline if JavaScript is enabled, or off page if not. See also the related reading links ahead for more resources. |
---|
- Introduction
- [The Basics](#The Basics)
- [Character Representations](#Character Representations)
- [Character Encoding Schemes](#Character Encoding Schemes)
- [Python's String Types](#Python's String Types)
- [Text and Binary Files](#Text and Binary Files)
- [Python 3.0 Strings in Action](#Python 3.0 Strings in Action)
- [Literals and Basic Properties](#Literals and Basic Properties)
- [String Type Conversions](#String Type Conversions)
- [Coding Unicode Strings in Python 3.0](#Coding Unicode Strings in Python 3.0)
- [Coding Unicode Strings in Python 2.6](#Coding Unicode Strings in Python 2.6)
- [Source-File Encoding Declarations](#Source-File Encoding Declarations)
- [Processing 3.0 Bytes Objects](#Processing 3.0 Bytes Objects)
- [Method Calls](#Method Calls)
- [Sequence Operations](#Sequence Operations)
- [Other Ways to Make Bytes](#Other Ways to Make Bytes)
- [Mixing String Types](#Mixing String Types)
- [Using 3.0 bytearray Objects](#Using 3.0 bytearray Objects)
- [Python 3.0 File Modes in Action](#Python 3.0 File Modes in Action)
- [Text File Basics](#Text File Basics)
- [Using Text and Binary Modes](#Using Text and Binary Modes)
- [Using Unicode Text Files](#Using Unicode Text Files)
- [Other String Tool Changes in 3.0](#Other String Tool Changes in 3.0)
- [The re Pattern-Matching Module](#The re Pattern-Matching Module)
- [The struct Binary-Data Module](#The struct Binary-Data Module)
- For More Reading
Jun-2009 (last polished Apr-2024)
Strings in 3.X: Unicode and Binary Data
One of the most noticeable changes in Python 3.0 is the mutation of string object types. In a nutshell, 2.X's str
and unicode
types have morphed into 3.X's bytes
and str
types, and a new mutable bytearray
type has been added. Especially if you process data that is either Unicode or binary in nature, this can have substantial impacts on your code. As a general rule of thumb, how much you need to care about this topic depends in large part upon which of the following categories you fall into:
- If you deal with non-ASCII Unicode text—for instance, in the context of internationalized applications, Internet content, or XML parsers—you will find support for text encodings to be different in 3.0, but also probably more direct, accessible, and seamless than in 2.6, thanks to 3.0's all-Unicode
str
. - If you deal with binary data—for example, in the form of image or audio files, network transfers, or packed data processed with the
struct
module—you will need to understand 3.0's newbytes
object, and its different and sharper distinction between text and binary data and files. - If you fall into neither of the prior two categories, you can generally use strings in 3.0 much as you would in 2.6: with the general
str
string type, text files, and all the familiar string operations. Your strings will be encoded and decoded using your platform's default encoding (e.g., ASCII, UTF-8, or Latin-1; thelocale
module'sgetpreferredencoding()
gives youropen()
default if you must know), but you probably won't notice.
For example, if text is still always ASCII in your corner of the software world, you might be able to get by with normal string objects and text files, and can avoid most of the following story. As we'll see in a moment, ASCII is a simple kind of Unicode and a subset of other encodings, so string operations and files "just work" if your programs process ASCII text.
Even if you fall into the last category above, though, a basic understanding of 3.0's string model can help, both to demystify some of the underlying details now, and to help you master Unicode or binary data issues if they impact you in the future. Given the prominence of the web in most software careers today, that impact may be more a matter of "when" than "if."
The Basics
Before looking at code, let's begin with a general overview of the 3.0 string model. To understand why 3.0 went the way it did, we have to start with a brief look at how characters are actually represented in computers.
Character Representations
Most programmers think of strings as a series of characters (really, their integer codes) used to represent textual data. That's still true in the brave new world of Unicode, but the way characters are stored in a computer's memory and files can vary, depending on both what sort of characters are recorded, and how programmers choose to record them.
For many programmers in the US, the ASCII standard defines their notion of text strings. ASCII is a standard created in the US, which defines character codes 0..127, and thus allows each character to be stored in one 8-bit byte. For example, the ASCII standard maps character 'a' to the integer value 97 (61 in hex), which can be stored in a single byte both in memory and on files. If you wish to check, Python's ord()
shows the integer code of a given character; chr()
reveals the character of a given integer code; and hex()
gives the code's byte value as two hex digits, each of which fits a 4-bit nibble; the first of these is the value of a character's code—and byte—in ASCII:
ord('a') # character => code 97 chr(97) # code => character 'a' hex(97) # byte value: fits 8 bits '0x61'
ASCII makes text processing simple, because characters directly correlate to bytes. Sometimes, though, this isn't enough. Accented characters and special symbols, for example, do not fit into the range of character codes defined by ASCII. To allow for some such extra characters, other standards allow all possible values in an 8-bit byte, 0..255, to be used as codes, and assign values 128..255 to additional characters. One such standard is known as Latin-1, and is widely used in Western Europe. In Latin-1, character codes above 127 are assigned to accented and otherwise-special characters. For instance, the character which Latin-1 assigns to code 196 (a.k.a. byte value 0xc4
) is a specially marked and non-ASCII character. Per Python 3.X:
chr(196) # too big for ASCII 'Ä' ord('Ä') # okay for Latin-1 196 hex(ord('Ä')) # byte value in Latin-1 '0xc4'
Still, some alphabets define so many characters that it is impossible to represent them as one byte-sized code per character. The integer codes of the symbols and characters in the following, for example, require more space than a byte—as do those of all the silly emojis that may not work in some viewers and editors, but manage to crop up in your emails anyhow:
ord('☞') 9758 hex(ord('☞')) # too big for one byte '0x261e'
[hex(ord(c)) for c in '真Л⇨'] # ditto: Unicode required ['0x771f', '0x41b', '0x21e8']
[hex(ord(c)) for c in '🙂🙊👍'] # emojis > two bytes (16 bits) ['0x1f642', '0x1f64a', '0x1f44d']
Unicode provides the generality we need to deal with text containing non-ASCII characters and symbols like these. In fact, it defines and assigns enough character codes to represent almost every natural language in use, plus a large set of symbols. Unicode is sometimes referred to as "wide-character" strings, because its range of characters is so broad that multiple bytes may be needed to represent individual character codes. To allow for this, it also defines standard ways to map character codes to bytes for storage and transmission that are both platform and language neutral—the encodings we'll explore in the next section.
The takeaway here is that Unicode's combination of all-encompassing character codes and their predefined encodings make it a highly flexible model, and the standard way that programs deal with non-English and other text that may have more characters than 8-bit bytes can handle. As an added bonus, earlier schemes like ASCII also fall under the Unicode umbrella unchanged, but we have to move on to the next section to see how.
Character Encoding Schemes
The key to understanding how Unicode works lies in the way its character codes (a.k.a. "code points") in memory are mapped to their encoded forms as needed for efficient storage or transfer. We say that characters are translated to and from raw bytes using an _encoding_—the rules for translating a Unicode string into a sequence of bytes, and extracting the string from a sequence of bytes. More procedurally, this translation back and forth between bytes and strings is defined by two terms:
- Encoding is the process of translating a string of characters into its raw-bytes form, per any desired encoding that's broad enough to store its characters.
- Decoding is the process of translating a string of raw bytes into its character-string form, per the encoding originally used to create the bytes string.
As noted, Unicode defines both character codes and a set of standard encodings. For some of the encodings it defines, the translation process is trivial—ASCII and Latin-1, for instance, map each character to a single byte, so little or no work is required to encode and decode. For other encodings, the mapping can be more complex, and yield multiple bytes per character.
The widely used UTF-8 encoding, for example, allows more characters to be represented by employing a variable-number-of-bytes scheme that's both general and economical. Character codes less than 128 are represented as a single byte; codes between 128 and 0x7ff (2047) are turned into two bytes, where each byte has a value between 128 and 255; and codes above 0x7ff are turned into three- or four-byte sequences having values between 128 and 255. This keeps simple ASCII strings compact, sidesteps byte ordering issues, and avoids null (zero) bytes that can cause problems for C libraries and networking.
Despite such details, it's important to note that ASCII is a _subset_of both Latin-1 and UTF-8. This is true because these encodings both assign ASCII characters to the same codes, and encode those characters to bytes the same way. This makes Unicode compatible with existing ASCII data: every character string encoded per ASCII is also valid according to the Latin-1 and UTF-8 encodings, and every ASCII file is a valid Latin-1 and UTF-8 file. Technically, the ASCII encoding is a 7-bit subset of the other two: it's binary compatible for all character codes less than 128. Latin-1 and UTF-8 simply allow for additional characters: Latin-1 for characters mapped to values 128..255 within a byte, and UTF-8 for characters that may be represented with multiple bytes.
Other encodings support richer character sets in other ways. For instance, UTF-16_and UTF-32 use a fixed and larger 2 or 4 bytes per character, respectively, the former with a special "surrogate pair" protocol for codes too large for 2 bytes. We'll skip further details here, but keep in mind that all of these—ASCII, Latin-1, UTF-8, and others—are simply alternative Unicode encodings that yield the same Unicode code-point text when decoded. This net effect ensures that text is_portable across all the tools that use it, in exchange for minor translation costs:
- When decoded, character code points may or may not occupy multiple bytes in memory, depending on programming-language implementation. Pythons 3.3 and later, for example, use a variable-length scheme to store decoded text with 1, 2, or 4 bytes per character, depending on string content. Earlier Pythons instead store each character in a fixed 2 or 4 bytes, depending on compilation settings.
- When encoded, the format of character code points is wholly determined by the standard Unicode encoding applied. This format is the same regardless of which programming language creates or processes the text, making it ideal for storage and transfer—especially in the wildly diverse realm of the Internet. This format is often less ideal for programs to use, though, which is why it's normally decoded when loaded.
To Python programmers, an encoding is specified as a string containing the encoding's name. Python comes with roughly 100 different encodings; see the Python Library Reference for a list. Importing module encodings
and asking for help(encodings)
shows you many as well; some are implemented in Python, and some in C. Some encodings have multiple names too; for example, "latin-1", "iso_8859_1" and "8859" are all synonyms for the same encoding, Latin-1. We'll revisit encodings [later](#Coding Unicode Strings in Python 3.0) in this article, when we study Unicode coding techniques.
For another take on the Unicode backstory, see the Python standard manual set. It includes a "Unicode HOWTO" in its "Python HOWTOs" section which provides additional details which we will skip here in the interest of space.
Update:some encodings also require or allow markers at the start of encoded text, known as Unicode BOMs. These markers can designate byte order and encoding type, and may be present whether the encoded text is stored in files or memory. Though not covered in this early-draft article, there is a brief look at this topic on this site here, and more complete surveys in later books. For more on why using the correct encoding matters in general, read the Latin-1/CP-1252 sagahere.
Update:this doc also does not discuss Unicode normalization, an advanced but essential topic. In short, the Unicode standard oddly allows some non-ASCII characters (e.g., ñ
and Ä
) to be represented with multiple and differing code-point sequences when decoded. This in turn forces many text- and filename-processing tools to make disparate forms equivalent before running comparisons. For more details on this border case, see this site's off-page coverage hereandhere.
Python's String Types
At a more concrete level, the Python language provides multiple string data types to represent content in your script: both _textual data_—integer code-point values of decoded Unicode characters in memory, as well as _binary data_—raw byte values, including text that is in encoded form. These types differ in the two Python lines.
For example, Python 2.X has a general string type for representing both simple 8-bit character text like ASCII and binary data, along with a specific type for representing richer Unicode text that may occupy multiple bytes when encoded or decoded:
- **
str
**—for representing both 8-bit text and binary data - **
unicode
**—for representing Unicode text (decoded code points) Python 2.X's two string types are different (unicode
allows for the extra range of Unicode characters, and has extra support for encoding and decoding), but their operation sets largely overlap. Because thestr
string type in 2.X represents both text that can be represented with 8-bit bytes as well as binary data, it can be used for both textual and non-textual content.
By contrast, Python 3.0 comes with 3 string object types:
- **
str
**—for representing Unicode text (decoded code points) - **
bytes
**—for representing binary data (including encoded text) - **
bytearray
**—a mutable flavor of thebytes
type All 3 types support similar operation sets, but have different roles. The main goal behind this change was to merge the normal and Unicode string types of 2.X (itsstr
andunicode
) into a single string type that supports both byte-oriented and richer Unicode text. Developers wanted to remove the 2.X string dichotomy, and make Unicode processing more uniform and natural.
To achieve this, the 3.0 **str**
type is defined as an immutable sequence of characters (really, code points that are not necessarily bytes). Its content may contain both simple text such as ASCII whose encoded and decoded forms yield one byte per character, as well as richer Unicode text whose encoded and decoded forms may both require multiple bytes per character. In memory, a str
is just a sequence of Unicode code points. When transferred to and from files, a str
is automatically encoded and decoded using either the platform default, or a provided encoding name to translate with an explicit scheme.
While 3.0's new str
type does achieve 2.X str
/unicode
merging for text, many programs still need to process raw binary content that is not encoded per any Unicode format—as well as the bytes used to store text when it is encoded. Image files, and packed data you might process with Python's struct
module fall into this category. To support this, a new type, **bytes**
, also was introduced to support processing of truly binary data. bytes
is just bytes, not Unicode characters, though its content may include still-encoded text.
In 2.X, the general str
type filled this binary data role, because strings were just sequences of bytes (the separate unicode
type handled richer text). In 3.0, the bytes
type is defined as an immutable sequence of 8-bit integers representing byte values, and supports almost all the same operations that the str
type does; this includes string methods, sequence operations, and even re
module pattern matching, but not formatting (till later in 3.X's evolution: see the update [ahead](#3.X bytes formatting)).
A bytes
object really is a sequence of small integers, each of which is in the range 0..255; indexing a bytes
returns an int
, slicing one returns another bytes
, and running list()
on one returns a list of integers, not characters. However, when processed with operations that assume characters (e.g., the isalpha()
method), the contents of bytes
objects are assumed to be ASCII-encoded bytes. Further, bytes
items whose values fall in the range of ASCII character codes are printed as ASCII characters instead of integers; this is done purportedly for convenience, though it may also confuse the distinction between text and binary data.
While it was at it, Python also sprouted **bytearray**
in 3.0, a variant of bytes
, which is mutable, and so supports in-place changes. The bytearray
type supports the usual string operations that str
and bytes
do, but also has many of the same in-place change operations as lists (e.g., append()
and extend()
, and assignment to indexes). Assuming your strings can be treated as raw bytes, bytearray
finally adds direct in-place mutability for string data—something not possible in 2.X, or with 3.0's str
or bytes
.
Text and Binary Files
File I/O has also been revamped in 3.0 to reflect the str
/bytes
distinction. Really, text is just decoded integer character codes when it is in memory; it's when text is transferred to and from external interfaces like files that Unicode encodings come into play. By contrast, truly binary data may have nothing at all to do with encodings (or text at all). Because of this, Python now makes a sharp platform-independent distinction between text files and binary files:
- When a file is opened in text mode, reading its data automatically decodes its content (per a platform default or a provided encoding), and returns it as a
str
; writing takes astr
, and automatically encodes it before transferring to the file. Text mode files also support universal end-of-line translation, and encoding specification arguments. - When a file is opened in binary mode by adding a "b" to the mode string argument in the
open()
call, reading its data does not decode it in any way, and simply returns its content raw and unchanged, as abytes
object; writing takes abytes
object and transfers it to the file unchanged. Binary-mode files also accept abytearray
object for the content to be written to the file.
Because str
and bytes
are sharply differentiated by the language, the net effect is that you must decide whether your data is text or binary in nature, and use str
or bytes
objects to represent its content in your script, respectively. Ultimately, the mode in which you open a file will dictate which type of object your script will use to represent its content.
- If you are processing image files, packed data created by other programs whose content you must extract, and some device data streams, chances are good that you will want to deal with it using
bytes
and binary-mode files. You might also opt forbytearray
to update the data without making copies of it in memory. - If instead you are processing something that is textual in nature such as program output, HTML, JSON, internationalized text, and CVS or XML files, you probably want to use
str
and text-mode files.
Notice that the mode string argument to open()
(its second argument) becomes fairly crucial in Python 3.0—its content not only specifies a file processing mode, but also implies a Python object type. By adding a "b" (lower-case only) to the mode string, you specify a binary mode file, and will receive, or must provide, a bytes
object to represent the file's content when reading or writing. Without the "b", your file is processed in text mode, and you'll use str
objects to represent its content. For example, modes "rb", "wb", and "rb+", imply bytes
; "r", "w+", and "rt" (the default) imply str
.
If you're anxious to see files in action, watch for the examples[ahead](#Python 3.0 File Modes in Action), especially those of Unicode-text [files](#Using Unicode Text Files). To understand file usage in full, though, we first need to explore string basics.
Python 3.0 Strings in Action
Let's step through a few examples that demonstrate how the 3.0 string types are used. Note up front that the code in this section was run with and applies to 3.0 only, unless noted otherwise. That said, although there is no bytes
type in Python 2.6 (it has just the general str
), some cross-version compatibility is still possible: in 2.6, the call bytes(X)
is present as a synonym for str(X)
, and the new literal form b'...'
is taken to be the same as the normal string literal '...'
(in this article, ...
means a string's characters). You may still run into version skew in some cases, though; the 2.6 bytes()
call, for instance, does not allow the second argument (encoding name) required by 3.0's bytes()
.
Literals and Basic Properties
Python 3.0 string objects originate when you call a function such as str()
or bytes()
; process a file created by calling open()
(described [later](#Python 3.0 File Modes in Action) in this article); or code literal syntax in your script. For the latter, a new literal form, b'...'
(and equivalently, B'...'
) is used to create bytes
objects in 3.0, and bytearray
objects may be created by calling the bytearray()
function, with a variety of possible arguments.
More formally, in 3.0 all the current string literal forms—'...'
, "..."
, and triple-quoted blocks—generate a str
; adding a "b" or "B" just before them creates a bytes
instead. This new b'...'
bytes literal is similar in spirit to the r'...'
raw string, which suppresses backslash escapes. Consider the following:
C:\misc>c:\python30\python
B = b'spam' # make a bytes object (8-bit bytes) S = 'eggs' # make a str object (Unicode characters, 8-bit or wider)
type(B), type(S) (<class 'bytes'>, <class 'str'>)
B # prints as a character string, really a sequence of ints b'spam' S 'eggs'
B[0], S[0] # indexing returns an int for bytes, str for str (115, 'e')
B[1:], S[1:] # slicing makes another bytes or str (b'pam', 'ggs')
list(B), list(S) ([115, 112, 97, 109], ['e', 'g', 'g', 's']) # bytes is really ints
B[0] = 'x' # both are immutable TypeError: 'bytes' object does not support item assignment
S[0] = 'x' TypeError: 'str' object does not support item assignment
B = B""" # bytes prefix works on single, double, triple quotes ... xxxx ... yyyy ... """ B b'\nxxxx\nyyyy\n'
As mentioned, for forward compatibility, in Python 2.6 the 3.0 b'...'
literal is present but is the same as '...'
and makes a 2.X str
, and bytes()
is just a synonym for str()
; in 3.0, both these address the distinct bytes
type, as shown above for the literal. Also note that the u'...'
and U'...'
unicode
string literal forms in 2.6 discussed [ahead](#Coding Unicode Strings in Python 2.6)are gone in 3.0; use '...'
in 3.0 instead, since all text strings are Unicode in the 3.X line, even if they contain only ASCII characters.
Update:Python 3.X later reinstated 2.X's unicode
string literals to ease migration of 2.X code: a 2.X u'...'
unicode
literal in Python 3.3 and later is now just a synonym for a 3.X '...'
str
literal. This makes sense given 3.X's all-Unicode str
type, and is the backward-compatible equivalent of 2.X's forward-compatible b'...'
support. It's tempting to read into this that 2.X's str
and unicode
simply become 3.X's bytes
and str
, but the division of these types' roles is much sharper in 3.X, as the next section explains.
String Type Conversions
Syntax aside, the first thing you might notice about Python 3.0 strings is what they cannot do. Although Python 2.X allows its str
and unicode
objects to be freely mixed (if thestr
contains only 7-bit ASCII text, at least), 3.X draws a much sharper distinction—str
and bytes
never mix automatically in expressions, and as a rule are not converted to one another automatically when passed to functions. That is, a function that expects an argument to be a str
object won't generally accept a bytes
(and vice versa), and operators are just as rigid in 3.X:
'eggs' + b'spam' TypeError: can only concatenate str (not "bytes") to str
This is easier to understand if you remember that a text string may be radically different in its encoded and decoded forms, and Python has no idea what the content of a bytes
is: if the bytes
is encoded text its encoding is unknown, but it may also be binary data that has nothing to do with text at all. Because of this ambiguity, Python 3.0 basically requires that you either commit to one type or the other, or perform manual, explicit conversions with the following tools:
_S_.encode()
andbytes(_S_, _encoding_)
encode astr
_S_
to a newbytes
_B_.decode()
andstr(_B_, _encoding_)
decode abytes
_B_
to a newstr
Boththe _S_.encode()
and _B_.decode()
methods above and the file open()
call we'll explore [ahead](#Using Unicode Text Files)use either an explicitly passed-in encoding name or a default. In Python 3.X, the methods' default is always UTF-8, but open()
uses a value in the locale
module that may vary per platform (and environment settings). In 2.X both defaults are usually ASCII, as exposed in the sys
module (which allows changes at start-up). For example, in 3.X:
S = 'eggs' S.encode() # str to bytes: encode text into raw bytes b'eggs'
bytes(S, encoding='ascii') # str to bytes, alternative b'eggs'
B = b'spam' B.decode() # bytes to str: decode raw bytes into text 'spam'
str(B, encoding='ascii') # bytes to str, alternative 'spam'
Putting this together solves our original type error, and allows us to mix strings and bytes in 3.X as either encoded or decoded text:
S, B ('eggs', b'spam')
S.encode('ascii') + B # bytes + bytes (encoded) b'eggsspam'
S + B.decode('ascii') # str + str (code points) 'eggsspam'
Two cautions here. First of all, your platform's various default encodings are available in the sys
and locale
modules, but the encoding argument to bytes()
is not optional, even though it is in _S_.encode()
(and _B_.decode()
). Second, although str()
does not require the encoding argument like bytes()
does, leaving it off in str()
calls does not mean it defaults—instead, a str()
without an encoding returns the bytes
object's print string, not its decoded and converted str
form (this is usually not what you'll want!). Assuming B
and S
are still as in the prior listing:
import sys, locale sys.platform # underlying platform 'win32' locale.getpreferredencoding(False) # Windows open() default: a Latin-1 superset 'cp1252' sys.getdefaultencoding() # but str() does not use defaults 'utf-8'
bytes(S) TypeError: string argument without an encoding
str(B) # str() without encoding "b'spam'" # print string, not conversion! len(str(B)) 7
len(str(B, encoding='ascii')) # use encoding to convert to str 4
Update:as of 2024, Python's docs state that the default encoding for file content is now locale.getencoding()
, not locale.getpreferredencoding(False)
, but this is not true: the former is ignorant of a new UTF-8 mode option that can be enabled by environment variable or command-line argument, though the difference won't matter after Python 3.15 enables UTF-8 mode everywhere (per current plans). You also shouldn't generally care: use explicit defaults in opens to avoid interoperability hurdles today.
Having said all that, it's important to also note that encoding and decoding are substantially more than simple programming-language type conversions; really, they produce very different kinds of data. Encoding returns the bytes that result from transforming a text string per a Unicode scheme, and decoding returns the text string that is produced by undoing that transformation. While this is a conversion of sorts, and the mapping may seem trivial for simple text like ASCII, Unicode tends to make much more sense if you avoid blurring the distinction—especially for richer types of text like that in the next section.
Coding Unicode Strings in Python 3.0
Encoding and decoding get more meaningful when you start dealing with actual non-ASCII Unicode text. To code Unicode characters that may be difficult to type on your keyboard, Python string literals support both:
\x_NN_
hex escapes, where 2 hex digits (_NN_
) specify a character code as a 1-byte (8-bit) numeric value\u_NNNN_
and\U_NNNNNNNN_
Unicode escapes, where the first form gives 4 hex digits to denote a 2-byte (16-bit) character code, and the second gives 8 hex digits for a 4-byte (32-bit) code.
Importantly, in str
objects all three of the escapes listed above are used to give a Unicode character's code point value, not its encoded bytes; use bytes
objects if you need to represent a character's encoded bytes instead.
Let's see how this all translates to code. Simple 7-bit ASCII text is formatted with one character per byte under most of the encoding schemes described near the start of this article (again, this is why ASCII passes as a binary-compatible subset of many other schemes):
ord('X') # 'X' has binary value 88 in the default encoding 88 chr(88) # 88 stands for character 'X' 'X'
S = 'XYZ' # str (code points displayed as their character glyphs) S 'XYZ' len(S) # 3 characters long 3
S.encode('ascii') # values 0..127 in 1 byte each (ASCII bytes shown as chars) b'XYZ' S.encode('latin-1') # values 0..255 in 1 byte each b'XYZ' S.encode('utf-8') # values 0..127 in 1 byte, 128..2047 in 2, others in 3 or 4 b'XYZ'
By contrast, the less common UTF-16 and UTF-32 use 2 and 4 bytes for every character, respectively, even for simple text like ASCII. This makes these encodings' data fast to process but may consume extra space and bandwidth, which renders them subpar in many applications. In the following, ASCII bytes print as characters, non-ASCIIs print as\x_NN_
escapes, and each result has a 2- or 4-byte BOM header at the front whose details we're largely skipping here (see the earlier update):
S 'XYZ'
S.encode('utf-16') # always 2 or 4 bytes per character, plus a BOM header b'\xff\xfeX\x00Y\x00Z\x00'
S.encode('utf-32') b'\xff\xfe\x00\x00X\x00\x00\x00Y\x00\x00\x00Z\x00\x00\x00'
To code non-ASCII characters, you can use hex and Unicode escapes in your strings. The numeric values coded as hexadecimal literals 0xC4
and 0xE8
, for instance, are the Unicode code points used to represent two special characters outside the 7-bit range of ASCII; we can embed them in str
objects, because str
supports Unicode in 3.X today:
chr(0xc4) # 0xC4 and 0xE8 are accented characters outside ASCII's range 'Ä' chr(0xe8) 'è'
S = '\u00c4\u00e8' # 16-bit Unicode escapes S 'Äè' len(S) # 2 characters long (not number of bytes!) 2
Now, if we try to encode a non-ASCII string to raw bytes as ASCII, we'll get an error. Encoding as Latin-1 works, though, and allocates one byte per character; encoding as UTF-8 allocates 2 bytes per character instead. If you write this string to a file, the raw bytes shown is what is actually stored on the file for the encoding types given:
S = '\u00c4\u00e8' S.encode('ascii') UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)
S.encode('latin-1') # one byte per character b'\xc4\xe8'
S.encode('utf-8') # two bytes per character b'\xc3\x84\xc3\xa8'
len(S.encode('latin-1')) # 2 bytes in latin-1, 4 in utf-8 2 len(S.encode('utf-8')) 4
Note that you can also go the other way—from raw bytes back to a Unicode string. You could read raw bytes from a file and decode manually this way, but the encoding mode you give to the open()
call causes this decoding to be done for you automatically (and avoids issues that may arise from reading partial character sequences when reading by blocks of bytes):
B = b'\xc4\xe8' B b'\xc4\xe8' len(B) # 2 raw bytes, 2 characters 2 B.decode('latin-1') # decode to latin-1 text 'Äè'
B = b'\xc3\x84\xc3\xa8' len(B) # 4 raw bytes 4 B.decode('utf-8') 'Äè' len(B.decode('utf-8')) # 2 Unicode characters 2
When needed, you can also specify both 16- and 32-bit Unicode code-point values for characters in your str
strings: use \u...
with 4 hex digits for the former, and \U...
with 8 hex digits for the latter. As the last example in the following shows, you can also build such strings up piecemeal using chr()
, but it might become tedious for large strings:
S = 'A\u00c4B\U000000e8C' S # A, B, C, and 2 non-ASCII characters 'AÄBèC' len(S) # 5 characters long 5
S.encode('latin-1') b'A\xc4B\xe8C' len(S.encode('latin-1')) # 5 bytes in latin-1 5
S.encode('utf-8') b'A\xc3\x84B\xc3\xa8C' len(S.encode('utf-8')) # 7 bytes in utf-8 7
S.encode('cp500') # two other western european encodings b'\xc1c\xc2T\xc3' S.encode('cp850') # 5 bytes each b'A\x8eB\x8aC'
S = 'spam' # ascii text is the same in most S.encode('latin-1') b'spam' S.encode('utf-8') b'spam' S.encode('cp500') # cp500 is ibm ebcdic b'\xa2\x97\x81\x94' S.encode('cp850') b'spam'
S = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C' S 'AÄBèC'
Notice that Python 3.0 allows special characters' code points to be coded with both hex and Unicode escapes in str
string literals, but allows only hex escapes in bytes
literals; in fact, Unicode escape sequences are taken verbatim in bytes
, and not as escapes. This makes sense if you remember that bytes
objects hold characters' encoded bytes—not their decoded code points. This is true even though code-point and encoded-byte values happen to be the same for some characters in some encodings (confusingly!). Because bytes
are not code points, they also must be decoded to str
to print their non-ASCII characters properly:
S = 'A\xC4B\xE8C' # str recognizes hex and Unicode escapes S 'AÄBèC'
S = 'A\u00C4B\U000000E8C' # 4- and 8-digit Unicode escapes (str only) S 'AÄBèC'
B = b'A\xC4B\xE8C' # bytes recognizes hex but not Unicode B b'A\xc4B\xe8C'
B = b'A\u00C4B\U000000E8C' # Unicode escape sequences taken literally B # bytes are encoded bytes, not code points b'A\u00C4B\U000000E8C'
B = b'A\xC4B\xE8C' # use hex escapes for latin-1 bytes B # prints non-ASCII as hex b'A\xc4B\xe8C' print(B) b'A\xc4B\xe8C' B.decode('latin-1') # decode to str to interpret as text 'AÄBèC'
Finally, notice that bytes
literals assume that embedded characters are ASCII, and require escapes for byte values > 127; str
literals allow embedding any character supported by the file's source-code encoding (which defaults to UTF-8 in 3.X, unless encoding declarations are given—discussed ahead):
S = 'AÄBèC' # chars from UTF-8 if no encoding declaration S 'AÄBèC'
B = b'AÄBèC' SyntaxError: bytes can only contain ASCII literal characters.
B = b'A\xC4B\xE8C' # chars must be ASCII, or escapes B # non-ASCIIs are latin-1 encoded bytes b'A\xc4B\xe8C' B.decode('latin-1') 'AÄBèC'
S.encode() # source code encoded per UTF-8 by default b'A\xc3\x84B\xc3\xa8C' # uses system default to encode, unless passed S.encode('utf-8') b'A\xc3\x84B\xc3\xa8C' B.decode() # raw bytes do not correspond to utf-8 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: ...
S = 'AÄBèC' S 'AÄBèC' S.encode() # default utf-8 encoding b'A\xc3\x84B\xc3\xa8C'
T = S.encode('cp500') # convert to EBCDIC T b'\xc1c\xc2T\xc3'
U = T.decode('cp500') # convert back to Unicode U 'AÄBèC'
U.encode() # back to UTF-8 bytes, by default b'A\xc3\x84B\xc3\xa8C'
Coding Unicode Strings in Python 2.6
Now that you've seen the basics of Unicode strings in 3.0, it's also important to know that you can do much the same in 2.6, though the tools differ. Unicode is already available in Python 2.6, but it is a distinct data type from str
, and 2.6 allows free mixing of normal and unicode
strings when compatible. In fact, you can essentially pretend 2.6's str
is 3.0's bytes
when it comes to decoding into a unicode
string, as long as it's in proper form.
Here's 2.6 string support in action (all other sections in this topic but this one are run under 3.0):
import sys sys.version '2.6 (r26:66721, Oct 2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)]'
S = 'A\xC4B\xE8C' # string of 8-bit bytes print S # some are non-ascii AÄBèC
S.decode('latin-1') # decode byte to latin-1 unicode u'A\xc4B\xe8C'
S.decode('utf-8') # not formatted as utf-8 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-2: invalid data
S.decode('ascii') # outside ascii range UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)
To store arbitrarily encoded Unicode text, make a Unicode object with the u'...'
literal form; this is no longer available in 3.0, since all strings support Unicode in that version (update: as noted [earlier](#3.X unicode literal), Python 3.X later reinstated 2.X's u'...'
unicode
string literals to ease migration of 2.X code):
U = u'A\xC4B\xE8C' # make unicode string, hex escapes U u'A\xc4B\xe8C' print U AÄBèC
Once created, you convert Unicode text to different encodings; this is similar to encoding str
objects into bytes
objects in 3.0:
U.encode('latin-1') # encode per latin-1: 8-bit bytes 'A\xc4B\xe8C' U.encode('utf-8') # encode per utf-8: multi-byte 'A\xc3\x84B\xc3\xa8C'
Non-ASCII characters can be coded with hex or Unicode escapes in string literals just as in 3.0, but just as for bytes
in 3.0, the \u...
and \U...
escapes are recognized only for unicode
strings in 2.6, not 8-bit str
strings:
U = u'A\xC4B\xE8C' # hex escapes for non-ascii U u'A\xc4B\xe8C' print U AÄBèC
U = u'A\u00C4B\U000000E8C' # unicode escapes for non-ASCII U # u'' = 16 bits, U''= 32 bits u'A\xc4B\xe8C' print U AÄBèC
S = 'A\xC4B\xE8C' # hex escapes work S 'A\xc4B\xe8C' print S # but some print oddly, unless decoded A-BFC print S.decode('latin-1') AÄBèC
S = 'A\u00C4B\U000000E8C' # not unicode escapes: taken literally! S 'A\u00C4B\U000000E8C' print S A\u00C4B\U000000E8C len(S) 19
Like 3.0's str
and bytes
, 2.6's unicode
and str
share nearly identical operation sets, so you can often treat unicode
as though it were str
unless you need to convert to other encodings. One of the primary differences between 2.6 and 3.0 is that unicode
and non-unicode str
objects can be freely mixed in expressions, as long as the non-unicode object contains only 7-bit ASCII characters; the non-unicode str
is automatically converted up to unicode
in the process (in 3.0, str
and bytes
never mix automatically, and require manual conversions):
u'ab' + 'cd' # can mix if compatible (if str is all ASCII) u'abcd'
S = 'A\xC4B\xE8C' # can't mix if incompatible U = u'A\xC4B\xE8C' S + U UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 1: ordinal not in range(128)
S.decode('latin-1') + U # manual conversion still required u'A\xc4B\xe8CA\xc4B\xe8C'
print S.decode('latin-1') + U AÄBèCAÄBèC
Finally, note that 2.6's open()
call supports only files of 8-bit bytes, and returns their content as str
strings; it's up to you to interpret that content as text or binary data. To read and write Unicode files and encode or decode their content in the process, see 2.6's library manual for information on the **codecs.open()**
call. This call provides much the same functionality as 3.0's open()
, and uses 2.6 unicode
objects to represent file content—reading a file translates encoded bytes into decoded Unicode characters, and writing translates Unicode strings to the desired encoding specified when opened. We'll see more on files in both Pythons [ahead](#Python 3.0 File Modes in Action).
Source-File Encoding Declarations
One last note on coding non-ASCII text: Unicode escapes suffice for the occasional Unicode character in string literals, but can become tedious if you need to code non-ASCII text in your strings frequently. For string literals and other text that you embed in your script files, Python uses the UTF-8 encoding in 3.X (and ASCII in 2.X) by default to read your code's text, but allows you to change this per file to use an arbitrary encoding—and hence directly embed any unescaped characters that the chosen encoding supports.
To make this work, simply include a comment which names the encoding used to save your source file. This special encoding-declaration comment must appear as either the first or second line in your script, and is usually of the following form (see Python's manuals for other formats it accepts):
-- coding: latin-1 --
When present, Python will recognize strings represented natively in the given encoding. That way, you can edit your script file in a text editor that accepts, displays, and saves accented and other non-ASCII characters, and Python will correctly decode them when reading your string literals and other program-file text.
For example, notice how the comment at the top of the following file, "text.py," allows Python to recognize Latin-1 characters embedded in strings when the file is saved with this encoding:
-- coding: latin-1 --
any of the following string literal forms work in latin-1;
changing the encoding above to either ascii or utf-8 fails,
because the 0xc4 and 0xe8 in myStr1 are not valid in either
myStr1 = 'aÄBèC'
myStr2 = 'A\u00c4B\U000000e8C'
myStr3 = 'A' + chr(0xC4) + 'B' + chr(0xE8) + 'C'
import sys, locale print('Sys default encoding: ', sys.getdefaultencoding()) print('Open default encoding:', locale.getencoding()) # later Python 3.X
for aStr in myStr1, myStr2, myStr3: print('{0}, strlen={1}, '.format(aStr, len(aStr)), end='')
bytes1 = aStr.encode() # per default utf-8: 2 bytes for non-ASCII
bytes2 = aStr.encode('latin-1') # one byte per char
#bytes3 = aStr.encode('ascii') # ascii fails: outside 0..127 range
print('byteslen1={0}, byteslen2={1}'.format(len(bytes1), len(bytes2)))
C:\misc>c:\python30\python text.py Sys default encoding: utf-8 Open default encoding: cp1252 aÄBèC, strlen=5, byteslen1=7, byteslen2=5 AÄBèC, strlen=5, byteslen1=7, byteslen2=5 AÄBèC, strlen=5, byteslen1=7, byteslen2=5
Since most programmers are likely to fall back on the default source encodings (especially the general UTF-8 in Python 3.X), we'll defer to Python's standard manual setfor more details on this option, as well as more advanced Unicode support such as propertiesand character-name escapes in strings that we'll skip here.
Processing 3.0 Bytes Objects
We'll see the string types we've met in action again when we study files [ahead](#Python 3.0 File Modes in Action). First, though, let's take a brief detour to dig a bit deeper into the operation sets provided by the new bytes
type in 3.0.
As mentioned [earlier](#Python's String Types), the 3.0 bytes
type supports sequence operations and most of the same methods available on str
(and present in 2.X's str
type). However, bytes
does not support the format()
method or the %
formatting expression (until 3.5, per the update [ahead](#3.X bytes formatting)). Moreover, you cannot mix and match bytes
and str
without explicit conversions—you generally will use all str
type objects and text files for text data, and all bytes
type objects and binary files for binary data.
Method Calls
If you really want to see what attributes str
has that bytes
doesn't, you can always check their dir()
results; this can also tell you something about the expression operators they support (e.g., __mod__
and __rmod__
implement the %
operator):
C:\misc>c:\python30\python Python 3.0 (r30:67507, Dec 3 2008, 20:14:27) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information.
# Attributes unique to str
set(dir('abc')) - set(dir(b'abc')) {'isprintable', 'format', 'mod', 'encode', 'isidentifier', '_formatter_field_name_split', 'isnumeric', 'rmod', 'isdecimal', '_formatter_parser', 'maketrans'}
# Attributes unique to bytes
set(dir(b'abc')) - set(dir('abc')) {'decode', 'fromhex'}
As you can see, str
and bytes
have almost identical functionality; their unique attributes are generally methods that don't apply to the other. For instance, decode()
translates a raw bytes
into its str
representation, and encode()
translates a str
into its raw bytes
representation). Most methods are shared between str
and bytes
, though. Moreover, bytes
are immutable just like str
in both 2.6 and 3.0 (error messages here have been shortened for brevity):
B = b'spam' # b'...' bytes literal B.find(b'pa') 1
B.replace(b'pa', b'XY') b'sXYm'
B b'spam'
B[0] = 'x' TypeError: 'bytes' object does not support item assignment
One notable exception to this rule: string formatting works only on str
in 3.0, not on bytes
. As told here, 3.0 also convolutes the string formatting story in general by adding redundant functionality, but that story is beyond the scope of this page:
b'%s' % 99 TypeError: unsupported operand type(s) for %: 'bytes' and 'int'
'%s' % 99 '99'
b'{0}'.format(99) AttributeError: 'bytes' object has no attribute 'format'
'{0}'.format(99) '99'
Update:Python 3.5 eventually extended %
formatting (only) to bytes
objects per this page—for better or worse. The extension has a heavily ASCII bias which clashes badly with the generalized Unicode text model of 3.X, but may be useful in limited contexts.
Sequence Operations
Besides method calls, all the usual generic sequence operations you know (and possibly love) from Python 2.X strings and lists work as expected on both str
and bytes
in 3.0; this includes indexing, slicing, concatenation, and so on. Notice in the following that indexing bytes
returns an integer giving the byte's binary value; bytes
really is a sequence of 8-bit integers, but it prints as a string of ASCII-coded characters (plus non-ASCII escapes) when displayed as a whole. To check a given byte's text interpretation, use chr()
to convert it back to its character:
B = b'spam' B b'spam'
B[0] 115 B[-1] 109
chr(B[0]) 's'
B[1:], B[:-1] (b'pam', b'spa')
len(B) 4
B + b'lmn' b'spamlmn' B * 4 b'spamspamspamspam'
Other Ways to Make Bytes
So far, we've been making bytes
objects with the b'...'
literal syntax; they can also be created by calling the bytes()
constructor with a str
and an encoding name, calling bytes
with an iterable of integers representing byte values, or encoding a str
object per the default (or passed-in) encoding. Encoding takes a str
and returns the raw binary bytes value of the string according to its encoding specification; decoding takes a raw bytes
sequence and encodes it to its string representation—a series of Unicode characters:
B = b'abc' B b'abc'
B = bytes('abc', 'ascii') B b'abc'
ord('a') 97 B = bytes([97, 98, 99]) B b'abc'
B = 'spam'.encode() # or bytes() B b'spam'
S = B.decode() # or str() S 'spam'
From a larger perspective, the last two of these operations can also be seen as tools for converting between str
and bytes
, introduced [earlier](#String Type Conversions)and expanded upon in the next section.
Mixing String Types
Notice in the replace()
call of the earlier method-calls [section](#Method Calls) how we have to pass in two bytes objects—str
types won't work there. Although Python 2.X automatically convertsstr
to and from unicode
when possible (that is, when the str
is only 7-bit ASCII text), Python requires specific string types in some contexts, and expects manual conversions if needed:
# Must pass expected types to function and method calls
B = b'spam'
B.replace('pa', 'XY') TypeError: expected an object with the buffer interface
B.replace(b'pa', b'XY') b'sXYm'
B = B'spam' B.replace(bytes('pa'), bytes('xy')) TypeError: string argument without an encoding
B.replace(bytes('pa', 'ascii'), bytes('xy', 'utf-8')) b'sxym'
# Must convert manually in mixed-type expressions
b'ab' + 'cd' TypeError: can't concat bytes to str
b'ab'.decode() + 'cd' # bytes to str 'abcd'
b'ab' + 'cd'.encode() # str to bytes b'abcd'
b'ab' + bytes('cd', 'ascii') # str to bytes b'abcd'
Two footnotes here. First, remember that encoding and decoding are more than a simple type conversion; as we learned in the fuller coverage[earlier](#String Type Conversions), they create different types of data altogether. Second, although you can create bytes
objects yourself to represent packed binary data, they can also be made automatically by reading files opened in binary mode, as we'll see [later](#Python 3.0 File Modes in Action) in this article. First, though, let's briefly meet bytes
' changeable cousin.
Using 3.0 bytearray Objects
So far, we've focused on str
and bytes
, since they subsume 2.6's unicode
and str
. Python 3.0 has a third string type, though—bytearray
is essentially a mutable variant of bytes
, and thus a mutable sequence of integers in the range 0..255. As such, it supports the same string methods and sequence operations as bytes
, as well as the mutable in-place-change operations found on lists:
# Creation: a mutable sequence of small (0..255) ints
B = b'spam' # str 'spam' works in 2.X only C = bytearray(B) C bytearray(b'spam') C[0], chr(C[0]) # ASCII integer code for 's' (115, 's')
# Mutable, but must assign ints, not strings
C[0] = 'x' TypeError: an integer is required
C[0] = b'x' TypeError: an integer is required
C[0] = ord('x') C bytearray(b'xpam')
C[1] = b'Y'[0] C bytearray(b'xYam')
# Methods overlap with both str and bytes, but also has list's mutable methods
set(dir(b'abc')) - set(dir(bytearray(b'abc'))) {'getnewargs'}
set(dir(bytearray(b'abc'))) - set(dir(b'abc')) {'insert', 'alloc', 'reverse', 'extend', 'delitem', 'pop', 'setitem' , 'iadd', 'remove', 'append', 'imul'}
# Mutable method calls
C bytearray(b'xYam')
C.append(b'LMN') TypeError: an integer is required
C.append(ord('L')) C bytearray(b'xYamL')
C.extend(b'MNO') C bytearray(b'xYamLMNO')
# Sequence operations and string methods
C + b'!#' bytearray(b'xYamLMNO!#')
C[0] 120
C[1:] bytearray(b'YamLMNO')
len(C) 8
C bytearray(b'xYamLMNO')
C.replace('xY', 'sp') TypeError: Type str doesn't support the buffer API
C.replace(b'xY', b'sp') bytearray(b'spamLMNO')
C bytearray(b'xYamLMNO')
C * 4 bytearray(b'xYamLMNOxYamLMNOxYamLMNOxYamLMNO')
Finally, by way of summary, the following examples demonstrate how bytes
and bytearray
are sequences of ints
, and str
is a sequence of characters (i.e., decoded Unicode code points); although all three can contain character values and support many of the same operations, you should use str
for textual data, bytes
for binary data, and bytearray
for binary data you wish to change in place:
B b'spam' list(B) [115, 112, 97, 109]
C bytearray(b'xYamLMNO') list(C) [120, 89, 97, 109, 76, 77, 78, 79]
S = 'spam' list(S) ['s', 'p', 'a', 'm']
Python 3.0 File Modes in Action
Now that we've learned all about Python's string types, let's turn to their roles in files—the main context in which most programmers will likely encounter Unicode and bytes, and the last major topic of this tutorial.
As also mentioned [above](#Text and Binary Files), the mode in which you open a file is crucial: it determines which object type you will use to represent the file's content in your script. Text mode implies str
objects, and binary mode implies bytes
:
- Text mode files interpret file contents according to an encoding—either the default for your platform, or one whose name you pass in. By passing in an encoding name to
open()
, you can force conversions for various types of Unicode files. Text mode files may also perform universal line-end translations for you or not; by default, all line-end forms map to the\n
character in your script, regardless of which platform you are on. - Binary mode files instead return file content to you raw, as a sequence of integers representing byte values, with no encoding or decoding, and no line-end translations.
In terms of code, the second positional argument to open()
determines whether you want text or binary processing and types, just as it does in 2.X Python—adding a "b" to the string implies binary mode. The default mode is "rt" which is the same as "r", which means text input, just as in 2.X. In 3.0, though, this mode argument to open()
also implies an object type for file content representation regardless of the underlying platform—text files return a str
for reads and expect one for writes, but binary files return a bytes
for reads and expect bytes
(or bytearray
) for writes.
Text File Basics
To demonstrate, let's begin with basic file I/O. As long as you're processing basic text files (e.g., ASCII) and don't care about circumventing the platform-default encoding of strings, files look and feel much as they do in 2.X (for that matter, so do strings in general). The following, for instance, writes one line of text to a file and reads it back in 3.0, exactly as it would in 2.6 (note that file
is no longer a built-in name in 3.0, and it's perfectly okay to use it as a variable here either way):
C:\misc>c:\python30\python
# Basic text files (and strings) work the same as in 2.X
file = open('temp', 'w') size = file.write('abc\n') # returns number bytes written file.close() # manual close to flush output buffer
file = open('temp') # default mode is "r" (== "rt"), which means text input text = file.read() text 'abc\n'
Using Text and Binary Modes
Next, we'll write a text file and read it back in both modes in 3.0. Notice that we are required to provide a str
for writing, but reading gives us a str
or bytes
depending on the open mode (I've strung operations together here into one-liners just for brevity):
# Write and read a text file
open('temp', 'w').write('abc\n') # text mode output, provide a str 4
open('temp', 'r').read() # text mode input, returns a str 'abc\n'
open('temp', 'rb').read() # binary mode input, returns a bytes b'abc\r\n'
Now, let's do the same, but with a binary file; we must provide a bytes
to write, and still get back a str
or bytes
depending on the input mode:
# Write and read a binary file
open('temp', 'wb').write(b'abc\n') # binary mode output, provide a bytes 4
open('temp', 'r').read() # text mode input, returns a str 'abc\n'
open('temp', 'rb').read() # binary mode input, returns a bytes b'abc\n'
Notice that the same holds even if the data we're writing to the binary file is truly binary in nature; in the following, the \x00
is a binary zero byte, and not a printable character (though it passes as a text code point in the default encoding):
# Write and read binary data
open('temp', 'wb').write(b'a\x00c') 3
open('temp', 'r').read() 'a\x00c'
open('temp', 'rb').read() b'a\x00c'
Binary mode files always return contents as a bytes
object, but accept either a bytes
or bytearray
object for writing. This naturally follows, given that bytearray
is mostly just a mutable variant of bytes
. In fact, most APIs in Python 3.0 that accept a bytes
also allow a bytearray
:
# Bytearrays work too
BA = bytearray(b'\x01\x02\x03')
open('temp', 'wb').write(BA) 3
open('temp', 'r').read() '\x01\x02\x03'
open('temp', 'rb').read() b'\x01\x02\x03'
Finally, notice that you can't get away with violating Python's str
/bytes
distinction when it comes to files; in the following we get errors (shortened here) if we try to write a bytes
to a text file, or a str
to a binary file. Although it is often possible to convert between these two types (as described [earlier](#String Type Conversions)in this article), you will usually want to stick to str
for text data and bytes
for binary data:
# Types are not flexible for file content
open('temp', 'w').write('abc\n') # auto encodes str to bytes 4 open('temp', 'w').write(b'abc\n') # but bytes is not decoded text TypeError: can't write bytes to text stream
open('temp', 'wb').write(b'abc\n') # writes raw bytes 4 open('temp', 'wb').write('abc\n') # but str is not raw bytes TypeError: can't write str to binary stream
This may seem strict, but Python cannot guess how you may wish to interpret the contents of a bytes
or str
when used in the opposite context, and wisely refuses to decode or encode content implicitly. Moreover, because str
and bytes
operation sets largely intersect, the choice of types won't be much of a dilemma for most programs. See earlier in this article for more on bytes
[operations](#Processing 3.0 Bytes Objects) and mixed-type [constraints](#String Type Conversions), and the struct
module coverage ahead for another binary-file [example](#The struct Binary-Data Module).
Using Unicode Text Files
Update: this draft article originally stopped short here before demonstrating open()
encodings for text files. In brief, text files allow a specific Unicode encoding-scheme name to be passed in with an encoding
argument, and use it to automatically decode and encode text on input and output, respectively:
open(_filepathname_, 'r', encoding='utf8')
decodes on readsopen(_filepathname_, 'w', encoding='latin1')
encodes on writescodecs.open()
is equivalent in 2.X- BOMs may require special handling or encoding names
The first of the above, for instance, assumes the file's content is encoded per UTF-8, and automatically decodes its data to str
code points when read by the program. Similarly, the second bullet above encodes str
code points to their Latin-1 format as they are output to the file. File transfers raise exceptions whenever a requested format doesn't work.
For example, in Python 3.X:
file = open('uni.txt', 'w', encoding='utf8') # auto encodes to bytes file.write('spÄm') 4 file.close()
text = open('uni.txt', 'r', encoding='utf8').read() # auto decodes to str text 'spÄm'
raw = open('uni.txt', 'rb').read() # no decoding applied raw b'sp\xc3\x84m'
text = open('uni.txt', 'r', encoding='ascii').read() # Ä's utf8 bytes aren't ascii UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128)
import codecs codecs.open('uni.txt', 'r', encoding='utf8').read() # 2.X's flavor in 3.X 'spÄm'
In the absence of encoding
, text files still encode and decode per a platform- and version-specific default noted earlier. Although you may not notice these translations if your default and files agree, you generally should not rely on the default; it makes your programs dependent on the context in which their files were created, and can lead to portability issues. A program run on a UTF-8 default platform, for instance, may have trouble using a file made under a Latin-1 default (an interoperability pitfall also noted here).
For more coverage and examples of these topics, try this site's postshere andhere, and see this article's later version in this book.
Other String Tool Changes in 3.0
In closing, it's worth noting that many of the popular string-processing tools in Python's standard library have also been revamped for the new str
/bytes
dichotomy. We won't cover any of these application-focused tools in much detail in this core-language book, but as a sample, here's a quick look at two of the major tools impacted.
The re Pattern-Matching Module
Python's re
pattern-matching module has been generalized to work on any objects of any string type in 3.0—str
, bytes
, and bytearray
. Note that you can't mix str
and bytes
types in its calls' arguments, though:
import re S = 'Bugger all down here on earth!' B = b'Bugger all down here on earth!'
re.match('(.) down (.) on (.*)', S).groups() ('Bugger all', 'here', 'earth!')
re.match(b'(.) down (.) on (.*)', B).groups() (b'Bugger all', b'here', b'earth!')
re.match('(.) down (.) on (.*)', B).groups() ... TypeError: can't use a string pattern on a bytes-like object
re.match(b'(.) down (.) on (.*)', S).groups() ... TypeError: can't use a bytes pattern on a string-like object
re.match(b'(.) down (.) on (.*)', bytearray(B)).groups() (bytearray(b'Bugger all'), bytearray(b'here'), bytearray(b'earth!'))
re.match('(.) down (.) on (.*)', bytearray(B)).groups() ... TypeError: can't use a string pattern on a bytes-like object
The struct Binary-Data Module
Along similar lines, the Python struct
module, used to create and extract packed binary data from strings, works in 3.0 as it does in 2.X, but in 3.X operates on bytes
and bytearray
only, not str
(which makes sense, given that it's intended for processing binary data, not decoded text):
import struct B = struct.pack('>i4sh', 7, b'spam', 8) # 's' requires bytes as of 3.2 B # (it encodes str as utf8 in 3.0/3.1) b'\x00\x00\x00\x07spam\x00\x08'
vals = struct.unpack('>i4sh', B) # packed data is bytes, not str vals (7, b'spam', 8)
vals = struct.unpack('>i4sh', B.decode()) TypeError: 'str' does not have the buffer interface
Apart from the new syntax for bytes
, creating and reading binary files works almost the same in 3.0 as it does in 2.X (and as described briefly[earlier](#Using Text and Binary Modes) in this article, and in more detail in this book):
C:\misc>c:\python30\python.exe
F = open('data.bin', 'wb') # open binary output file import struct data = struct.pack('>i4sh', 7, b'spam', 8) # create packed binary data data # bytes in 3.0, not str b'\x00\x00\x00\x07spam\x00\x08' F.write(data) # write to the file 10 F.close()
F = open('data.bin', 'rb') # open binary input file data = F.read() # read bytes data b'\x00\x00\x00\x07spam\x00\x08' values = struct.unpack('>i4sh', data) # extract packed binary data values # back to Python objects (7, b'spam', 8)
Update: Python 3.2 changed the struct.pack()
call to require a bytes
(or bytearray
) object for its "s" conversion code; using a str
is an error. In 3.0 and 3.1 a str
is allowed and automatically encoded to bytes as UTF-8 text—arguably too large an assumption, but 3.2 changes working and documented behavior. To placate 3.2 and later, the examples above simply use a b'...'
literal instead of '...'
; in your code, encode as needed (e.g., mystr.encode('utf8')
.
For more on re
, struct
, and other string-related modules impacted by 3.0's new Unicode support, consult the Python library manual; this article's published version, which also covers pickle
object serialization and XML parsing; or application-focused follow-up books such as Programming Python.
For More Reading
An edited and enhanced version of this page's material appeared in this book, and was later expanded in this edition; see the latter for additional coverage of strings in both Python 3.X and 2.X.
For related resources online at this site, you may also be interested in a 2016 review of Python 3.5 bytes-string formatting; 2018 program-usage notes regarding UnicodeBOMs,defaults, andencodings; and the 2022 coverage of normalization.
For additional reading, try these other articles popular at learning-python.com:
- Using tkinter Programs on Android
- The New Windows Launcher in Python 3.3
- Validating Function Arguments with Decorators
- A More Realistic OOP Example
- Python's New-Style Inheritance Algorithm
- When Pythons Attack (restored)
- Answer Me These Questions Three...
- Python Changes 2014+
- Teaching Python — 25 Years of Spam
These and more are available on the blog page.