Course
Strings represent human-readable text and are one of the most basic and important data types in computer programming. However, each character in a text string is represented by one or more bytes of binary data. Applications such as input and output operations and data transmission require strings to be converted to bytes using a specific encoding.
This tutorial explores the techniques of converting strings to bytes in Python. If you're interested in the reverse operation, check out my tutorial on how to convert bytes to strings in Python.
Before getting into details, let’s start with a short answer for those of you in a hurry.
Short Answer: How to Convert String to Bytes in Python
Python makes it straightforward to convert a string into bytes using the built-in .encode()
method:
my_string = "Hello, world!"
bytes_representation = my_string.encode(encoding="utf-8")
# Optional: Specify the desired encoding (UTF-8 is the default)
print(bytes_representation)
# Output: b'Hello, world!'
The .encode()
method returns a new bytes object representing the encoded version of the original string. By default, it uses UTF-8 encoding, but you can specify other encodings like 'ascii'
or 'latin-1'
if needed.
Let's explore this conversion process in more detail.
Understanding Strings and Bytes in Python
Two of the core built-in data types in Python are str
and bytes
. These data types share common features but have key differences.
Both str
and bytes
are immutable sequences, meaning we can't modify their elements after creation. A string is an immutable sequence of characters, whereas a bytes
object is an immutable sequence of integers between 0 and 255. This range of integers can be represented by 8 bits, which is one byte. Therefore, a bytes
object is a sequence of bytes.
ASCII character encoding
Let's consider ASCII characters first. ASCII (American Standard Code for Information Interchange) is a character encoding that contains only 128 characters. Therefore, any ASCII character can be represented by seven bits, which is fewer than a single byte.
We can create a bytes
object by adding b in front of the single, double, or triple quotes we normally use for strings:
word_as_bytes = b"Python"
print(word_as_bytes)
print(type(word_as_bytes))
b'Python'
<class 'bytes'>
Although the code displays the characters spelling Python, each element in the bytes
object is an integer between 0 and 255:
print(word_as_bytes[0])
print(list(word_as_bytes))
80
[80, 121, 116, 104, 111, 110]
The first element of word_as_bytes
is the integer 80
, which is the ASCII code for uppercase P:
print(chr(80))
P
When casting word_as_bytes
to a list, the list contains the integers representing each byte. The integers are the ASCII codes for the letters in the word Python.
However, the ASCII character set is limited.
UTF-8 character encoding
The most common character encoding is UTF-8, which is an 8-bit Unicode encoding. The 128 ASCII characters are represented by the same integers in UTF-8, but other characters can also be represented using two or more bytes for each character.
Let's create a bytes
object using non-ASCII characters. We'll need to use the bytes()
constructor:
word_as_bytes = bytes("café", "utf-8")
print(word_as_bytes)
b'caf\xc3\xa9'
The bytes
object displays the first three letters in café directly. However, the accented é is not an ASCII character, and it's represented by two bytes, which are displayed as \xc3
and \xa9
. These bytes represent the hexadecimal numbers c3 and a9, which are the integers 195 and 169. These two bytes combined represent é in UTF-8:
print(len(word_as_bytes))
print(list(word_as_bytes))
5
[99, 97, 102, 195, 169]
A five-element bytes
object represents the four-letter string.
Converting Strings to Bytes: The encode() Method
Earlier, we used the bytes()
constructor to convert a text string into a bytes
object. A more common way of converting Python strings to bytes is to use the .encode()
string method, which gives control over encoding and error handling. This method returns a bytes
object that represents the string:
word_as_bytes = "Hello Python!".encode()
print(word_as_bytes)
print(type(word_as_bytes))
b'Hello Python!'
<class 'bytes'>
The .encode()
method defaults to the UTF-8 encoding. UTF-8 is the most widely used encoding format, and it supports a much wider range of characters than ASCII. UTF-8 represents each character with a sequence of one, two, three, or four-byte units.
We can call .encode()
with an alternative encoding as an argument:
word_as_bytes = "Hello Python!".encode("utf-16")
print(word_as_bytes)
b'\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00P\x00y\x00t\x00h\x00o\x00n\x00!\x00'
The bytes
object is different in this case as it represents the UTF-16 encoding of the same text string.
Encoding Errors
Since not all encodings include all characters, errors can occur when encoding a string into a bytes
object.
Let's consider the string "Café • £2.20"
, which has three non-ASCII characters. We can encode this using the default UTF-8 encoding:
word_as_bytes = "Café • £2.20".encode()
print(word_as_bytes)
b'Caf\xc3\xa9 \xe2\x80\xa2 \xc2\xa32.20'
The non-ASCII characters are replaced by their hexadecimal escape sequences. However, .encode()
raises an error if the same string is encoded using ASCII since several characters aren't present in the ASCII encoding:
word_as_bytes = "Café • £2.20".encode("ascii")
print(word_as_bytes)
Traceback (most recent call last):
...
word_as_bytes = "Café • £2.20".encode("ascii")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 3: ordinal not in range(128)
The .encode()
string method has a parameter errors
, which has a default value of "strict"
. The "strict" argument forces .encode()
to raise a UnicodeEncodeError
when a character can't be encoded.
However, there are other options for handling errors. One option is to ignore the errors using the "ignore"
argument:
word_as_bytes = "Café • £2.20".encode("ascii", errors="ignore")
print(word_as_bytes)
b'Caf 2.20'
This option doesn't raise an error. Instead, .encode()
returns a bytes
object, and the accented é, the bullet point, and the pound sign are omitted.
Another option is to replace the characters that can't be encoded with something else. There are several replacement options. One of these is the errors="replace"
argument:
word_as_bytes = "Café • £2.20".encode("ascii", errors="replace")
print(word_as_bytes)
b'Caf? ? ?2.20'
The three non-ASCII characters, which can't be encoded in this example, are replaced with a question mark. Therefore, each missing character is replaced with another single character that acts as a placeholder.
We can also replace the missing characters with more informative text:
word_as_bytes = "Café • £2.20".encode("ascii", errors="backslashreplace")
print(word_as_bytes)
b'Caf\\xe9 \\u2022 \\xa32.20'
Calling .encode()
with errors="backslashreplace"
replaces the characters that can't be encoded with their hexadecimal escape sequences. The hexadecimal number e9 represents the accented é. The bullet point is the Unicode character u2022, and the hexadecimal number a3 represents the pound sign.
We can also use "xmlcharrefreplace"
to replace the missing characters with their XML code:
word_as_bytes = "Café • £2.20".encode("ascii", errors="xmlcharrefreplace")
print(word_as_bytes)
b'Café • £2.20'
Another option is to replace invalid characters with their formal name using errors="namereplace"
:
word_as_bytes = "Café • £2.20".encode("ascii", errors="namereplace")
print(word_as_bytes)
b'Caf\\N{LATIN SMALL LETTER E WITH ACUTE} \\N{BULLET} \\N{POUND SIGN}2.20'
Different situations may require tailored error handling, and the .encode()
string method provides several options for dealing with characters that can't be encoded.
Applications of String-to-Byte Conversion in Data Science
String-to-byte conversion is a fundamental operation that finds applications in various data science domains:
- Natural language processing (NLP): When working with text data for tasks like sentiment analysis, topic modeling, or machine translation, we often preprocess text by tokenizing it into words or subwords. This tokenization process frequently involves converting strings to byte sequences for efficient representation and manipulation.
- Data cleaning and preprocessing: Byte-level operations can be useful for cleaning text data, such as removing invalid characters or normalizing text based on specific byte patterns.
- Feature engineering: In some cases, byte-level features (e.g., n-grams of bytes) can be extracted from text data and used as input features for machine learning models.
- Web scraping and data extraction: When scraping data from websites, we often receive HTML or other text-based content that might need to be parsed and processed at the byte level to extract relevant information.
- Data compression: Certain data compression algorithms operate on byte sequences, so converting strings to bytes can be a necessary step before applying compression techniques.
Understanding these applications can help us identify situations where converting strings to bytes can be useful.
Conclusion
Strings are sequences of human-readable characters. These characters are encoded as bytes of binary data, which can be stored in a bytes
object. A bytes
object is a sequence of integers, and each integer represents a byte.
Applications require strings to be converted to bytes
objects, which we can convert using either the bytes()
constructor or the string .encode()
method. Mastering the conversion between strings and bytes enables more flexible data manipulation.
You can continue your Python learning with the following tutorials and courses:
String to Bytes Conversion FAQs
What’s the difference between str and bytes?
A string is an immutable sequence of characters, whereas a bytes
object is an immutable sequence of integers. Each integer represents a byte.
Should I use bytes() or str.encode() to convert a string to bytes?
The str.encode()
method is the preferred way to convert strings to bytes in Python, offering clarity and flexibility in choosing the encoding and error handling strategy. The bytes()
constructor can be used in specific scenarios where str.encode()
isn't directly applicable.
I studied Physics and Mathematics at UG level at the University of Malta. Then, I moved to London and got my PhD in Physics from Imperial College. I worked on novel optical techniques to image the human retina. Now, I focus on writing about Python, communicating about Python, and teaching Python.