Python’s standard library comes equipped with several built-in packages for developers to begin reaping the benefits of the language instantly. One such package is the multiprocessing module which enables the systems to run multiple processes simultaneously. In other words, developers can break applications into smaller threads that can run independently from their Python code. These threads or processes are then allocated to the processor by the operating system, allowing them to run in parallel, improving your Python programs' performance and efficiency.
If words such as threads, processes, processors, etc are unfamiliar to you, don’t worry. In this article, we will be covering the definition of a process, how they differ from threads, and how to use the multiprocessing module.
Check out this DataCamp Workspace to follow along with the code in this tutorial.
What are processes in Python?
Understanding the concept of processes and threads is extremely useful to comprehend better how an operating system manages programs through the various execution stages. A process is merely a reference to a computer program. Each program has one process associated with it.
As you read this article, your computer is highly likely to have many processes running at any one time, even if you only have a few programs open – this is due to most operating systems having several tasks running in the background. For example, you may have only three programs running at this present moment, but your computer may have more than 30 active processes running simultaneously.
How to check the active processes currently running on your computer depends on your operating system:
- On Windows: Ctrl+Shift+Esc will start the task manager
- On Mac: open Spotlight Search on Mac and type "Activity Monitor," then hit Return
- On Linux: click on Application Menu and search for System Monitor.
Python provides access to real system-level processes. Instantiating an instance of the Process class from the multiprocessing module enables developers to reference the underlying native process using Python. A new native process is created behind the scenes when a process is started. The lifecycle of a Python process consists of three stages: The initiation of a new process, the running process, and the terminated process – we will cover each stage in The basics of Python’s multiprocessing module.
All processes are made up of one or more threads. Do you recall we mentioned that “a process is merely a reference to a computer program?” Well, each Python program is a process that consists of one default thread called the main thread. The main thread is responsible for executing the instructions within your Python programs. However, it’s important to note that processes and threads are different.
Multiprocessing vs. Threading
To be more specific, one instance of Python’s interpreter – the tool that converts the code written in Python to the language a computer can understand – is equal to one process. A process will consist of at least one thread, called the “main thread” in Python, although other threads may be created within the same process – all other threads created within a process will belong to that process.
The thread serves as a representation of how your Python program will be executed, and once all of the non-background threads are terminated, the Python process will terminate.
- Process: One process is an instance of the Python interpreter that consists of at least one thread called the main thread.
- Thread: A representation of how a Python program is executed within a Python process.
Python has two extremely similar classes that grant us more control over processes and threads: multiprocessing.Process and threading.Thread.
Let’s review some of their similarities and differences.
Concurrency is a concept in which different program parts may be executed out-of-order or in partial order without the final outcome being affected. Both classes were initially intended for concurrency.
#2 Support for concurrency primitives
The multiprocessing.Process and threading.Thread classes support the same concurrency primitives – a tool that enables the synchronization and coordination of threads and processes.
#3 Uniform API
Once you’ve grasped the multiprocessing.Process API, then you can transfer that knowledge to the threading.Thread API, and vice versa. They were intentionally designed this way.
Despite their APIs being the same, processes and threads are different. A process is a higher level of abstraction than a thread: a process is a reference to a computer program, and a thread belongs to a process. This difference is inherent in the classes. Thus, the classes represent two different native functions managed by an underlying operating system.
#2 Access to the shared state
The two classes access the shared state differently. Since threads belong to a process, they can share memory within a process. Thus, a function executed in a new thread still has access to the same data and state within a process. The way threads share states with one another is known as “shared memory” and is pretty straightforward. On the contrary, sharing states between process is much more involved: the state must be serialized and transmitted between processes. In other words, processes do not use shared memory to share states because they have separate memory. Instead, processes are shared using a technique called “inter-process communication,” and to perform it in Python requires other explicit tools like multiprocessing.Pipe or multiprocessing.Queue.
The Python Global Interpreter Lock (GIL) is a lock that permits only one thread to hold control over the Python interpreter. Multiple threads are subject to the GIL, which often makes using Python to perform multithreading a bad idea: true multi-core execution through multithreading is not supported by Python on the CPython interpreter. However, processes are not subject to the GIL because the GIL is used within each Python process but not across processes.
The benefits of Python Multiprocessing
Think of a processor as an entrepreneur. As the entrepreneur's business grows, there are more tasks that must be managed to keep up with the growth of the business. If the entrepreneur decides to take on all of these tasks alone (i.e., accounting, sales, marketing, innovation, etc.), she risks hampering the overall efficiency and performance of the business since one person can only do so much at any one time. For example, before moving on to innovation tasks, she must stop the sales tasks - this is known as running tasks “sequentially.”
Most entrepreneurs understand trying to do everything alone is a bad idea. Consequently, they typically offset the growing number of tasks by hiring employees to manage various departments. This way, tasks can be done in parallel - meaning one task does not have to be stopped for another to run. Hiring more employees to perform specific tasks is akin to using multiple processors to carry out operations. For example, computer vision projects are quite demanding since you will typically be required to process lots of image data which is time-consuming: to speed up this procedure, you could process multiple images in parallel.
Thus, we can say that multiprocessing is useful for making programs more efficient by dividing and assigning tasks to different processors. Python’s multiprocessing module simplifies this further by serving as a high-level tool to increase your programs' efficiency by assigning tasks to different processes.
The basics of Python’s Multiprocessing module
In the “What are processes in Python?” section, we mentioned that the lifecycle of a Python process consists of three stages: the new process, the running process, and the terminated process. This section will delve deeper into each phase of the lifecycle and provide coded examples.
The new process
A new process may be defined as the process that has been created by instantiating an instance of the Process class. A child process is spawned when we assign the Process object to a variable.
from multiprocessing import Process # Create a new process process = Process()
Right now, our process instance is not doing anything because we initialized an empty Process object. We could alter our Process object's configurations to execute a specific function by passing a function we want to run in a different process to the target parameter of the class.
# Create a new process with a specified function to execute. def example_function(): pass new_process = Process(target=example_function)
If our target function had parameters too then we would simply pass them to the args parameter of the Process object as a tuple.
Tip: Learn how to write functions in Python with the Writing Functions in Python interactive course.
# Create a new process with specific function to execute with args. def example(args): pass process = Process(target=example, args=("Hi",))
Note that we have only created a new process, but it is not yet running.
Let’s see how we can run a new process.
The running process
Running a new process is quite straightforward: simply call the start() method of the process instance.
# Run the new process process.start()
This action begins the process’s activity by calling the run() method of the Process instance under the hood. The run() method is also responsible for calling the custom function specified in the Process instance's target parameter (if specified).
Recall that earlier in the tutorial, we stated each process has at least one thread called the main thread - this is the default. Thus, when a child process is started, the main thread is created for that child process and is started. The main thread is responsible for executing all of our code in the child process.
We can check that our process instance is alive from when the start() method returns until the child process terminates using the is_alive() method on our Process instance.
Note: If you’re testing this in a notebook, the is_alive() call must be in the same cell as the call to the start() method for it to catch the running process.
# Run the new process process.start() # Check process is running process.is_alive() """ True """
If the process was not running, the call to the is_alive() method would return False.
The terminated process
When the run() function returns or exits, the process is terminated: we don’t necessarily have to do anything explicitly. In other words, you can expect a process to be terminated after the instructions you set as the target function is complete. Thus, the is_alive() method would return false.
# Check process is terminated - should return false. process.is_alive() """ False """
However, a process may also be terminated if it encounters an unhandled exception or an error is raised. For example, if there is an error in the function you set as the target, the process will terminate.
# Create a new process with a specific function that has an error # to execute with args. def example(args): split_args = list(args.split()) # "name" variable is not in the function namespace - should raise error return name # New process process = Process(target=example, args=("Hi",)) # Running the new process process.start() # Check process is running process.is_alive() """ True Process Process-15: Traceback (most recent call last): File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() """
The Process object also has terminate() and kill() methods that enable users to terminate a process forcefully.
# Create a new process with a specific function to execute with args. def example(args): split_args = list(args.split()) # "name" variable is not in the function namespace - should raise error return split_args # New process process = Process(target=example, args=("Hi",)) # Running the new process process.start() if process.is_alive(): process.terminate() # You can also use process.kill() print("Process terminated forcefully") """ Process terminated forcefully """
It is important to note that forceful terminations are not the recommended way to terminate a process: they may not close all open resources safely or store the required program state. A better solution is to use a controlled shutdown with a process-safe boolean flag or similar tool.
Python Multiprocessing Tutorial
Now that you understand the basics of multiprocessing, let’s work on an example to demonstrate how to do concurrent programming in Python.
The function we create will simply print a statement, sleep for 1 second, then print another sleep - learn more about functions in this Python functions tutorial.
import time def do_something(): print("I'm going to sleep") time.sleep(1) print("I'm awake")
The first step is to create a new process: we are going to create two.
# Create new child process process_1 = Process(target=do_something) process_2 = Process(target=do_something)
Here’s how a concurrent program will look in a notebook environment:
%%time # Starts both processes process_1.start() process_2.start() """ I'm going to sleep CPU times: user 810 µs, sys: 7.34 ms, total: 8.15 ms Wall time: 6.04 ms I'm going to sleep """
Given the output of the program, it’s pretty evident that there’s a problem somewhere in our code. The timer is printed midway through our first process, and the second print statement is not printed.
This occurs because there are three processes running: main process, process_1, and process_2. The process that is tracking the time and printing it is the main process. For our main process to wait before printing the time, we must call the join() method on our two processes after we run them.
Note: If you’re interested in learning more, check out this Stackoverflow discussion.
Let’s look at our new code snippet:
%%time # Create new child process (Cannot run a process more than once) new_process_1 = Process(target=do_something) new_process_2 = Process(target=do_something) # Starts both processes new_process_1.start() new_process_2.start() new_process_1.join() new_process_2.join() """ I'm going to sleep I'm going to sleep I'm awake I'm awake CPU times: user 0 ns, sys: 14 ms, total: 14 ms Wall time: 1.01 s """
The wall time of this run was slightly longer than the first run. However, this run completed the calls to both our target functions before returning the time information. We can also apply this same reasoning to make more processes run concurrently.
In this tutorial, you learned about how to make Python programs more efficient by running them concurrently. Specifically, you learned:
- What processes are and how you can view them in your computer.
- The similarities and differences between Python’s multiprocessing and threading modules.
- The basics of the multiprocessing module and how to run a Python program concurrently using multiprocessing.
Want to learn more about Programming in Python? Check out Datacamp’s Python Programmer Track. No prior programming experience is required, and by the end of the track, you will have gained the necessary skills to successfully develop software, wrangle data, and perform advanced analysis in Python.
Courses for Python
Google Cloud for Data Scientists: Harnessing Cloud Resources for Data Analysis
A Guide to Docker Certification: Exploring The Docker Certified Associate (DCA) Exam
Bash & zsh Shell Terminal Basics Cheat Sheet
Functional Programming vs Object-Oriented Programming in Data Analysis
A Comprehensive Introduction to Anomaly Detection
Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners