course
Python Data Classes: A Comprehensive Tutorial
Data classes are one of the features of Python that, after you discover them, you are never going back to the old way. Consider this regular class:
class Exercise:
def __init__(self, name, reps, sets, weight):
self.name = name
self.reps = reps
self.sets = sets
self.weight = weight
To me, that class definition is very inefficient — in the __init__
method, you repeat each parameter at least three times. This may not sound like a big deal, but think about how often you write classes in your lifetime with many more parameters.
In comparison, take a look at the data classes alternative of the above code:
from dataclasses import dataclass
@dataclass
class Exercise:
name: str
reps: int
sets: int
weight: float # Weight in lbs
This modest-looking piece of code is orders of magnitude better than a regular class. The tiny @dataclass
decorator is implementing __init__
, __repr__
, __eq__
classes behind the scenes, which would have taken at least 20 lines of code manually.
Besides, many other features, such as comparison operators, object ordering, and immutability, are all a single line away from being magically created for our class.
So, the purpose of this tutorial is to show you why data classes are one of the best things to happen to Python if you love object-oriented programming.
Let’s get started!
Basics of Python Data Classes
Let’s cover some of the fundamental concepts of Python data classes that make the so useful.
Some methods are automatically generated in data classes
Despite all their features, data classes are regular classes that take much less code to implement the same functionality. Here is the Exercise
class again:
from dataclasses import dataclass
@dataclass
class Exercise:
name: str
reps: int
sets: int
weight: float
ex1 = Exercise("Bench press", 10, 3, 52.5)
# Verifying Exercise is a regular class
ex1.name
'Bench press'
Right now, Exercise
already has __repr__
and __eq__
methods already implemented. Let's verify it:
repr(ex1)
"Exercise(name='Bench press', reps=10, sets=3, weight=52.5)"
The object representation of an object repr
must return the code that can recreate itself, and we can see that is exactly the case for ex1
.
In comparison, Exercise
defined in the old way would look like this:
class Exercise:
def __init__(self, name, reps, sets, weight):
self.name = name
self.reps = reps
self.sets = sets
self.weight = weight
ex3 = Exercise("Bench press", 10, 3, 52.5)
ex3
<__main__.Exercise at 0x7f6834100130>
Looks pretty awful and useless!
Now, let’s verify the existence of __eq__
, which is the equality operator:
# Redefine the class
@dataclass
class Exercise:
name: str
reps: int
sets: int
weight: float
ex1 = Exercise("Bench press", 10, 3, 52.5)
ex2 = Exercise("Bench press", 10, 3, 52.5)
Comparing the class to itself and to another class with identical parameters must return True:
ex1 == ex2
True
ex1 == ex1
True
And so it does! In regular classes, this logic would have been a pain to write.
Data classes require type hints
As you might have noticed, data classes require type hints when defining fields. In fact, data classes allow any type from the typing
module. For example, here is how to create a field that can accept Any
data type:
from typing import Any
@dataclass
class Dummy:
attr: Any
However, the idiosyncrasy of Python is that even though data classes require type hints, types aren’t actually enforced.
For example, creating an instance of Exercise
class with completely incorrect data types can be run without errors:
silly_exercise = Exercise("Bench press", "ten", "three sets", 52.5)
silly_exercise.sets
“Three sets”
If you want to enforce data types, you must use type checkers such as Mypy.
Data classes allow default values in fields
Until now, we haven’t added any defaults to our classes. Let’s fix that:
@dataclass
class Exercise:
name: str = "Push-ups"
reps: int = 10
sets: int = 3
weight: float = 0
# Now, all fields have defaults
ex5 = Exercise()
ex5
Exercise(name='Push-ups', reps=10, sets=3, weight=0)
Keep in mind that non-default fields can’t follow default fields. For example, the below code will throw an error:
@dataclass
class Exercise:
name: str = "Push-ups"
reps: int = 10
sets: int = 3
weight: float # NOT ALLOWED
ex5 = Exercise()
ex5
TypeError: non-default argument 'weight' follows default argument
In practice, you will rarely define defaults with name: type = value
syntax.
Instead, you will use the field
function, which allows more control of each field definition:
from dataclasses import field
@dataclass
class Exercise:
name: str = field(default="Push-up")
reps: int = field(default=10)
sets: int = field(default=3)
weight: float = field(default=0)
# Now, all fields have defaults
ex5 = Exercise()
ex5
Exercise(name='Push-up', reps=10, sets=3, weight=0)
The field
function has more parameters, such as:
repr
init
compare
default_factory
and so on. We will discuss these in the coming sections.
Data classes can be created with a function
A final note on the data class basics is that their definition can be even shorter by using the make_dataclass
function:
from dataclasses import make_dataclass
Exercise = make_dataclass(
"Exercise",
[
("name", str),
("reps", int),
("sets", int),
("weight", float),
],
)
ex3 = Exercise("Deadlifts", 8, 3, 69.0)
ex3
Exercise(name='Deadlifts', reps=8, sets=3, weight=69.0)
But you will sacrifice readability, so I don’t recommend using this function.
Advanced Python Data Classes
In this section, we will discuss advanced features of data classes that bring more benefits. One such feature is a default factory.
Default factories
To explain default factories, let’s create another class named WorkoutSession
that accepts two fields:
from dataclasses import dataclass
from typing import List
@dataclass
class Exercise:
name: str = "Push-ups"
reps: int = 10
sets: int = 3
weight: float = 0
@dataclass
class WorkoutSession:
exercises: List[Exercise]
duration_minutes: int
By using the List
type, we are specifying that WorkoutSession
accepts a list of Exercise
instances.
# Define the Exercise instances for HIIT training
ex1 = Exercise(name="Burpees", reps=15, sets=3)
ex2 = Exercise(name="Mountain Climbers", reps=20, sets=3)
ex3 = Exercise(name="Jump Squats", reps=12, sets=3)
exercises_monday = [ex1, ex2, ex3]
hiit_monday = WorkoutSession(exercises=exercises_monday, duration_minutes=30)
Right now, each workout session instance requires exercises to be initialized. But this doesn’t mirror how people work out — first, they start a session (probably in an app), and then they add exercises as they work out.
So, we must be able to create sessions with no exercises and no duration. Let’s make this happen by adding an empty list as a default value for exercises
:
@dataclass
class WorkoutSession:
exercises: List[Exercise] = []
duration_minutes: int = None
hiit_monday = WorkoutSession("25-02-2024")
ValueError: mutable default <class 'list'> for field exercises is not allowed: use default_factory
However, we got an error — it turns out data classes don’t allow mutable default values.
Thankfully, we can fix this by using a default factory:
@dataclass
class WorkoutSession:
exercises: List[Exercise] = field(default_factory=list) # PAY ATTENTION
duration_minutes: int = 0
hiit_monday = WorkoutSession()
hiit_monday
WorkoutSession(exercises=[], duration_minutes=0)
The default_factory
parameter accepts a function that returns an initial value for a data class field. This means that it can accept any arbitrary function:
tuple
dict
set
- Any user-defined custom function
This is accurate regardless of whether the result of the function is mutable or not.
Now, if we think about it, most people start their training with warm-up exercises that are typically similar for any kind of workout. So, initializing sessions with no exercises may not be what some people want.
Instead, let’s create a function that returns three warm-up Exercises
:
def create_warmup():
return [
Exercise("Jumping jacks", 30, 1),
Exercise("Squat lunges", 10, 2),
Exercise("High jumps", 20, 1),
]
@dataclass
class WorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5 # Increase the default duration as well
hiit_monday = WorkoutSession()
hiit_monday
WorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), Exercise(name='Squat lunges', reps=10, sets=2, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0)], duration_minutes=5)
Now, any time we create a session, they will come with some warm-up exercises already logged. The new version of WorkoutSession
has a default duration of five minutes to account for that.
Adding methods to data classes
Since data classes are regular classes, adding methods to them stays the same. Let’s add two methods to our WorkoutSession
data class:
@dataclass
class WorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5
def add_exercise(self, exercise: Exercise):
self.exercises.append(exercise)
def increase_duration(self, minutes: int):
self.duration_minutes += minutes
Using these methods, we can now log any new activity to a session:
hiit_monday = WorkoutSession()
# Log a new exercise
new_exercise = Exercise("Deadlifts", 6, 4, 60)
hiit_monday.add_exercise(new_exercise)
hiit_monday.increase_duration(15)
But there is a problem:
hiit_monday
WorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), Exercise(name='Squat lunges', reps=10, sets=2, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0), Exercise(name='Deadlifts', reps=6, sets=4, weight=60)], duration_minutes=20)
When we print the session, its default representation is too verbose and unreadable since it contains the code to recreate the object. Let’s fix that.
__repr__
and __str__
in data classes
Data classes implement __repr__
automatically but not __str__
. This makes the class fall back on __repr__
when we call print on it.
So, let’s override this behavior by defining __str__
on our own:
@dataclass
class Exercise:
name: str = "Push-ups"
reps: int = 10
sets: int = 3
weight: float = 0
def __str__(self):
base = f"{self.name}: {self.reps}/{self.sets}"
if self.weight == 0:
return base
return base + f", {self.weight} lbs"
ex1 = Exercise(name="Burpees", reps=15, sets=3)
ex1
Exercise(name='Burpees', reps=15, sets=3, weight=0)
The __repr__
is still the same, but when we call print
on it:
print(ex1)
Burpees: 15/3
The class’ spring representation is much nicer. Now, let’s fix WorkoutSession
as well:
@dataclass
class WorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5 # Increase the default duration as well
def add_exercise(self, exercise: Exercise):
self.exercises.append(exercise)
def increase_duration(self, minutes: int):
self.duration_minutes += minutes
def __str__(self):
base = ""
for ex in self.exercises:
base += str(ex) + "\n"
base += f"\nSession duration: {self.duration_minutes} minutes."
return base
hiit_monday = WorkoutSession()
print(hiit_monday)
Jumping jacks: 30/1
Squat lunges: 10/2
High jumps: 20/1
Session duration: 5 minutes.
Note: Use the “Explain code” button at the bottom of the snippet to get a line-by-line explanation of the code.
Now, we have got a readable and compact output.
Comparison in data classes
For many classes, it makes sense to compare their objects by some logic. For workouts, it can be the workout duration, the exercise intensity or the weight.
First, let’s see what happens if we try to compare two workouts in the current state:
hiit_wednesday = WorkoutSession()
hiit_wednesday.add_exercise(Exercise("Pull-ups", 7, 3))
print(hiit_wednesday)
Jumping jacks: 30/1
Squat lunges: 10/2
High jumps: 20/1
Pull-ups: 7/3
Session duration: 5 minutes.
hiit_monday > hiit_wednesday
TypeError: '>' not supported between instances of 'WorkoutSession' and 'WorkoutSession'
We receive a TypeError
as data classes don't implement comparison operators. But this is easily fixable by setting the order
parameter to True
:
@dataclass(order=True)
class WorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5
...
hiit_monday = WorkoutSession()
# hiit_monday.add_exercise(...)
hiit_monday.increase_duration(10)
hiit_wednesday = WorkoutSession()
hiit_monday > hiit_wednesday
True
This time, comparison works, but what are we even comparing?
In data classes, comparison is performed in the order in which the fields are defined. Right now, the classes are compared based on workout duration since the first field, exercises
, contains non-standard objects.
We can verify this by increasing the duration of the Wednesday session:
hiit_monday = WorkoutSession()
# hiit_monday.add_exercise(...)
hiit_wednesday = WorkoutSession()
hiit_wednesday.increase_duration(10)
hiit_monday > hiit_wednesday
False
As expected, we received False
.
But what would happen if the first field of Workout
was another type of field, say, a string? Let's try and find out:
@dataclass(order=True)
class WorkoutSession:
date: str = None # DD-MM-YYYY
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5
...
hiit_monday = WorkoutSession("25-02-2024")
hiit_monday.increase_duration(10)
hiit_wednesday = WorkoutSession("27-02-2024")
hiit_monday > hiit_wednesday
False
Even though the Monday session lasts longer, the comparison tells us that it is smaller than Wednesday. The reason is that “25” comes before “27” in Python string comparison.
So, how do we keep the order of the fields and still sort sessions based on the workout duration? This is easy through the field
function:
@dataclass(order=True)
class WorkoutSession:
date: str = field(default=None, compare=False)
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5
...
hiit_monday = WorkoutSession("25-02-2024")
hiit_monday.increase_duration(10)
hiit_wednesday = WorkoutSession("27-02-2024")
hiit_monday > hiit_wednesday
True
By setting compare
to False
for any field, we exclude it from sorting, as evidenced by the above result.
Post-init field manipulation
Right now, we have a default session duration of five minutes to account for warm-up exercises. However, this only makes sense if a user starts a session with a warm-up. What if they start a session with other exercises:
new_session = WorkoutSession([Exercise("Diamond push-ups", 10, 3)])
new_session.duration_minutes
5
For just a single exercise, the total duration is five minutes, which is illogical. Each session must dynamically guess its duration based on the number of sets of each exercise. This means we should make duration_minutes
dependent on the exercises
field.
Let’s implement it:
@dataclass
class WorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = field(default=0, init=False)
def __post_init__(self):
set_duration = 3
for ex in self.exercises:
self.duration_minutes += ex.sets * set_duration
...
This time, we are defining duration_minutes
with init
set to False
to delay the field's initialization.
Then, inside a special method __post_init__
, we are updating its value based on the total number of sets in each Exercise
.
Now, when we initialize WorkoutSession
, the duration_minutes
is dynamically increased by three minutes for each set in each exercise.
# Adding an exercise with three sets
hiit_friday = WorkoutSession([Exercise("Diamond push-ups", 10, 3)])
hiit_friday.duration_minutes
9
In general, if you want to define a field that depends on other fields of your data class, you can use the __post_init__
logic.
Immutability in Data Classes
Our WorkoutSession
data class is almost ready; it just needs to be protected. Right now, it can be messed up pretty easily:
hiit_friday.duration_minutes = 1000
hiit_friday.duration_minutes
1000
del hiit_friday.exercises
We want to protect all fields of our classes so that they can be modified only in a way we want. To accomplish this, the @dataclass
decorator offers a convenient frozen
argument:
@dataclass(frozen=True)
class FrozenExercise:
name: str
reps: int
sets: int
weight: int | float = 0
ex1 = FrozenExercise("Muscle-ups", 5, 3)
Now, if we want to modify any field, we get an error:
ex1.sets = 5
FrozenInstanceError: cannot assign to field 'sets'
Setting frozen
to True
automatically adds __deleteattr__
and __setattr__
methods for each field so that they are protected from deletion or updates after initialization. Also, others won't be able to add new fields as well:
ex1.new_field = 10
FrozenInstanceError: cannot assign to field 'new_field'
This functionality would have dozens of lines of code if we were dealing with traditional classes.
However, please note that we can’t make our classes truly immutable. For example, let’s rewrite the WorkoutSession
with frozen
set to True
:
@dataclass(frozen=True)
class ImmutableWorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5
session1 = ImmutableWorkoutSession()
As expected, we can’t directly modify the list of exercises:
session1.exercises = [Exercise()]
However, exercises
is a list, which is fully mutable, making the following operation possible:
# Changing one of the elements in a list
# Changing one of the elements in a list
session1.exercises[1] = FrozenExercise("Totally new exercise", 5, 5)
print(session1)
ImmutableWorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), FrozenExercise(name='Totally new exercise', reps=5, sets=5, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0)], duration_minutes=5)
So, to protect from accidental changes, it is recommended to use immutable objects such as tuples for field values.
Inheritance in data classes
One last point we will cover is the order of fields in parent and child classes.
Since data classes are regular classes, inheritance works as usual:
@dataclass(frozen=True)
class ImmutableWorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5
@dataclass(frozen=True)
class CardioWorkoutSession(ImmutableWorkoutSession):
pass
But, since the last field in the parent class (ImmutableWorkoutSession
) has a default value, all fields in child classes must have default values.
For example, this is not allowed:
@dataclass(frozen=True)
class ImmutableWorkoutSession:
exercises: List[Exercise] = field(default_factory=create_warmup)
duration_minutes: int = 5
@dataclass(frozen=True)
class CardioWorkoutSession(ImmutableWorkoutSession):
intensity_level: str # Not allowed, must have a default
TypeError: non-default argument 'intensity_level' follows default argument
Disadvantages of Data Classes and Further Resources
Data classes have been steadily improving since Python 3.7 (they were great to begin with) and cover many use cases where you might need to write classes. But they might be disadvantageous in the following scenarios:
- Custom
__init__
methods - Custom
__new__
methods - Various inheritance patterns
And many more, as discussed in this great Reddit thread. If you want a more detailed rationale for why data classes were introduced and why they aren’t drop-in replacements to regular class definitions, read PEP 557.
If you are interested in object-oriented programming in general, here is a course to continue your journey:
Fundamentally, data classes are fancier structures to hold and retrieve data more efficiently. However, Python has many other data structures that perform this task more or less in a similar manner. For example, you can learn about counters, defaultdicts
and namedtuples
in the last chapter of the Data Types for Data Science course.
I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn.
Keep Learning Python
track
Python Data Fundamentals
course
Introduction to Functions in Python
tutorial
Python Classes Tutorial
DataCamp Team
3 min
tutorial
Object-Oriented Programming in Python (OOP): Tutorial
tutorial
Inner Classes in Python
tutorial
Python Linked Lists: Tutorial With Examples
tutorial
Introduction to Python Metaclasses
tutorial