Skip to main content
HomeTutorialsPython

Python Data Classes: A Comprehensive Tutorial

A beginner-friendly tutorial on Python data classes and how to use them in practice
Updated Mar 2024  · 9 min read

Data classes are one of the features of Python that, after you discover them, you are never going back to the old way. Consider this regular class:

class Exercise:
   def __init__(self, name, reps, sets, weight):
       self.name = name
       self.reps = reps
       self.sets = sets
       self.weight = weight

To me, that class definition is very inefficient — in the __init__ method, you repeat each parameter at least three times. This may not sound like a big deal, but think about how often you write classes in your lifetime with many more parameters.

In comparison, take a look at the data classes alternative of the above code:

from dataclasses import dataclass


@dataclass
class Exercise:
   name: str
   reps: int
   sets: int
   weight: float  # Weight in lbs

This modest-looking piece of code is orders of magnitude better than a regular class. The tiny @dataclass decorator is implementing __init__, __repr__, __eq__ classes behind the scenes, which would have taken at least 20 lines of code manually.

Besides, many other features, such as comparison operators, object ordering, and immutability, are all a single line away from being magically created for our class.

So, the purpose of this tutorial is to show you why data classes are one of the best things to happen to Python if you love object-oriented programming.

Let’s get started!

Basics of Python Data Classes

Let’s cover some of the fundamental concepts of Python data classes that make the so useful.

Some methods are automatically generated in data classes

Despite all their features, data classes are regular classes that take much less code to implement the same functionality. Here is the Exercise class again:

from dataclasses import dataclass


@dataclass
class Exercise:
   name: str
   reps: int
   sets: int
   weight: float


ex1 = Exercise("Bench press", 10, 3, 52.5)

# Verifying Exercise is a regular class
ex1.name
'Bench press'

Right now, Exercise already has __repr__ and __eq__ methods already implemented. Let's verify it:

repr(ex1)
"Exercise(name='Bench press', reps=10, sets=3, weight=52.5)"

The object representation of an object repr must return the code that can recreate itself, and we can see that is exactly the case for ex1.

In comparison, Exercise defined in the old way would look like this:

class Exercise:
   def __init__(self, name, reps, sets, weight):
       self.name = name
       self.reps = reps
       self.sets = sets
       self.weight = weight


ex3 = Exercise("Bench press", 10, 3, 52.5)

ex3
<__main__.Exercise at 0x7f6834100130>

Looks pretty awful and useless!

Now, let’s verify the existence of __eq__, which is the equality operator:

# Redefine the class
@dataclass
class Exercise:
   name: str
   reps: int
   sets: int
   weight: float


ex1 = Exercise("Bench press", 10, 3, 52.5)
ex2 = Exercise("Bench press", 10, 3, 52.5)

Comparing the class to itself and to another class with identical parameters must return True:

ex1 == ex2
True
ex1 == ex1
True

And so it does! In regular classes, this logic would have been a pain to write.

Data classes require type hints

As you might have noticed, data classes require type hints when defining fields. In fact, data classes allow any type from the typing module. For example, here is how to create a field that can accept Any data type:

from typing import Any


@dataclass
class Dummy:
   attr: Any

However, the idiosyncrasy of Python is that even though data classes require type hints, types aren’t actually enforced.

For example, creating an instance of Exercise class with completely incorrect data types can be run without errors:

silly_exercise = Exercise("Bench press", "ten", "three sets", 52.5)

silly_exercise.sets

“Three sets”

If you want to enforce data types, you must use type checkers such as Mypy.

Data classes allow default values in fields

Until now, we haven’t added any defaults to our classes. Let’s fix that:

@dataclass
class Exercise:
   name: str = "Push-ups"
   reps: int = 10
   sets: int = 3
   weight: float = 0


# Now, all fields have defaults
ex5 = Exercise()
ex5
Exercise(name='Push-ups', reps=10, sets=3, weight=0)

Keep in mind that non-default fields can’t follow default fields. For example, the below code will throw an error:

@dataclass
class Exercise:
   name: str = "Push-ups"
   reps: int = 10
   sets: int = 3
   weight: float  # NOT ALLOWED


ex5 = Exercise()
ex5
TypeError: non-default argument 'weight' follows default argument

In practice, you will rarely define defaults with name: type = value syntax.

Instead, you will use the field function, which allows more control of each field definition:

from dataclasses import field


@dataclass
class Exercise:
   name: str = field(default="Push-up")
   reps: int = field(default=10)
   sets: int = field(default=3)
   weight: float = field(default=0)


# Now, all fields have defaults
ex5 = Exercise()
ex5
Exercise(name='Push-up', reps=10, sets=3, weight=0)

The field function has more parameters, such as:

  • repr
  • init
  • compare
  • default_factory

and so on. We will discuss these in the coming sections.

Data classes can be created with a function

A final note on the data class basics is that their definition can be even shorter by using the make_dataclass function:

from dataclasses import make_dataclass

Exercise = make_dataclass(
   "Exercise",
   [
       ("name", str),
       ("reps", int),
       ("sets", int),
       ("weight", float),
   ],
)

ex3 = Exercise("Deadlifts", 8, 3, 69.0)
ex3
Exercise(name='Deadlifts', reps=8, sets=3, weight=69.0)

But you will sacrifice readability, so I don’t recommend using this function.

Advanced Python Data Classes

In this section, we will discuss advanced features of data classes that bring more benefits. One such feature is a default factory.

Default factories

To explain default factories, let’s create another class named WorkoutSession that accepts two fields:

from dataclasses import dataclass
from typing import List


@dataclass
class Exercise:
   name: str = "Push-ups"
   reps: int = 10
   sets: int = 3
   weight: float = 0


@dataclass
class WorkoutSession:
   exercises: List[Exercise]
   duration_minutes: int

By using the List type, we are specifying that WorkoutSession accepts a list of Exercise instances.

# Define the Exercise instances for HIIT training
ex1 = Exercise(name="Burpees", reps=15, sets=3)
ex2 = Exercise(name="Mountain Climbers", reps=20, sets=3)
ex3 = Exercise(name="Jump Squats", reps=12, sets=3)
exercises_monday = [ex1, ex2, ex3]

hiit_monday = WorkoutSession(exercises=exercises_monday, duration_minutes=30)

Right now, each workout session instance requires exercises to be initialized. But this doesn’t mirror how people work out — first, they start a session (probably in an app), and then they add exercises as they work out.

So, we must be able to create sessions with no exercises and no duration. Let’s make this happen by adding an empty list as a default value for exercises:

@dataclass
class WorkoutSession:
   exercises: List[Exercise] = []
   duration_minutes: int = None


hiit_monday = WorkoutSession("25-02-2024")
ValueError: mutable default <class 'list'> for field exercises is not allowed: use default_factory

However, we got an error — it turns out data classes don’t allow mutable default values.

Thankfully, we can fix this by using a default factory:

@dataclass
class WorkoutSession:
   exercises: List[Exercise] = field(default_factory=list)  # PAY ATTENTION
   duration_minutes: int = 0


hiit_monday = WorkoutSession()
hiit_monday
WorkoutSession(exercises=[], duration_minutes=0)

The default_factory parameter accepts a function that returns an initial value for a data class field. This means that it can accept any arbitrary function:

  • tuple
  • dict
  • set
  • Any user-defined custom function

This is accurate regardless of whether the result of the function is mutable or not.

Now, if we think about it, most people start their training with warm-up exercises that are typically similar for any kind of workout. So, initializing sessions with no exercises may not be what some people want.

Instead, let’s create a function that returns three warm-up Exercises:

def create_warmup():
   return [
       Exercise("Jumping jacks", 30, 1),
       Exercise("Squat lunges", 10, 2),
       Exercise("High jumps", 20, 1),
   ]

@dataclass
class WorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5  # Increase the default duration as well


hiit_monday = WorkoutSession()
hiit_monday

WorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), Exercise(name='Squat lunges', reps=10, sets=2, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0)], duration_minutes=5)

Now, any time we create a session, they will come with some warm-up exercises already logged. The new version of WorkoutSession has a default duration of five minutes to account for that.

Adding methods to data classes

Since data classes are regular classes, adding methods to them stays the same. Let’s add two methods to our WorkoutSession data class:

@dataclass
class WorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5

   def add_exercise(self, exercise: Exercise):
       self.exercises.append(exercise)

   def increase_duration(self, minutes: int):
       self.duration_minutes += minutes

Using these methods, we can now log any new activity to a session:

hiit_monday = WorkoutSession()

# Log a new exercise
new_exercise = Exercise("Deadlifts", 6, 4, 60)

hiit_monday.add_exercise(new_exercise)
hiit_monday.increase_duration(15)

But there is a problem:

hiit_monday

WorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), Exercise(name='Squat lunges', reps=10, sets=2, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0), Exercise(name='Deadlifts', reps=6, sets=4, weight=60)], duration_minutes=20)

When we print the session, its default representation is too verbose and unreadable since it contains the code to recreate the object. Let’s fix that.

__repr__ and __str__ in data classes

Data classes implement __repr__ automatically but not __str__. This makes the class fall back on __repr__ when we call print on it.

So, let’s override this behavior by defining __str__ on our own:

@dataclass
class Exercise:
   name: str = "Push-ups"
   reps: int = 10
   sets: int = 3
   weight: float = 0

   def __str__(self):
       base = f"{self.name}: {self.reps}/{self.sets}"
       if self.weight == 0:
           return base
       return base + f", {self.weight} lbs"


ex1 = Exercise(name="Burpees", reps=15, sets=3)
ex1
Exercise(name='Burpees', reps=15, sets=3, weight=0)

The __repr__ is still the same, but when we call print on it:

print(ex1)
Burpees: 15/3

The class’ spring representation is much nicer. Now, let’s fix WorkoutSession as well:

@dataclass
class WorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5  # Increase the default duration as well

   def add_exercise(self, exercise: Exercise):
       self.exercises.append(exercise)

   def increase_duration(self, minutes: int):
       self.duration_minutes += minutes

   def __str__(self):
       base = ""

       for ex in self.exercises:
           base += str(ex) + "\n"
       base += f"\nSession duration: {self.duration_minutes} minutes."

       return base


hiit_monday = WorkoutSession()
print(hiit_monday)

Jumping jacks: 30/1
Squat lunges: 10/2
High jumps: 20/1

Session duration: 5 minutes.

Note: Use the “Explain code” button at the bottom of the snippet to get a line-by-line explanation of the code.

Now, we have got a readable and compact output.

Comparison in data classes

For many classes, it makes sense to compare their objects by some logic. For workouts, it can be the workout duration, the exercise intensity or the weight.

First, let’s see what happens if we try to compare two workouts in the current state:

hiit_wednesday = WorkoutSession()

hiit_wednesday.add_exercise(Exercise("Pull-ups", 7, 3))
print(hiit_wednesday)

Jumping jacks: 30/1
Squat lunges: 10/2
High jumps: 20/1
Pull-ups: 7/3

Session duration: 5 minutes.

hiit_monday > hiit_wednesday
TypeError: '>' not supported between instances of 'WorkoutSession' and 'WorkoutSession'

We receive a TypeError as data classes don't implement comparison operators. But this is easily fixable by setting the order parameter to True:

@dataclass(order=True)
class WorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5

   ...

hiit_monday = WorkoutSession()
# hiit_monday.add_exercise(...)
hiit_monday.increase_duration(10)

hiit_wednesday = WorkoutSession()

hiit_monday > hiit_wednesday

True

This time, comparison works, but what are we even comparing?

In data classes, comparison is performed in the order in which the fields are defined. Right now, the classes are compared based on workout duration since the first field, exercises, contains non-standard objects.

We can verify this by increasing the duration of the Wednesday session:

hiit_monday = WorkoutSession()
# hiit_monday.add_exercise(...)

hiit_wednesday = WorkoutSession()
hiit_wednesday.increase_duration(10)

hiit_monday > hiit_wednesday
False

As expected, we received False.

But what would happen if the first field of Workout was another type of field, say, a string? Let's try and find out:

@dataclass(order=True)
class WorkoutSession:
   date: str = None  # DD-MM-YYYY
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5

   ...

hiit_monday = WorkoutSession("25-02-2024")
hiit_monday.increase_duration(10)

hiit_wednesday = WorkoutSession("27-02-2024")

hiit_monday > hiit_wednesday
False

Even though the Monday session lasts longer, the comparison tells us that it is smaller than Wednesday. The reason is that “25” comes before “27” in Python string comparison.

So, how do we keep the order of the fields and still sort sessions based on the workout duration? This is easy through the field function:

@dataclass(order=True)
class WorkoutSession:
   date: str = field(default=None, compare=False)
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5

   ...

hiit_monday = WorkoutSession("25-02-2024")
hiit_monday.increase_duration(10)

hiit_wednesday = WorkoutSession("27-02-2024")

hiit_monday > hiit_wednesday
True

By setting compare to False for any field, we exclude it from sorting, as evidenced by the above result.

Post-init field manipulation

Right now, we have a default session duration of five minutes to account for warm-up exercises. However, this only makes sense if a user starts a session with a warm-up. What if they start a session with other exercises:

new_session = WorkoutSession([Exercise("Diamond push-ups", 10, 3)])

new_session.duration_minutes
5

For just a single exercise, the total duration is five minutes, which is illogical. Each session must dynamically guess its duration based on the number of sets of each exercise. This means we should make duration_minutes dependent on the exercises field.

Let’s implement it:

@dataclass
class WorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = field(default=0, init=False)

   def __post_init__(self):
       set_duration = 3
       for ex in self.exercises:
           self.duration_minutes += ex.sets * set_duration

   ...

This time, we are defining duration_minutes with init set to False to delay the field's initialization.

Then, inside a special method __post_init__, we are updating its value based on the total number of sets in each Exercise.

Now, when we initialize WorkoutSession, the duration_minutes is dynamically increased by three minutes for each set in each exercise.

# Adding an exercise with three sets
hiit_friday = WorkoutSession([Exercise("Diamond push-ups", 10, 3)])

hiit_friday.duration_minutes
9

In general, if you want to define a field that depends on other fields of your data class, you can use the __post_init__ logic.

Immutability in Data Classes

Our WorkoutSession data class is almost ready; it just needs to be protected. Right now, it can be messed up pretty easily:

hiit_friday.duration_minutes = 1000

hiit_friday.duration_minutes
1000

del hiit_friday.exercises

We want to protect all fields of our classes so that they can be modified only in a way we want. To accomplish this, the @dataclass decorator offers a convenient frozen argument:

@dataclass(frozen=True)
class FrozenExercise:
   name: str
   reps: int
   sets: int
   weight: int | float = 0


ex1 = FrozenExercise("Muscle-ups", 5, 3)

Now, if we want to modify any field, we get an error:

ex1.sets = 5
FrozenInstanceError: cannot assign to field 'sets'

Setting frozen to True automatically adds __deleteattr__ and __setattr__ methods for each field so that they are protected from deletion or updates after initialization. Also, others won't be able to add new fields as well:

ex1.new_field = 10
FrozenInstanceError: cannot assign to field 'new_field'

This functionality would have dozens of lines of code if we were dealing with traditional classes.

However, please note that we can’t make our classes truly immutable. For example, let’s rewrite the WorkoutSession with frozen set to True:

@dataclass(frozen=True)
class ImmutableWorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5


session1 = ImmutableWorkoutSession()

As expected, we can’t directly modify the list of exercises:

session1.exercises = [Exercise()]

However, exercises is a list, which is fully mutable, making the following operation possible:

# Changing one of the elements in a list

# Changing one of the elements in a list
session1.exercises[1] = FrozenExercise("Totally new exercise", 5, 5)

print(session1)

ImmutableWorkoutSession(exercises=[Exercise(name='Jumping jacks', reps=30, sets=1, weight=0), FrozenExercise(name='Totally new exercise', reps=5, sets=5, weight=0), Exercise(name='High jumps', reps=20, sets=1, weight=0)], duration_minutes=5)

So, to protect from accidental changes, it is recommended to use immutable objects such as tuples for field values.

Inheritance in data classes

One last point we will cover is the order of fields in parent and child classes.

Since data classes are regular classes, inheritance works as usual:

@dataclass(frozen=True)
class ImmutableWorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5


@dataclass(frozen=True)
class CardioWorkoutSession(ImmutableWorkoutSession):
   pass

But, since the last field in the parent class (ImmutableWorkoutSession) has a default value, all fields in child classes must have default values.

For example, this is not allowed:

@dataclass(frozen=True)
class ImmutableWorkoutSession:
   exercises: List[Exercise] = field(default_factory=create_warmup)
   duration_minutes: int = 5


@dataclass(frozen=True)
class CardioWorkoutSession(ImmutableWorkoutSession):
   intensity_level: str  # Not allowed, must have a default

TypeError: non-default argument 'intensity_level' follows default argument

Disadvantages of Data Classes and Further Resources

Data classes have been steadily improving since Python 3.7 (they were great to begin with) and cover many use cases where you might need to write classes. But they might be disadvantageous in the following scenarios:

  • Custom __init__ methods
  • Custom __new__ methods
  • Various inheritance patterns

And many more, as discussed in this great Reddit thread. If you want a more detailed rationale for why data classes were introduced and why they aren’t drop-in replacements to regular class definitions, read PEP 557.

If you are interested in object-oriented programming in general, here is a course to continue your journey:

Fundamentally, data classes are fancier structures to hold and retrieve data more efficiently. However, Python has many other data structures that perform this task more or less in a similar manner. For example, you can learn about counters, defaultdicts and namedtuples in the last chapter of the Data Types for Data Science course.


Photo of Bex Tuychiev
Author
Bex Tuychiev

I am a data science content creator with over 2 years of experience and one of the largest followings on Medium. I like to write detailed articles on AI and ML with a bit of a sarcastıc style because you've got to do something to make them a bit less dull. I have produced over 130 articles and a DataCamp course to boot, with another one in the makıng. My content has been seen by over 5 million pairs of eyes, 20k of whom became followers on both Medium and LinkedIn. 

Topics

Keep Learning Python

Course

Intermediate Python

4 hr
1.1M
Level up your data science skills by creating visualizations using Matplotlib and manipulating DataFrames with pandas.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Exploring Matplotlib Inline: A Quick Tutorial

Learn how matplotlib inline can enable you to display your data visualizations directly in a notebook quickly and easily! In this article, we cover what matplotlib inline is, how to use it, and how to pair it with other libraries to create powerful visualizations.
Amberle McKee's photo

Amberle McKee

How to Use the NumPy linspace() Function

Learn how to use the NumPy linspace() function in this quick and easy tutorial.
Adel Nehme's photo

Adel Nehme

Python Absolute Value: A Quick Tutorial

Learn how to use Python's abs function to get a number's magnitude, ignoring its sign. This guide explains finding absolute values for both real and imaginary numbers, highlighting common errors.
Amberle McKee's photo

Amberle McKee

How to Check if a File Exists in Python

Learn how to check if a file exists in Python in this simple tutorial
Adel Nehme's photo

Adel Nehme

Writing Custom Context Managers in Python

Learn the advanced aspects of resource management in Python by mastering how to write custom context managers.
Bex Tuychiev's photo

Bex Tuychiev

How to Convert a List to a String in Python

Learn how to convert a list to a string in Python in this quick tutorial.
Adel Nehme's photo

Adel Nehme

See MoreSee More