Course
How to Save a Pandas DataFrame to CSV
If you create a DataFrame or modify an existing one, you may want to export it so that other people can work with it, or you can save it for your future work. For example, you may have to submit a prediction from a machine learning project to your examiner or send a modified dataset to a coworker.
In Python, you can export a DataFrame as a CSV file using Pandas’ .to_csv()
method.
In this article, I’ll walk you through the main steps of the process and explain the method's parameters.
If you want to learn more about Pandas, check out this course on Data Manipulation with Pandas.
I’ll start with a quick answer for those of you in a hurry, and then I’ll get into the details.
Short Answer: How to Save Pandas DataFrame to CSV
To save a Pandas DataFrame as a CSV, use the DataFrame.to_csv()
method:
df.to_csv('your_file_name.csv', index=False)
Replace 'your_file_name.csv'
with the desired name and path for your file. The index=False
argument prevents Pandas from adding an index column to your CSV.
Creating a Pandas DataFrame
Now, let’s look at the longer answer. Instead of creating a DataFrame from scratch, I’ll import a dataset, modify it, and save the resulting DataFrame in CSV format. The first step is to import the necessary libraries:
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
Then, we can import a seaborn
dataset. The imported dataset is already a Pandas Dataframe, so we don’t need to transform it.
# We’ll import the 'titanic' dataset from seaborn
df = sns.load_dataset('titanic')
print(df.head())
survived pclass sex age sibsp parch fare embarked class
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
Modifying a Pandas DataFrame
Now, let’s do some processing to get a new DataFrame. Since my purpose is to explore the .to_csv()
method from Pandas, I’ll only do a min-max normalization on numerical variables.
scaler = MinMaxScaler()
# Choose the columns that have integer or float data-types
numerical columns - df.select_dtypes(include=['float64','int64']).columns
new_df = df.copy()
# The min-max scaler will represent all numbers on a 0-to-1 scale
new_df[numerical_columns] = scaler.fit_transform(df[numerical_columns])
print(new_df.head())
survived pclass sex age sibsp parch fare embarked
0 0.0 1.0 male 0.271174 0.125 0.0 0.014151 S
1 1.0 0.0 female 0.472229 0.125 0.0 0.139136 C
2 1.0 1.0 female 0.321438 0.000 0.0 0.015469 S
3 1.0 0.0 female 0.434531 0.125 0.0 0.103644 S
4 0.0 1.0 male 0.434531 0.000 0.0 0.015713 S
class who adult_male deck embark_town alive alone
Third man True NaN Southampton no False
First woman False C Cherbourg yes False
Third woman False NaN Southampton yes True
First woman False C Southampton yes False
Third man True NaN Southampton no True
Saving a Pandas DataFrame as CSV
We’ll use the DataFrame.to_csv()
method to save the modified DataFrame as a CSV file.
new_df.to_csv('new_titanic.csv')
This will save the CSV file in the current directory. If you need a refresher, read this tutorial on how to get the current directory in Python.
The way I called this method ensured all parameters were set to their default values. Now, I’ll present the parameters and how you can change their values according to your needs.
Parameters of DataFrame.to_csv()
Explained
Here’s the same .to_csv()
method listed with all the parameters:
new_df.to_csv(
path_or_buf='new_titanic.csv',
sep=',',
na_rep='',
float_format=None,
columns=[],
header=True,
index=True,
index_label=None,
mode='w',
encoding='utf-8',
quoting=csv.QUOTE_MINIMAL,
lineterminator=os.linesep,
storage_options={},
compression='infer',
quotechar='',
chunksize=None,
date_format=None,
doublequote=True,
escapechar=None,
decimal='.',
errors='strict'
)
Some of these parameters are optional, such as columns
, encoding
, quoting
, lineterminator
and storage_options
, so I just set values for them, not necessarily default values.
The first parameter can be either a string, which will be the name of the file, or a file object. Look at this example:
with open('new_titanic.csv', 'w', encoding = 'utf=8') as file:
new_df.to_csv(file)
Separators and missing values
The next parameter is the separator, which refers to how the data are separated. The default separator is the comma, but we can change it to a tab or semicolon. A tab or semicolon separator can be useful when your exported data needs to be in a certain format. Also, for data that contain commas, like texts, using the comma as a separator can bring unwanted changes.
For example, if you have ‘Hello, world’ as a datapoint, then if you want to use the comma separator, you also need quotations to keep it as a single datapoint. Using a tab or semicolon separator instead can be the better option since it may take less memory than using commas and quotation marks.
new_df.to_csv('new_titanic.csv', sep = '\t')
new_df_tab = pd.read_csv('new_titanic.csv')
0\t0\t3\tmale\t22.0\t1\t0\t7.25\tS\tThird\tman...
1\t1\t1\tfemale\t38.0\t1\t0\t71.2833\tC\tFirst...
2\t1\t3\tfemale\t26.0\t0\t0\t7.925\tS\tThird\t...
3\t1\t1\tfemale\t35.0\t1\t0\t53.1\tS\tFirst\tw...
4\t0\t3\tmale\t35.0\t0\t0\t8.05\tS\tThird\tman...
The third parameter, na_rep
, refers to the representation of the missing values. Common choices would be to leave the cell empty to insert ‘NaN’
or ‘Na’
, but you can also insert your custom string. I will insert ‘Datacamp’
in the following example for every missing value.
new_df.to_csv('new_titanic.csv', na_rep='Datacamp')
new_df_datacamp = pd.read_csv('new_titanic.csv')
new_df_datacamp = pd.DataFrame(new_df_datacamp)
#find the missing values
missing_values = new_df_datacamp.isnull().sum()
print(missing_values)
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
Handling columns and indexes
Let’s see how the missing values are represented for the age
column:
print(new_df_datacamp['age'][:10])
0.2711736617240512
0.4722292033174164
0.32143754712239253
0.4345312892686604
0.4345312892686604
Datacamp
0.6732847449107816
0.01985423473234481
0.33400351847197784
0.17064589092736868
If you want fewer decimals, use the parameter float_format
to specify the number of decimals. Here, we’ll use only three.
new_df.to_csv('new_titanic.csv', na_rep='Datacamp', float_format='%.3f')
new_df_datacamp = pd.read_csv('new_titanic.csv')
new_df_datacamp = pd.DataFrame(new_df_datacamp)
print(new_df_datacamp['age'][:10])
0.271
0.472
0.321
0.435
0.435
Datacamp
0.673
0.020
0.334
0.171
If you don’t want to see the index, set the index
parameter to False
.
You can also decide whether to keep the column names, using the header
parameter and setting it to True
or False
, which is the way to go (or you can set the names for the columns).
The columns
parameter allows you to select which columns to keep. The index_label
parameter allows you to set a name for the column index or a list of names if you have multiple indexes.
new_df.to_csv(
'new_titanic.csv',
index=True,
header=True,
columns=['age', 'alive'],
index_label='Index'
)
new_df = pd.read_csv('new_titanic.csv')
new_df = pd.DataFrame(new_df)
print(new_df.head())
Index age alive
0 0.271174 no
1 0.472229 yes
2 0.321438 yes
3 0.434531 yes
4 0.434531 no
File opening, encoding, and compression
The mode
parameter allows you to decide what to do when you already have a CSV file with the same name as the one you have built. If you set the parameter to ‘w’
and already have a new_titanic.csv
file, the one you create will replace the old one. If you set it to ‘x’
, the new file is created only if there is no existing file with the same. Finally, if you set the parameter to ‘a’
, it’ll append the new file to the older one with the same name (maybe you just want to add some rows).
The encoding parameter sets the encoding of the new file. If your data contains special characters, such as letters of other alphabets, you can set it to ‘utf-8’
or another encoding.
The compression
parameter sets the way the new file is compressed. By default, it’s set to ‘infer’
, which means that it compresses the file according to the specified extension, but you can also specify your preferred extension.
# Infer the compression
new_df.to_csv('new_titanic.csv.gz')
# Specify the compression explicitly
new_df.to_csv('new_data.csv.gz', compression='gzip')
# Specify the compression via a dictionary
new_df.to_csv(
'new_data.zip',
compression={
'method': 'zip',
'archive_name': 'data.csv'
}
)
How to use quoting
The quoting
parameter sets the kind of data to be put between quotes. As I said earlier, when you have a comma separator and also cells containing commas (such as texts), you have to decide what to put between quotes.
# Default value, quotes only fields with special characters or separators like tabs and commas
new_df.to_csv('new_titanic.csv', quoting=csv.QUOTE_MINIMAL)
# Quotes all fields
new_df.to_csv('new_titanic.csv', quoting=csv.QUOTE_ALL)
# Quotes only fields with non-numeric values
new_df.to_csv('new_titanic.csv', quoting=csv.QUOTE_NONNUMERIC)
# No field is quoted
new_df.to_csv('new_titanic.csv', quoting=csv.QUOTE_NONE)
Chunksize and date-time data
The chunksize
parameter specifies how much of your file will be written at once. By letting the default value, all the rows are written at once, but you can specify a number, which can be useful when the file is very large.
The date_format
parameter sets the representation of datetime objects. You can specify it using the strftime format.
# Write DataFrame to CSV with dates formatted as 'MM/DD/YYYY'
new_df.to_csv('new_titanic.csv', date_format='%m/%d/%Y')
# You can set datetime objects up to seconds
new_df.to_csv('new_titanic.csv', date_format='%Y-%m-%d %H:%M:%S')
Conclusion
Hopefully, you will now be ready to export your DataFrame in the format that suits your needs!
I encourage you to keep learning—here are some resources you can try:
I am finishing my master's degree in data science at the University of Bucharest, Romania, with a thesis about traffic flow prediction using various models based on Graph Neural Networks. The main skills I learned were data cleaning and visualization in Python and Power BI, preprocessing for various machine learning tasks, and building models with PyTorch (or TensorFlow). At the same time, as a PhD student in philosophy, I am teaching seminars on first-order logic while working on my thesis on Bayesian confirmation theory. I'm eager to learn more about data and building models and to teach what I learn.
Learn Pandas with these courses!
Course
Reshaping Data with pandas
Course
Writing Efficient Code with pandas
cheat-sheet
Reshaping Data with pandas in Python
tutorial
Pandas Tutorial: DataFrames in Python
tutorial
pandas read csv() Tutorial: Importing Data
tutorial
How to Drop Columns in Pandas Tutorial
DataCamp Team
3 min
tutorial
Pandas Add Column Tutorial
DataCamp Team
4 min
tutorial
Pandas Reset Index Tutorial
Satyam Tripathi
8 min