Tutorials
python
+1

Converting Strings to Dates as datetime Objects

Learn how to convert strings to datetime objects in Python and why doing so has become standard practice for working data scientists today.

Analyzing datasets with dates and times is often very cumbersome. Months of different lengths, different distributions of weekdays and weekends, leap years, and the dreaded timezones are just a few things you may have to consider depending on your context. For this reason, Python has a data type specifically designed for dates and times called datetime.

However, in many datasets, you'll find that dates are represented as strings. So, in this tutorial, you'll learn how to convert date strings to the datetime format and see how you can use its powerful set of tools to work effectively with complicated time series data.

Dealing with different representations of dates

The main challenge is often specifying how date strings are expressed. For example, 'Wednesday, June 6, 2018' can be represented as '6/6/18' and '06-06-2018' too. These all inform you of the same date, but you can probably imagine that the code to convert each of these is slightly different. Take a moment to examine the function calls below:

from datetime import datetime

# Define dates as strings
date_str1 = 'Wednesday, June 6, 2018'
date_str2 = '6/6/18'
date_str3 = '06-06-2018'

# Define dates as datetime objects
date_dt1 = datetime.strptime(date_str1, '%A, %B %d, %Y')
date_dt2 = datetime.strptime(date_str2, '%m/%d/%y')
date_dt3 = datetime.strptime(date_str3, '%m-%d-%Y')

# Print converted dates
print(date_dt1)
print(date_dt2)
print(date_dt3)
2018-06-06 00:00:00
2018-06-06 00:00:00
2018-06-06 00:00:00

First, the datetime type is imported from the datetime module. Then, the date string is passed to the .strptime() method, followed by what's called Python's strptime directives. You can combine directives, special characters (e.g. ,, /, or - as in the cases above), and spaces to match the date string you are trying to parse. As you can see, the resulting datetime objects are identical because all three date strings represent the same date.

You can find the full list of directives in the Python Documentation, but below is a table most relevant to what you saw above:

Code Meaning Example
%A Weekday as locale’s full name. Wednesday
%a Weekday as locale’s abbreviated name. Wed
%B Month as locale’s full name. June
%d Day of the month. 06
%m Month as a number. 6
%Y Four-digit year. 2018
%y Two-digit year. 18

Converting the date string column

Now that you're familiar with Python's strptime directives, let's learn how to convert a whole column of date strings in a dataset to the datetime format.

From now on, you'll be working with a DataFrame called eth that contains some historical data on ether, a cryptocurrency whose blockchain is generated by the Ethereum platform. Your dataset has the following columns:

  • date: Date, daily at 00:00 UTC.
  • txVolume: Unadjusted measure of the total value, in US dollars, of outputs on the blockchain.
  • txCount: Number of transactions happening on the public blockchain.
  • marketCap: Unit price in US dollars multiplied by the number of units in circulation.
  • price: Opening price in US dollars at 00:00 UTC.
  • generatedCoins: Number of new coins that have been mined into existence.
  • exchangeVolume: The volume, measured by US dollars, at exchanges like GDAX and Bitfinex.

Here are the first few rows of your dataset. Note how the dates are represented so you can use the right directives later:

print(eth.head())
         date  txVolume(USD)  txCount  marketcap(USD)  price(USD)  \
0  2015-08-10   1.193012e+06     2037      43130000.0    0.713989   
1  2015-08-11   1.052027e+06     4963      42796500.0    0.708087   
2  2015-08-12   7.923370e+05     2036      64018400.0    1.060000   
3  2015-08-13   2.181182e+06     2842      73935400.0    1.220000   
4  2015-08-14   4.154763e+06     3174     109594000.0    1.810000   

   generatedCoins  exchangeVolume(USD)  
0     27817.34375             405283.0  
1     28027.81250            1463100.0  
2     27370.93750            2150620.0  
3     28268.12500            4068680.0  
4     31106.71875            4637030.0

And let's confirm that the date column needs to be converted:

# Confirm the date column is a string
print(eth.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1014 entries, 0 to 1013
Data columns (total 7 columns):
date                   1014 non-null object
txVolume(USD)          1014 non-null float64
txCount                1014 non-null int64
marketcap(USD)         1014 non-null float64
price(USD)             1014 non-null float64
generatedCoins         1014 non-null float64
exchangeVolume(USD)    1014 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 55.5+ KB
None

The date column is indeed a string, which—remember—is denoted as an object type in Python. You can convert it to the datetime type with the .to_datetime() method in pandas. The console below contains the call to convert the column. Can you complete it by specifying the directives according to how dates are expressed in your dataset?

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiaW1wb3J0IHBhbmRhcyBhcyBwZFxuZXRoID0gcGQucmVhZF9jc3YoXCJodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vdG9tbXlqZWUvYXJ0aWNsZXMvbWFzdGVyL3BhbmRhcy9kYXRldGltZS9kYXRhL2V0aF9maW4uY3N2XCIpIiwic2FtcGxlIjoiIyBDb21wbGV0ZSB0aGUgY2FsbCB0byBjb252ZXJ0IHRoZSBkYXRlIGNvbHVtblxuZXRoWydkYXRlJ10gPSAgcGQudG9fZGF0ZXRpbWUoZXRoWydkYXRlJ10sXG4gICAgICAgICAgICAgICAgICAgICAgICAgICAgICBmb3JtYXQ9J19fXycpXG5cbiMgQ29uZmlybSB0aGUgZGF0ZSBjb2x1bW4gaXMgaW4gZGF0ZXRpbWUgZm9ybWF0XG5wcmludChldGguaW5mbygpKSJ9

Did you complete it using '%Y-%m-%d'? Great!

Components of datetime objects

Now that you have datetime objects as your date column, you can extract specific components of the date such as the month, day, or year, all of which are available as the object's attributes:

# Print datetime attributes
print(eth['date'][0].month)
print(eth['date'][0].day)
print(eth['date'][0].year)
8
10
2015

Date attributes are frequently used to group data by a particular time frame. For example, you can see how many ethers were generated on a yearly basis:

from collections import defaultdict
import matplotlib.pyplot as plt

# Initialize defaultdict of type float
yearly_total_coins = defaultdict(float)

# Loop over the rows of eth
for day in eth.iterrows():
    # Get the date
    dates = day[1][0]

    # Get the number of coins generated
    num_coins = day[1][5]

    # Add the total number of coins to the current value for the year
    yearly_total_coins[dates.year] += num_coins

# Print yearly_total_coins
print(yearly_total_coins)

# Visualize aggregated data
plt.bar(range(len(yearly_total_coins)), list(yearly_total_coins.values()), align='center')
plt.xticks(range(len(yearly_total_coins)), list(yearly_total_coins.keys()))
plt.title('# of ethers generated by year')
plt.show()
defaultdict(<class 'float'>, {2015: 3805167.8125, 2016: 11321892.96875, 2017: 9230132.65625, 2018: 2849975.625})

Adding and subtracting time

Another common case when working with dates is to get a date 30, 60, or 90 days in the past from some date. In Python, the timedelta object from the datetime module is used to represent differences in datetime objects. You can create a timedelta by passing any number of keyword arguments such as days, seconds, microseconds, milliseconds, minutes, hours, and weeks to it.

Once you have a timedelta object, you can add or subtract it from a datetime object to get another datetime object. Try it in the console below:

eyJsYW5ndWFnZSI6InB5dGhvbiIsInByZV9leGVyY2lzZV9jb2RlIjoiZnJvbSBkYXRldGltZSBpbXBvcnQgZGF0ZXRpbWUiLCJzYW1wbGUiOiIjIEltcG9ydCB0aW1lZGVsdGEgZnJvbSBkYXRldGltZSBtb2R1bGVcbmZyb20gZGF0ZXRpbWUgaW1wb3J0IHRpbWVkZWx0YVxuXG4jIERlZmluZSB0aW1lZGVsdGEgb2YgNTkgZGF5cywgNCBob3VycywgNDIgbWludXRlc1xuZ2xhbmNlYmFjayA9IHRpbWVkZWx0YShkYXlzPTU5LFxuICAgICAgICAgICAgICAgICAgICAgICBob3Vycz00LFxuICAgICAgICAgICAgICAgICAgICAgICBtaW51dGVzPTQyKVxuXG4jIEdldCBjdXJyZW50IGRhdGUgYW5kIHRpbWVcbnJpZ2h0X25vdyA9IGRhdGV0aW1lLm5vdygpXG5wcmludChyaWdodF9ub3cpXG5cbiMgV2hlbiB3YXMgNTkgZGF5cywgNCBob3VycywgNDIgbWludXRlcyBhZ28/XG5wcmludChyaWdodF9ub3cgLSBnbGFuY2ViYWNrKSJ9

Indexing pandas DataFrames with DatetimeIndex

After you've converted the date column to the datetime format, it is usually a good idea to index your DataFrame by the date, creating the DatetimeIndex. Like the datetime type, the DatetimeIndex is a special index type designed to work with dates and times. By using the .set_index() method with the inplace argument set equal to True, you can remove the date column from your dataset and append it as the index of your DataFrame:

# Set index
eth.set_index('date', inplace=True)
print(eth.head(5))
                txVolume  txCount    marketCap     price  generatedCoins  \
date                                                                       
2015-08-10  1.193012e+06     2037   43130000.0  0.713989     27817.34375   
2015-08-11  1.052027e+06     4963   42796500.0  0.708087     28027.81250   
2015-08-12  7.923370e+05     2036   64018400.0  1.060000     27370.93750   
2015-08-13  2.181182e+06     2842   73935400.0  1.220000     28268.12500   
2015-08-14  4.154763e+06     3174  109594000.0  1.810000     31106.71875   

            exchangeVolume  
date                        
2015-08-10        405283.0  
2015-08-11       1463100.0  
2015-08-12       2150620.0  
2015-08-13       4068680.0  
2015-08-14       4637030.0

Setting the DatetimeIndex in your DataFrame unlocks a whole set of useful functionalities. For example, when visualizing your time series data, pandas automatically creates reasonably spaced date labels for the x-axis:

eth.txCount.plot(title='# of transactions')
plt.show()

Pretty neat, right?

Partial string indexing and slicing

Perhaps the most useful functionalities are partial string indexing and slicing, which allow you to easily subset your data. Let's say you wanted to take a closer look at the peak around January 2018 as seen in the graph above. The following code uses partial string indexing and slicing to observe the number of transactions between December 2017 and February 2018.

# Subset data around peak
peak_eth = eth['2017-12': '2018-2']
peak_eth.txCount.plot(title='# of transactions around peak')

Well, that was easy! Did you notice that the boundaries are inclusive for partial string indexing unlike for typical Python indexing?

Recap

From stock prices to flight timings, data containing dates and times are abudant in a wide variety of domains. Being able to effectively work with such data is essential for answering questions that start with when, how long, or how often.

In this tutorial, you:

  • Learned about Python's directives to deal with the various ways dates can be represented.
  • Learned how to convert strings to dates with the .strptime() method in the datetime module.
  • Learned how to convert date string columns in DataFrames with the .to_datetime() method in pandas.
  • Learned about the components of the datetime object and how to access them as the object's attributes.
  • Learned how to do date arithmetic with the timedelta object.
  • Learned to index DataFrames with DatetimeIndex and explored some examples of its usefulness.

The datetime and DatetimeIndex objects in Python offer a much more manageable and intuitive way to deal with dates and times. If you're curious about learning more advanced techniques and eager to get some hands-on practice manipulating, analyzing, and visualizing time series data, check out the Manipulating Time Series Data in Python, Introduction to Time Series Analysis in Python, and Visualizing Time Series Data in Python courses on DataCamp.

Happy learning!

Want to leave a comment?