1. Scala's real-world project repository data
With almost 30k commits and a history spanning over ten years, Scala is a mature programming language. It is a general-purpose programming language that has recently become another prominent language for data scientists.
Scala is also an open source project. Open source projects have the advantage that their entire development histories -- who made changes, what was changed, code reviews, etc. -- are publicly available.
We're going to read in, clean up, and visualize the real world project repository of Scala that spans data from a version control system (Git) as well as a project hosting site (GitHub). We will find out who has had the most influence on its development and who are the experts.
The dataset we will use, which has been previously mined and extracted from GitHub, is comprised of three files:
pulls_2011-2013.csv
contains the basic information about the pull requests, and spans from the end of 2011 up to (but not including) 2014.pulls_2014-2018.csv
contains identical information, and spans from 2014 up to 2018.pull_files.csv
contains the files that were modified by each pull request.
# Importing pandas
import pandas as pd
# Loading in the data
pulls_one = pd.read_csv('datasets/pulls_2011-2013.csv')
pulls_two = pd.read_csv('datasets/pulls_2014-2018.csv')
pull_files = pd.read_csv('datasets/pull_files.csv')
2. Preparing and cleaning the data
First, we will need to combine the data from the two separate pull DataFrames.
Next, the raw data extracted from GitHub contains dates in the ISO8601 format. However, pandas
imports them as regular strings. To make our analysis easier, we need to convert the strings into Python's DateTime
objects. DateTime
objects have the important property that they can be compared and sorted.
The pull request times are all in UTC (also known as Coordinated Universal Time). The commit times, however, are in the local time of the author with time zone information (number of hours difference from UTC). To make comparisons easy, we should convert all times to UTC.
pulls_one.head()
pulls_two.head()
# Append pulls_one to pulls_two
pulls = pd.concat([ pulls_one, pulls_two], ignore_index = True)
# Convert the date for the pulls object
pulls['date'] = pd.to_datetime(pulls['date'], utc = True)
pull_files.info()
pull_files.head()
pull_files.pid.nunique()
pulls.info()
pulls.pid.nunique()
3. Merging the DataFrames
The data extracted comes in two separate files. Merging the two DataFrames will make it easier for us to analyze the data in the future tasks.
# Merge the two DataFrames
data = pd.merge(left = pulls, right= pull_files, on = 'pid')
data.head()
data.shape