1. Scala's real-world project repository data
With almost 30k commits and a history spanning over ten years, Scala is a mature programming language. It is a general-purpose programming language that has recently become another prominent language for data scientists.
Scala is also an open source project. Open source projects have the advantage that their entire development histories -- who made changes, what was changed, code reviews, etc. -- are publicly available.
We're going to read in, clean up, and visualize the real world project repository of Scala that spans data from a version control system (Git) as well as a project hosting site (GitHub). We will find out who has had the most influence on its development and who are the experts.
The dataset we will use, which has been previously mined and extracted from GitHub, is comprised of three files:
pulls_2011-2013.csv
contains the basic information about the pull requests, and spans from the end of 2011 up to (but not including) 2014.pulls_2014-2018.csv
contains identical information, and spans from 2014 up to 2018.pull_files.csv
contains the files that were modified by each pull request.
Task 1: Instructions Import the dataset into the notebook. All the relevant files can be found in the datasets subfolder.
Import the pandas module. Load in 'datasets/pulls_2011-2013.csv' and 'datasets/pulls_2014-2018.csv'as pandas DataFrames and assign them to pulls_one and pulls_two respectively. Similarly, load in 'datasets/pull_files.csv' and assign it to pull_files. Good to know For this Project, you need to be comfortable with pandas. The skills required to complete this Project are covered in Data Manipulation with pandas, and Joining Data with pandas.
Show Answer HINT If you have imported pandas like this:
import pandas as pd You can load in a CSV file like this:
my_data = pd.read_csv('my_data.csv')
# Importing pandas
import pandas as pd
# Loading in the data
pulls_one = pd.read_csv('datasets/pulls_2011-2013.csv')
pulls_two = pd.read_csv('datasets/pulls_2014-2018.csv')
pull_files = pd.read_csv('datasets/pull_files.csv')
2. Preparing and cleaning the data
First, we will need to combine the data from the two separate pull DataFrames.
Next, the raw data extracted from GitHub contains dates in the ISO8601 format. However, pandas
imports them as regular strings. To make our analysis easier, we need to convert the strings into Python's DateTime
objects. DateTime
objects have the important property that they can be compared and sorted.
The pull request times are all in UTC (also known as Coordinated Universal Time). The commit times, however, are in the local time of the author with time zone information (number of hours difference from UTC). To make comparisons easy, we should convert all times to UTC.
Task 2: Instructions Combine the two pulls DataFrames and then convert date to a DateTime object.
Append pulls_one to pulls_two and assign the result to pulls. Convert the date column for the pulls object from a string into a DateTime object. For the conversion, we recommend using pandas' to_datetime() function. Set the utc parameter to True, as this will simplify future operations.
Coordinated Universal Time (UTC) is the basis for civil time today. This 24-hour time standard is kept using highly precise atomic clocks combined with the Earth's rotation.
HINT You can append one DataFrame to another using df1.append(df2, , ignore_index=True). ignore_index ensures that the existing index labels are not used, and the resulting axis will be labeled 0, 1, ..., n-1.
If you have a DataFrame named df, you can convert a column named column to a DateTime object in UTC like so:
# Append pulls_one to pulls_two
pulls = pulls_two.append(pulls_one, ignore_index=True)
# Convert the date for the pulls object
pulls['date'] = pd.to_datetime(pulls['date'], utc=True)
3. Merging the DataFrames
The data extracted comes in two separate files. Merging the two DataFrames will make it easier for us to analyze the data in the future tasks.
Task 3: Instructions Merge the two DataFrames.
Merge pulls and pull_files on the pid column. Assign the result to the data variable. The pandas DataFrame has a merge method that will perform the joining of two DataFrames on a common field.
Show Answer HINT If df_1 and df_2 are DataFrames that you would like to merge on a column named column, the code to do so would look like:
new_df = df_1.merge(df_2, on='column')
# Merge the two DataFrames
data = pulls.merge(pull_files, on='pid')
4. Is the project still actively maintained?
The activity in an open source project is not very consistent. Some projects might be active for many years after the initial release, while others can slowly taper out into oblivion. Before committing to contributing to a project, it is important to understand the state of the project. Is development going steadily, or is there a drop? Has the project been abandoned altogether?
The data used in this project was collected in January of 2018. We are interested in the evolution of the number of contributions up to that date.
For Scala, we will do this by plotting a chart of the project's activity. We will calculate the number of pull requests submitted each (calendar) month during the project's lifetime. We will then plot these numbers to see the trend of contributions.
A helpful reminder of how to access various components of a date can be found in this exercise of Data Manipulation with pandas
Additionally, recall that you can group by multiple variables by passing a list to
.groupby()
. This video from Data Manipulation with pandas should help!
Task 4: Instructions Calculate and plot project activity in terms of pull requests.
Group data by month and year (i.e. '2011-01', '2011-02', etc), and count the number pull requests (pid). Store the counts in a variable called counts. There are a number of ways to accomplish this. One way would be to create two new columns containing the year and month attributes of the date column, and then group by these two variables. Plot counts using a bar chart (this has been done for you). Note, the scaffolding exists to help you create the two columns as suggested above. However, this exercise will only check whether you create counts correctly. Thus, alternate solutions are more than welcome!
Show Answer HINT If you wanted to extract the month and day from a date column of x, one solution would be to create two new columns using the following:
x['month'] = x['date'].dt.month
x['day'] = x['date'].dt.day
Don't forget that you can group by multiple variables by passing a list to groupby(). Using the example above, you could group by month and day using x.groupby(['month', 'day'])['pid'].count()
%matplotlib inline
# Create a column that will store the month
data['month'] = data['date'].dt.month
# Create a column that will store the year
data['year'] = data['date'].dt.year
# Group by the month and year and count the pull requests
counts = data.groupby(['year','month'])['pid'].count()
# Plot the results
counts.plot(kind='bar', figsize = (12,4))
5. Is there camaraderie in the project?
The organizational structure varies from one project to another, and it can influence your success as a contributor. A project that has a very small community might not be the best one to start working on. The small community might indicate a high barrier of entry. This can be caused by several factors, including a community that is reluctant to accept pull requests from "outsiders," that the code base is hard to work with, etc. However, a large community can serve as an indicator that the project is regularly accepting pull requests from new contributors. Such a project would be a good place to start.
In order to evaluate the dynamics of the community, we will plot a histogram of the number of pull requests submitted by each user. A distribution that shows that there are few people that only contribute a small number of pull requests can be used as in indicator that the project is not welcoming of new contributors.
Task 5: Instructions Plot pull requests by user.
Group the pull requests by each user and count the number of pull requests they submitted. Store the counts in a variable called by_user. Plot the histogram for by_user.
Show Answer HINT The agg function allows you to specify an aggregating function after you perform a groupby. For example, to select the last date after grouping by file you can run the following code (where max is the aggregating function):
commits.groupby('file').agg({ 'date': 'max' }) Pandas' DataFrame object has a hist() function that will draw a histogram.
data.head()