Skip to content
The GitHub History of the Scala Language
  • AI Chat
  • Code
  • Report
  • 1. Scala's real-world project repository data

    With almost 30k commits and a history spanning over ten years, Scala is a mature programming language. It is a general-purpose programming language that has recently become another prominent language for data scientists.

    Scala is also an open source project. Open source projects have the advantage that their entire development histories -- who made changes, what was changed, code reviews, etc. -- are publicly available.

    We're going to read in, clean up, and visualize the real world project repository of Scala that spans data from a version control system (Git) as well as a project hosting site (GitHub). We will find out who has had the most influence on its development and who are the experts.

    The dataset we will use, which has been previously mined and extracted from GitHub, is comprised of three files:

    1. pulls_2011-2013.csv contains the basic information about the pull requests, and spans from the end of 2011 up to (but not including) 2014.
    2. pulls_2014-2018.csv contains identical information, and spans from 2014 up to 2018.
    3. pull_files.csv contains the files that were modified by each pull request.
    # Importing pandas
    import pandas as pd
    
    # Loading in the data
    pulls_one = pd.read_csv('datasets/pulls_2011-2013.csv')
    pulls_two = pd.read_csv('datasets/pulls_2014-2018.csv')
    pull_files = pd.read_csv('datasets/pull_files.csv') 

    2. Preparing and cleaning the data

    First, we will need to combine the data from the two separate pull DataFrames.

    Next, the raw data extracted from GitHub contains dates in the ISO8601 format. However, pandas imports them as regular strings. To make our analysis easier, we need to convert the strings into Python's DateTime objects. DateTime objects have the important property that they can be compared and sorted.

    The pull request times are all in UTC (also known as Coordinated Universal Time). The commit times, however, are in the local time of the author with time zone information (number of hours difference from UTC). To make comparisons easy, we should convert all times to UTC.

    pulls_one.head()
    pulls_two.head()
    # Append pulls_one to pulls_two
    pulls = pd.concat([ pulls_one, pulls_two], ignore_index = True)
    
    # Convert the date for the pulls object
    pulls['date'] = pd.to_datetime(pulls['date'], utc = True)
    pull_files.info()
    pull_files.head()
    pull_files.pid.nunique()
    pulls.info()
    pulls.pid.nunique()

    3. Merging the DataFrames

    The data extracted comes in two separate files. Merging the two DataFrames will make it easier for us to analyze the data in the future tasks.

    # Merge the two DataFrames
    data = pd.merge(left = pulls, right= pull_files, on  = 'pid')
    data.head()
    data.shape