Project: A Solution Implementing Big Data Analytics
Introduction The goal of this project was to implement a big data solution to analyse a dataset of academic publications. The tasks in this project involve finding unique paper IDs, calculating average authors per paper, identifying the number of distinct journals, ranking authors based on publication count and impact factor, and analysing publication trends over the years.
Data Description The analysis will be performed on two datasets:
Academic Publications: A JSON file containing information about publications including the authors, the name of the journal the paper was published, the number of citations, and other relevant information. This dataset is a subset of the S20RC dataset released through the Semantic Scholar Public API under the ODC-By 1.0 licence. Journals dataset: A CSV file containing information about journals, including journal name and impact factor,
Environment Setup The code in this notebook was written and executed using Python and the Apache Spark framework version 3.5.0 on the Databricks platform, a cloud based platform for handling large-scale data processing for big data and machine learning. The following libraries will be used:
- pyspark.sql for working with Spark DataFrames and SQL-like operations
- The functions module from pyspark.sql imported as F for data manipulation and transformation tasks
- All functions from pyspark.sql.functions imported for data manipulation
- pyspark.ml.feature for feature engineering
- Bucketizer class for discretizing or binning continuous features
- pandas library imported for data manipulation.
- matplotlib Library for data visualisation.
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
from pyspark.sql import functions as F
from pyspark.sql.functions import *
from pyspark.ml.feature import Bucketizer
import pandas as pd
import matplotlib.pyplot as plt
The datasets were loaded from the Databricks File System (DBFS) into Spark DataFrames using the following code
filepath = '/FileStore/tables/large.json.gz'
publications = spark.read.json(filepath)
journal = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/tables/journal_information.csv")
Task 1: Programmatically confirm that all papers have unique IDs and output the number of papers in the file.
def count_unique_papers_rdd(file_path):
df = spark.read.json(file_path)
rdd = df.select('corpusid').rdd.map(lambda row: row[0])
general_count = rdd.count()
unique_count= rdd.distinct().count()
duplicate_count = general_count - unique_count
print(f"The number of Unique papers in this file is: {unique_count}")
if general_count == unique_count:
print("All papers in this file are unique")
else:
print(f"There are {duplicate_count} duplicated papers")
def count_unique_papers_df(file_path):
df= spark.read.json(file_path)
general_count = df.count()
unique_count = df.distinct().count()
print(f"The number of Unique papers in this file is: {unique_count}")
if general_count == unique_count:
print("All papers in this file are unique")
else:
print(f"There are {duplicate_count} duplicated papers")
count_unique_papers_rdd(filepath)
count_unique_papers_df(filepath)
Answer: Each ID seems to be unique, there are 150,000 unique IDS
Task 2: Find out the average number of authors per paper
RDD Implementation: The calculate_average_authors function created takes an RDD of dictionaries (where each dictionary represents a row in the dataset) and the key for the authors column as input. The function then does the following: Maps each row to extract the number of authors by calculating the length of the author list. Reduces the resulting RDD to calculate the sum of author counts and the total count of papers. Computes the average number of authors by dividing the sum of author counts by the total count of papers and returns it when called with an argument.