ℹ️ Introduction to Data Science Notebooks
You can skip this section if you are already familiar with data science notebooks.
Data science notebooks
A data science notebook is a document that contains text cells (what you're reading right now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells, and then run a cell by selecting it and then clicking the Run button above ( ▶, or Run All ) or hitting control + enter
.
The result will be displayed directly in the notebook.
Try running the cell below:
# Run this cell to see the result
lab = df.manufacturer.value_counts()
print(lab)
Modify any of the numbers and rerun the cell.
Data science notebooks & data analysis
Notebooks are great for interactive data analysis. Let's create a pandas DataFrame using the read_csv()
function.
We will load the dataset "chocolate_bars.csv" containing different characteristics of over 2,500 chocolate bars and their reviews.
By using the .head()
method, we display the first five rows of data:
# Importing the pandas module
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',20)
# Reading in the data
df = pd.read_csv('data/chocolate_bars.csv')
# Define the style of backgroung chart
plt.style.use('ggplot')
df.dropna(subset=["num_ingredients", 'ingredients'], inplace=False)
df.head()
df.manufacturer.value_counts()
df.company_location.value_counts()
# Checking the shape of the dataframe
countries=pd.read_csv('https://gist.githubusercontent.com/tadast/8827699/raw/f5cac3d42d16b78348610fc4ec301e9234f82821/countries_codes_and_coordinates.csv')
print(countries)
# Generate an overview of the dataframe
df.info()
# Search about missing values
df.isnull().sum() # or we can use : df.duplicated().sum()
# Remove missing values
df.dropna(subset=["num_ingredients", 'ingredients'], inplace=True)
# Checking if the rows are duplicated in the dataframe
df.drop_duplicates()
# Show the plot of each column against the other in the dataframe
sns.pairplot(df)
# Display correlation coefficent for all columns pairs
corr = df.corr()
corr.plot(kind='bar')
plt.show()
corr
# What is the average rating of by counrty of origin?
x = df.groupby("bean_origin")["rating"].mean().head()
x.sort_values(ascending=True)
x.plot(kind='bar')