Skip to content
0

ℹ️ Introduction to Data Science Notebooks

You can skip this section if you are already familiar with data science notebooks.

Data science notebooks

A data science notebook is a document that contains text cells (what you're reading right now) and code cells. What is unique with a notebook is that it's interactive: You can change or add code cells, and then run a cell by selecting it and then clicking the Run button above ( , or Run All ) or hitting control + enter.

The result will be displayed directly in the notebook.

Try running the cell below:

# Run this cell to see the result
lab = df.manufacturer.value_counts()
print(lab)

Modify any of the numbers and rerun the cell.

Data science notebooks & data analysis

Notebooks are great for interactive data analysis. Let's create a pandas DataFrame using the read_csv() function.

We will load the dataset "chocolate_bars.csv" containing different characteristics of over 2,500 chocolate bars and their reviews.

By using the .head() method, we display the first five rows of data:

# Importing the pandas module
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_columns',10)
pd.set_option('display.max_rows',20)

# Reading in the data
df = pd.read_csv('data/chocolate_bars.csv')

# Define the style of backgroung chart
plt.style.use('ggplot')

df.dropna(subset=["num_ingredients", 'ingredients'], inplace=False)
df.head()


df.manufacturer.value_counts()
df.company_location.value_counts()
# Checking the shape of the dataframe
countries=pd.read_csv('https://gist.githubusercontent.com/tadast/8827699/raw/f5cac3d42d16b78348610fc4ec301e9234f82821/countries_codes_and_coordinates.csv')
print(countries)
# Generate an overview of the dataframe
df.info()
# Search about missing values
df.isnull().sum()  # or we can use : df.duplicated().sum()
# Remove missing values
df.dropna(subset=["num_ingredients", 'ingredients'], inplace=True)
# Checking if the rows are duplicated in the dataframe
df.drop_duplicates()
# Show the plot of each column against the other in the dataframe
sns.pairplot(df)
# Display correlation coefficent for all columns pairs
corr = df.corr()
corr.plot(kind='bar')
plt.show()

corr
# What is the average rating of by counrty of origin?

x = df.groupby("bean_origin")["rating"].mean().head()
x.sort_values(ascending=True)
x.plot(kind='bar')