Skip to content

Course Notes

  • In This course We Analyze our Data step by step
  1. Read your data in CSV File
  2. Summarize the number of missing values and statistical or numeric data
  3. Use histogram to look at distribution of numeric data
  4. How to celect numeric or categorical data from DataFrame and How to create new columns
  5. Use Seaborn plots to calculate median use(boxplot) to calculate mean use(barplot)
  6. To calculate the relationship between to values use (scatterplot)
  7. Strategies for adderssing missing data
  8. Inputing summaries statistic
  9. Converting and analyzing categorical data
  10. How to clean outliers 11.Data Time Correlations Relative class frequancy (crosstab) Hypothesis
# Import any packages you want to use here
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import seaborn as sb
# Display the dataset
clean_books = pd.read_csv('datasets/clean_books.csv', encoding='utf-8')
clean_books
# Summarize the numbers of missing value in each columns data type and memory usage
clean_books.info()
# Numeric data 
clean_books.describe()
# Plot the data  use histogram to look at the distribution of numeric data
sns.histplot(x="rating", data=clean_books)
plt.show()
# Look to data type for each columns
clean_books.dtypes
# Comparing between value use isin() method
clean_books['genre'].isin(['Fiction', 'None Fiction'])
# Count the values
clean_books.value_counts('genre')
# Use opretor tilde to denie the column return True is value exist
~clean_books['genre'].isin(['Fiction'])
# Check if year 2020 is exist
~clean_books['year'].isin(['2020']).head()
sns.boxplot(x=clean_books["year"].astype(int), y=clean_books['rating'])
plt.xticks(rotation=45)
plt.show()
# What is median year
sns.set()
sns.boxenplot(x=clean_books["year"].astype(int))
plt.show()
import numpy as np
print(np.median(clean_books['year']), 'Median')
print(np.max(clean_books['year']), 'Max')
print(np.min(clean_books['year']), 'Min')