Skip to content
Course Notes: Preprocessing for Machine Learning in Python
Course Notes
Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the datasets
folder.
# Import any packages you want to use here
Take Notes
Add notes here about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets here
Introduction to Data Preprocessing
Introduction to Data Preprocessing
Dropping missing data
# Drop the Latitude and Longitude columns from volunteer
volunteer_cols = volunteer.drop(["Latitude", "Longitude"], axis=1)
# Drop rows with missing category_desc values from volunteer_cols
volunteer_subset = volunteer_cols.dropna(subset=["category_desc"])
# Print out the shape of the subset
print(volunteer_subset.shape)
Working with data types
Converting a column type
# Print the head of the hits column
print(volunteer["hits"].head())
# Convert the hits column to type int
volunteer["hits"] = volunteer["hits"].astype("int")
# Look at the dtypes of the dataset
print(volunteer.dtypes)
Training and test sets
Stratified sampling
# Create a DataFrame with all columns except category_desc
X = volunteer.drop("category_desc", axis=1)
# Create a category_desc labels dataset
y = volunteer[["category_desc"]]
# Use stratified sampling to split up the dataset according to the y dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)
# Print the category_desc counts from y_train
print(y_train["category_desc"].value_counts())