课程

Python 中的机器学习预处理

中级技能水平

更新时间 2025年12月

学习如何清洗并准备数据以用于机器学习！

免费开始课程

PythonMachine Learning

4小时

20 视频

62 道练习

4,700 XP

66,582

成就证明

深受数千家公司学习者的喜爱

需要团队培训？

企业版试用

课程描述

本课程介绍何时以及如何进行数据预处理。预处理是任何机器学习项目中的关键步骤，用于让数据准备好进入建模阶段。预处理介于数据导入与清洗、以及拟合机器学习模型之间。您将学习如何对数据进行标准化，使其满足模型所需的形式；如何创建新特征，以更好地利用数据集中的信息；以及如何选择最合适的特征来提升模型拟合效果。最后，您将练习在一个关于 UFO 目击记录的数据集上进行预处理，为后续建模做好准备。

先决条件

Cleaning Data in Python Supervised Learning with scikit-learn

1

Introduction to Data Preprocessing

In this chapter you'll learn exactly what it means to preprocess data. You'll take the first steps in any preprocessing journey, including exploring data types and dealing with missing data.

Introduction to preprocessing

Exploring missing data

Dropping missing data

Working with data types

Exploring data types

Converting a column type

Training and test sets

Class imbalance

Stratified sampling

2

Standardizing Data

This chapter is all about standardizing data. Often a model will make some assumptions about the distribution or scale of your features. Standardization is a way to make your data fit these assumptions and improve the algorithm's performance.

Standardization

When to standardize

Modeling without normalizing

Log normalization

Checking the variance

Log normalization in Python

Scaling data for feature comparison

Scaling data - investigating columns

Scaling data - standardizing columns

Standardized data and modeling

KNN on non-scaled data

KNN on scaled data

3

Feature Engineering

In this section you'll learn about feature engineering. You'll explore different ways to create new, more useful, features from the ones already in your dataset. You'll see how to encode, aggregate, and extract information from both numerical and textual features.

Feature engineering

Feature engineering knowledge test

Identifying areas for feature engineering

Encoding categorical variables

Encoding categorical variables - binary

Encoding categorical variables - one-hot

Engineering numerical features

Aggregating numerical features

Extracting datetime components

Engineering text features

Extracting string patterns

Vectorizing text

Text classification using tf/idf vectors

4

Selecting Features for Modeling

This chapter goes over a few different techniques for selecting the most important features from your dataset. You'll learn how to drop redundant features, work with text vectors, and reduce the number of features in your dataset using principal component analysis (PCA).

Feature selection

When to use feature selection

Identifying areas for feature selection

Removing redundant features

Selecting relevant features

Checking for correlated features

Selecting features using text vectors

Exploring text vectors, part 1

Exploring text vectors, part 2

Training Naive Bayes with feature selection

Dimensionality reduction

Training a model with PCA

5

Putting It All Together

Now that you've learned all about preprocessing you'll try these techniques out on a dataset that records information on UFO sightings.

UFOs and preprocessing

Checking column types

Dropping missing data

Categorical variables and standardization

Extracting numbers from strings

Identifying features for standardization

Engineering new features

Encoding categorical variables

Features from dates

Text vectorization

Feature selection and modeling

Selecting the ideal dataset

Modeling the UFO dataset, part 1

Modeling the UFO dataset, part 2

Congratulations!

Python 中的机器学习预处理

课程完成

获得成就证明

将此证书添加到您的 LinkedIn 档案、简历或履历中
在社交媒体和绩效评估中分享立即注册

加入超过19百万学习者，今天就开始Python 中的机器学习预处理！

通过 DataCamp for Mobile 提升您的数据技能

随时随地通过我们的移动课程和每日 5 分钟编程挑战提升技能。