课程

Feature Engineering for Machine Learning in Python

中级技能水平

更新时间 2023年2月

Create new features to improve the performance of your Machine Learning models.

免费开始课程

PythonMachine Learning

4小时

16 视频

53 道练习

4,350 XP

38,853

成就证明

深受数千家公司学习者的喜爱

需要团队培训？

企业版试用

课程描述

Every day you read about the amazing breakthroughs in how the newest applications of machine learning are changing the world. Often this reporting glosses over the fact that a huge amount of data munging and feature engineering must be done before any of these fancy models can be used. In this course, you will learn how to do just that. You will work with Stack Overflow Developers survey, and historic US presidential inauguration addresses, to understand how best to preprocess and engineer features from categorical, continuous, and unstructured data. This course will give you hands-on experience on how to prepare any data for your own machine learning models.

先决条件

Supervised Learning with scikit-learn

1

Creating Features

In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Why generate features?

Getting to know your data

Selecting specific data types

Dealing with categorical features

One-hot encoding and dummy variables

Dealing with uncommon categories

Numeric variables

Binarizing columns

Binning values

2

Dealing with Messy Data

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Why do missing values exist?

How sparse is my data?

Finding the missing values

Dealing with missing values (I)

Listwise deletion

Replacing missing values with constants

Dealing with missing values (II)

Filling continuous missing values

Imputing values in predictive models

Dealing with other data issues

Dealing with stray characters (I)

Dealing with stray characters (II)

Method chaining

3

Conforming to Statistical Assumptions

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Data distributions

What does your data look like? (I)

What does your data look like? (II)

When don't you have to transform your data?

Scaling and transformations

Normalization

Standardization

Log transformation

When can you use normalization?

Removing outliers

Percentage based outlier removal

Statistical outlier removal

Scaling and transforming new data

Train and testing transformations (I)

Train and testing transformations (II)

4

Dealing with Text Data

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Encoding text

Cleaning up your text

High level text features

Word counts

Counting words (I)

Counting words (II)

Limiting your features

Text to DataFrame

Term frequency-inverse document frequency

Inspecting Tf-idf values

Transforming unseen data

Using longer n-grams

Finding the most common words

Feature Engineering for Machine Learning in Python

课程完成

获得成就证明

将此证书添加到您的 LinkedIn 档案、简历或履历中
在社交媒体和绩效评估中分享立即注册

加入超过19百万学习者，今天就开始Feature Engineering for Machine Learning in Python！

通过 DataCamp for Mobile 提升您的数据技能

随时随地通过我们的移动课程和每日 5 分钟编程挑战提升技能。