メインコンテンツへスキップ

ホーム Spark

コース

PySparkで学ぶ特徴量エンジニアリング

上級スキルレベル

更新日 2026/01

データサイエンティストが時間の70～80%を費やす、データ整理と特徴量エンジニアリングの実践を学ぶ。

コースを無料で開始

SparkData Manipulation

4時間

16 ビデオ

60 演習

5,000 XP

17,764

修了証明書

何千もの企業の従業員が支持

チームのトレーニングを担当していますか？

Businessをお試しください

コース説明

現実のデータは雑然としており、その意味を見いだすのがあなたの役目です。MTCars や Iris のような玩具データセットは丁寧にキュレーション・クレンジングされていますが、それでも強力な Machine Learning アルゴリズムが意味を抽出し、予測・分類・クラスタリングに活用できるようにするには、変換が必要です。このコースでは、データサイエンティストが時間の70〜80%を費やすと言われる実務、すなわちデータ整形と特徴量エンジニアリングの泥臭い部分を扱います。データセットの規模がますます大きくなる今、PySpark を使ってこのビッグデータの課題をスケーラブルに解決していきましょう！

前提条件

Supervised Learning with scikit-learn Introduction to PySpark

1

Exploratory Data Analysis

Get to know a bit about your problem before you dive in! Then learn how to statistically and visually inspect your dataset!

Where to Begin

Where to begin?

Check Version

Load in the data

Defining A Problem

What are we predicting?

Verifying Data Load

Verifying DataTypes

Visually Inspecting Data / EDA

Using Corr()

Using Visualizations: distplot

Using Visualizations: lmplot

チャプターを開始

2

Wrangling with Spark Functions

Real data is rarely clean and ready for analysis. In this chapter learn to remove unneeded information, handle missing values and add additional data to your analysis.

Dropping data

Dropping a list of columns

Using text filters to remove records

Filtering numeric fields conditionally

Adjusting Data

Custom Percentage Scaling

Scaling your scalers

Correcting Right Skew Data

Working with Missing Data

Visualizing Missing Data

Imputing Missing Data

Calculate Missing Percents

Getting More Data

A Dangerous Join

Spark SQL Join

Checking for Bad Joins

チャプターを開始

3

Feature Engineering

In this chapter learn how to create new features for your machine learning model to learn from. We'll look at generating them by combining fields, extracting values from messy columns or encoding them for better results.

Feature Generation

Differences

Deeper Features

Time Features

Time Components

Joining On Time Components

Extracting Features

Extracting Text to New Features

Splitting & Exploding

Pivot & Join

Binarizing, Bucketing & Encoding

Binarizing Day of Week

One Hot Encoding

チャプターを開始

4

Building a Model

In this chapter we'll learn how to choose which type of model we want. Then we will learn how to apply our data to the model and evaluate it. Lastly, we'll learn how to interpret the results and save the model for later!

Choosing the Algorithm

Which MLlib Module?

Creating Time Splits

Adjusting Time Features

Feature Engineering Assumptions for RFR

Feature Engineering For Random Forests

Dropping Columns with Low Observations

Naively Handling Missing and Categorical Values

Building a Model

Building a Regression Model

Evaluating & Comparing Algorithms

Understanding Metrics

Interpreting, Saving & Loading

Interpreting Results

Saving & Loading Models

Final Thoughts

チャプターを開始

PySparkで学ぶ特徴量エンジニアリング

コース完了

修了証明書を取得

この修了書をLinkedInや履歴書、CVに追加しましょう
ソーシャルメディアや人事評価で共有しましょう今すぐ登録

19百万人を超える学習者と共にPySparkで学ぶ特徴量エンジニアリングを始めましょう！

DataCamp for Mobileでデータスキルを磨きましょう

モバイルコースと毎日の 5 分間のコーディングチャレンジで、外出先でも進歩できます。