メインコンテンツへスキップ

ホーム Spark

コース

Machine Learning with PySpark

上級スキルレベル

更新日 2025/11

Apache Sparkでデータから予測する方法を学ぶ。決定木、ロジスティック回帰、線形回帰、アンサンブル、パイプラインを使用。

コースを無料で開始

SparkMachine Learning

4時間

16 ビデオ

56 演習

4,550 XP

29,676

修了証明書

何千もの企業の従業員が支持

チームのトレーニングを担当していますか？

Businessをお試しください

コース説明

Apache Spark を機械学習に活用する方法を学ぶ

Sparkは、ビッグデータを扱うための強力な汎用ツールです。 Spark は、クラスター全体にわたるコンピュートタスクの分散を透過的に処理します。これは、処理が高速であることを意味しますが、技術的な詳細を気にするのではなく、分析に集中できるようにもなります。このコースでは、データをSparkに取り込み、その後、Sparkの3つの基本的な機械学習アルゴリズムを学びます: 線形回帰、ロジスティック回帰/分類器、およびパイプラインの作成。

決定木を構築してテストする

独自の決定木を構築することは、機械学習モデルの探求を始めるのに最適な方法です。「再帰的分割」と呼ばれるアルゴリズムを使用してデータを2つのクラスに分け、データ内で2つのクラスを最も有益に分割する予測変数を見つけ、その後さらにノードを追加してこの処理を繰り返します。その後、意思決定ツリーを使って新しいデータで予測できます。

PySparkでロジスティック回帰と線形回帰をマスターする

ロジスティック回帰と線形回帰は、PySparkでサポートされている重要な機械学習手法です。ロジスティック回帰モデルの構築と評価を学んだ後、線形回帰モデルの作成へ進み、予測変数を最も関連性の高い選択肢だけに絞り込む方法を身につけます。

コースの終わりには、コース全体に散りばめられた実践的な課題と練習データセットのおかげで、新たに身につけた機械学習の知識を自信を持って活用できるようになります。

前提条件

Supervised Learning with scikit-learn Introduction to PySpark

1

Introduction

Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.

Machine Learning & Spark

Characteristics of Spark

Components in a Spark Cluster

Connecting to Spark

Location of Spark master

Creating a SparkSession

Loading Data

Loading flights data

Loading SMS spam data

チャプターを開始

2

Classification

Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.

Data Preparation

Removing columns and rows

Column manipulation

Categorical columns

Assembling columns

Decision Tree

Train/test split

Build a Decision Tree

Evaluate the Decision Tree

Logistic Regression

Build a Logistic Regression model

Evaluate the Logistic Regression model

Turning Text into Tables

Punctuation, numbers and tokens

Stopwords and hashing

Training a spam classifier

チャプターを開始

3

Regression

Next you'll learn to create Linear Regression models. You'll also find out how to augment your data by engineering new predictors as well as a robust approach to selecting only the most relevant predictors.

One-Hot Encoding

Encoding flight origin

Encoding shirt sizes

Flight duration model: Just distance

Interpreting the coefficients

Flight duration model: Adding origin airport

Interpreting coefficients

Bucketing & Engineering

Bucketing departure time

Flight duration model: Adding departure time

Regularization

Flight duration model: More features!

Flight duration model: Regularization!

チャプターを開始

4

Ensembles & Pipelines

Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.

Flight duration model: Pipeline stages

Flight duration model: Pipeline model

SMS spam pipeline

Cross-Validation

Cross validating simple flight duration model

Cross validating flight duration model pipeline

Grid Search

Optimizing flights linear regression

Dissecting the best flight duration model

SMS spam optimised

How many models for grid search?

Delayed flights with Gradient-Boosted Trees

Delayed flights with a Random Forest

Evaluating Random Forest

Closing thoughts

チャプターを開始

Machine Learning with PySpark

コース完了

修了証明書を取得

この修了書をLinkedInや履歴書、CVに追加しましょう
ソーシャルメディアや人事評価で共有しましょう今すぐ登録

19百万人を超える学習者と共にMachine Learning with PySparkを始めましょう！

DataCamp for Mobileでデータスキルを磨きましょう

モバイルコースと毎日の 5 分間のコーディングチャレンジで、外出先でも進歩できます。