メインコンテンツへスキップ

ホーム Spark

コース

PySparkで学ぶBig Data入門

上級スキルレベル

更新日 2025/02

PySparkでビッグデータを扱う基礎を学ぶ。

コースを無料で開始

SparkData Engineering

4時間

16 ビデオ

55 演習

4,600 XP

65,217

修了証明書

何千もの企業の従業員が支持

チームのトレーニングを担当していますか？

Businessをお試しください

コース説明

近年Big Dataは大きな注目を集め、多くの企業で一般的に活用されるようになりました。では、Big Dataとは何でしょうか？このコースでは、PySparkを通してBig Dataの基礎を学びます。SparkはBig Data向けの「超高速なクラスター計算」フレームワークで、汎用のデータ処理エンジンを提供し、Hadoopに比べてメモリ上で最大100倍、ディスク上で最大10倍の高速化が可能です。PythonからSparkを扱うPySparkや、SparkSQL、MLlib（Machine Learning向け）などの強力な高水準ライブラリを使います。William Shakespeareの作品の分析、Fifa 2018データの解析、ゲノムデータセットのクラスタリングにも取り組みます。コースの最後には、PySparkの深い理解と、一般的なBig Data分析への応用力が身につきます。

前提条件

Introduction to Python

1

Introduction to Big Data analysis with Spark

This chapter introduces the exciting world of Big Data, as well as the various concepts and different frameworks for processing Big Data. You will understand why Apache Spark is considered the best framework for BigData.

What is Big Data?

The 3 V's of Big Data

PySpark: Spark with Python

Understanding SparkContext

Interactive Use of PySpark

Loading data in PySpark shell

Review of functional programming in Python

Use of lambda() with map()

Use of lambda() with filter()

チャプターを開始

2

Programming in PySpark RDD’s

The main abstraction Spark provides is a resilient distributed dataset (RDD), which is the fundamental and backbone data type of this engine. This chapter introduces RDDs and shows how RDDs can be created and executed using RDD Transformations and Actions.

Abstracting Data with RDDs

RDDs from Parallelized collections

RDDs from External Datasets

Partitions in your data

Basic RDD Transformations and Actions

Map and Collect

Filter and Count

Pair RDDs in PySpark

ReduceBykey and Collect

SortByKey and Collect

Advanced RDD Actions

CountingBykeys

Create a base RDD and transform it

Remove stop words and reduce the dataset

Print word frequencies

チャプターを開始

3

PySpark SQL & DataFrames

In this chapter, you'll learn about Spark SQL which is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine. This chapter shows how Spark SQL allows you to use DataFrames in Python.

Abstracting Data with DataFrames

RDD to DataFrame

Loading CSV into DataFrame

Operating on DataFrames in PySpark

Inspecting data in PySpark DataFrame

PySpark DataFrame subsetting and cleaning

Filtering your DataFrame

Interacting with DataFrames using PySpark SQL

Running SQL Queries Programmatically

SQL queries for filtering Table

Data Visualization in PySpark using DataFrames

PySpark DataFrame visualization

Part 1: Create a DataFrame from CSV file

Part 2: SQL Queries on DataFrame

Part 3: Data visualization

チャプターを開始

4

Machine Learning with PySpark MLlib

PySpark MLlib is the Apache Spark scalable machine learning library in Python consisting of common learning algorithms and utilities. Throughout this last chapter, you'll learn important Machine Learning algorithms. You will build a movie recommendation engine and a spam filter, and use k-means clustering.

Overview of PySpark MLlib

PySpark ML libraries

PySpark MLlib algorithms

Collaborative filtering

Loading Movie Lens dataset into RDDs

Model training and predictions

Model evaluation using MSE

Classification

Loading spam and non-spam data

Feature hashing and LabelPoint

Logistic Regression model training

Loading and parsing the 5000 points data

K-means training

Visualizing clusters

Congratulations!

チャプターを開始

PySparkで学ぶBig Data入門

コース完了

修了証明書を取得

この修了書をLinkedInや履歴書、CVに追加しましょう
ソーシャルメディアや人事評価で共有しましょう今すぐ登録

19百万人を超える学習者と共にPySparkで学ぶBig Data入門を始めましょう！

DataCamp for Mobileでデータスキルを磨きましょう

モバイルコースと毎日の 5 分間のコーディングチャレンジで、外出先でも進歩できます。