Перейти к основному содержимому

Главная Spark

Курс

Machine Learning with PySpark

Продвинутый уровеньУровень навыков

Обновлено 11.2025

Learn how to make predictions from data with Apache Spark, using decision trees, logistic regression, linear regression, ensembles, and pipelines.

Начать курс бесплатно

SparkMachine Learning

4 ч

16 видео

56 Упражнений

4,550 XP

29,676

Справка об успешном завершении

Создать бесплатный аккаунт

Продолжить через Google Показать больше вариантов

или

Продолжая, вы принимаете наши Условия использования, нашу Политику конфиденциальности и соглашаетесь с тем, что ваши данные хранятся в США.

Любимая обучающимися из тысяч компаний

Обучаете команду?

Попробуйте для бизнеса

Описание курса

Learn to Use Apache Spark for Machine Learning

Spark is a powerful, general purpose tool for working with Big Data. Spark transparently handles the distribution of compute tasks across a cluster. This means that operations are fast, but it also allows you to focus on the analysis rather than worry about technical details. In this course you'll learn how to get data into Spark and then delve into the three fundamental Spark Machine Learning algorithms: Linear Regression, Logistic Regression/Classifiers, and creating pipelines.

Build and Test Decision Trees

Building your own decision trees is a great way to start exploring machine learning models. You’ll use an algorithm called ‘Recursive Partitioning’ to divide data into two classes and find a predictor within your data that results in the most informative split of the two classes, and repeat this action with further nodes. You can then use your decision tree to make predictions with new data.

Master Logistic and Linear Regression in PySpark

Logistic and linear regression are essential machine learning techniques that are supported by PySpark. You’ll learn to build and evaluate logistic regression models, before moving on to creating linear regression models to help you refine your predictors to only the most relevant options.

By the end of the course, you’ll feel confident in applying your new-found machine learning knowledge, thanks to hands-on tasks and practice data sets found throughout the course.

Необходимые условия

Supervised Learning with scikit-learn Introduction to PySpark

1

Introduction

Spark is a framework for working with Big Data. In this chapter you'll cover some background about Spark and Machine Learning. You'll then find out how to connect to Spark using Python and load CSV data.

Machine Learning & Spark

Characteristics of Spark

Components in a Spark Cluster

Connecting to Spark

Location of Spark master

Creating a SparkSession

Loading Data

Loading flights data

Loading SMS spam data

Начать главу

2

Classification

Now that you are familiar with getting data into Spark, you'll move onto building two types of classification model: Decision Trees and Logistic Regression. You'll also find out about a few approaches to data preparation.

Data Preparation

Removing columns and rows

Column manipulation

Categorical columns

Assembling columns

Decision Tree

Train/test split

Build a Decision Tree

Evaluate the Decision Tree

Logistic Regression

Build a Logistic Regression model

Evaluate the Logistic Regression model

Turning Text into Tables

Punctuation, numbers and tokens

Stopwords and hashing

Training a spam classifier

Начать главу

3

Regression

Next you'll learn to create Linear Regression models. You'll also find out how to augment your data by engineering new predictors as well as a robust approach to selecting only the most relevant predictors.

One-Hot Encoding

Encoding flight origin

Encoding shirt sizes

Flight duration model: Just distance

Interpreting the coefficients

Flight duration model: Adding origin airport

Interpreting coefficients

Bucketing & Engineering

Bucketing departure time

Flight duration model: Adding departure time

Regularization

Flight duration model: More features!

Flight duration model: Regularization!

Начать главу

4

Ensembles & Pipelines

Finally you'll learn how to make your models more efficient. You'll find out how to use pipelines to make your code clearer and easier to maintain. Then you'll use cross-validation to better test your models and select good model parameters. Finally you'll dabble in two types of ensemble model.

Flight duration model: Pipeline stages

Flight duration model: Pipeline model

SMS spam pipeline

Cross-Validation

Cross validating simple flight duration model

Cross validating flight duration model pipeline

Grid Search

Optimizing flights linear regression

Dissecting the best flight duration model

SMS spam optimised

How many models for grid search?

Delayed flights with Gradient-Boosted Trees

Delayed flights with a Random Forest

Evaluating Random Forest

Closing thoughts

Начать главу

Machine Learning with PySpark

Курс
завершён

Получить сертификат об окончании

Добавьте эту квалификацию в профиль LinkedIn, резюме или CV
Поделитесь в социальных сетях и в обзоре эффективностиЗаписаться сейчас

Присоединяйтесь к более чем 19 миллионам обучающихся и начните Machine Learning with PySpark уже сегодня!

Создать бесплатный аккаунт

Продолжить через Google Показать больше вариантов

или

Продолжая, вы принимаете наши Условия использования, нашу Политику конфиденциальности и соглашаетесь с тем, что ваши данные хранятся в США.

Развивайте свои навыки работы с данными с помощью DataCamp для мобильных устройств.

Успевайте в обучении на ходу с помощью наших мобильных курсов и ежедневных 5-минутных заданий по программированию.