본문으로 바로가기

강의

PySpark로 하는 Feature Engineering

고급기술 수준

업데이트됨 2026. 1.

데이터 과학자가 시간의 70–80%를 쏟는 핵심, 데이터 정제와 피처 엔지니어링의 실무를 깊이 있게 학습하세요.

무료로 강의 시작

SparkData Manipulation

4시간

16 동영상

60 연습 문제

5,000 XP

17,763

성취 증명서

수천 개 기업의 학습자들이 사랑하는

팀을 교육하시나요?

비즈니스용으로 체험해 보세요

강의 설명

현실의 데이터는 늘 지저분합니다. 우리의 일은 그 속에서 의미를 찾아내는 것이죠. MTCars나 Iris 같은 토이 데이터셋도 꼼꼼한 선별과 정제를 거쳤지만, 여전히 강력한 Machine Learning 알고리즘이 의미를 추출하고 예측·분류·군집화에 활용하려면 적절한 변환이 필요합니다. 이 과정에서는 데이터 과학자들이 시간의 70~80%를 쏟는 데이터 정리와 Feature Engineering의 실무적인 내용을 다룹니다. 데이터셋 규모가 점점 커지는 지금, PySpark로 Big Data 문제를 효율적으로 다뤄 보세요!

선수 조건

Supervised Learning with scikit-learn Introduction to PySpark

1

Exploratory Data Analysis

Get to know a bit about your problem before you dive in! Then learn how to statistically and visually inspect your dataset!

Where to Begin

Where to begin?

Check Version

Load in the data

Defining A Problem

What are we predicting?

Verifying Data Load

Verifying DataTypes

Visually Inspecting Data / EDA

Using Corr()

Using Visualizations: distplot

Using Visualizations: lmplot

2

Wrangling with Spark Functions

Real data is rarely clean and ready for analysis. In this chapter learn to remove unneeded information, handle missing values and add additional data to your analysis.

Dropping data

Dropping a list of columns

Using text filters to remove records

Filtering numeric fields conditionally

Adjusting Data

Custom Percentage Scaling

Scaling your scalers

Correcting Right Skew Data

Working with Missing Data

Visualizing Missing Data

Imputing Missing Data

Calculate Missing Percents

Getting More Data

A Dangerous Join

Spark SQL Join

Checking for Bad Joins

3

Feature Engineering

In this chapter learn how to create new features for your machine learning model to learn from. We'll look at generating them by combining fields, extracting values from messy columns or encoding them for better results.

Feature Generation

Differences

Deeper Features

Time Features

Time Components

Joining On Time Components

Extracting Features

Extracting Text to New Features

Splitting & Exploding

Pivot & Join

Binarizing, Bucketing & Encoding

Binarizing Day of Week

One Hot Encoding

4

Building a Model

In this chapter we'll learn how to choose which type of model we want. Then we will learn how to apply our data to the model and evaluate it. Lastly, we'll learn how to interpret the results and save the model for later!

Choosing the Algorithm

Which MLlib Module?

Creating Time Splits

Adjusting Time Features

Feature Engineering Assumptions for RFR

Feature Engineering For Random Forests

Dropping Columns with Low Observations

Naively Handling Missing and Categorical Values

Building a Model

Building a Regression Model

Evaluating & Comparing Algorithms

Understanding Metrics

Interpreting, Saving & Loading

Interpreting Results

Saving & Loading Models

Final Thoughts

PySpark로 하는 Feature Engineering

강의
완료

수료증 획득

LinkedIn 프로필, 이력서 또는 CV에 이 인증서를 추가하세요
소셜 미디어와 성과 평가에서 공유하세요지금 등록

19백만 명 이상의 학습자와 함께 PySpark로 하는 Feature Engineering을(를) 시작하세요!

DataCamp for Mobile을 통해 데이터 분석 능력을 향상시키세요.

모바일 강좌와 매일 5분 코딩 챌린지를 통해 이동 중에도 학습 효과를 높이세요.