본문으로 바로가기

강의

Python으로 배우는 NLP 피처 엔지니어링

고급기술 수준

업데이트됨 2024. 11.

텍스트에서 유용한 정보를 추출하고, 머신러닝에 적합한 형식으로 가공하는 기법을 학습합니다.

무료로 강의 시작

PythonMachine Learning

4시간

15 동영상

52 연습 문제

4,200 XP

29,225

성취 증명서

수천 개 기업의 학습자들이 사랑하는

팀을 교육하시나요?

비즈니스용으로 체험해 보세요

강의 설명

이 강의에서는 텍스트에서 유용한 정보를 추출하고 ML 모델에 적용하기 적합한 형식으로 처리하는 기법을 학습해요. 구체적으로 POS 태깅, 개체명 인식(NER), 가독성 점수, n-gram과 tf-idf 모델을 배우고, 이를 scikit-learn과 spaCy로 구현하는 방법을 익힙니다. 또한 두 문서가 서로 얼마나 유사한지도 계산해 볼 거예요. 실습을 통해 영화 리뷰의 감성을 예측하고, 영화와 TED Talk 추천 시스템을 만들어 봅니다. 강의를 마치고 나면 어떤 텍스트에서도 핵심 피처를 설계해 내고, 데이터 사이언스의 까다로운 문제들을 해결할 수 있게 될 거예요!

선수 조건

Introduction to Natural Language Processing in Python Supervised Learning with scikit-learn

1

Basic features and readability scores

Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

Introduction to NLP feature engineering

Data format for ML algorithms

One-hot encoding

Basic feature extraction

Character count of Russian tweets

Word count of TED talks

Hashtags and mentions in Russian tweets

Readability tests

Readability of 'The Myth of Sisyphus'

Readability of various publications

2

Text preprocessing, POS tagging and NER

In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

Tokenization and Lemmatization

Identifying lemmas

Tokenizing the Gettysburg Address

Lemmatizing the Gettysburg address

Text cleaning

Cleaning a blog post

Cleaning TED talks in a dataframe

Part-of-speech tagging

POS tagging in Lord of the Flies

Counting nouns in a piece of text

Noun usage in fake news

Named entity recognition

Named entities in a sentence

Identifying people mentioned in a news article

3

N-Gram models

Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

Building a bag of words model

Word vectors with a given vocabulary

BoW model for movie taglines

Analyzing dimensionality and preprocessing

Mapping feature indices with feature names

Building a BoW Naive Bayes classifier

BoW vectors for movie reviews

Predicting the sentiment of a movie review

Building n-gram models

n-gram models for movie tag lines

Higher order n-grams for sentiment analysis

Comparing performance of n-gram models

4

TF-IDF and similarity scores

Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

Building tf-idf document vectors

tf-idf weight of commonly occurring words

tf-idf vectors for TED talks

Cosine similarity

Range of cosine scores

Computing dot product

Cosine similarity matrix of a corpus

Building a plot line based recommender

Comparing linear_kernel and cosine_similarity

Plot recommendation engine

The recommender function

TED talk recommender

Beyond n-grams: word embeddings

Generating word vectors

Computing similarity of Pink Floyd songs

Congratulations!

Python으로 배우는 NLP 피처 엔지니어링

강의
완료

수료증 획득

LinkedIn 프로필, 이력서 또는 CV에 이 인증서를 추가하세요
소셜 미디어와 성과 평가에서 공유하세요지금 등록

19백만 명 이상의 학습자와 함께 Python으로 배우는 NLP 피처 엔지니어링을(를) 시작하세요!

DataCamp for Mobile을 통해 데이터 분석 능력을 향상시키세요.

모바일 강좌와 매일 5분 코딩 챌린지를 통해 이동 중에도 학습 효과를 높이세요.