メインコンテンツへスキップ

ホーム Python

コース

Pythonで学ぶNLPの特徴量エンジニアリング

上級スキルレベル

更新日 2024/11

テキストから有用な情報を抽出し、機械学習に適した形式へ処理する手法を学びます。

コースを無料で開始

PythonMachine Learning

4時間

15 ビデオ

52 演習

4,200 XP

29,225

修了証明書

何千もの企業の従業員が支持

チームのトレーニングを担当していますか？

Businessをお試しください

コース説明

このコースでは、テキストから有用な情報を抽出し、Machine Learningモデルに適した形式へ処理するための手法を学びます。具体的には、品詞（POS）タグ付け、固有表現抽出（NER）、可読性スコア、n-gramやtf-idfモデルについて取り上げ、scikit-learnとspaCyを使った実装方法を学習します。さらに、2つの文書同士の類似度を計算する方法も扱います。学習の過程では、映画レビューの感情を予測し、映画とTED Talkのレコメンダを作成します。修了後は、あらゆるテキストから重要な特徴量を設計し、データサイエンスにおける難しい課題のいくつかを解決できるようになります。

前提条件

Introduction to Natural Language Processing in Python Supervised Learning with scikit-learn

1

Basic features and readability scores

Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

Introduction to NLP feature engineering

Data format for ML algorithms

One-hot encoding

Basic feature extraction

Character count of Russian tweets

Word count of TED talks

Hashtags and mentions in Russian tweets

Readability tests

Readability of 'The Myth of Sisyphus'

Readability of various publications

チャプターを開始

2

Text preprocessing, POS tagging and NER

In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

Tokenization and Lemmatization

Identifying lemmas

Tokenizing the Gettysburg Address

Lemmatizing the Gettysburg address

Text cleaning

Cleaning a blog post

Cleaning TED talks in a dataframe

Part-of-speech tagging

POS tagging in Lord of the Flies

Counting nouns in a piece of text

Noun usage in fake news

Named entity recognition

Named entities in a sentence

Identifying people mentioned in a news article

チャプターを開始

3

N-Gram models

Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

Building a bag of words model

Word vectors with a given vocabulary

BoW model for movie taglines

Analyzing dimensionality and preprocessing

Mapping feature indices with feature names

Building a BoW Naive Bayes classifier

BoW vectors for movie reviews

Predicting the sentiment of a movie review

Building n-gram models

n-gram models for movie tag lines

Higher order n-grams for sentiment analysis

Comparing performance of n-gram models

チャプターを開始

4

TF-IDF and similarity scores

Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

Building tf-idf document vectors

tf-idf weight of commonly occurring words

tf-idf vectors for TED talks

Cosine similarity

Range of cosine scores

Computing dot product

Cosine similarity matrix of a corpus

Building a plot line based recommender

Comparing linear_kernel and cosine_similarity

Plot recommendation engine

The recommender function

TED talk recommender

Beyond n-grams: word embeddings

Generating word vectors

Computing similarity of Pink Floyd songs

Congratulations!

チャプターを開始

Pythonで学ぶNLPの特徴量エンジニアリング

コース完了

修了証明書を取得

この修了書をLinkedInや履歴書、CVに追加しましょう
ソーシャルメディアや人事評価で共有しましょう今すぐ登録

19百万人を超える学習者と共にPythonで学ぶNLPの特徴量エンジニアリングを始めましょう！

DataCamp for Mobileでデータスキルを磨きましょう

モバイルコースと毎日の 5 分間のコーディングチャレンジで、外出先でも進歩できます。