मुख्य सामग्री पर जाएं

होम Python

कोर्स

Feature Engineering for NLP in Python

उन्नतकौशल स्तर

अपडेट किया गया 11/2024

Learn techniques to extract useful information from text and process them into a format suitable for machine learning.

मुफ़्त में पाठ्यक्रम शुरू करें

PythonMachine Learning

4 घंटे

15 वीडियो

52 अभ्यास

4,200 XP

29,233

उपलब्धि का प्रमाण पत्र

अपना मुफ़्त खाता बनाएं

Google के साथ जारी रखें अधिक विकल्प दिखाएँ

या

जारी रखने पर, आप हमारी उपयोग की शर्तें, हमारी गोपनीयता नीति को स्वीकार करते हैं और यह भी कि आपका डेटा संयुक्त राज्य अमेरिका में संग्रहीत किया जाता है।

हजारों कंपनियों के शिक्षार्थियों द्वारा पसंद किया गया

टीम को ट्रेनिंग देना चाहते हैं?

व्यवसाय के लिए आज़माएँ

पाठ्यक्रम विवरण

In this course, you will learn techniques that will allow you to extract useful information from text and process them into a format suitable for applying ML models. More specifically, you will learn about POS tagging, named entity recognition, readability scores, the n-gram and tf-idf models, and how to implement them using scikit-learn and spaCy. You will also learn to compute how similar two documents are to each other. In the process, you will predict the sentiment of movie reviews and build movie and Ted Talk recommenders. Following the course, you will be able to engineer critical features out of any text and solve some of the most challenging problems in data science!

पूर्व आवश्यकताएं

Introduction to Natural Language Processing in Python Supervised Learning with scikit-learn

1

Basic features and readability scores

Learn to compute basic features such as number of words, number of characters, average word length and number of special characters (such as Twitter hashtags and mentions). You will also learn to compute readability scores and determine the amount of education required to comprehend a piece of text.

Introduction to NLP feature engineering

Data format for ML algorithms

One-hot encoding

Basic feature extraction

Character count of Russian tweets

Word count of TED talks

Hashtags and mentions in Russian tweets

Readability tests

Readability of 'The Myth of Sisyphus'

Readability of various publications

अध्याय शुरू करें

2

Text preprocessing, POS tagging and NER

In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

Tokenization and Lemmatization

Identifying lemmas

Tokenizing the Gettysburg Address

Lemmatizing the Gettysburg address

Text cleaning

Cleaning a blog post

Cleaning TED talks in a dataframe

Part-of-speech tagging

POS tagging in Lord of the Flies

Counting nouns in a piece of text

Noun usage in fake news

Named entity recognition

Named entities in a sentence

Identifying people mentioned in a news article

अध्याय शुरू करें

3

N-Gram models

Learn about n-gram modeling and use it to perform sentiment analysis on movie reviews.

Building a bag of words model

Word vectors with a given vocabulary

BoW model for movie taglines

Analyzing dimensionality and preprocessing

Mapping feature indices with feature names

Building a BoW Naive Bayes classifier

BoW vectors for movie reviews

Predicting the sentiment of a movie review

Building n-gram models

n-gram models for movie tag lines

Higher order n-grams for sentiment analysis

Comparing performance of n-gram models

अध्याय शुरू करें

4

TF-IDF and similarity scores

Learn how to compute tf-idf weights and the cosine similarity score between two vectors. You will use these concepts to build a movie and a TED Talk recommender. Finally, you will also learn about word embeddings and using word vector representations, you will compute similarities between various Pink Floyd songs.

Building tf-idf document vectors

tf-idf weight of commonly occurring words

tf-idf vectors for TED talks

Cosine similarity

Range of cosine scores

Computing dot product

Cosine similarity matrix of a corpus

Building a plot line based recommender

Comparing linear_kernel and cosine_similarity

Plot recommendation engine

The recommender function

TED talk recommender

Beyond n-grams: word embeddings

Generating word vectors

Computing similarity of Pink Floyd songs

Congratulations!

अध्याय शुरू करें

Feature Engineering for NLP in Python

पाठ्यक्रम
पूर्ण

उपलब्धि का प्रमाण पत्र अर्जित करें

इस प्रमाण पत्र को अपनी LinkedIn प्रोफ़ाइल, रिज्यूमे या CV में जोड़ें
इसे सोशल मीडिया पर और अपनी प्रदर्शन समीक्षा में साझा करेंअभी नामांकन करें

व्यवसाय के लिए

2 या अधिक लोगों को प्रशिक्षण दे रहे हैं?

अपनी टीम को सभी सुविधाओं सहित पूर्ण DataCamp प्लेटफॉर्म तक पहुंच प्रदान करें।

निम्नलिखित ट्रैक में

मशीन लर्निंग वैज्ञानिक में Python

प्राकृतिक भाषा प्रसंस्करण में Python

इंस्ट्रक्टर

Rounak Banik

Rounak Banik

Data Scientist at Fractal Analytics

सहयोगी

कोर्स संसाधन

Russian Troll Tweetsडेटासेट

Movie Overviews and Taglinesडेटासेट

Preprocessed Movie Reviewsडेटासेट

TED Talk Transcriptsडेटासेट

Real and Fake News Headlinesडेटासेट

19 मिलियन से अधिक शिक्षार्थियों के साथ जुड़ें और आज ही Feature Engineering for NLP in Python शुरू करें!

अपना मुफ़्त खाता बनाएं

Google के साथ जारी रखें अधिक विकल्प दिखाएँ

या

जारी रखने पर, आप हमारी उपयोग की शर्तें, हमारी गोपनीयता नीति को स्वीकार करते हैं और यह भी कि आपका डेटा संयुक्त राज्य अमेरिका में संग्रहीत किया जाता है।

मोबाइल के लिए DataCamp के साथ अपने डेटा कौशल को बढ़ाएं

हमारे मोबाइल कोर्स और दैनिक 5 मिनट की कोडिंग चुनौतियों के साथ चलते-फिरते प्रगति करें।