Skip to content
Course Notes: Feature Engineering for NLP in Python
Part 1: Basic features and readability scores
Text Processing
These are processes done to prepare the text data for analysis.
STANDARDIZE THE DATASET
- Converting words to Lowercase Reduction to reduction
- Converting words to base-words reduction to reduce
Vectorization
To convert textual data to numerical data using vectors
Basic Features
- Word Count
- Character Count
- Average word length
- Hashtag count
POS Tagging
This is extracting features from Individual words
For example:
Parts of Speech Tagging
POS Taggiing will label each word with its part of speech
Named Entity Recognition
This is used to find out if a particular noun is referring to a person, organization or country.
# Import any packages you want to use here
import pandas as pd
!pip install textatistic
!pip install spacy
!spacy download en_core_web_sm
# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])
Basic feature extraction
Character Count (Includes white space)
"I don't know." # 13 characters
text = "I don't know."
num_chars = len(text)
print(num_chars)
reviews = pd.read_csv('datasets/movie_reviews_clean.csv')
# Create a new feature that stores the character count of reviews
reviews['num_chars'] = reviews['review'].apply(len)
reviews['num_chars'].head()
Word Count
Assuming that every word is separated by a space, we can use the split() method to convert the text to a list where every element is a word.
text = "Mary had a little lamb."
words = text.split()
# Print the list containing words
print(words)
# Print number of words
print(len(words))
# To make things easier, let us convert it to a Function
def word_count(text):
words = text.split()
# Return number of words
return len(words)
reviews['num_words'] = reviews['review'].apply(word_count)
reviews['num_words'].head()
Average Word Length
Calculate the average word length of each review
Summation(length of each word)/no of words in the review