Course Notes: Feature Engineering for NLP in Python

Part 1: Basic features and readability scores

Text Processing

These are processes done to prepare the text data for analysis.

STANDARDIZE THE DATASET

Converting words to Lowercase Reduction to reduction
Converting words to base-words reduction to reduce

Vectorization

To convert textual data to numerical data using vectors

Basic Features

Word Count
Character Count
Average word length
Hashtag count

POS Tagging

This is extracting features from Individual words

For example:

Parts of Speech Tagging

POS Taggiing will label each word with its part of speech

Named Entity Recognition

This is used to find out if a particular noun is referring to a person, organization or country.

# Import any packages you want to use here
import pandas as pd
!pip install textatistic
!pip install spacy
!spacy download en_core_web_sm

# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

Basic feature extraction

Character Count (Includes white space)

"I don't know." # 13 characters

text = "I don't know."
num_chars = len(text)
print(num_chars)

reviews = pd.read_csv('datasets/movie_reviews_clean.csv')
# Create a new feature that stores the character count of reviews
reviews['num_chars'] = reviews['review'].apply(len)
reviews['num_chars'].head()

Word Count

Assuming that every word is separated by a space, we can use the split() method to convert the text to a list where every element is a word.

text = "Mary had a little lamb."
words = text.split()

# Print the list containing words
print(words)

# Print number of words
print(len(words))

# To make things easier, let us convert it to a Function
def word_count(text):
    words = text.split()

    # Return number of words
    return len(words)

reviews['num_words'] = reviews['review'].apply(word_count)
reviews['num_words'].head()

Average Word Length

Calculate the average word length of each review

Summation(length of each word)/no of words in the review

‌
‌
‌