Skip to content

Part 1: Basic features and readability scores

Text Processing

These are processes done to prepare the text data for analysis.

STANDARDIZE THE DATASET

  • Converting words to Lowercase Reduction to reduction
  • Converting words to base-words reduction to reduce

Vectorization

To convert textual data to numerical data using vectors

Basic Features

  • Word Count
  • Character Count
  • Average word length
  • Hashtag count

POS Tagging

This is extracting features from Individual words

For example:

Parts of Speech Tagging

POS Taggiing will label each word with its part of speech

Named Entity Recognition

This is used to find out if a particular noun is referring to a person, organization or country.

# Import any packages you want to use here
import pandas as pd
!pip install textatistic
!pip install spacy
!spacy download en_core_web_sm
# Perform one-hot encoding
df1 = pd.get_dummies(df1, columns=['feature 5'])

Basic feature extraction

Character Count (Includes white space)

"I don't know." # 13 characters

text = "I don't know."
num_chars = len(text)
print(num_chars)
reviews = pd.read_csv('datasets/movie_reviews_clean.csv')
# Create a new feature that stores the character count of reviews
reviews['num_chars'] = reviews['review'].apply(len)
reviews['num_chars'].head()

Word Count

Assuming that every word is separated by a space, we can use the split() method to convert the text to a list where every element is a word.

text = "Mary had a little lamb."
words = text.split()

# Print the list containing words
print(words)

# Print number of words
print(len(words))
# To make things easier, let us convert it to a Function
def word_count(text):
    words = text.split()

    # Return number of words
    return len(words)
reviews['num_words'] = reviews['review'].apply(word_count)
reviews['num_words'].head()

Average Word Length

Calculate the average word length of each review

Summation(length of each word)/no of words in the review