In this course, you'll learn natural language processing (NLP) basics, such as how to identify and separate words, how to extract topics in a text, and how to build your own fake news classifier. You'll also learn how to use basic libraries such as NLTK, alongside libraries which utilize deep learning to solve common NLP problems. This course will give you the foundation to process and parse text as you move forward in your Python learning.
Regular expressions & word tokenizationFree
This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.Introduction to regular expressions50 xpWhich pattern?50 xpPracticing regular expressions: re.split() and re.findall()100 xpIntroduction to tokenization50 xpWord tokenization with NLTK100 xpMore regex with re.search()100 xpAdvanced tokenization with NLTK and regex50 xpChoosing a tokenizer50 xpRegex with NLTK tokenization100 xpNon-ascii tokenization100 xpCharting word length with NLTK50 xpCharting practice100 xp
Simple topic identification
This chapter will introduce you to topic identification, which you can apply to any text you encounter in the wild. Using basic NLP models, you will identify topics from texts based on term frequencies. You'll experiment and compare two simple methods: bag-of-words and Tf-idf using NLTK, and a new library Gensim.Word counts with bag-of-words50 xpBag-of-words picker50 xpBuilding a Counter with bag-of-words100 xpSimple text preprocessing50 xpText preprocessing steps50 xpText preprocessing practice100 xpIntroduction to gensim50 xpWhat are word vectors?50 xpCreating and querying a corpus with gensim100 xpGensim bag-of-words100 xpTf-idf with gensim50 xpWhat is tf-idf?50 xpTf-idf with Wikipedia100 xp
This chapter will introduce a slightly more advanced topic: named-entity recognition. You'll learn how to identify the who, what, and where of your texts using pre-trained models on English and non-English text. You'll also learn how to use some new libraries, polyglot and spaCy, to add to your NLP toolbox.Named Entity Recognition50 xpNER with NLTK100 xpCharting practice100 xpStanford library with NLTK50 xpIntroduction to SpaCy50 xpComparing NLTK with spaCy NER100 xpspaCy NER Categories50 xpMultilingual NER with polyglot50 xpFrench NER with polyglot I100 xpFrench NER with polyglot II100 xpSpanish NER with polyglot100 xp
Building a "fake news" classifier
You'll apply the basics of what you've learned along with some supervised machine learning to build a "fake news" detector. You'll begin by learning the basics of supervised machine learning, and then move forward by choosing a few important features and testing ideas to identify and classify fake news articles.Classifying fake news using supervised learning with NLP50 xpWhich possible features?50 xpTraining and testing50 xpBuilding word count vectors with scikit-learn50 xpCountVectorizer for text classification100 xpTfidfVectorizer for text classification100 xpInspecting the vectors100 xpTraining and testing a classification model with scikit-learn50 xpText classification models50 xpTraining and testing the "fake news" model with CountVectorizer100 xpTraining and testing the "fake news" model with TfidfVectorizer100 xpSimple NLP, complex problems50 xpImproving the model50 xpImproving your model100 xpInspecting your model100 xp
PrerequisitesPython Data Science Toolbox (Part 2)
Katharine JarmulSee More
Katharine Jarmul runs a data analysis company called kjamistan that specializes in helping companies analyze data and training others on data analysis best practices, particularly with Python. She has been using Python for 8 years for a variety of data work -- including telling stories at major national newspapers, building large scale aggregation software, making decisions based on customer analytics, and marketing spend and advising new ventures on the competitive landscape.