Loved by learners at thousands of companies
This course introduces students to the areas involved in topic modeling: preparation of corpus, fitting of topic models using Latent Dirichlet Allocation algorithm (in package topicmodels), and visualizing the results using ggplot2 and wordclouds.
Quick introduction to the workflowFree
This chapter introduces the workflow used in topic modeling: preparation of a document-term matrix, model fitting, and visualization of results with ggplot2.
Wordclouds, stopwords, and control arguments
This chapter explains how to use join functions to remove or keep words in the document-term matrix, how to make wordcloud charts, and how to use some of the many control arguments.Random nature of LDA algorithm50 xpProbabilities of words in topics100 xpEffect of argument alpha100 xpManipulating the vocabulary50 xpMaking a dtm - refresher100 xpRemoving stopwords100 xpKeeping the needed words100 xpWord clouds50 xpWordcloud of term frequency100 xpHistory of the Byzantine Empire50 xpLDA model fitting - first iteration100 xpCapturing the actions - dtm with verbs100 xpMaking a chart100 xpUse wordclouds100 xp
Named entity recognition as unsupervised classification
This chapter goes into detail on how LDA topic models can be used as classifiers. It covers the importance of the Dirichlet shape parameter alpha, construction of word contexts for named entities using regex, and technical issues like corpus alignment and held-out data.Using topic models as classifiers50 xpSame k, different alpha100 xpProbabilities of words in topics100 xpFrom word windows to dtm50 xpRegex patterns for entity matching100 xpMaking a corpus100 xpFrom dtm to topic model100 xpCorpus alignment and classification50 xpTrain a topic model100 xpAlign corpus100 xpClassify test data100 xpExplore the results50 xp
How many topics is enough?
This chapter explains the basic methods used in the search for the optimal number of topics. It also covers how to use a single document as a source of data, and how topic numbering can be controlled using seed words.Finding the best number of topics50 xpPreparing the dtm100 xpFiltering by word frequency100 xpFitting one model100 xpUsing perplexity to find the best k100 xpTopic models fitted to novels50 xpGenerating chunk numbers100 xpInner join and cast dtm100 xpFinding the best value for k50 xpLocking topics by using seed words50 xpTopics without seedwords100 xpTopics with seedwords100 xpFinal words (and more things to learn)50 xp
Associate Director, Quantitative Analysis Center, Wesleyan University
Pavel Oleinikov uses his background in social and natural sciences to advance the application of quantitative methods to data from the social world. He teaches courses on basics of Big Data, network analysis, text mining, and skills-focused courses. A large part of his work lies in assisting Wesleyan faculty with their diverse projects.