Skip to content
!pip install scrapy
Hidden output
!pip install pyLDAvis==3.3.0
Hidden output
!pip install bertopic==0.9.3
Hidden output
!python3 -m spacy download en_core_web_sm

Generic questions

"Self-identification" is worthy of comment here: I was the one actually grouping several self-descriptions under one, fairly uncommon category defined by me, "digital ethics".

The first attempt at clarifying the initial question would be to identify the trendy topics of research in the field. I will try to outline the difficulties in answering this simple question (dataset conception and collection, methods of treatments (is frequency a good method to identify the topic of a work, especially in short texts?), change of granularity).

I have extracted research topics from texts who were mostly a general presentation of the researcher, not an exclusive presentation of their research topics.

However, there could be other objects of analysis than topics, such as approaches, disciplines of origins, etc? There is a difference between "topic of a document describing research" and "topic of research". Topic identification refers to the former, so we will have to analyze the type of information on research that comes out of our NLP processing.

Suggestions for further treatment

Should we add a disciplinary level of analysis in NLP treatment? a linguistic level (in the future)?

Should we add a more predictive (opaque) model, if only to enable discussion of explainability and bias? Training from scratch is not possible in this project, but transfer learning may be possible. What should be the object of such a model? Suggestion: topic detection to see which event are on digital ethics. Ideally, scrap them on a daily basis, put them on the website and archive them for research and further online training. Again, explore the possibility to become multi-lingual. Could we also use Twitter (and other social networds, if they have the required API) as a source for events?

NLP treatment Research topics: self-description

Ne pas limiter le traitement NLP aux research topics: voir si d'autres informations ne sont pas exploitables.

Il faut voir si on peut trouver une méthode pour streamliner l'application des principaux traitements à différentes échelles, ie à différentes bases de données.

_ Research topics: Publication data_ Projected pipeline: 1. Build a list of identifiers for all researchers in the database that will be used in advanced queries for "publications written by". 2. Automatically collect all publication titles (and their abstracts) for researchers in the list built in step 1. 3. Subset the data on digital ethics (manual work?) 4. Perform NLP treatments for topic identification. La difficulté de l'étape 1 viendra essentiellement des gens qui n'ont pas de profils Google et/ou qui ont un nom propre très commun.

This should enable a comparison with the results on self-description on the principle "same treatment, other dataset". However, we should add various pieces of information to our current data for self-description to enable comparison at different levels: identifier (for individual level), localisation (for country level). Finally, the two datasets could be merged to give a new vision at all scales (individual, country, global).

See Octoparse, Parsehub, SerpAPI as potential tools for scraping Google Scholar data. Importance of inserting artificial wait times (Octoparse) or to enable IP rotation to avoid Google antiscraping tools such as Re-Captcha. Can be recommended to run your webscraper from the cloud, so that it does not come from your IP adress. For all block strategies, see https://serpapi.com/blog/how-to-reduce-chance-of-being-blocked-while-web/

https://python.plainenglish.io/scrape-google-scholar-with-python-fc6898419305 : script examples to perform various Google Scholar scraping tasks, including author info and Google Scholar profiles, but using BeautifulSoup and SerpAI instead of Scrapy.It contains a function that will perform a list of queries, which could take our list of identifiers in step 1 as an input.

For an example using Scrapy: https://www.scraperapi.com/blog/scrape-data-google-search-results/

Topic Modeling

It would be good to present several topic modeling techniques. However, can we consider Latent Dirichlet Allocation as a different method from probabilistic Latent Semantic Analysis if LDA can be presentend as a Bayesian version of pLSA.

Automatic update and Future evolution

How to save data to study evolution of the field through time?

Automatisation de la mise à jour sur le site

Certains des outputs de notre travail devrait être mis à jour régulièrement: -la base de données sans les adresses. -les analyses des thèmes de recherche les plus fréquents (ou moins pour les auto-descriptions, plus dur pour les publications). -autre ?

Il faudrait sauvegarder les résultats intermédiaires pour documenter l'évolution du champ au fur et à mesure.

L'analyse des thèmes de recherche les plus fréquents d'après les auto-descriptions et son interprétation

L'analyse Tf-idf est là pour déterminer l'identité thématique spécifique d'un chercheur, ou son thème de recherche le plus original. Cela n'est pas la même chose que l'identité thématique tout court (qui reflèterait le ou les thèmes prinicipaux de recherche du chercheur), car il est possible que cette identité thématique soit exprimée par des thèmes très partagés dans le corpus, et ait donc un score TF-IDF. Il n'est pas certain que l'interprétation naïve du TF-IDF comme le thème du document soit juste à cet égard. Il faut donc comparer le tf-idf avec une liste des termes les plus fréquents dans chaque document.

Il faudra penser à la comparaison avec l'analyse des thèmes émergeant par analyse des titres (et peut-être des résumés) de publication.

Visualisation

Visualisation sur la carte

Il faut distinguer visuellement les chercheurs actifs des chercheurs inactifs sur la carte.

Peut-on définir une métrique censée pour les thèmes ?

Visualisation des thèmes de recherche

Que veut-on ici ? Une représentation de mots où -la taille des mots représente la fréquence. -la distance entre les mots correspond à une notion de proximité sémantique.

Le wordcloud est une visualisation "fréquence par taille" assez rustique : https://www.analyticsvidhya.com/blog/2021/05/how-to-build-word-cloud-in-python/

Il n'est pas sûr qu'une telle visualisation soit féconde, s'il y a trop de termes ou s'il n'y a pas de termes beaucoup plus fréquents que les autres.

La semantic map est une forme de visualisation des concepts-clés d'un document qui peut indiquer certaines relations, notamment hiérarchiques, entre concepts. Il faut voir quelle est la bonne option pour nous.

Voir Vosviewer pour un logiciel spécialisé dans la visualisation de données scientométriques, notamment les nuages de termes.

See where the cleaning should be done : the original database (for instance, for lines made only of blanks), in the pandas dataframe, or in the list produced from this. Depend on the semantics of dropna() and other datacleaning methods.

# Word Frequency and Word Cloud Germany clean

# FREQUENCY ANALYSIS AND WORDCLOUD

# Import generic packages for data analysis and vizualisation
import pandas as pd
import matplotlib.pyplot as plt

# Import NLP packages
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from gensim.corpora.dictionary import Dictionary
import string
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')


# CODE BLOCK: clean and tokenize text data
# Download csv database
data = pd.read_csv('research_topics_list.csv')


# Select column containing self-description and convert it to list
research_topics_list = data['Translated Research Topics'].dropna().to_list()

# Clean data from Python symbols
pattern = ('(\\n|\\xa0\\t|\\u200b|\\u202f|\\xad|\\u2028|\\r)')
research_topics_list_clean = [re.sub(pattern, ' ', list) for list in research_topics_list]

# Tokenize words in lower-cased data
tokenized_topics = [word_tokenize(doc.lower())
                    for doc in research_topics_list_clean]


# Clean data from punctuation signs, stopwords and generic words
stop_words = stopwords.words("english")
stop_words.extend(['include', 'well', 'different', 'especially', 'new'])
extra_signs = ['``', "''", '”', '“', '”', '”', '``', '”', '"']
list_generic_words = ['digital', 'digitalization', 'digitalisation',
                      'ethics', 'ethical', 'political', 'social',
                      'society', 'legal', 'law', 'philosophy',
                      'sociology', 'studies', 'research', 'theory', 'methods',
                      'questions', 'context', 'project', 'analysis', 'areas',
                      'perspective', 'technology', 'technologies',
                      'artificial', 'intelligence', 'ai', 'focuses', 'focus',
                      'data', 'science', 'dissertation',
                      'interests', 'interested']


tokenized_topics_clean = [[element for element in list if element
                           not in stop_words and element
                           not in list_generic_words and element
                           not in string.punctuation and element
                           not in extra_signs] for list in tokenized_topics]


# Convert list of lists of tokens into a list of tokens
token_list = [element for list in tokenized_topics_clean for element in list]


# CODE BLOCK: frequency analysis and word cloud visualization
# Compute and map most frequent tokens in data
token_frequency = nltk.FreqDist(token_list)
#print(token_frequency)
token_frequency.plot(20)
plt.show()

# Attribute an id to each doc in the list of lists of tokens           
dictionary = Dictionary(tokenized_topics_clean)


# Formation of a  gensim corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_topics_clean]


# WordCloud visualization of most frequent words in research topics
wordcloud = WordCloud(max_words=40, background_color='white').fit_words(
    token_frequency)


# Display the generated image:
plt.imshow(wordcloud)
plt.axis("off")
plt.show()
# LATENT DIRICHLET ALLOCATION LDA Germany clean

# Import generic packages for data analysis and vizualisation

import pyLDAvis.gensim_models
from nltk.tokenize import word_tokenize
from gensim.corpora.dictionary import Dictionary
import pyLDAvis
import gensim
from gensim.models import CoherenceModel
from nltk.corpus import stopwords
import re
import pandas as pd
from pprint import pprint

# Import NLP packages
import string
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# CODE BLOCK: cleaning and tokenizing text data

# Download csv database
database = pd.read_csv('research_topics_list.csv')

# Select column containing self-description and convert it to list
research_topics_list = database['Translated Research Topics'].dropna(
).to_list()

# Clean data from Python symbols
pattern = ('(\\n|\\xa0\\t|\\u200b|\\u202f|\\xad|\\u2028|\\r)')
research_topics_list_clean = [re.sub(pattern, ' ', list) for list in research_topics_list]


# Tokenize words lower-cased data
tokenized_topics = [word_tokenize(doc.lower())
                    for doc in research_topics_list_clean]

# Clean data from various signs, stopwords and generic words
stop_words = stopwords.words("english")
stop_words.extend(['include', 'well', 'different', 'especially', 'new'])
extra_signs = ['``', "''", '”', '“', '”', '”', '``', '”', '"']
list_generic_words = ['digital', 'digitalization', 'digitalisation', 'ethics',
                      'ethical', 'political', 'social', 'society', 'legal',
                      'law', 'philosophy', 'sociology', 'studies', 'research',
                      'theory', 'methods',
                      'questions', 'context', 'project', 'analysis', 'areas',
                      'perspective', 'technology', 'technologies',
                      'artificial', 'intelligence', 'ai', 'focuses', 'focus',
                      'data', 'science', 'dissertation', 'interests',
                      'interested']

tokenized_topics_clean = [
    [element for element in list if element not in stop_words
     and element not in extra_signs and element not in list_generic_words
     and element not in string.punctuation] for list in tokenized_topics]

# CODE BLOCK: formation of gensim corpus

# Convert list of lists of tokens into a list of tokens
token_list = [element for list in tokenized_topics_clean for element in list]


# Attribute an id to each doc in the list of lists of tokens
dictionary = Dictionary(tokenized_topics_clean)

# Formation of a  gensim corpus
corpus = [dictionary.doc2bow(doc) for doc in tokenized_topics_clean]


# CODE BLOCK Latent Dirichlet Allocation

# Fix the parameter "number of topics"
num_topics = 5

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=dictionary,
                                       num_topics=num_topics)

# Print topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Visualization of LDA results with pyLDAvis package
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
# LATENT DIRICHLET ALLOCATION LDA Germany clean and fine-tuned

# Import generic packages for data analysis and vizualisation

import pyLDAvis.gensim_models
import spacy
from nltk.tokenize import word_tokenize
from gensim.corpora.dictionary import Dictionary
from gensim.utils import simple_preprocess
import pyLDAvis
import gensim
import gensim.corpora as corpora
from nltk.corpus import stopwords
import re
import pandas as pd
from pprint import pprint
import numpy as np
import tqdm
from gensim.models.coherencemodel import CoherenceModel

#Visualization modules
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Import NLP packages
import string
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# CODE BLOCK: cleaning and tokenizing text data

# Download csv database
database = pd.read_csv('research_topics_list.csv')

# Select column containing self-description and convert it to list
research_topics_list = database['Translated Research Topics'].dropna(
).to_list()

# Function to remove punctuation and tokenize sentences into words
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  

research_topics_words = list(sent_to_words(research_topics_list))

# Build the bigram and trigram models
bigram = gensim.models.Phrases(research_topics_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[research_topics_words], threshold=100) 

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# Define extended stopwords 
stop_words = stopwords.words("english")
stop_words.extend(['include', 'well', 'different', 'especially', 'new', 'digital', 'digitalization', 'digitalisation', 'ethics',
                      'ethical', 'political', 'social', 'society', 'legal',
                      'law', 'philosophy', 'sociology', 'studies', 'research',
                      'theory', 'methods',
                      'questions', 'context', 'project', 'analysis', 'areas',
                      'perspective', 'technology', 'technologies',
                      'artificial', 'intelligence', 'ai', 'focuses', 'focus',
                      'data', 'science', 'dissertation', 'interests',
                      'interested'])
#extra_signs = ['``', "''", '”', '“', '”', '”', '``', '”', '"']

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out



# Remove Stop Words
research_topics_words_nostops = remove_stopwords(research_topics_words)

# Form Bigrams
research_topics_words_bigrams = make_bigrams(research_topics_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner'])
# Do lemmatization keeping only noun, adj, vb, adv
research_topics_lemmatized = lemmatization(research_topics_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

#print(research_topics_lemmatized[:10])

# Create Dictionary
id2word = corpora.Dictionary(research_topics_lemmatized)

# Create Corpus
texts = research_topics_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
#print(corpus[:1][0][:30])

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=10, 
                                       random_state=100,
                                       chunksize=100,
                                       passes=10,
                                       per_word_topics=True)
# Print the Keyword in the 10 topics
#pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=research_topics_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('Coherence Score: ', coherence_lda)

# supporting function
def compute_coherence_values(corpus, dictionary, k, a, b):
    
    lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=dictionary,
                                           num_topics=k, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=a,
                                           eta=b)
    
    coherence_model_lda = CoherenceModel(model=lda_model, texts=research_topics_lemmatized, dictionary=id2word, coherence='c_v')
    
    return coherence_model_lda.get_coherence()

grid = {}
grid['Validation_Set'] = {}

# Topics range
min_topics = 2
max_topics = 11
step_size = 1
topics_range = range(min_topics, max_topics, step_size)

# Alpha parameter
alpha = list(np.arange(0.01, 1, 0.3))
alpha.append('symmetric')
alpha.append('asymmetric')

# Beta parameter
beta = list(np.arange(0.01, 1, 0.3))
beta.append('symmetric')

# Validation sets
num_of_docs = len(corpus)
corpus_sets = [gensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.75)), 
               corpus]

corpus_title = ['75% Corpus', '100% Corpus']

model_results = {'Validation_Set': [],
                 'Topics': [],
                 'Alpha': [],
                 'Beta': [],
                 'Coherence': []
                }

# Can take a long time to run
if 1 == 1:
    pbar = tqdm.tqdm(total=(len(beta)*len(alpha)*len(topics_range)*len(corpus_title)))
    
    # iterate through validation corpuses
    for i in range(len(corpus_sets)):
        # iterate through number of topics
        for k in topics_range:
            # iterate through alpha values
            for a in alpha:
                # iterare through beta values
                for b in beta:
                    # get the coherence score for the given parameters
                    cv = compute_coherence_values(corpus=corpus_sets[i], dictionary=id2word, 
                                                  k=k, a=a, b=b)
                    # Save the model results
                    model_results['Validation_Set'].append(corpus_title[i])
                    model_results['Topics'].append(k)
                    model_results['Alpha'].append(a)
                    model_results['Beta'].append(b)
                    model_results['Coherence'].append(cv)
                    
                    pbar.update(1)
    #pd.DataFrame(model_results).to_csv('./results/lda_tuning_results.csv', index=False)
    pbar.close()

model_results_df= pd.DataFrame(model_results)

model_results_df_filtered = model_results_df[(model_results_df['Alpha']== 0.01) & (model_results_df['Beta']==0.01)]
    
sns.relplot(data=model_results_df_filtered, x= 'Topics', y= 'Coherence', kind= 'line')

num_topics = 3

lda_model = gensim.models.LdaMulticore(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=num_topics, 
                                           random_state=100,
                                           chunksize=100,
                                           passes=10,
                                           alpha=0.91,
                                           eta=0.01)


# Visualization of LDA results with pyLDAvis package
pyLDAvis.enable_notebook()
pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word)
# BERTOPIC Germany

# Import necessary packages
import pandas as pd
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from typing import Pattern
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from itertools import chain
from bertopic import BERTopic
import re

# Import database and extract column of interest
research_topics = pd.read_csv('research_topics_list.csv')



docs = research_topics['Translated Research Topics'].dropna().to_list()
clean_docs = [x for x in docs if x != 'Not available']


# Clean data from Python symbols
super_clean_docs =[]
pattern =('(\\n|\\xa0\\t|\\u200b|\\u202f|\\xad|\\u2028|\\r)')
for x in clean_docs :
    super_clean_docs.append(re.sub(pattern, ' ', x))



#Tokenize words lower-cased data
        
tokenized_docs = [word_tokenize(doc.lower()) for doc in super_clean_docs]


        
# Clean data from various signs, stopwords and generic words
stop_words = stopwords.words("english")
stop_words.extend(['include', 'well', 'different', 'especially', 'new'])
extra_signs = ['``', "''",'”','“','”','”', '``','”','"'] 
list_generic_words = ['digital', 'digitalization', 'digitalisation', 'ethics', 'ethical', 'political', 'social', 'society', 'legal', 'law','philosophy', 'sociology', 'studies', 'research', 'theory', 'methods', 'questions', 'context', 'project', 'analysis', 'areas', 'perspective', 'technology', 'technologies', 'artificial', 'intelligence', 'ai', 'focuses', 'focus', 'data', 'science', 'dissertation', 'interests', 'interested' ]

super_clean_docs_list = [[element for element in list if element not in stop_words  and element not in extra_signs and element not in list_generic_words and element not in string.punctuation] for list in tokenized_docs]



#convert list of lists of tokens into a list of tokens
super_clean_docs=[]
for list in super_clean_docs_list:
    for element in list:
        super_clean_docs.append(element)

# Concatenate the document with itself 10 times to augment its size

super_clean_docs_dup = super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs


# Instantiate Bertopic with specific embedding
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", nr_topics=5)
#embedding_model = "all-MiniLM-L6-v2"

#Fit model and transform data
topics, probs = topic_model.fit_transform(super_clean_docs_dup)



#Perform sanity check on results
#topic_model.get_topic_info()

#look into topic model i
#topic_model.get_topic(0)

#Look at representative docs for topic i
#topic_model.get_representative_docs(0)


# Interactive Intertopic distance visualization of topics
map_chart = topic_model.visualize_topics()
map_chart.show()

# Bar Visualization of main words

bar_chart = topic_model.visualize_barchart(n_words=10)
bar_chart.show()
# BERTOPIC France

# Import necessary packages
import pandas as pd
import re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from typing import Pattern
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from itertools import chain
from bertopic import BERTopic
import re

# Import database and extract column of interest
research_topics = pd.read_csv('france_trans.csv')



docs = research_topics['Translated Research Topics'].dropna().to_list()
clean_docs = [x for x in docs if x != 'Not available']


# Clean data from Python symbols
super_clean_docs =[]
pattern =('(\\n|\\xa0\\t|\\u200b|\\u202f|\\xad|\\u2028|\\r)')
for x in clean_docs :
    super_clean_docs.append(re.sub(pattern, ' ', x))



#Tokenize words lower-cased data
        
tokenized_docs = [word_tokenize(doc.lower()) for doc in super_clean_docs]


        
# Clean data from various signs, stopwords and generic words
stop_words = stopwords.words("english")
stop_words.extend(['include', 'well', 'different', 'especially', 'new'])
extra_signs = ['``', "''",'”','“','”','”', '``','”','"'] 
list_generic_words = ['digital', 'digitalization', 'digitalisation', 'ethics', 'ethical', 'political', 'social', 'society', 'legal', 'law','philosophy', 'sociology', 'studies', 'research', 'theory', 'methods', 'questions', 'context', 'project', 'analysis', 'areas', 'perspective', 'technology', 'technologies', 'artificial', 'intelligence', 'ai', 'focuses', 'focus', 'data', 'science', 'dissertation', 'interests', 'interested' ]

super_clean_docs_list = [[element for element in list if element not in stop_words  and element not in extra_signs and element not in list_generic_words and element not in string.punctuation] for list in tokenized_docs]



#convert list of lists of tokens into a list of tokens
super_clean_docs=[]
for list in super_clean_docs_list:
    for element in list:
        super_clean_docs.append(element)

# Concatenate the document with itself 10 times to augment its size

super_clean_docs_dup = super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs + super_clean_docs


# Instantiate Bertopic with specific embedding
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", nr_topics=5)
#embedding_model = "all-MiniLM-L6-v2"

#Fit model and transform data
topics, probs = topic_model.fit_transform(super_clean_docs_dup)



#Perform sanity check on results
#topic_model.get_topic_info()

#look into topic model i
#topic_model.get_topic(0)

#Look at representative docs for topic i
#topic_model.get_representative_docs(0)


# Interactive Intertopic distance visualization of topics
map_chart = topic_model.visualize_topics()
map_chart.show()

# Bar Visualization of main words

bar_chart = topic_model.visualize_barchart(n_words=10)
bar_chart.show()
# BERTOPIC clean Germany

# Import necessary packages
from bertopic import BERTopic
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import pandas as pd
import re
import string
import nltk
nltk.download('punkt')
nltk.download('stopwords')

# CODE BLOCK: cleaning and tokenizing text data
# Import database
research_topics = pd.read_csv('research_topics_list.csv')


# Extract the column of research topics in English, drop NaN values and convert
# to list
docs = research_topics['Translated Research Topics'].dropna().to_list()


# Clean out "Not available" values
clean_docs = [x for x in docs if x != 'Not available']


# Clean data from Python symbols
pattern = ('(\\n|\\xa0\\t|\\u200b|\\u202f|\\xad|\\u2028|\\r)')
super_clean_docs = [re.sub(pattern, ' ', x) for x in clean_docs]


# Tokenize words lower-cased data
tokenized_docs = [word_tokenize(doc.lower()) for doc in super_clean_docs]


# Clean data from various signs, stopwords and generic words
stop_words = stopwords.words("english")
stop_words.extend(['include', 'well', 'different', 'especially', 'new'])
extra_signs = ['``', "''", '”', '“', '”', '”', '``', '”', '"']
list_generic_words = ['digital', 'digitalization', 'digitalisation',
                      'ethics', 'ethical', 'political', 'social', 'society',
                      'legal', 'law', 'philosophy', 'sociology', 'studies',
                      'research', 'theory', 'methods', 'questions', 'context',
                      'project', 'analysis', 'areas', 'perspective',
                      'technology', 'technologies', 'artificial',
                      'intelligence', 'ai', 'focuses', 'focus', 'data',
                      'science', 'dissertation', 'interests', 'interested']

super_clean_docs_list = [[element for element in list if element
                          not in stop_words and element not in extra_signs
                          and element not in list_generic_words and element
                          not in string.punctuation]
                         for list in tokenized_docs]


# Convert list of lists of tokens into a list of tokens
super_clean_docs = [element for list in super_clean_docs_list
                    for element in list]



# CODE BLOCK: Bertopic
# Concatenate the document with itself 10 times to augment its size

super_clean_docs_dup = super_clean_docs + super_clean_docs +\
    super_clean_docs + super_clean_docs + super_clean_docs + \
    super_clean_docs + super_clean_docs + super_clean_docs +\
    super_clean_docs + super_clean_docs + super_clean_docs


# Instantiate Bertopic with specific embedding
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2", nr_topics=5)


# Fit model and transform data
topics, probs = topic_model.fit_transform(super_clean_docs_dup)


# CODE BLOCK: visualization of Bertopic results
# Perform sanity check on results
topic_model.get_topic_info()

# Look into topic model 0
topic_model.get_topic(0)

# Look at representative docs for topic 0
topic_model.get_representative_docs(0)


# Interactive Intertopic distance visualization of topics
map_chart = topic_model.visualize_topics()
map_chart.show()

# Bar Visualization of main words
bar_chart = topic_model.visualize_barchart(n_words=10)
bar_chart.show()