Skip to content

Autism vs Autism Spectrum RSV Analysis Part 1

This notebook is Part 1 of the analysis of Google searches in different languages for "Autism" and "Autism spectrum disorder". Part 1 focuses on:

  • data cleaning
  • df merging

Reading data in

I downloaded a series of google trends data in order to compare the interest of people for "autism spectrum disroder" and "autism" in different languages.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

For the convenience of handling multiple data files, I createed a function, that will help me read in the data more easily

def scrape_filenames():
  '''
  lists the names of uploaded files
  cleans them and returns a list of file names
  '''
  string_list = !ls
  file_list = []

  for s in string_list:
    s = s.replace('  ', ',').replace('    ', ',').replace('\t', ',')
    file_list.extend(s.split(','))

  file_list = [s for s in file_list if len(s)>0]
  file_list = [s.strip() for s in file_list]

  return file_list
file_list = scrape_filenames()

_short.csv - contains RSV data for "Autism" in a given language -long.csv - contain RSV data dfor "Autism Spectrum Disorder" in a given language I used wikipedia to search for the translations of both terms

print(file_list)

Cleaning and merging individual languages

def read_batch(file_list):
  '''
  takes a list of file names
  joins them into a path
  and returns the data frame
  '''

  all_languages = pd.read_csv('arabic_long.csv', skiprows = 2, parse_dates=['Miesiąc'])
  language_names = []
  for name in file_list:
    path = name
    if (path == 'autism_language.ipynb')| (path == 'notebook.ipynb'):
      continue
    col_name = name.split('.')[0]
    df = pd.read_csv(path, skiprows = 2, parse_dates=['Miesiąc'])
    df = df.rename(columns = {df.columns[1]:col_name})
    all_languages = all_languages.merge(df, how='outer', on='Miesiąc')

  all_languages = all_languages.rename(columns = {'Miesiąc': 'date'})
  all_languages.drop(columns = 'طيف التوحد: (Cały świat)', inplace = True)

  return all_languages
df_lang = read_batch(file_list)
df_lang.head()
df_lang.set_index('date', inplace=True)
# visuals and statistics for english short
english_short = df_lang['english_short']
english_long = df_lang['english_long']