Skip to content

Python Data Science Toolbox (Part 2)

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import matplotlib.pyplot as plt

# Import the course datasets 
world_ind = pd.read_csv('datasets/world_ind_pop_data.csv')
tweets = pd.read_csv('datasets/tweets.csv')

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Define count_entries()
def count_entries(csv_file, c_size, colname):
    """Return a dictionary with counts of
    occurrences as value for each key."""
    
    # Initialize an empty dictionary: counts_dict
    counts_dict = {}

    # Iterate over the file chunk by chunk
    for chunk in pd.read_csv(csv_file, chunksize=c_size):

        # Iterate over the column in DataFrame
        for entry in chunk[colname]:
            if entry in counts_dict.keys():
                counts_dict[entry] += 1
            else:
                counts_dict[entry] = 1

    # Return counts_dict
    return counts_dict

# Call count_entries(): result_counts
result_counts = count_entries("tweets.csv", c_size=10, colname="lang")

# Print result_counts
print(result_counts)

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Create a zip object containing the CountryName and CountryCode columns in world_ind. Unpack the resulting zip object and print the tuple values.
  • Use a list comprehension to extract the first 25 characters of the text column of the tweets DataFrame provided that the tweet is not a retweet (i.e., starts with "RT").
  • Create an iterable reader object so that you can use next() to read datasets/world_ind_pop_data.csv in chunks of 20.