Skip to content

Collecting data through the Twitter API

Firstly, we need to authenticate ourselves with the Twitter API using Tweepy. This authentication involves using the consumer key and access token, which are required to access the Twitter API. This is the initial step that needs to be taken before we can collect data.

from tweepy import OAuthHandler
from tweepy import API

# Consumer key authentication
auth = OAuthHandler(consumer_key, consumer_secret)

# Access key authentication
auth.set_access_token(access_token, access_token_secret)

# Set up the API with the authentication handler
api = API(auth)

After successful authentication, we are ready to collect data from Twitter based on specific keywords. We use Tweepy's Stream class to do this. In this step, we specify the keywords we want to monitor and start collecting data that matches these keywords.

from tweepy import Stream

# Set up words to track
keywords_to_track = ["#rstats", "#python"]

# Instantiate the SListener object 
listen = SListener(api)

# Instantiate the Stream object
stream = Stream(auth, listen)

# Begin collecting data
stream.filter(track = keywords_to_track)

With these steps, we have successfully set up authentication and initiated the collection of Twitter data based on specific keywords. This allows us to gather data relevant to our interests or projects.

In this code section, we load JSON-formatted tweet data, convert it into a Python object, and then access various aspects of the tweet. We print the tweet's text content and unique ID. Additionally, we access user-related information, including the user's handle, follower count, location, and description. Furthermore, we delve into retweet data by printing the retweeted tweet's text, the text of the original tweet that was retweeted, the user who performed the retweet, and the user who originally posted the tweet being retweeted.

# Load JSON
import json

# Convert from JSON to Python object
tweet = json.loads(tweet_json)

# Print tweet text
print(tweet['text'])

# Print tweet id
print(tweet['id'])

# Print user handle
print(tweet['user']['screen_name'])

# Print user follower count
print(tweet['user']['followers_count'])

# Print user location
print(tweet['user']['location'])

# Print user description
print(tweet['user']['description'])

# Print the text of the tweet
print(rt['text'])

# Print the text of tweet which has been retweeted
print(rt['retweeted_status']['text'])

# Print the user handle of the tweet
print(rt['user']['screen_name'])

# Print the user handle of the tweet which has been retweeted
print(rt['retweeted_status']['user']['screen_name'])

Processing Twitter Text

Tweet Items and Tweet Flattening

In the realm of Twitter data analysis, tweets often come with various fields in the Twitter JSON that contain textual data. In a typical tweet, you can find the tweet text, the user's description, and their location. However, there are additional complexities to consider, such as extended tweets for messages longer than 140 characters and quoted tweets, which include both the original tweet's text and the commentary.

# Print the tweet text
print(quoted_tweet['text'])

# Print the quoted tweet text
print(quoted_tweet['quoted_status']['text'])

# Print the quoted tweet's extended (140+) text
print(quoted_tweet['quoted_status']['extended_tweet']['full_text'])

# Print the quoted user location
print(quoted_tweet['quoted_status']['user']['location'])

A Tweet Flattening Function

In Twitter analysis, we often deal with hundreds or thousands of tweets. To streamline the process, we can create a function called flatten_tweets() to flatten JSON data containing tweets. We'll use this function frequently, adjusting it as needed for different data types.

def flatten_tweets(tweets_json):
    """ Flattens out tweet dictionaries so relevant JSON
        is in a top-level dictionary."""
    tweets_list = []
    
    # Iterate through each tweet
    for tweet in tweets_json:
        tweet_obj = json.loads(tweet)
    
        # Store the user screen name in 'user-screen_name'
        tweet_obj['user-screen_name'] = tweet_obj['user']['screen_name']
    
        # Check if this is a 140+ character tweet
        if 'extended_tweet' in tweet_obj:
            # Store the extended tweet text in 'extended_tweet-full_text'
            tweet_obj['extended_tweet-full_text'] = tweet_obj['extended_tweet']['full_text']
    
        if 'retweeted_status' in tweet_obj:
            # Store the retweet user screen name in 'retweeted_status-user-screen_name'
            tweet_obj['retweeted_status-user-screen_name'] = tweet_obj['retweeted_status']['user']['screen_name']

            # Store the retweet text in 'retweeted_status-text'
            tweet_obj['retweeted_status-text'] = tweet_obj['retweeted_status']['text']
            
        tweets_list.append(tweet_obj)
    return tweets_list

Loading Tweets into a DataFrame

Now, let's import this processed data into a pandas DataFrame for scalable tweet analysis. We'll be working with a dataset containing tweets that include either the '#rstats' or '#python' hashtag, stored as a list of tweet JSON objects in data_science_json.

# Import pandas
import pandas as pd

# Flatten the tweets and store in `tweets`
tweets = flatten_tweets(data_science_json)

# Create a DataFrame from `tweets`
ds_tweets = pd.DataFrame(tweets)

# Print out the first 5 tweets from this dataset
print(ds_tweets['text'].values[0:5])