Capturing Tweets from Twitter's Streaming Endpoints

The code for this project isn’t super complex, but it requires a little knowledge of how Twitter streams data from their Streaming APIs, so I thought I’d go through it for folks that are unfamiliar. We’ll use the python-twitter library for interacting with the Twitter’s API. You can get a copy of that via pip with:

pip install python-twitter==3.1

or by cloning the repository from Github:

git clone git@github.com:bear/python-twitter.git
cd python-twitter
pip install .

Either way you choose (and pip is generally preferred), you should have a working install of python-twitter from either of the above and be ready to continue.

If you’re used to working with Tweepy or Twython, both of which provide a “StreamHandler” kind of class, python-twitter provides a few methods on the main API instance which you then wrap in your own handler. This means that you end up with some extra boilerplate code, but it’s usually not that bad and (I think) ends up being a little simpler to use in most cases. So let’s get started.

We’ll instantiate the API instance first with our application keys/tokens:

import twitter
api = twitter.Api(
    consumer_key=[consumer key],
    consumer_secret=[consumer secret],
    access_token_key=[access token],
    access_token_secret=[access token secret])

(If you don’t have credentials, check out the Getting Started guide, which walks you through creating an application.)

Next up, let’s say that we want to to archive all tweets containing a hashtag; let’s go with “#TheHighlanderWasAGreatMovie” (true and pithy!). Anyway, so we have a few ways of interacting with Twitter’s streaming API endpoints, so let’s go through those:

Endpoint Good for Method call
statuses/sample.json Tracking public tweets matching certain words or phrases, tweeted by certain people, or tweeted from a location api.GetstreamFilter()
statuses/filter.json Getting a large volume of random public tweets. api.GetStreamSample()
user.json Getting tweets/events/etc. specific to the authenticated user api.GetUserStream()

And that’s it! There are really only three streams with which we can interact (at least without additional permission from Twitter). Since we want to track tweets by more than one person (so not a userstream) and we want to track a specific word or phrase, we’ll use the GetStreamFilter() method, which returns a generator.

GetStreamFilter() can take a few arguments:

Keyword What it does
track filters the stream to a list of words or phrases
follow filters the stream to the public events / tweets of a list of users
location filters stream to a list of specific geographic locations (each one in the form of “39.12,-140.99,1mi”)

We’re not going to get near the limits of the number of phrases, accounts, or locations you can track so don’t worry about that. If you’re curious, you can get out Twitter’s documentation here.

So continuing on. We’ve imported our library, created our Api instance and we’re ready to track our hashtag. GetStreamFilter() expects a list of strings to track, so we’ll store our hashtag in a list for later reference. (If we were tracking a large number of tags, it’d get tedious to read so I tend to put whatever we’re tracking in a list and pass that around.)

import twitter
api = twitter.Api(
    consumer_key=[consumer key],
    consumer_secret=[consumer secret],
    access_token_key=[access token],
    access_token_secret=[access token secret])

hashtags_to_track = [
    "#TheHighlanderWasAGreatMovie",
]

stream = api.GetStreamFilter(track=hashtags_to_track)
for line in stream:
    print(line)

If you are tracking a hashtag with a large volume of tweets, you’ll probably see data returned pretty quickly. Since python-twitter doesn’t enforce any data types on the stream (i.e., you won’t get back a bunch of twitter.Status objects just from the above code), let’s do that now:

import twitter
api = twitter.Api(
    consumer_key=[consumer key],
    consumer_secret=[consumer secret],
    access_token_key=[access token],
    access_token_secret=[access token secret])

hashtags_to_track = [
    "#TheHighlanderWasAGreatMovie",
]

stream = api.GetStreamFilter(track=hashtags_to_track)
for line in stream:
    # Signal that the line represents a tweet
    if 'in_reply_to_status_id' in line:
        tweet = twitter.Status.NewFromJsonDict(line)

        # Let's only print the user & text of the tweet for now
        print("User: {user}, Tweet: '{tweet}'".format(
            user=tweet.user.screen_name,
            tweet=tweet.text))

As noted, if you are tracking a large-volume hashtag or a lot of users, you’ll probably get a fair amount of data back immediately. If you wan to check to make sure that you can connect and that your code is working properly replace

stream = api.GetStreamFilter(track=hashtags_to_track)

with

stream = api.GetStreamSample()

and you’ll get a large number of tweets pretty immediately.

So that’s good if you just want to display a bunch of tweets to the terminal, but we can store these to later analyze to find film connoisseurs with tastes as refined as ours. What structure you want to use is up to you, but for now, let’s store our data in a csv file. Luckily, python’s CSV module is pretty straightforward.

import csv
import twitter
api = twitter.Api(
    consumer_key=[consumer key],
    consumer_secret=[consumer secret],
    access_token_key=[access token],
    access_token_secret=[access token secret])

hashtags_to_track = [
    "#TheHighlanderWasAGreatMovie",
]

stream = api.GetStreamFilter(track=hashtags_to_track)

with open('tweets.csv', 'w+') as csv_file:
    csv_writer = csv.writer(csv_file)
    for line in stream:
        # Signal that the line represents a tweet
        if 'in_reply_to_status_id' in line:
            tweet = twitter.Status.NewFromJsonDict(line)
            print(tweet.id)
            row = [tweet.id, tweet.user.screen_name, tweet.text]
            csv_writer.writerow(row)

And that’s basically it at this point! If you don’t want to store all these tweets locally, you can set up a database and run this as a background service, but that’s another blog post.