Using twitter to assess political strategy and position

In this notebook we'll explore the networks of both sides of US political aisle: TheDemocrats and the GOP. We'll identify like minded political and social interest communities, and use these communities as landmarks to quantify social distance.

What can this data really tell us?

Our world is messy and complicated, online social networks like twitter (and the internet in general) give us a peak into this nuanced world.

Think of Twitter as a network of human sensors. -- Rick Lawrence, IBM, Machine Learning & Decision Analytics

What's possible now that our interests and relationships are digitized and available for download? The following is a small example.

This notebook depends on:

This will get you everything you need:

 $ git clone https://github.com/timmytw/graphreduce.git

 $ cd graphreduce/; pip install -r requirements.txt
In [6]:
import os, math, inspect
from IPython.display import display_html
from operator import mul
import graphlab as gl
from graphreduce.graph_wrapper import GraphWrapper

Downloading and compressing our network

First we'll download the preassembled 2 degree ego networks of the DNC and RNC, then we'll mine these combined networks for compression patterns (communities).

This will be the most time consuming part of our exercise, it takes roughly 6.5 mins on my magnetic drive / 8gb ram / i7 laptop.

In [3]:
this_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
cache_dir = this_dir+'/.twitter_politics/'
if os.path.exists(cache_dir+'parent'):
    gw = GraphWrapper.from_previous_reduction(cache_dir)
else:
    v_path = 'http://static.smarttypes.org/static/graphreduce/test_data/TheDemocrats_GOP.vertex.csv.gz'
    e_path = 'http://static.smarttypes.org/static/graphreduce/test_data/TheDemocrats_GOP.edge.csv.gz'
    gw, mdls = GraphWrapper.reduce(v_path, e_path)
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)
    gw.save(cache_dir)
[INFO] Start server at: ipc:///tmp/graphlab_server-20361 - Server binary: /home/timmyt/.virtualenvs/graphreduce/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1423869317.log
[INFO] GraphLab Server Version: 1.2.1


---------------------------------------------
Top level detection, 1 partition(s)
---------------------------------------------
 - partition 1 of 1, v_count: 1581
   + found 16 communities, mdl: 9.91881

---------------------------------------------
Bottom level detection, 1 partition(s)
---------------------------------------------
 - partition 1 of 1, v_count: 37012
   + found 380 communities, mdl: 13.8409

---------------------------------------------
Top level detection, 1 partition(s)
---------------------------------------------
 - partition 1 of 1, v_count: 380
   + found 1 communities, mdl: 8.09308

[13.8409]
total runtime: 0:06:24.760820

Network community detection

The topic of community detection is broad and deep. The method here, the map equation, uses information theory to quantify the compression of a random walk. Relaxmap is a parallel implementation of the map equation objective.

Let's take a look at the most popular communities, ordered by pagerank:

In [18]:
def display_table(rows):
    table_template = '<table>%s</table>'
    row_template = '<tr>%s</tr>'
    header_column_template = '<th>%s</th>'
    normal_column_template = '<td>%s</td>'
    rows_html = []
    for i, row in enumerate(rows):
        row_html = []
        for column in row:
            col_template = header_column_template if i == 0 else normal_column_template
            row_html.append(col_template % column)
        rows_html.append(row_template % ''.join(row_html))
    display_html(table_template % ''.join(rows_html), raw=True)

def mk_labels_html(labels):
    labels_html_template = '<span style="color:#0000FF;padding:5px;">%s</span>'
    labels_html = []
    for x in labels:
        labels_html.append(labels_html_template % x)
    return ''.join(labels_html)

def display_communities(results, score_name, header):
    output_rows = [header]
    for x in results:
        output_row = [str(x[score_name])[:4], x['member_count'], mk_labels_html(x['top_labels'])]
        output_rows.append(output_row)
    display_table(output_rows) 

min_members = 25
communities = gw.g.get_vertices()
communities = communities[communities['member_count'] >= min_members]
display_html('<h3>Popular communities</h3>', raw=True)
header = ['Pagerank', 'Member count', 'Top labels']
display_communities(communities.sort('pr', ascending=False)[:10], 'pr', header)

Popular communities

PagerankMember countTop labels
4.879311editorcnnpoliticscorrespondentpolitical
4.452616actorofficialtwitteractresswriter
3.59985foundertechceomarketingtechnology
3.556333conservativetcotchristianlibertylibertarian
3.47715endorsementofficialarchivedmilitarytwitter
3.081279footballespnsportsofficialtwitter
2.61196worldinternationalpovertynationsglobal
2.591361uniteblueliberalprogressiveobamap2
2.44314organizingactionvolunteersmaintainedofa
2.39217foodchefrestaurantcookrecipes

Communities close to the respective parties

Let's have a look at communities close to the respective parties. The first output variable is a reciprocal_interest score (explanation forthcoming) the second is the number of members in the community, the third is a list of community labels, for more on community labeling you can check out the source here, def label_communities(self).

In [19]:
def reciprocal_interest(scores):
    def _score(row):
        return row['user_interest'] * row['community_interest']
    return scores.apply(_score)

user_community_scores = gw.child.user_community_scores(reciprocal_interest, min_members)

def users_top_communities(user_id, scores):
    user_scores = scores[scores['user_id'] == user_id]
    user_scores = user_scores.join(communities, {'community_id':'__id'})
    user_scores.remove_column('community_id.1')
    return user_scores.sort('score', ascending=False)

header = ['Score', 'Member count', 'Top labels']

display_html('<h3>DNC communities</h3>', raw=True)
dem_id = '14377605'
dem_communities = users_top_communities(dem_id, user_community_scores)
display_communities(dem_communities[:10], 'score', header)

display_html('<h3>RNC communities</h3>', raw=True)
rep_id = '11134252'
rep_communities = users_top_communities(rep_id, user_community_scores)
display_communities(rep_communities[:10], 'score', header)

DNC communities

ScoreMember countTop labels
3.21314organizingactionvolunteersmaintainedofa
3.219311editorcnnpoliticscorrespondentpolitical
3.21298democratscollegecollegedemsuniversitydemocratic
0.30321papennsylvaniapittsburghcountyphiladelphia
0.26377massachusettsbostonmamapolistate
0.231361uniteblueliberalprogressiveobamap2
0.19683texasdallasaustinstatetx
0.18405seattlewashingtonwastateking
0.17340ohiocolumbusstatedaytoncincinnati
0.15537michigandetroitstatemigopmi

RNC communities

ScoreMember countTop labels
3.216333conservativetcotchristianlibertylibertarian
2.659311editorcnnpoliticscorrespondentpolitical
2.41534districtcongressionalrepresentingcongressmanproudly
0.11395virginiavarichmonddistrictdelegates
0.10535floridatampajacksonvillemiamipolitical
0.10347iowamoinesdesiaiowan
0.06237minnesotamnminneapolisstartribunepaul
0.06291nhhampshirestatemanchestergranite
0.05683texasdallasaustinstatetx
0.05211georgiaatlantagaajcstate

What can we glean from this?

The 'score' here is the product of user_interest and community_interest. Twitter is a directed network, our objective function rewards relationships where an account follows many people in a community and many people in the community follow the account, a reciprocal_interest function.

What can we glean from this? I'm not really sure. But there are a few things worth mentioning.

The DNC is aligned heavily with volunteers, colleges, and the news media. And then supportive and swing states. I was surprised to see texas, Can Democrats Turn Texas and Arizona Blue by 2016?

The RNC is aligned primarily with the conservative community, the media, and congressional representation, then a mix of it's own support and swing states.

We'll use the communities closest to each party as features (landmarks) to measure similarity. Let's look at accounts close to the respective parties:

In [21]:
def users_top_users(user_id, scores, feature_ids):
    assert scores['score'].min() >= 0
    scores = scores.groupby('user_id', 
        {'score':gl.aggregate.CONCAT('community_id', 'score')},
        {'num_features':gl.aggregate.COUNT('community_id')})
    scores = scores[scores['num_features'] > len(feature_ids) * .20]
    user_score = scores[scores['user_id'] == user_id][0]
    def distance(row):
        total_distance = 0
        for x in feature_ids:
            score1 = user_score['score'].get(x)
            score2 = row['score'].get(x)
            if score1 and score2:
                dis = abs(score1 - score2)
            elif score1 or score2:
                dis = (score1 or score2) * 2
            else:
                dis = 0
            total_distance+=dis
        return total_distance
    scores['distance'] = scores.apply(distance)
    scores = scores.join(gw.verticy_descriptions, {'user_id':'__id'})
    scores['distance'] = (scores['distance'] - scores['distance'].mean()) \
        / (scores['distance'].std())
    return scores.sort('distance')

feature_ids = list(rep_communities['community_id'][:5])
feature_ids += list(dem_communities['community_id'][:5])
feature_ids = list(set(feature_ids))

def mk_twitter_link(screen_name):
    return '<a target="_blank" href="https://twitter.com/%s">%s</a>' % (screen_name, screen_name)

def display_accounts(results, score_name, header):
    output_rows = [header]
    for x in results:
        output_row = [str(x[score_name])[:4], 
                      mk_twitter_link(x['screen_name']), 
                      x['description']]
        output_rows.append(output_row)
    display_table(output_rows) 

header = ['Distance', 'Account', 'Description']

display_html('<h3>Accounts similar to the DNC</h3>', raw=True)
dem_users = users_top_users(dem_id, user_community_scores, feature_ids)
display_accounts(dem_users[:10], 'distance', header)

display_html('<h3>Accounts similar to the RNC</h3>', raw=True)
rep_users = users_top_users(rep_id, user_community_scores, feature_ids)
display_accounts(rep_users[:10], 'distance', header)

Accounts similar to the DNC

DistanceAccountDescription
-5.2TheDemocratsThis is the official Twitter account of the Democratic Party. Follow our tweets to get the latest info on Democratic news, issues, and events.
-4.0jeremybirdFounding Partner w @270Strategies; Senior Advisor w @BGTX; Former National Field Director, Obama for America; believer in grassroots, empowerment campaigning
-3.8OFA_HQ
-3.8dsccDemocratic Senatorial Campaign Committee | Committed to Electing a Democratic Majority
-3.7Messina20122012 Obama Campaign Manager, former White House Deputy Chief of Staff. Proud Montanan
-3.6CollegeDemsCollege Democrats of America | The official youth branch of the Democratic National Committee. Become a fan on Facebook: http://t.co/YPg5VNqv3R
-3.4MarlonDMarshallKansas Jayhawk for life, grassroots organizer, sports fan, proud St. Louis native, full-time please believer.
-3.4DNCYouthCouncilThe official Twitter page of the DNC Youth Council. We work to increase youth participation in the Democratic Party.
-3.4woodhousebPresident, Americans United for Change and American Bridge, former DNC Comm. Dir./Obama '08/'12 Surrogate, Husband, Father, Fisherman, BBaller, NFL junkie.
-3.3JonCarsonOFAJon Carson is Executive Director of Organizing for Action.

Accounts similar to the RNC

DistanceAccountDescription
-3.9GOPUpdates from the Republican National Committee.
-3.7DailyCallerPolitics, entertainment, slideshows. You're welcome.
-3.4PRyanHusband; Proud father of 3; Wisconsinite; Go Pack Go!
-3.4SpeakerBoehnerOfficial Twitter account for U.S. House Speaker John Boehner (R-OH)
-3.4SenRandPaulI fight for the Constitution, individual liberty and the freedoms that make this country great.
-3.3Senate_GOPsNews and updates from Republican senators and their staff.
-3.2megynkellyHappily married to Doug, crazy in love with my children Yates, Yardley, and Thatcher, and anchor of The Kelly File on Fox News Channel
-3.2johnboehnerI represent Ohio's 8th Congressional District and serve as Speaker of the House; am fighting for a smaller, more accountable government.
-3.1HeritageA think tank devoted to the principles of free enterprise, limited government, individual freedom, traditional American values, and a strong national defense.
-3.1MittRomneyFormer Governor of Massachusetts

Accounts of interest to both sides

Now lets look for accounts of interest to both the DNC and RNC, these are accounts that have similar reciprocal_interest score distributions as the DNC and RNC:

In [22]:
def users_in_between(distances):
    n_dimensions = len(distances)
    _distances = distances[0]
    for x in distances[1:]:
        _distances = _distances.append(x)
    distances = _distances
    distances = distances.groupby('user_id', {'distances':gl.aggregate.CONCAT('distance')})
    def between(row):
        if len(row['distances']) != n_dimensions:
            return None
        x = gl.SArray(row['distances'])
        if x.std() > .15:
            return None
        return x.mean() + x.std()
    distances['distance'] = distances.apply(between)
    distances = distances.dropna().join(gw.verticy_descriptions, {'user_id':'__id'})
    return distances.sort('distance')

display_html('<h3>Of interest to the DNC and RNC</h3>', raw=True)
equidistant_users = users_in_between([dem_users, rep_users])
display_accounts(equidistant_users[:10], 'distance', header)

Of interest to the DNC and RNC

DistanceAccountDescription
-2.1DavidMDrucker@dcexaminer Senior Congressional Correspondent covering Capitol Hill, campaigns & national political trends.
-2.0postpoliticsThe latest political news and analysis from The Washington Post.
-1.8sppeoplesNH native who covers national politics for The Associated Press, bleeds with New England sports fans, and has yet to find good tacos in DC.
-1.8kararowlandCapitol Hill producer w/ Fox News; UVA and LSE alum. Anglophile. Also fond of cute animals, craft beer & Charlottesville. RT don't = endorsements, and all that.
-1.8katiezez@washingtonpost White House correspondent. @nytimes Boston bureau, @AP New Jersey, @UMKnightWallace alum. katie.zezima@washpost.com
-1.7JamesPindellPolitical reporter for The Boston Globe focused on presidential primaries. Political analyst for WMUR-TV in New Hampshire.
-1.7SunlenSerfatyNational Correspondent, CNN Newsource
-1.7DanielStrauss4@TPM reporter. RTs ≠ endorsements. E-mail: daniel@talkingpointsmemo.com
-1.7PattyMurrayOfficial account of U.S. Senator Patty Murray (D-WA) | Tweets come from staff unless signed “-PM” by Senator Murray | RT≠endorsement
-1.7nedrapicklerWhite House reporter for The Associated Press and a proud native of Flint, Michigan

Give me some more.

The focus of this notebook has been unsupervised learning, in a future notebook we'll look at using labeled data and network compression patterns (communities) to predict future actions. Say hi @SmartTypes if this is of interest.

In []: