Using twitter to assess political strategy and position¶

In this notebook we'll explore the networks of both sides of US political aisle: TheDemocrats and the GOP. We'll identify like minded political and social interest communities, and use these communities as landmarks to quantify social distance.

What can this data really tell us?

Our world is messy and complicated, online social networks like twitter (and the internet in general) give us a peak into this nuanced world.

Think of Twitter as a network of human sensors. -- Rick Lawrence, IBM, Machine Learning & Decision Analytics

What's possible now that our interests and relationships are digitized and available for download? The following is a small example.

This notebook depends on:

This will get you everything you need:

 $ git clone https://github.com/timmytw/graphreduce.git

 $ cd graphreduce/; pip install -r requirements.txt

import os, math, inspect
from IPython.display import display_html
from operator import mul
import graphlab as gl
from graphreduce.graph_wrapper import GraphWrapper

Downloading and compressing our network

First we'll download the preassembled 2 degree ego networks of the DNC and RNC, then we'll mine these combined networks for compression patterns (communities).

This will be the most time consuming part of our exercise, it takes roughly 6.5 mins on my magnetic drive / 8gb ram / i7 laptop.

this_dir = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe())))
cache_dir = this_dir+'/.twitter_politics/'
if os.path.exists(cache_dir+'parent'):
    gw = GraphWrapper.from_previous_reduction(cache_dir)
else:
    v_path = 'http://static.smarttypes.org/static/graphreduce/test_data/TheDemocrats_GOP.vertex.csv.gz'
    e_path = 'http://static.smarttypes.org/static/graphreduce/test_data/TheDemocrats_GOP.edge.csv.gz'
    gw, mdls = GraphWrapper.reduce(v_path, e_path)
    if not os.path.exists(cache_dir):
        os.makedirs(cache_dir)
    gw.save(cache_dir)

[INFO] Start server at: ipc:///tmp/graphlab_server-20361 - Server binary: /home/timmyt/.virtualenvs/graphreduce/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1423869317.log
[INFO] GraphLab Server Version: 1.2.1


---------------------------------------------
Top level detection, 1 partition(s)
---------------------------------------------
 - partition 1 of 1, v_count: 1581
   + found 16 communities, mdl: 9.91881

---------------------------------------------
Bottom level detection, 1 partition(s)
---------------------------------------------
 - partition 1 of 1, v_count: 37012
   + found 380 communities, mdl: 13.8409

---------------------------------------------
Top level detection, 1 partition(s)
---------------------------------------------
 - partition 1 of 1, v_count: 380
   + found 1 communities, mdl: 8.09308

[13.8409]
total runtime: 0:06:24.760820

Network community detection

The topic of community detection is broad and deep. The method here, the map equation, uses information theory to quantify the compression of a random walk. Relaxmap is a parallel implementation of the map equation objective.

Let's take a look at the most popular communities, ordered by pagerank:

def display_table(rows):
    table_template = '<table>%s</table>'
    row_template = '<tr>%s</tr>'
    header_column_template = '<th>%s</th>'
    normal_column_template = '<td>%s</td>'
    rows_html = []
    for i, row in enumerate(rows):
        row_html = []
        for column in row:
            col_template = header_column_template if i == 0 else normal_column_template
            row_html.append(col_template % column)
        rows_html.append(row_template % ''.join(row_html))
    display_html(table_template % ''.join(rows_html), raw=True)

def mk_labels_html(labels):
    labels_html_template = '<span style="color:#0000FF;padding:5px;">%s</span>'
    labels_html = []
    for x in labels:
        labels_html.append(labels_html_template % x)
    return ''.join(labels_html)

def display_communities(results, score_name, header):
    output_rows = [header]
    for x in results:
        output_row = [str(x[score_name])[:4], x['member_count'], mk_labels_html(x['top_labels'])]
        output_rows.append(output_row)
    display_table(output_rows) 

min_members = 25
communities = gw.g.get_vertices()
communities = communities[communities['member_count'] >= min_members]
display_html('<h3>Popular communities</h3>', raw=True)
header = ['Pagerank', 'Member count', 'Top labels']
display_communities(communities.sort('pr', ascending=False)[:10], 'pr', header)

Communities close to the respective parties

Let's have a look at communities close to the respective parties. The first output variable is a reciprocal_interest score (explanation forthcoming) the second is the number of members in the community, the third is a list of community labels, for more on community labeling you can check out the source here, def label_communities(self).

def reciprocal_interest(scores):
    def _score(row):
        return row['user_interest'] * row['community_interest']
    return scores.apply(_score)

user_community_scores = gw.child.user_community_scores(reciprocal_interest, min_members)

def users_top_communities(user_id, scores):
    user_scores = scores[scores['user_id'] == user_id]
    user_scores = user_scores.join(communities, {'community_id':'__id'})
    user_scores.remove_column('community_id.1')
    return user_scores.sort('score', ascending=False)

header = ['Score', 'Member count', 'Top labels']

display_html('<h3>DNC communities</h3>', raw=True)
dem_id = '14377605'
dem_communities = users_top_communities(dem_id, user_community_scores)
display_communities(dem_communities[:10], 'score', header)

display_html('<h3>RNC communities</h3>', raw=True)
rep_id = '11134252'
rep_communities = users_top_communities(rep_id, user_community_scores)
display_communities(rep_communities[:10], 'score', header)

What can we glean from this?

The 'score' here is the product of user_interest and community_interest. Twitter is a directed network, our objective function rewards relationships where an account follows many people in a community and many people in the community follow the account, a reciprocal_interest function.

What can we glean from this? I'm not really sure. But there are a few things worth mentioning.

The DNC is aligned heavily with volunteers, colleges, and the news media. And then supportive and swing states. I was surprised to see texas, Can Democrats Turn Texas and Arizona Blue by 2016?

The RNC is aligned primarily with the conservative community, the media, and congressional representation, then a mix of it's own support and swing states.

We'll use the communities closest to each party as features (landmarks) to measure similarity. Let's look at accounts close to the respective parties:

def users_top_users(user_id, scores, feature_ids):
    assert scores['score'].min() >= 0
    scores = scores.groupby('user_id', 
        {'score':gl.aggregate.CONCAT('community_id', 'score')},
        {'num_features':gl.aggregate.COUNT('community_id')})
    scores = scores[scores['num_features'] > len(feature_ids) * .20]
    user_score = scores[scores['user_id'] == user_id][0]
    def distance(row):
        total_distance = 0
        for x in feature_ids:
            score1 = user_score['score'].get(x)
            score2 = row['score'].get(x)
            if score1 and score2:
                dis = abs(score1 - score2)
            elif score1 or score2:
                dis = (score1 or score2) * 2
            else:
                dis = 0
            total_distance+=dis
        return total_distance
    scores['distance'] = scores.apply(distance)
    scores = scores.join(gw.verticy_descriptions, {'user_id':'__id'})
    scores['distance'] = (scores['distance'] - scores['distance'].mean()) \
        / (scores['distance'].std())
    return scores.sort('distance')

feature_ids = list(rep_communities['community_id'][:5])
feature_ids += list(dem_communities['community_id'][:5])
feature_ids = list(set(feature_ids))

def mk_twitter_link(screen_name):
    return '<a target="_blank" href="https://twitter.com/%s">%s</a>' % (screen_name, screen_name)

def display_accounts(results, score_name, header):
    output_rows = [header]
    for x in results:
        output_row = [str(x[score_name])[:4], 
                      mk_twitter_link(x['screen_name']), 
                      x['description']]
        output_rows.append(output_row)
    display_table(output_rows) 

header = ['Distance', 'Account', 'Description']

display_html('<h3>Accounts similar to the DNC</h3>', raw=True)
dem_users = users_top_users(dem_id, user_community_scores, feature_ids)
display_accounts(dem_users[:10], 'distance', header)

display_html('<h3>Accounts similar to the RNC</h3>', raw=True)
rep_users = users_top_users(rep_id, user_community_scores, feature_ids)
display_accounts(rep_users[:10], 'distance', header)

Accounts of interest to both sides

Now lets look for accounts of interest to both the DNC and RNC, these are accounts that have similar reciprocal_interest score distributions as the DNC and RNC:

def users_in_between(distances):
    n_dimensions = len(distances)
    _distances = distances[0]
    for x in distances[1:]:
        _distances = _distances.append(x)
    distances = _distances
    distances = distances.groupby('user_id', {'distances':gl.aggregate.CONCAT('distance')})
    def between(row):
        if len(row['distances']) != n_dimensions:
            return None
        x = gl.SArray(row['distances'])
        if x.std() > .15:
            return None
        return x.mean() + x.std()
    distances['distance'] = distances.apply(between)
    distances = distances.dropna().join(gw.verticy_descriptions, {'user_id':'__id'})
    return distances.sort('distance')

display_html('<h3>Of interest to the DNC and RNC</h3>', raw=True)
equidistant_users = users_in_between([dem_users, rep_users])
display_accounts(equidistant_users[:10], 'distance', header)

Give me some more.

The focus of this notebook has been unsupervised learning, in a future notebook we'll look at using labeled data and network compression patterns (communities) to predict future actions. Say hi @SmartTypes if this is of interest.

Pagerank	Member count	Top labels
4.87	9311	editorcnnpoliticscorrespondentpolitical
4.45	2616	actorofficialtwitteractresswriter
3.59	985	foundertechceomarketingtechnology
3.55	6333	conservativetcotchristianlibertylibertarian
3.47	715	endorsementofficialarchivedmilitarytwitter
3.08	1279	footballespnsportsofficialtwitter
2.61	196	worldinternationalpovertynationsglobal
2.59	1361	uniteblueliberalprogressiveobamap2
2.44	314	organizingactionvolunteersmaintainedofa
2.39	217	foodchefrestaurantcookrecipes

Score	Member count	Top labels
3.21	314	organizingactionvolunteersmaintainedofa
3.21	9311	editorcnnpoliticscorrespondentpolitical
3.21	298	democratscollegecollegedemsuniversitydemocratic
0.30	321	papennsylvaniapittsburghcountyphiladelphia
0.26	377	massachusettsbostonmamapolistate
0.23	1361	uniteblueliberalprogressiveobamap2
0.19	683	texasdallasaustinstatetx
0.18	405	seattlewashingtonwastateking
0.17	340	ohiocolumbusstatedaytoncincinnati
0.15	537	michigandetroitstatemigopmi

Score	Member count	Top labels
3.21	6333	conservativetcotchristianlibertylibertarian
2.65	9311	editorcnnpoliticscorrespondentpolitical
2.41	534	districtcongressionalrepresentingcongressmanproudly
0.11	395	virginiavarichmonddistrictdelegates
0.10	535	floridatampajacksonvillemiamipolitical
0.10	347	iowamoinesdesiaiowan
0.06	237	minnesotamnminneapolisstartribunepaul
0.06	291	nhhampshirestatemanchestergranite
0.05	683	texasdallasaustinstatetx
0.05	211	georgiaatlantagaajcstate

Distance	Account	Description
-5.2	TheDemocrats	This is the official Twitter account of the Democratic Party. Follow our tweets to get the latest info on Democratic news, issues, and events.
-4.0	jeremybird	Founding Partner w @270Strategies; Senior Advisor w @BGTX; Former National Field Director, Obama for America; believer in grassroots, empowerment campaigning
-3.8	OFA_HQ
-3.8	dscc	Democratic Senatorial Campaign Committee \| Committed to Electing a Democratic Majority
-3.7	Messina2012	2012 Obama Campaign Manager, former White House Deputy Chief of Staff. Proud Montanan
-3.6	CollegeDems	College Democrats of America \| The official youth branch of the Democratic National Committee. Become a fan on Facebook: http://t.co/YPg5VNqv3R
-3.4	MarlonDMarshall	Kansas Jayhawk for life, grassroots organizer, sports fan, proud St. Louis native, full-time please believer.
-3.4	DNCYouthCouncil	The official Twitter page of the DNC Youth Council. We work to increase youth participation in the Democratic Party.
-3.4	woodhouseb	President, Americans United for Change and American Bridge, former DNC Comm. Dir./Obama '08/'12 Surrogate, Husband, Father, Fisherman, BBaller, NFL junkie.
-3.3	JonCarsonOFA	Jon Carson is Executive Director of Organizing for Action.

Distance	Account	Description
-3.9	GOP	Updates from the Republican National Committee.
-3.7	DailyCaller	Politics, entertainment, slideshows. You're welcome.
-3.4	PRyan	Husband; Proud father of 3; Wisconsinite; Go Pack Go!
-3.4	SpeakerBoehner	Official Twitter account for U.S. House Speaker John Boehner (R-OH)
-3.4	SenRandPaul	I fight for the Constitution, individual liberty and the freedoms that make this country great.
-3.3	Senate_GOPs	News and updates from Republican senators and their staff.
-3.2	megynkelly	Happily married to Doug, crazy in love with my children Yates, Yardley, and Thatcher, and anchor of The Kelly File on Fox News Channel
-3.2	johnboehner	I represent Ohio's 8th Congressional District and serve as Speaker of the House; am fighting for a smaller, more accountable government.
-3.1	Heritage	A think tank devoted to the principles of free enterprise, limited government, individual freedom, traditional American values, and a strong national defense.
-3.1	MittRomney	Former Governor of Massachusetts

Distance	Account	Description
-2.1	DavidMDrucker	@dcexaminer Senior Congressional Correspondent covering Capitol Hill, campaigns & national political trends.
-2.0	postpolitics	The latest political news and analysis from The Washington Post.
-1.8	sppeoples	NH native who covers national politics for The Associated Press, bleeds with New England sports fans, and has yet to find good tacos in DC.
-1.8	kararowland	Capitol Hill producer w/ Fox News; UVA and LSE alum. Anglophile. Also fond of cute animals, craft beer & Charlottesville. RT don't = endorsements, and all that.
-1.8	katiezez	@washingtonpost White House correspondent. @nytimes Boston bureau, @AP New Jersey, @UMKnightWallace alum. katie.zezima@washpost.com
-1.7	JamesPindell	Political reporter for The Boston Globe focused on presidential primaries. Political analyst for WMUR-TV in New Hampshire.
-1.7	SunlenSerfaty	National Correspondent, CNN Newsource
-1.7	DanielStrauss4	@TPM reporter. RTs ≠ endorsements. E-mail: daniel@talkingpointsmemo.com
-1.7	PattyMurray	Official account of U.S. Senator Patty Murray (D-WA) \| Tweets come from staff unless signed “-PM” by Senator Murray \| RT≠endorsement
-1.7	nedrapickler	White House reporter for The Associated Press and a proud native of Flint, Michigan