In this post, a gentle introduction to different Python packages will let you create network graphs users can interact with. Taking a few steps into graph theory, you can apply these methods to anything from links between the severity of terrorist attacks or the prices of taxi cabs. In this tutorial, you can use information from Twitter to make graphs anyone can appreciate.
Table of contents:
- Get Started
- Extract Tweets and Followers
- Process the Data
- Create the Graph
- Evaluate the Graph
- Plot the Map
1. Get Started
Make sure you’re familiar with using a command line interface such as Terminal and you can download the necessary Python packages (chart-studio, matplotlib, networkx, pandas, plotly and python-twitter). You can use Anaconda to download them. This tutorial will introduce parts of the script you can run from the command line to extract tweets and visualize them.
If you don’t have a Twitter developer account, you’ll need to login here and get one. Then create an app and find your keys and secret codes for the consumer and access tokens. This lets you extract information from Twitter.
2. Extract Tweets and Followers
To extract Tweets, run the script below. In this example, the tweets of the UCSC Science Communication class of 2020 are analyzed (in
screennames) so their Twitter handles are used. Replace the variables currently defined as
None below with them. Keep these keys and codes safe and don’t share them with others. Set
datadir to the output directory to store the data.
The code begins with
import statements to use the required packages including
os, which should come installed with Python.
import json import os import pickle import twitter screennames = ["science_ari", "shussainather", "laragstreiff", "scatter_cushion", "jessekathan", "jackjlee", "erinmalsbury", "joetting13", "jonathanwosen", "heysmartash"] CONSUMER_KEY = None CONSUMER_SECRET = None ACCESS_TOKEN_KEY = None ACCESS_TOKEN_SECRET = None datadir = "data/twitter"
Extract the information we need. This code goes through each screen name and accesses their tweet and follower information. It then saves the data of both of them to output JSON and pickle files.
t = twitter.Api(consumer_key = CONSUMER_KEY,
consumer_secret = CONSUMER_SECRET,
access_token_key = ACCESS_TOKEN_KEY,
access_token_secret = ACCESS_TOKEN_SECRET)
for sn in screennames:
For each user, get the followers and tweets and save them
to output pickle and JSON files.
fo = datadir + "/" + sn + ".followers.pickle"
# Get the follower information.
fof = t.GetFollowers(screen_name = sn)
with open(fo, "w") as fofpickle:
pickle.dump(fof, fofpickle, protocol = 2)
with open(fo, "r") as fofpickle:
with open(fo.replace(".pickle", ".json"), "w") as fofjson:
fofdata = pickle.load(fofpickle)
json.dump(fofdata, fofjson) # Get the user's timeline with the 500 most recent tweets.
timeline = t.GetUserTimeline(screen_name=sn, count=500)
tweets = [i.AsDict() for i in timeline]
with open(datadir + "/" + sn + ".tweets.json", "w") as tweetsjson:
json.dump(tweets, tweetsjson) # Store the informtion in a JSON.
This should extract the follower and tweets and save them to pickle and JSON files in the
3. Process the Data
Now that you have an input JSON file of tweets, you can set it to the
tweetsjson variable in the code below to read it as a
For the rest of the tutorial, start a new script for convenience.
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
from plotly.offline import iplot, plot
from operator import itemgetter
pandas to import the JSON file as a
df = pd.read_json(tweetsjson)
tfinal as the final DataFrame to make.
tfinal = pd.DataFrame(columns = ["created_at", "id", "in_reply_to_screen_name", "in_reply_to_status_id", "in_reply_to_user_id", "retweeted_id", "retweeted_screen_name", "user_mentions_screen_name", "user_mentions_id", "text", "user_id", "screen_name", "followers_count"])
Then, extract the columns you’re interested in and add them to
eqcol = ["created_at", "id", "text"]
tfinal[eqcol] = df[eqcol]
tfinal = filldf(tfinal)
tfinal = tfinal.where((pd.notnull(tfinal)), None)
Use the following functions to extract information from them. Each function extracts information form the input
df DataFrame and adds it to the
First, get the basic information: screen name, user ID and how many followers.
Get the basic information about the user.
tfinal["screen_name"] = df["user"].apply(lambda x: x["screen_name"])
tfinal["user_id"] = df["user"].apply(lambda x: x["id"])
tfinal["followers_count"] = df["user"].apply(lambda x: x["followers_count"])
Then, get information on which tweets have been retweeted.
# Inside the tag "retweeted_status" will find "user" and will get "screen name" and "id".
tfinal["retweeted_screen_name"] = df["retweeted_status"].apply(lambda x: x["user"]["screen_name"] if x is not np.nan else np.nan)
tfinal["retweeted_id"] = df["retweeted_status"].apply(lambda x: x["user"]["id_str"] if x is not np.nan else np.nan)
Figure out which tweets are replies and to who they are replying.
Get reply info.
# Just copy the "in_reply" columns to the new DataFrame.
tfinal["in_reply_to_screen_name"] = df["in_reply_to_screen_name"]
tfinal["in_reply_to_status_id"] = df["in_reply_to_status_id"]
The following function runs each of these functions to get the information into
Put it all together.
You’ll use this
getinteractions() function in the next step when creating the graph. This takes the actual information from the
tfinal DataFrame and puts it into the format that a graph can use.
def getinteractions(row): """ Get the interactions between different users. """ # From every row of the original DataFrame. # First we obtain the "user_id" and "screen_name". user = row["user_id"], row["screen_name"] # Be careful if there is no user id. if user is None: return (None, None), 
For the remainder of the
for loop, get the information if it’s there.
# The interactions are going to be a set of tuples. interactions = set() # Add all interactions. # First, we add the interactions corresponding to replies adding # the id and screen_name. interactions.add((row["in_reply_to_user_id"], row["in_reply_to_screen_name"])) # After that, we add the interactions with retweets. interactions.add((row["retweeted_id"], row["retweeted_screen_name"])) # And later, the interactions with user mentions. interactions.add((row["user_mentions_id"], row["user_mentions_screen_name"])) # Discard if user id is in interactions. interactions.discard((row["user_id"], row["screen_name"])) # Discard all not existing values. interactions.discard((None, None)) # Return user and interactions. return user, interactions
4. Create the Graph
Initialize the graph with
graph = nx.Graph()
Loop through the
tfinal DataFrame and get the interaction information. Use the
getinteractions function to get each user and interaction involved with each tweet.
for index, tweet in tfinal.iterrows():
user, interactions = getinteractions(tweet)
user_id, user_name = user
tweet_id = tweet["id"]
for interaction in interactions:
int_id, int_name = interaction
graph.add_edge(user_id, int_id, tweet_id=tweet_id)
graph.node[user_id]["name"] = user_name
graph.node[int_id]["name"] = int_name
5. Evaluate the Graph
In the field of social network analysis (SNA), researchers use measurements of nodes and edges to tell what graphs re like. This lets you separate the signal from noise when looking at network graphs.
First, look at the degrees and edges of the graph. The
degrees = [val for (node, val) in graph.degree()]
print("The maximum degree of the graph is " + str(np.max(degrees)))
print("The minimum degree of the graph is " + str(np.min(degrees)))
print("There are " + str(graph.number_of_nodes()) + " nodes and " + str(graph.number_of_edges()) + " edges present in the graph")
print("The average degree of the nodes in the graph is " + str(np.mean(degrees)))
Are all the nodes connected?
print("The graph is connected")
print("The graph is not connected")
print("There are " + str(nx.number_connected_components(graph)) + " connected in the graph.")
Information about the largest subgraph can tell you what sort of tweets represent the majority.
largestsubgraph = max(nx.connected_component_subgraphs(graph), key=len)
print("There are " + str(largestsubgraph.number_of_nodes()) + " nodes and " + str(largestsubgraph.number_of_edges()) + " edges present in the largest component of the graph.")
The clustering coefficient tells you how close together the nodes congregate using the density of the connections surrounding a node. If many nodes are connected in a small area, there will be a high clustering coefficient.
print("The average clustering coefficient is " + str(nx.average_clustering(largestsubgraph)) + " in the largest subgraph")
print("The transitivity of the largest subgraph is " + str(nx.transitivity(largestsubgraph)))
print("The diameter of our graph is " + str(nx.diameter(largestsubgraph)))
print("The average distance between any two nodes is " + str(nx.average_shortest_path_length(largestsubgraph)))
Centrality tells you how many direct, “one step,” connections each node has to other nodes in the network, and there are two ways to measure it. “Betweenness centrality” represents which nodes act as “bridges” between nodes in a network by finding the shortest paths and counting how many times each node falls on one. “Closeness centrality,” instead, scores each node based on the sum of the shortest paths.
graphcentrality = nx.degree_centrality(largestsubgraph)
maxde = max(graphcentrality.items(), key=itemgetter(1))
graphcloseness = nx.closeness_centrality(largestsubgraph)
graphbetweenness = nx.betweenness_centrality(largestsubgraph, normalized=True, endpoints=False)
maxclo = max(graphcloseness.items(), key=itemgetter(1))
maxbet = max(graphbetweenness.items(), key=itemgetter(1))
print("The node with ID " + str(maxde) + " has a degree centrality of " + str(maxde) + " which is the max of the graph.")
print("The node with ID " + str(maxclo) + " has a closeness centrality of " + str(maxclo) + " which is the max of the graph.")
print("The node with ID " + str(maxbet) + " has a betweenness centrality of " + str(maxbet) + " which is the max of the graph.")
6. Plot the Map
Get the edges and store them in lists
Ye in the x- and y-directions.
for e in G.edges():
Xe.extend([pos[e], pos[e], None])
Ye.extend([pos[e], pos[e], None])
Define the Plotly “trace” for nodes and edges. Plotly uses these traces as a way of storing the graph data right before it’s plotted.
trace_nodes = dict(type="scatter", x=Xn, y=Yn, mode="markers", marker=dict(size=28, color="rgb(0,240,0)"), text=labels, hoverinfo="text")
trace_edges = dict(type="scatter", mode="lines", x=Xe, y=Ye, line=dict(width=1, color="rgb(25,25,25)"), hoverinfo="none")
Plot the graph with the Fruchterman-Reingold layout algorithm. This image shows an example of a graph plotted with this algorithm, designed to provide clear, explicit ways the nodes are connected.
pos = nx.fruchterman_reingold_layout(G)
layout variables to customize what appears on the graph. Using the
showline=False, option, you will hide the axis line, grid, tick labels and title of the graph. Then the
fig variable creates the actual figure.
axis = dict(showline=False,
layout = dict(title= "My Graph",
plot_bgcolor="#EFECEA", # Set background color.
fig = dict(data=[trace_edges, trace_nodes], layout=layout)
Annotate with the information you want others to on each node. Use the
labels variable to list (with the same length as
pos) what should appear as an annotation.
labels = range(len(pos))
def make_annotations(pos, anno_text, font_size=14, font_color="rgb(10,10,10)"):
raise ValueError("The lists pos and text must have the same len")
annotations = 
for k in range(L):
y=pos[k]+0.075,#this additional value is chosen by trial and error
font=dict(color= font_color, size=font_size),