The Emergent Beauty of Mathematics

The first 30 seconds of a Brownian tree (code can be found here).

Like a flower in full bloom, nature reveals its patterns shaped by mathematics. Or as particles collide with one another, they create snowflake-like fractals through emergence. How do fish swarm in schools or consciousness come about from the brain? Simulations can provide answers.

Through code, you can simulate how living cells or physical particles would interact with one another. Using equations that govern the behavior of how cells act when they meet one another or how they would grow and evolve into larger structures. In the gif above, you can use diffusion-limited aggregation to create Brownian trees. These are the structures that emerge when particles move randomly with respect to one another. Particles in fluid (like dropping color dye into water) take these patterns when you look at them under a microscope. As the particles collide and form trees, they create shapes and patterns like water crystals on glass. These visuals can give you a way of appreciating how beautiful mathematics is. The way mathematical theory can borrow from nature and how biological communities of living organisms themselves depend on physical laws shows how such an interdisciplinary approach provides a way to bridge different disciplines.

After about 20 minutes, the branches of the Brownian tree take form.

In the code, the particles are set to move with random velocities in two dimensions and, if they collide with the tree (a central particle at the beginning), they form parts of the tree. As the tree grows bigger over time, it takes the shapes of branches the same way neurons in the brain form trees that send signals between one another. These fractals, in their uniqueness, give them a kind of mathematical beauty.

Conway’s game of life represents another way something emerges from randomness.

Flashing lights coming and going away like stars shining in the sky are more than just randomness. These models of cellular interactions are known as cellular automaton. The gif above shows an example of Conway’s game of life, a simulation of how living cells interact with one another.

These cells “live” and “die” according to four simple rules: (1) live cells with fewer than two live neighbors die, as if by underpopulation, (2) live cells with two or three live neighbors live on to the next generation, (3) live cells with more than three live neighbors die, as if by overpopulation and (4) dead cells with exactly three live neighbors become live cells, as if by reproduction.

Conus textile shows a similar cellular automaton pattern on its shell.

Through these rules, specific shapes emerge such as “gliders” or “knightships” you can further describe with rules and equations. You’ll find natural versions of cells obeying rules like the colorful patterns on a seashell. Complex structures emerging from more basic, fundamental sets of rules unite these observations. While the beauty of these structures becomes more and more apparent from the patterns between different disciplines, searching for these patterns in other contexts can be more difficult such as human behavior.

Recent writing like Genesis: The Deep Origin of Societies by biologist E.O. Wilson take on the debate over how altruism in humans evolved. While the shape of snowflakes can emerge from the interactions between water molecules, humans getting along with one another seems far more complicated and higher-level. Though you can find similar cooperation in ants and termites creating societies, how did science let this happen?

Biologists have answered that organisms choose to mate with individuals and increase the survival chances of themselves and their offspring while passing on their genes. Though they’ve argued this for decades, Wilson offers a contrary point of view. In groups, selfish organisms defeat altruistic ones, but altruistic groups beat selfish groups overall. This group selection drives the emergence of altruism. Through these arguments, both sides have appealed to the mathematics of nature, showing its growing importance in recognizing the patterns of life.

Wilson clarifies that data analysis and mathematical modeling should come second to the biology itself. Becoming experts on organisms themselves should be a priority. Regardless of what form it takes, the beauty is still there, even if it’s below the surface.

How to Create Interactive Network Graphs (from Twitter or elsewhere) with Python

In this post, a gentle introduction to different Python packages will let you create network graphs users can interact with. Taking a few steps into graph theory, you can apply these methods to anything from links between the severity of terrorist attacks or the prices of taxi cabs. In this tutorial, you can use information from Twitter to make graphs anyone can appreciate.

The code for steps 1 and 2 can be found on GitHub here, and the code for the rest, here.

Table of contents:

  1. Get Started
  2. Extract Tweets and Followers
  3. Process the Data
  4. Create the Graph
  5. Evaluate the Graph
  6. Plot the Map

1. Get Started

Make sure you’re familiar with using a command line interface such as Terminal and you can download the necessary Python packages (chart-studio, matplotlib, networkx, pandas, plotly and python-twitter). You can use Anaconda to download them. This tutorial will introduce parts of the script you can run from the command line to extract tweets and visualize them.

If you don’t have a Twitter developer account, you’ll need to login here and get one. Then create an app and find your keys and secret codes for the consumer and access tokens. This lets you extract information from Twitter.

2. Extract Tweets and Followers

To extract Tweets, run the script below. In this example, the tweets of the UCSC Science Communication class of 2020 are analyzed (in screennames) so their Twitter handles are used. Replace the variables currently defined as None below with them. Keep these keys and codes safe and don’t share them with others. Set datadir to the output directory to store the data.

The code begins with import statements to use the required packages including json and os, which should come installed with Python.

import json
import os
import pickle
import twitter 

screennames = ["science_ari", "shussainather", "laragstreiff",                  "scatter_cushion", "jessekathan", "jackjlee",                 "erinmalsbury", "joetting13", "jonathanwosen",                 "heysmartash"] 

CONSUMER_KEY = None
CONSUMER_SECRET = None
ACCESS_TOKEN_KEY = None
ACCESS_TOKEN_SECRET = None

datadir = "data/twitter"

Extract the information we need. This code goes through each screen name and accesses their tweet and follower information. It then saves the data of both of them to output JSON and pickle files.

t = twitter.Api(consumer_key = CONSUMER_KEY,
consumer_secret = CONSUMER_SECRET,
access_token_key = ACCESS_TOKEN_KEY,
access_token_secret = ACCESS_TOKEN_SECRET)
for sn in screennames:
"""
For each user, get the followers and tweets and save them
to output pickle and JSON files.
"""
fo = datadir + "/" + sn + ".followers.pickle"
# Get the follower information.
fof = t.GetFollowers(screen_name = sn)
with open(fo, "w") as fofpickle:
pickle.dump(fof, fofpickle, protocol = 2)
with open(fo, "r") as fofpickle:
with open(fo.replace(".pickle", ".json"), "w") as fofjson:
fofdata = pickle.load(fofpickle)
json.dump(fofdata, fofjson) # Get the user's timeline with the 500 most recent tweets.
timeline = t.GetUserTimeline(screen_name=sn, count=500)
tweets = [i.AsDict() for i in timeline]
with open(datadir + "/" + sn + ".tweets.json", "w") as tweetsjson:
json.dump(tweets, tweetsjson) # Store the informtion in a JSON.

This should extract the follower and tweets and save them to pickle and JSON files in the datadir.

3. Process the Data

Now that you have an input JSON file of tweets, you can set it to the tweetsjson variable in the code below to read it as a pandas DataFrame.

For the rest of the tutorial, start a new script for convenience.

import json
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import re

from plotly.offline import iplot, plot
from operator import itemgetter

Use pandas to import the JSON file as a pandas DataFrame.

df = pd.read_json(tweetsjson)

Set tfinal as the final DataFrame to make.

tfinal = pd.DataFrame(columns = ["created_at", "id", "in_reply_to_screen_name", "in_reply_to_status_id", "in_reply_to_user_id", "retweeted_id", "retweeted_screen_name", "user_mentions_screen_name", "user_mentions_id", "text", "user_id", "screen_name", "followers_count"])

Then, extract the columns you’re interested in and add them to tfinal.

eqcol = ["created_at", "id", "text"]
tfinal[eqcol] = df[eqcol]
tfinal = filldf(tfinal)
tfinal = tfinal.where((pd.notnull(tfinal)), None)

Use the following functions to extract information from them. Each function extracts information form the input df DataFrame and adds it to the tfinal one.

First, get the basic information: screen name, user ID and how many followers.

def getbasics(tfinal):
"""
Get the basic information about the user.
"""
tfinal["screen_name"] = df["user"].apply(lambda x: x["screen_name"])
tfinal["user_id"] = df["user"].apply(lambda x: x["id"])
tfinal["followers_count"] = df["user"].apply(lambda x: x["followers_count"])
return tfinal

Then, get information on which tweets have been retweeted.

def getretweets(tfinal):
"""
Get retweets.
"""
# Inside the tag "retweeted_status" will find "user" and will get "screen name" and "id".
tfinal["retweeted_screen_name"] = df["retweeted_status"].apply(lambda x: x["user"]["screen_name"] if x is not np.nan else np.nan)
tfinal["retweeted_id"] = df["retweeted_status"].apply(lambda x: x["user"]["id_str"] if x is not np.nan else np.nan)
return tfinal

Figure out which tweets are replies and to who they are replying.

def getinreply(tfinal):
"""
Get reply info.
"""
# Just copy the "in_reply" columns to the new DataFrame.
tfinal["in_reply_to_screen_name"] = df["in_reply_to_screen_name"]
tfinal["in_reply_to_status_id"] = df["in_reply_to_status_id"]
tfinal["in_reply_to_user_id"]= df["in_reply_to_user_id"]
return tfinal

The following function runs each of these functions to get the information into tfinal.

def filldf(tfinal):
"""
Put it all together.
"""
getbasics(tfinal)
getretweets(tfinal)
getinreply(tfinal)
return tfinal

You’ll use this getinteractions() function in the next step when creating the graph. This takes the actual information from the tfinal DataFrame and puts it into the format that a graph can use.

def getinteractions(row): """ Get the interactions between different users. """ # From every row of the original DataFrame. # First we obtain the "user_id" and "screen_name". user = row["user_id"], row["screen_name"] # Be careful if there is no user id. if user[0] is None: return (None, None), []

For the remainder of the for loop, get the information if it’s there.

    # The interactions are going to be a set of tuples.
    interactions = set()

    # Add all interactions. 
    # First, we add the interactions corresponding to replies adding 
    # the id and screen_name.
    interactions.add((row["in_reply_to_user_id"], 
    row["in_reply_to_screen_name"]))
    # After that, we add the interactions with retweets.
    interactions.add((row["retweeted_id"], 
    row["retweeted_screen_name"]))
    # And later, the interactions with user mentions.
    interactions.add((row["user_mentions_id"], 
    row["user_mentions_screen_name"]))

    # Discard if user id is in interactions.
    interactions.discard((row["user_id"], row["screen_name"]))
    # Discard all not existing values.
    interactions.discard((None, None))
    # Return user and interactions.
    return user, interactions

4. Create the Graph

Initialize the graph with networkx.

graph = nx.Graph()

Loop through the tfinal DataFrame and get the interaction information. Use the getinteractions function to get each user and interaction involved with each tweet.

for index, tweet in tfinal.iterrows():
user, interactions = getinteractions(tweet)
user_id, user_name = user
tweet_id = tweet["id"]
for interaction in interactions:
int_id, int_name = interaction
graph.add_edge(user_id, int_id, tweet_id=tweet_id)
graph.node[user_id]["name"] = user_name
graph.node[int_id]["name"] = int_name

5. Evaluate the Graph

In the field of social network analysis (SNA), researchers use measurements of nodes and edges to tell what graphs re like. This lets you separate the signal from noise when looking at network graphs.

First, look at the degrees and edges of the graph. The print statements should print out the information about these measurements.

degrees = [val for (node, val) in graph.degree()]
print("The maximum degree of the graph is " + str(np.max(degrees)))
print("The minimum degree of the graph is " + str(np.min(degrees)))
print("There are " + str(graph.number_of_nodes()) + " nodes and " + str(graph.number_of_edges()) + " edges present in the graph")
print("The average degree of the nodes in the graph is " + str(np.mean(degrees)))

Are all the nodes connected?

if nx.is_connected(graph):
print("The graph is connected")
else:
print("The graph is not connected")
print("There are " + str(nx.number_connected_components(graph)) + " connected in the graph.")

Information about the largest subgraph can tell you what sort of tweets represent the majority.

largestsubgraph = max(nx.connected_component_subgraphs(graph), key=len)
print("There are " + str(largestsubgraph.number_of_nodes()) + " nodes and " + str(largestsubgraph.number_of_edges()) + " edges present in the largest component of the graph.")

The clustering coefficient tells you how close together the nodes congregate using the density of the connections surrounding a node. If many nodes are connected in a small area, there will be a high clustering coefficient.

print("The average clustering coefficient is " + str(nx.average_clustering(largestsubgraph)) + " in the largest subgraph")
print("The transitivity of the largest subgraph is " + str(nx.transitivity(largestsubgraph)))
print("The diameter of our graph is " + str(nx.diameter(largestsubgraph)))
print("The average distance between any two nodes is " + str(nx.average_shortest_path_length(largestsubgraph)))

Centrality tells you how many direct, “one step,” connections each node has to other nodes in the network, and there are two ways to measure it. “Betweenness centrality” represents which nodes act as “bridges” between nodes in a network by finding the shortest paths and counting how many times each node falls on one. “Closeness centrality,” instead, scores each node based on the sum of the shortest paths.

graphcentrality = nx.degree_centrality(largestsubgraph)
maxde = max(graphcentrality.items(), key=itemgetter(1))
graphcloseness = nx.closeness_centrality(largestsubgraph)
graphbetweenness = nx.betweenness_centrality(largestsubgraph, normalized=True, endpoints=False)
maxclo = max(graphcloseness.items(), key=itemgetter(1))
maxbet = max(graphbetweenness.items(), key=itemgetter(1))

print("The node with ID " + str(maxde[0]) + " has a degree centrality of " + str(maxde[1]) + " which is the max of the graph.")
print("The node with ID " + str(maxclo[0]) + " has a closeness centrality of " + str(maxclo[1]) + " which is the max of the graph.")
print("The node with ID " + str(maxbet[0]) + " has a betweenness centrality of " + str(maxbet[1]) + " which is the max of the graph.")

6. Plot the Map

Get the edges and store them in lists Xe and Ye in the x- and y-directions.

Xe=[]
Ye=[]
for e in G.edges():
Xe.extend([pos[e[0]][0], pos[e[1]][0], None])
Ye.extend([pos[e[0]][1], pos[e[1]][1], None])

Define the Plotly “trace” for nodes and edges. Plotly uses these traces as a way of storing the graph data right before it’s plotted.

trace_nodes = dict(type="scatter",
                 x=Xn, 
                 y=Yn,
                 mode="markers",
                 marker=dict(size=28, color="rgb(0,240,0)"),
                 text=labels,
                 hoverinfo="text")

trace_edges = dict(type="scatter",                  
                 mode="lines",                  
                 x=Xe,                  
                 y=Ye,                 
                 line=dict(width=1, color="rgb(25,25,25)"),                                         hoverinfo="none")

Plot the graph with the Fruchterman-Reingold layout algorithm. This image shows an example of a graph plotted with this algorithm, designed to provide clear, explicit ways the nodes are connected.

The force-directed Fruchterman-Reingold algorithm to draw nodes in an understandable way.

pos = nx.fruchterman_reingold_layout(G)

Use the axis and layout variables to customize what appears on the graph. Using the showline=False, option, you will hide the axis line, grid, tick labels and title of the graph. Then the fig variable creates the actual figure.

axis = dict(showline=False,
zeroline=False,
showgrid=False,
showticklabels=False,
title=""
)


layout = dict(title= "My Graph",
font= dict(family="Balto"),
width=600,
height=600,
autosize=False,
showlegend=False,
xaxis=axis,
yaxis=axis,
margin=dict(
l=40,
r=40,
b=85,
t=100,
pad=0,
),
hovermode="closest",
plot_bgcolor="#EFECEA", # Set background color.
)


fig = dict(data=[trace_edges, trace_nodes], layout=layout)

Annotate with the information you want others to on each node. Use the labels variable to list (with the same length as pos) what should appear as an annotation.

labels = range(len(pos))

def make_annotations(pos, anno_text, font_size=14, font_color="rgb(10,10,10)"):
L=len(pos)
if len(anno_text)!=L:
raise ValueError("The lists pos and text must have the same len")
annotations = []
for k in range(L):
annotations.append(dict(text=anno_text[k],
x=pos[k][0],
y=pos[k][1]+0.075,#this additional value is chosen by trial and error
xref="x1", yref="y1",
font=dict(color= font_color, size=font_size),
showarrow=False)
)
return annotations
fig["layout"].update(annotations=make_annotations(pos, labels))

Finally, plot.

iplot(fig)

An example graph

Make a word cloud in a single line of Python

Moby-Dick, visualized

This is a concise way to make a word cloud using Python. It can teach you basics of coding while creating a nice graphic.

It’s actually four lines of code, but making the word cloud only takes one line, the final one.

import nltk
from wordcloud import WordCloud
nltk.download("stopwords")
WordCloud(background_color="white", max_words=5000, contour_width=3, contour_color="steelblue").generate_from_text(" ".join([r for r in open("mobydick.txt", "r").read().split() if r not in set(nltk.corpus.stopwords.words("english"))])).to_file("wordcloud.png")

Just tell me what to do now!

The first two lines lines specify the required packages you must download with these links: nltk and wordcloud. You may also try these links: nltk and wordcloud to download them. The third line downloads the stop words (common words like “the”, “a” and “in”) that you don’t want in your word cloud.

The fourth line is complicated. Calling the WordCloud() method, you can specify the background color, contour color and other options (found here). generate_from_text() takes a string of words to put in the word cloud.

The " ".join() creates this string of words separated by spaces from a list of words. The for loop in the square brackets[] creates this list of each word from the input file (in this case, mobydick.txt) with the r variable letting you use each word one at a time in the list.

The input file is open(), read() and split() into its words under the condition (using if) they aren’t in nltk.corpus.stopwords.words("english"). Finally, to_file() saves the image as wordcloud.png.

How to use this code

In the code, change "mobydick.txt" to the name of your text file (keep the quotation marks). Save the code in a file makewordcloud.py in the text file’s directory, and use a command line interface (such as Terminal) to navigate to the directory.

Run your script using python makewordcloud.py, and check out your wordcloud.png!

Global Mapping of Critical Minerals

The periodic table below illustrates the global abundance of critical minerals in the Earth’s crust in parts per million (ppm). Hover over each element to view! Lanthanides and actinides are omitted due to lack of available data.

Data is obtained from the USGS handbook “Critical Mineral Resources of the United States— Economic and Environmental Geology and Prospects for Future Supply.” The code used is found here.

Bokeh Plot

Because these minerals tend to concentrate in specific countries like niobium in Brazil or antimony in China and remain central to many areas of society such as national defense or engineering, governments like the US have come forward with listing these minerals as “critical.”

The abundance across different continents is shown in the map above.

You can find gallium, the most abundant of the critical minerals, in place of aluminum and zinc, elements smaller than gallium. Processing bauxite ore or sphalerite ore (from the sediment-hosted, Mississippi Valley-type and volcanogenic massive sulfide) of zinc yield gallium. The US meets its gallium needs through primary, recycled and refined forms of the element.

German and indium have uses in electronics, flat-panel display screens, light-emitting diodes (LEDs) and solar power arrays. China, Belgium, Canada, Japan and South Korea are the main producers of indium while germanium production can be more complicated. In many cases, countries import primary germanium from other ones, such as Canada importing from the US or Finland from the Democratic Republic of the Congo, to recover them.

Rechargable battery cathodes and jet aircraft turbine engines make use of cobalt. While the element is the central atom in vitamin B12, excess and overexposure can cause lung and heart dysfunction and dermatitis.

As one of only three countries that processes beryllium into products, the US doesn’t put much time or money into exploring for new deposits within its own borders because a single producer dominates the domestic berllyium market. Beryllium finds uses in magnetic resonance imaging (MRI) and medical lasers.

A Deep Learning Overview with Python

This course proposes a quick introduction to deep learning and two of its major networks, convolutional neural networks (CNNs) and recurrent neural networks (RNNs). The purpose is to give an intuitive sense of how to implement deep learning approaches for various tasks. To use this iPython notebook, run the python code in separate files for each cell. The content below each cell of this notebook is the output for running those cells.

Simple perceptron

In [1]:
import numpy as np

# sigmoid function
def sigmoid(x,deriv=False):
    if(deriv==True):
        return x*(1-x)
    return 1/(1+np.exp(-x))
    
# input dataset
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])
    
# output dataset            
y = np.array([[0,0,1,1]]).T

# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)

# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1

for j in range(100000):

    # forward propagation
    l0 = X
    l1 = sigmoid(np.dot(l0,syn0))

    # how much did we miss?
    l1_error = y - l1
    if (j% 10000) == 0:
        print("Error:" + str(np.mean(np.abs(l1_error))))

    # multiply how much we missed by the 
    # slope of the sigmoid at the values in l1
    l1_delta = l1_error * sigmoid(l1,True)

    # update weights
    syn0 += np.dot(l0.T,l1_delta)

print()
print("Prediction after Training:")
print(l1)
Error:0.517208275438
Error:0.00795484506673
Error:0.0055978239634
Error:0.00456086918013
Error:0.00394482243339
Error:0.00352530883742
Error:0.00321610234673
Error:0.00297605968522
Error:0.00278274003022
Error:0.0026227273927

Prediction after Training:
[[ 0.00301758]
 [ 0.00246109]
 [ 0.99799161]
 [ 0.99753723]]

What is the loss function here? How is it calculated?

Any idea how it would perform on non-linearly separable data? How could we test it?

Multilayer perceptron

Let’s use the fact that the sigmoid is differenciable (while the step function we saw in the slides is not). This allows us to add more layers (hence more modelling power).

In [2]:
import numpy as np

def sigmoid(x,deriv=False):
	if(deriv==True):
	    return x*(1-x)

	return 1/(1+np.exp(-x))
    
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])
                
y = np.array([[0],
			  [1],
			  [1],
			  [0]])

np.random.seed(1)

# randomly initialize our weights with mean 0
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1

for j in range(100000):

	# Feed forward through layers 0, 1, and 2
    l0 = X
    l1 = sigmoid(np.dot(l0,syn0))
    l2 = sigmoid(np.dot(l1,syn1))

    # how much did we miss the target value?
    l2_error = y - l2
    
    if (j% 10000) == 0:
        print("Error:" + str(np.mean(np.abs(l2_error))))
        
    # in what direction is the target value?
    # were we really sure? if so, don't change too much.
    l2_delta = l2_error*sigmoid(l2,deriv=True)

    # how much did each l1 value contribute to the l2 error (according to the weights)?
    l1_error = l2_delta.dot(syn1.T)
    
    # in what direction is the target l1?
    # were we really sure? if so, don't change too much.
    l1_delta = l1_error * sigmoid(l1,deriv=True)

    syn1 += l1.T.dot(l2_delta)
    syn0 += l0.T.dot(l1_delta)
    
print()
print(l2)
Error:0.496410031903
Error:0.00858452565325
Error:0.00578945986251
Error:0.00462917677677
Error:0.00395876528027
Error:0.00351012256786
Error:0.00318350238587
Error:0.00293230634228
Error:0.00273150641821
Error:0.00256631724004

[[ 0.00199094]
 [ 0.99751458]
 [ 0.99771098]
 [ 0.00294418]]

Setting up the environment

We have done toy examples for feedforward networks. Things quickly become complicated, so let’s go deeper by relying on high-level frameworks: TensorFlow and Keras. Most technicalities are thus avoided so that you can directly play with networks.

In [ ]:
!conda install tensorflow keras
In [3]:
import tensorflow as tf
import keras
/Users/syedather/.local/lib/python3.6/site-packages/matplotlib/__init__.py:1067: UserWarning: Duplicate key in file "/Users/syedather/.matplotlib/matplotlibrc", line #2
  (fname, cnt))
Using TensorFlow backend.
In [4]:
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
b'Hello, TensorFlow!'

CNNs

We are going to use the MNIST dataset for our first task. The code below loads the dataset and shows one training example and its label.

In [5]:
from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
from pylab import *

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print("The first training instance is labeled as: "+str(y_train[0]))
The first training instance is labeled as: 5
In [6]:
figure(1)
imshow(x_train[0], interpolation='nearest')
Out[6]:
<matplotlib.image.AxesImage at 0x1259b2320>

Now study the following code. What is the network we use? How many layers? What hyper parameters?

In [7]:
# Setup some hyper parameters
batch_size = 128
num_classes = 10
epochs = 15

# input image dimensions
img_rows, img_cols = 28, 28

# This is some technicality regarding Keras' dataset
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

# We convert the matrices to floats as we will use real numbers
x_train = x_train.astype('float32')[:1000]
x_test = x_test.astype('float32')[:200]
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)[:1000]
y_test = keras.utils.to_categorical(y_test, num_classes)[:200]


# Build network
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
# model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
# model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
              metrics=['accuracy'])

# Train
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))

# Evaluate on test data
score = model.evaluate(x_test, y_test, verbose=0)
print()
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# Evaluate on training data
score = model.evaluate(x_train, y_train, verbose=0)
print()
print('Train loss:', score[0])
print('Train accuracy:', score[1])
x_train shape: (1000, 28, 28, 1)
1000 train samples
200 test samples
Train on 1000 samples, validate on 200 samples
Epoch 1/15
1000/1000 [==============================] - 4s 4ms/step - loss: 1.7244 - acc: 0.5660 - val_loss: 0.9116 - val_acc: 0.7900
Epoch 2/15
1000/1000 [==============================] - 4s 4ms/step - loss: 0.5967 - acc: 0.8320 - val_loss: 0.5148 - val_acc: 0.8100
Epoch 3/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.4394 - acc: 0.8670 - val_loss: 0.3056 - val_acc: 0.8600
Epoch 4/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.3296 - acc: 0.9050 - val_loss: 0.3263 - val_acc: 0.9000
Epoch 5/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.2205 - acc: 0.9360 - val_loss: 0.2092 - val_acc: 0.9200
Epoch 6/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.1684 - acc: 0.9560 - val_loss: 0.1870 - val_acc: 0.9450
Epoch 7/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.1325 - acc: 0.9690 - val_loss: 0.1597 - val_acc: 0.9350
Epoch 8/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.0990 - acc: 0.9740 - val_loss: 0.1617 - val_acc: 0.9400
Epoch 9/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.0636 - acc: 0.9840 - val_loss: 0.1434 - val_acc: 0.9450
Epoch 10/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.0393 - acc: 0.9960 - val_loss: 0.1545 - val_acc: 0.9400
Epoch 11/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.0267 - acc: 0.9950 - val_loss: 0.1444 - val_acc: 0.9400
Epoch 12/15
1000/1000 [==============================] - 4s 4ms/step - loss: 0.0158 - acc: 1.0000 - val_loss: 0.1642 - val_acc: 0.9350
Epoch 13/15
1000/1000 [==============================] - 3s 3ms/step - loss: 0.0090 - acc: 1.0000 - val_loss: 0.1475 - val_acc: 0.9450
Epoch 14/15
1000/1000 [==============================] - 4s 4ms/step - loss: 0.0057 - acc: 1.0000 - val_loss: 0.1556 - val_acc: 0.9350
Epoch 15/15
1000/1000 [==============================] - 4s 4ms/step - loss: 0.0041 - acc: 1.0000 - val_loss: 0.1651 - val_acc: 0.9350

Test loss: 0.165074422359
Test accuracy: 0.935

Train loss: 0.00311407446489
Train accuracy: 1.0

Is there anything wrong here?

How do you think a linear classifier performs?

In [8]:
# Setup some hyper parameters
batch_size = 128
num_classes = 10
epochs = 15

# input image dimensions
img_rows, img_cols = 28, 28

# This is some technicality regarding Keras' dataset
if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

# We convert the matrices to floats as we will use real numbers
x_train = x_train.astype('float32')[:1000]
x_test = x_test.astype('float32')[:200]
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)[:1000]
y_test = keras.utils.to_categorical(y_test, num_classes)[:200]


# Build network
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adam(),
              metrics=['accuracy'])

# Train
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))

# Evaluate on test data
score = model.evaluate(x_test, y_test, verbose=0)
print()
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# Evaluate on training data
score = model.evaluate(x_train, y_train, verbose=0)
print()
print('Train loss:', score[0])
print('Train accuracy:', score[1])
x_train shape: (1000, 28, 28, 1)
1000 train samples
200 test samples
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-a1470fe28059> in <module>()
     53           epochs=epochs,
     54           verbose=1,
---> 55           validation_data=(x_test, y_test))
     56 
     57 # Evaluate on test data

~/anaconda3/lib/python3.6/site-packages/keras/models.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
    961                               initial_epoch=initial_epoch,
    962                               steps_per_epoch=steps_per_epoch,
--> 963                               validation_steps=validation_steps)
    964 
    965     def evaluate(self, x=None, y=None,

~/anaconda3/lib/python3.6/site-packages/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, **kwargs)
   1628             sample_weight=sample_weight,
   1629             class_weight=class_weight,
-> 1630             batch_size=batch_size)
   1631         # Prepare validation data.
   1632         do_validation = False

~/anaconda3/lib/python3.6/site-packages/keras/engine/training.py in _standardize_user_data(self, x, y, sample_weight, class_weight, check_array_lengths, batch_size)
   1478                                     output_shapes,
   1479                                     check_batch_axis=False,
-> 1480                                     exception_prefix='target')
   1481         sample_weights = _standardize_sample_weights(sample_weight,
   1482                                                      self._feed_output_names)

~/anaconda3/lib/python3.6/site-packages/keras/engine/training.py in _standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
    111                         ': expected ' + names[i] + ' to have ' +
    112                         str(len(shape)) + ' dimensions, but got array '
--> 113                         'with shape ' + str(data_shape))
    114                 if not check_batch_axis:
    115                     data_shape = data_shape[1:]

ValueError: Error when checking target: expected dense_4 to have 2 dimensions, but got array with shape (1000, 10, 10)

Let’s use this model to predict a value for the first training instance we vizualized.

In [ ]:
print(model.predict(np.expand_dims(x_train[0], axis=0)))

Is the model correct here? What is the output of the network?

RNNs

We will now switch to RNNs. These require more resources, so we can’t do the fanciest applications during the workshop. We will do some sentiment classification of movie reviews.

In [9]:
from __future__ import print_function
import numpy as np
import keras
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Embedding, LSTM, Bidirectional
from keras.datasets import imdb

# Number of considered words, based on frequencies
max_features = 20000
# cut texts after this number of words
maxlen = 100
batch_size = 32

print('Loading data...')
(x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=max_features, index_from=3)

# This is just for pretty printing the sentences...
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+3) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
id_to_word = {value:key for key,value in word_to_id.items()}

print("Here's the input for the first training instance:")
print(' '.join(id_to_word[id] for id in x_train[0] ))
Loading data...
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz
17465344/17464789 [==============================] - 2s 0us/step
Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step
Here's the input for the first training instance:
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all

What do you think about this text? Is it a positive or negative review?

In [10]:
print("Here are the dataset shapes")
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print("And the input for the first instance is represented as:")
print(x_train[0])
Here are the dataset shapes
25000 train sequences
25000 test sequences
And the input for the first instance is represented as:
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

What do these numbers represent? Is there any limitation you can imagine coming from this?

In [11]:
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)[:5000]
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)[:5000]
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
y_train = np.array(y_train)[:5000]
y_test = np.array(y_test)[:5000]

model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

model.compile('adam', 'binary_crossentropy', metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=4,
          validation_data=[x_test, y_test])
Pad sequences (samples x time)
x_train shape: (5000, 100)
x_test shape: (5000, 100)
Train...
Train on 5000 samples, validate on 5000 samples
Epoch 1/4
5000/5000 [==============================] - 54s 11ms/step - loss: 0.6032 - acc: 0.6570 - val_loss: 0.4283 - val_acc: 0.8056
Epoch 2/4
5000/5000 [==============================] - 54s 11ms/step - loss: 0.2761 - acc: 0.8918 - val_loss: 0.4403 - val_acc: 0.7948
Epoch 3/4
5000/5000 [==============================] - 61s 12ms/step - loss: 0.1101 - acc: 0.9670 - val_loss: 0.6366 - val_acc: 0.8026
Epoch 4/4
5000/5000 [==============================] - 56s 11ms/step - loss: 0.0478 - acc: 0.9868 - val_loss: 0.6637 - val_acc: 0.7954
Out[11]:
<keras.callbacks.History at 0x1392d76d8>
In [12]:
print("The neural net predicts that the first instance sentiment is:")
print(model.predict(np.expand_dims(x_train[0], axis=0)))
The neural net predicts that the first instance sentiment is:
[[ 0.99445081]]

Remarks? Comments?

How do the training scores compare to the test scores? How can we improve this? What are the current limitations?

This RNN use case takes more time to train but it is definitely more impressive. We will model the language, by training on a novel. For each (set of) word(s) in the novel, the objective is to predict the following word. This can be done on any text, and we don’t need annotated data – the text itself is enough.

Have a look at the following piece of code and try to understand what it does. Then, run it and see the network generating text! At first, the output is not meaningful, but it becomes so over time. This is the magic I was referring to.

Beware: this will take longer to run on a CPU. A GPU is recommended, but you can still try to run it for a while to see the predictions evolve. On my laptop, an epoch takes 6mins so the full training takes 6hrs. About 20 epochs are required for the generated text to be somewhat meaningful.

Note, however, that although this seems long, training actual deep learning models for concrete tasks takes days, even on multiple GPUs. This is mostly because of the data size and the much deeper networks.

In [ ]:
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io

# We load a text from Nietzsche
path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')
with io.open(path, encoding='utf-8') as f:
    text = f.read().lower()
print('corpus length:', len(text))

# We create dictionaries of character > index and the other way around
chars = sorted(list(set(text)))
print('total chars:', len(chars))
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

# cut the text in semi-redundant sequences of maxlen characters
maxlen = 40
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

print('Vectorization...')
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1


# build the model: a single LSTM
print('Build model...')
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)


def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)


def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

model.fit(x, y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])
Downloading data from https://s3.amazonaws.com/text-datasets/nietzsche.txt
606208/600901 [==============================] - 0s 0us/step
corpus length: 600893
total chars: 57
nb sequences: 200285
Vectorization...
Build model...
Epoch 1/60
200285/200285 [==============================] - 281s 1ms/step - loss: 1.9553

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "to
agree with many people. "good" is no "
to
agree with many people. "good" is no and it is the the of the same the of the sention of the strenge of the most the self-our of the inderent that the sensive indeed the one of the constitute of the most of the semple of the desire of the sensive of the most of the semple of the sempathy of the one of the into the every to a soul of the some of the persent the free of the semple of the most of the sention of the of the spiritual the 
----- diversity: 0.5
----- Generating with seed: "to
agree with many people. "good" is no "
to
agree with many people. "good" is no may a suptimes and also orage mankind the one of indeed of one streng the possible the sensition and the inderenation of a sul the in a sould be the orting a solitiarity of religions in a man of such and a scient, in every of and the self-to and of a revilued it is the most in the indeed, and it is assual that the ord of the of the distiture in its all the manter of the soul permans the decours of
----- diversity: 1.0
----- Generating with seed: "to
agree with many people. "good" is no "
to
agree with many people. "good" is no causest and hew the fown of every groktulr
destined a the art it noteriness of one it all and
and cothinded of that rendercaterfroe to doe," in the pational the is the onl yutre
allor upitsoon,--one
viburan mused a "master in the that niver if
a pridicle quesiles of
the shoold enss nowxing to
feef ma.t--wute disequerly that then her rewadd finale the eeblive alse rusurefver" a selovery catte he re
----- diversity: 1.2
----- Generating with seed: "to
agree with many people. "good" is no "
to
agree with many people. "good" is no likeurenes, it is novamentstisuser'stone, indos paces. fund, wethel feel the
que let doee new eveny that is that the catel. thotgy is
within ceoks of theregeritades) and itwas brutmes ageteron
clyrelogilabl freephi; its. by an? andaver happ
one of his absuman artificss? itself old a
ooker himsood and bus hray
fined in smuch is sudtirers of rerarder from and
afutty
mest utfered with to "bewnook one
Epoch 2/60
 81664/200285 [===========>..................] - ETA: 2:37 - loss: 1.6395

Web Scraping with Python Made Easy

Imagine you run a business selling shoes online and wanted to monitor how your competitors price their products. You could spend hours a day clicking through page after page or write a script for a web bot, an automated piece of software that keeps track a site’s updates. That’s where web scraping comes in.

Scraping websites lets you extract information from hundreds or thousands of webpages at once. You can search websites like Indeed for job opportunities or Twitter for tweets. In this gentle introduction to web scraping, we’ll go over the basic code to scrape websites such that anyone, regardless of background, can extract and analyze these kinds of results.

Getting Started

Using my GitHub repository on web scraping, you can install the software and run the scripts as instructed. Click on the src directory on the repository page to see the README.md file that explains each script and how to run them.

Examining the Site

You can use a sitemap file to located where websites upload content without crawling every single web page. Here’s a sample one. You can also find out how large a site is and how much information you can actually extract from it. You can search a site using Google’s Advanced Search to figure out how many pages you may need to scrape. This will come in handy when creating a web scraper that may need to pause for updates or act in a different manner after reaching a certain number of pages.

You can also run the identify.py script in the src directory to figure out more information bout how each site was built. This should give info about the frameworks, programming languages, and servers used in building each website as well as the registered owner for the domain. This also uses robotparser to check for restrictions.

Many websites have a robots.txt file with crawling restrictions. Make sure you check out this file for a website for more information about how to crawl a website or any rules that you should follow. The sample protocol can be found here.

Crawling a Site

There are three general approaches to crawling a site: Crawling a sitemap, Iterating through an ID for each webpage, and following webpage links. download.py shows how to download a webpage with methods of sitemap crawling, results.py shows you how to scrape those results while iterating through webpage IDs, and indeedScrape.py uses the webpage links for crawling. download.py also contains information on inserting delays, returning a list of links from HTML, and supporting proxies that can let you access websites through blocked requests.

Scraping the Data

In the file compare.py, you can compare the efficiency of the three web scraping methods.

You can use regular expressions (known as regex or regexp) to perform neat tricks with text for getting information from websites. The script regex.py shows how this is done.

You can also use the browser extension Firebug Lite to get information from a webpage. In Chrome, you can click View >> Developer >> View Source to get the source behind a webpage.

Beautiful Soup, one of the requried packages to run indeedScrape.py, parses a webpage and provides a convenient interface to navigate the content, as shown in bs4test.py. Lxml also does this in lxmltest.py. A comparison of these three scraping methods are in the following table.

Scraping methodPerformanceEase of useEase of install
RegexFastHardEasy
Beautiful SoupSlowEasyEasy
lxmlFastEasyHard

The callback.py script lets you scrape data and save it to an output .csv file.

Caching Downloads

Caching crawled webpages lets you store them in a manageablae format while only having to download them once. In download.py, there’s a python class Downloader that shows how to cache URLs after downloading their webpages. cache.py has a python class that maps a URL to a filename when caching.

Depending on which operating system you’re using, there’s a limit to how much you can cache.

Operating systemFile systemInvalid filename charactersMax filename length
LinuxExt3/Ext4/, \0255 bytes
OS X HFS Plus:, \0255 UT-16 code units
WindowsNTFS \, /, ?, :, *, >, <, |255 characters

Though cache.py is easy to use, you can take the hash of the URL itself to use as the filename to ensure your files directly map to the URLs of the saved cache. Using MongoDB, you can build ontop of the current file system database system and avoid the file system limitations. This method is found in mongocache.py using pymongo, a Python wrapper for MongoDB.

Test out the other scripts such as alexacb.py for downloading information on the top sites by Alexa ranking. mongoqueue.py has functionality for queueing the MongoDB inquiries that can be imported to other scripts.

You can work with dynamic webpages using the code from browserrender.py. The majority of leading websites using JavScript for functionality, meaning you can’t view all their content in barebones HTML.

A Comparison of Copper in the U.S.

As humanity’s oldest metal, copper comes in many forms. People have used copper for thousands of years. When the ancient Romans mined the element “cyprium” from Cyprus, the metal soon became known in English as “copper.” 

Copper is produced and consumed in many forms, from the lining of electrical motors to the coating of pennies. Thanks to its high thermal and electrical conductivity, the material is frequently used in telecommunication technologies and as a building material.

The process of copper production includes mining, refining, smelting, and electrowinning. Through smelting and electrolytic refining, engineers and scientists transform mined ores to copper cathodes. Cathodes are thin sheets of pure copper used as raw material for processing the metal into high-quality products. 

Using data available to the public from the U.S. Geological Survey, the copper market has changed to society’s needs over the past years. 

The four major types of copper are mined copper, secondary copper, refined copper and refined electrowon copper. Secondary copper comes from recycled and scrap materials such as tubes, sheets, cables, radiators and castings, as well as from residues like dust or slag. 

Engineers and scientists transform mined pure copper metal and copper from concentrated low-grade ores through smelting and electrolytic refining in creating copper cathodes. Acid leaching of oxidized ores produces more copper.

Thanks to the chemical and physical properties of copper, the material is suitable for electrical and thermal conductivity. Copper’s high ductility and malleability give it key roles in industrial applications of coil wining, power transmission and generation and telecommunication technologies.

The different methods of processing copper have remained constant for the most part between 1990 and 2010. The data is from “U.S. Mineral Dependence—Statistical Compilation of U.S. and World Mineral Production, Consumption, and Trade, 1990–2010” by James J. Barry, Grecia R. Matos and W. David Menzie. The rise in refined copper reflects market trends for the rising demand for refined copper, according to a report in Mining.com. Oxide and sulfur ores generally have between 0.5 and 2.0% copper. The process involves concentrating the ore to remove gangue and other materials.


Differences between reported and apparent processed copper consumption in the U.S. have decreased from 2005 to 2009. Copper consumption itself has dropped.

The various types of copper produced by the U.S. have remained constant over the time period. 

Mined copper has remained the dominant copper produced around the world, though refined copper has come close or equal to it from 1996 to 2001. Refined electrowon copper has steadily surpassed secondary copper over the time period, too.