1. Winter is Coming. Let's load the dataset ASAP

If you haven't heard of Game of Thrones, then you must be really good at hiding. Game of Thrones is the hugely popular television series by HBO based on the (also) hugely popular book series A Song of Ice and Fire by George R.R. Martin. In this notebook, we will analyze the co-occurrence network of the characters in the Game of Thrones books. Here, two characters are considered to co-occur if their names appear in the vicinity of 15 words from one another in the books.

This dataset constitutes a network and is given as a text file describing the edges between characters, with some attributes attached to each edge. Let's start by loading in the data for the first book A Game of Thrones and inspect it.

In [181]:
# Importing modules
# ... YOUR CODE HERE ...
import pandas as pd

# Reading in datasets/book1.csv
book1 = pd.read_csv('datasets/book1.csv')

# Printing out the head of the dataset
# ... YOUR CODE HERE ...
print(book1.head())
                            Source              Target        Type  weight  \
0                   Addam-Marbrand     Jaime-Lannister  Undirected       3   
1                   Addam-Marbrand     Tywin-Lannister  Undirected       6   
2                Aegon-I-Targaryen  Daenerys-Targaryen  Undirected       5   
3                Aegon-I-Targaryen        Eddard-Stark  Undirected       4   
4  Aemon-Targaryen-(Maester-Aemon)      Alliser-Thorne  Undirected       4   

   book  
0     1  
1     1  
2     1  
3     1  
4     1  
In [182]:
%%nose

def test_task1():
    assert len(book1) == 684, \
    'Make sure you have imported the correct book, i.e book1.csv'
Out[182]:
1/1 tests passed

2. Time for some Network of Thrones

The resulting DataFrame book1 has 5 columns: Source, Target, Type, weight, and book. Source and target are the two nodes that are linked by an edge. A network can have directed or undirected edges and in this network all the edges are undirected. The weight attribute of every edge tells us the number of interactions that the characters have had over the book, and the book column tells us the book number.

Once we have the data loaded as a pandas DataFrame, it's time to create a network. We will use networkx, a network analysis library, and create a graph object for the first book.

In [183]:
# Importing modules
# ... YOUR CODE FOR TASK 2 HERE ...
import networkx as nx

# Creating an empty graph object
G_book1 = nx.Graph()
In [184]:
%%nose

import networkx as nx

def test_type_of_graph():
    assert isinstance(G_book1, type(nx.Graph())) , \
    'Make sure you have created an object of graph type'
Out[184]:
1/1 tests passed

3. Populate the network with the DataFrame

Currently, the graph object G_book1 is empty. Let's now populate it with the edges from book1. And while we're at it, let's load in the rest of the books too!

In [185]:
# Iterating through the DataFrame to add edges
# ... YOUR CODE HERE ...
for _, edge in book1.iterrows():
    G_book1.add_edge(edge["Source"], edge["Target"], weight=edge["weight"])

# Creating a list of networks for all the books
books = [G_book1]
book_fnames = ['datasets/book2.csv', 'datasets/book3.csv', 'datasets/book4.csv', 'datasets/book5.csv']
for book_fname in book_fnames:
    book = pd.read_csv(book_fname)
    G_book = nx.Graph()
    for _, edge in book.iterrows():
        G_book.add_edge(edge['Source'], edge['Target'], weight=edge['weight'])
    books.append(G_book)
In [186]:
%%nose

def test_placeholder():
    assert len(G_book1) == 187, \
    'Make sure that you have iterated through the DataFrame'
Out[186]:
1/1 tests passed

4. Finding the most important character in Game of Thrones

Is it Jon Snow, Tyrion, Daenerys, or someone else? Let's see! Network Science offers us many different metrics to measure the importance of a node in a network. Note that there is no "correct" way of calculating the most important node in a network, every metric has a different meaning.

First, let's measure the importance of a node in a network by looking at the number of neighbors it has, that is, the number of nodes it is connected to. For example, an influential account on Twitter, where the follower-followee relationship forms the network, is an account which has a high number of followers. This measure of importance is called degree centrality.

Using this measure, let's extract the top ten important characters from the first book (book[0]) and the fifth book (book[4]).

In [187]:
# Calculating the degree centrality of book 1
deg_cen_book1 = nx.degree_centrality(books[0])

# Calculating the degree centrality of book 5
deg_cen_book5 = nx.degree_centrality(books[4])

# Sorting the dictionaries according to their degree centrality and storing the top 10
sorted_deg_cen_book1 = sorted(deg_cen_book1.items(), key=lambda x: x[1], reverse=True)

# Sorting the dictionaries according to their degree centrality and storing the top 10
sorted_deg_cen_book5 = sorted(deg_cen_book1.items(), key=lambda x: x[1], reverse=True)

# Printing out the top 10 of book1 and book5
# ... YOUR CODE FOR TASK 4 ...
print(sorted_deg_cen_book1[:10])
print(sorted_deg_cen_book5[:10])
[('Eddard-Stark', 0.3548387096774194), ('Robert-Baratheon', 0.2688172043010753), ('Tyrion-Lannister', 0.24731182795698928), ('Catelyn-Stark', 0.23118279569892475), ('Jon-Snow', 0.19892473118279572), ('Sansa-Stark', 0.18817204301075272), ('Robb-Stark', 0.18817204301075272), ('Bran-Stark', 0.17204301075268819), ('Joffrey-Baratheon', 0.16129032258064518), ('Cersei-Lannister', 0.16129032258064518)]
[('Eddard-Stark', 0.3548387096774194), ('Robert-Baratheon', 0.2688172043010753), ('Tyrion-Lannister', 0.24731182795698928), ('Catelyn-Stark', 0.23118279569892475), ('Jon-Snow', 0.19892473118279572), ('Sansa-Stark', 0.18817204301075272), ('Robb-Stark', 0.18817204301075272), ('Bran-Stark', 0.17204301075268819), ('Joffrey-Baratheon', 0.16129032258064518), ('Cersei-Lannister', 0.16129032258064518)]
In [188]:
%%nose

def test_task4():
    assert sorted_deg_cen_book1[0][0] == 'Eddard-Stark', \
    'Make sure you have sorted the degree centrality dictionary in the correct order'
Out[188]:
1/1 tests passed

5. Evolution of importance of characters over the books

According to degree centrality, the most important character in the first book is Eddard Stark but he is not even in the top 10 of the fifth book. The importance of characters changes over the course of five books because, you know, stuff happens... ;)

Let's look at the evolution of degree centrality of a couple of characters like Eddard Stark, Jon Snow, and Tyrion, which showed up in the top 10 of degree centrality in the first book.

In [189]:
%matplotlib inline

# Creating a list of degree centrality of all the books
evol = [nx.degree_centrality(book) for book in books]

# Creating a DataFrame from the list of degree centralities in all the books
degree_evol_df = pd.DataFrame.from_records(evol)

# Plotting the degree centrality evolution of Eddard-Stark, Tyrion-Lannister and Jon-Snow
# YOUR CODE FOR TASK 5 HERE ...
degree_evol_df[["Eddard-Stark", "Tyrion-Lannister", "Jon-Snow"]].plot()
Out[189]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd339673128>
In [190]:
%%nose

def test_task5():
    assert degree_evol_df.shape == (5, 796), \
    'Check the construction of the DataFrame degree_evol_df'
Out[190]:
1/1 tests passed

6. What's up with Stannis Baratheon?

We can see that the importance of Eddard Stark dies off as the book series progresses. With Jon Snow, there is a drop in the fourth book but a sudden rise in the fifth book.

Now let's look at various other measures like betweenness centrality and PageRank to find important characters in our Game of Thrones character co-occurrence network and see if we can uncover some more interesting facts about this network. Let's plot the evolution of betweenness centrality of this network over the five books. We will take the evolution of the top four characters of every book and plot it.

In [191]:
# Creating a list of betweenness centrality of all the books just like we did for degree centrality
evol = [nx.betweenness_centrality(book, weight="weight") for book in books]

# Making a DataFrame from the list
betweenness_evol_df = pd.DataFrame.from_records(evol).fillna(0)

# Finding the top 4 characters in every book
set_of_char = set()
for i in range(5):
    set_of_char |= set(list(betweenness_evol_df.T[i].sort_values(ascending=False)[0:4].index))
list_of_char = list(set_of_char)

# Plotting the evolution of the top characters
# ... YOUR CODE FOR TASK 6 HERE ...
betweenness_evol_df[list_of_char].plot(figsize=(13,7))
Out[191]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd3283bd668>
In [192]:
%%nose

def test_task6():
    assert betweenness_evol_df.shape == (5, 796), \
    'Check the construction of the DataFrame betweenness_evol_df'
Out[192]:
1/1 tests passed

7. What does the Google PageRank algorithm tell us about Game of Thrones?

We see a peculiar rise in the importance of Stannis Baratheon over the books. In the fifth book, he is significantly more important than other characters in the network, even though he is the third most important character according to degree centrality.

PageRank was the initial way Google ranked web pages. It evaluates the inlinks and outlinks of webpages in the world wide web, which is, essentially, a directed network. Let's look at the importance of characters in the Game of Thrones network according to PageRank.

In [193]:
# Creating a list of pagerank of all the characters in all the books
evol = [nx.pagerank(book) for book in books]

# Making a DataFrame from the list
pagerank_evol_df = pd.DataFrame.from_records(evol)

# Finding the top 4 characters in every book
set_of_char = set()
for i in range(5):
    set_of_char |= set(list(pagerank_evol_df.T[i].sort_values(ascending=False)[0:4].index))
list_of_char = list(set_of_char)

# Plotting the top characters
# ... YOUR CODE FOR TASK 6 HERE ...
pagerank_evol_df[list_of_char].plot(figsize=(13,7))
Out[193]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd328168e48>
In [194]:
%%nose

def test_task7():
    assert pagerank_evol_df.shape == (5, 796), \
    'Check the construction of the DataFrame pagerank_evol_df'
Out[194]:
1/1 tests passed

8. Correlation between different measures

Stannis, Jon Snow, and Daenerys are the most important characters in the fifth book according to PageRank. Eddard Stark follows a similar curve but for degree centrality and betweenness centrality: He is important in the first book but dies into oblivion over the book series.

We have seen three different measures to calculate the importance of a node in a network, and all of them tells us something about the characters and their importance in the co-occurrence network. We see some names pop up in all three measures so maybe there is a strong correlation between them?

Let's look at the correlation between PageRank, betweenness centrality and degree centrality for the fifth book using Pearson correlation.

In [195]:
# Creating a list of pagerank, betweenness centrality, degree centrality
# of all the characters in the fifth book.
measures = [nx.pagerank(books[4]), 
            nx.betweenness_centrality(books[4], weight='weight'), 
            nx.degree_centrality(books[4])]

# Creating the correlation DataFrame
cor = pd.DataFrame.from_records(measures)

# Calculating the correlation
# ... YOUR CODE FOR TASK 6 HERE ...
cor.T.corr()
Out[195]:
0 1 2
0 1.000000 0.793372 0.971493
1 0.793372 1.000000 0.833816
2 0.971493 0.833816 1.000000
In [196]:
%%nose

def test_task8():
    assert len(cor) == 3, \
    'Make sure you have calculated the correlation correctly'
Out[196]:
1/1 tests passed

9. Conclusion

We see a high correlation between these three measures for our character co-occurrence network.

So we've been looking at different ways to find the important characters in the Game of Thrones co-occurrence network. According to degree centrality, Eddard Stark is the most important character initially in the books. But who is/are the most important character(s) in the fifth book according to these three measures?

In [197]:
# Finding the most important character in the fifth book,  
# according to degree centrality, betweenness centrality and pagerank.
p_rank, b_cent, d_cent = cor.idxmax(axis=1)

# Printing out the top character accoding to the three measures
# ... YOUR CODE FOR TASK 6 HERE ...
print(p_rank, b_cent, d_cent)
Jon-Snow Stannis-Baratheon Jon-Snow
In [198]:
%%nose

def test_task9():
    assert p_rank == 'Jon-Snow', \
    'Nope, wrong answer. Make sure you have found the max using the correct axis'
Out[198]:
1/1 tests passed