Path Lengths

In [1]:
import pandas as pd
import numpy as np
import json

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

path = "/Users/mark/Dropbox/Math/Complex_Systems/research/wikipedia-network/paper/writeup/graphics/"
In [2]:
#load results into dataframe (~ 1min runtime)

results_path = "/Users/mark/Desktop/wiki_v4/"
with open(results_path + "lengths.json") as f:
    dict = json.load(f)
df = pd.DataFrame(dict.items())
df.columns = ['page', 'path length']

Articles with the Longest Path Length

path length: number of first link traversals up to a repeated article or an invalid link

In [6]:
df.sort(columns='path length', ascending=False).head(50)
 
Out[6]:
page path length
3822443 Holy fathers slain at sinai and raithu 366
4307586 Martyrs of Raithu 366
348831 Holly fathers slain at Sinai and Raithu 366
5246648 January 19 (Eastern Orthodox liturgics) 365
4725606 March 5 (Eastern Orthodox liturgics) 365
6494442 September 12 (Orthodox Liturgics) 365
3852463 June 28 (Eastern Orthodox liturgics) 365
9746286 October 8 (Eastern Orthodox liturgics) 365
6916418 June 10 (Eastern Orthodox Liturgics) 365
222840 October 2 (Eastern Orthodox liturgics) 365
10403282 September 10 (Eastern Orthodox liturgics) 365
7255605 May 22 (Eastern Orthodox Liturgics) 365
2627595 September 19 (Eastern Orthodox liturgics) 365
7238051 April 21 (Orthodox Liturgics) 365
7866409 May 1 (Eastern Orthodox liturgics) 365
4833573 February 20 (Eastern Orthodox liturgics) 365
1704007 March 13 (Eastern Orthodox liturgics) 365
5140090 June 13 (Eastern Orthodox liturgics) 365
4947568 June 12 (Orthodox Liturgics) 365
3076660 October 3 (Eastern Orthodox liturgics) 365
3384560 March 31 (Eastern Orthodox liturgics) 365
10471222 April 2 (Orthodox Liturgics) 365
541479 March 9 (Eastern Orthodox liturgics) 365
6363710 January 9 (Eastern Orthodox liturgics) 365
10628182 April 22 (Orthodox Liturgics) 365
8188141 June 16 (Orthodox liturgics) 365
7037526 January 24 (Eastern Orthodox liturgics) 365
8535203 May 7 (Eastern Orthodox Liturgics) 365
6240919 July 31 (Eastern Orthodox liturgics) 365
8927665 July 21 (Eastern Orthodox liturgics) 365
7484123 July 6 (Eastern Orthodox liturgics) 365
10912810 September 10 (Orthodox Liturgics) 365
1806792 January 13 (Eastern Orthodox liturgics) 365
26741 April 29 (Eastern Orthodox Liturgics) 365
5427351 July 12 (Eastern Orthodox liturgics) 365
4721922 April 24 (Eastern Orthodox liturgics) 365
4120771 April 4 (Orthodox Liturgics) 365
10787369 May 1 (Orthodox Liturgics) 365
3076639 September 29 (Eastern Orthodox liturgics) 365
1993279 January 17 (Eastern Orthodox liturgics) 365
11152069 September 21 (Orthodox Liturgics) 365
2371682 April 18 (Eastern Orthodox liturgics) 365
11152068 May 19 (Eastern Orthodox Liturgics) 365
8892979 May 2 (Eastern Orthodox Liturgics) 365
6928389 December 29 (Eastern Orthodox liturgics) 365
8675961 July 9 (Eastern Orthodox liturgics) 365
4086292 January 11 (Eastern Orthodox liturgics) 365
5352005 May 2 (Orthodox Liturgics) 365
8813140 March 6 (Eastern Orthodox liturgics) 365
3625925 April 11 (Eastern Orthodox liturgics) 365

Excluding Liturgics...

In [3]:
top1k= df.sort(columns='path length', ascending=False).head(1000)
top1k[top1k['page'].apply(lambda e: "liturgics" not in e.lower())]
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  if __name__ == '__main__':
Out[3]:
page path length
3822443 Holy fathers slain at sinai and raithu 366
4307586 Martyrs of Raithu 366
348831 Holly fathers slain at Sinai and Raithu 366
9349535 Holy fathers slain at Sinai and Raithu 365
3747101 2007 in Scotland 104
1672735 2005 in Scotland 103
5940038 2006 in Scotland 103
3427709 2003 in Scotland 102
5483187 2004 in Scotland 102
9082217 2001 in Scotland 101
9811445 2002 in Scotland 101
1265515 2000 in Scotland 100
2855529 1999 in Scotland 100
10463952 1998 in Scotland 99
10417119 1997 in Scotland 99
5922640 1996 in Scotland 98
7824403 1995 in Scotland 98
6168445 1994 in Scotland 97
5707635 1993 in Scotland 97
7005527 1991 in Scotland 96
2790716 1992 in Scotland 96
4807989 1989 in Scotland 95
2927664 1990 in Scotland 95
7288844 1988 in Scotland 94
6827142 1987 in Scotland 94
4544731 1986 in Scotland 93
1988634 1985 in Scotland 93
384869 1984 in Scotland 92
2551937 1983 in Scotland 92
7451802 1982 in Scotland 91
... ... ...
1611530 List of 2009 Canadian incumbents 60
3163047 1553 in architecture 60
8257496 1560s in architecture 60
8181836 Ontario general election, 1959 60
2002197 Karyaku 60
2315768 1836 in the UK 60
5147993 1557 in architecture 60
5277691 1920 in Scotland 60
3287735 Encho 60
1346830 1670 in England 60
8382878 1838 in the United Kingdom 60
2089886 1558 in architecture 60
8548725 1551 in architecture 60
4227662 Genko 60
6011737 1559 in architecture 60
2401677 Portal del Sur 60
5521742 1837 in the UK 60
2938682 1552 in architecture 60
4676620 1550 in architecture 60
7609224 1839 in the United Kingdom 60
9434873 Shochu (era) 60
2402488 Genkō (disambiguation) 60
5954972 1556 in architecture 60
7372394 1671 in England 60
8093557 Shōchū (era) 59
204011 1669 in England 59
8957378 1546 in architecture 59
7821574 1834 in the UK 59
2919238 26th Legislative Assembly of Ontario 59
4301033 1668 in England 59

386 rows × 2 columns

What's the total path length traversed?

In [52]:
df['path length'].sum()
Out[52]:
232356935
In [20]:
df[df['path length'] == 0].count()
Out[20]:
page           550220
path length    550220
dtype: int64

a sample:

In [21]:
df[df['path length'] == 0].head(10)
Out[21]:
page path length
21 Lowry model 0
32 Mismagius (Pokémon) 0
74 2002 Summer Camp Music Festival 0
126 Rvd tv 0
146 La Resistencia (gang) 0
158 Presidency of William McKinley 0
191 Chen Ying (Three Kingdoms) 0
204 Margot Grimmer 0
208 Ghost (novel) 0
252 Small Swiss Hound 0

How many articles have a path length of 2?

includes invalid links

In [48]:
df[df['path length'] == 2].count()
Out[48]:
page           326814
path length    326814
dtype: int64

How many articles have a path length of 3?

includes invalid links

In [50]:
df[df['path length'] == 3].count()
Out[50]:
page           247101
path length    247101
dtype: int64

What's the most frequent path length?

In [45]:
df['path length'].mode()
Out[45]:
0    29
dtype: int64

How many pages have a path length of 29?

In [46]:
df[df['path length'] == 29].count()
Out[46]:
page           698089
path length    698089
dtype: int64

Summary Statistics

In [25]:
df.describe()
Out[25]:
path length
count 11277534.000000
mean 20.603523
std 12.504190
min 0.000000
25% 7.000000
50% 26.000000
75% 30.000000
max 366.000000

75% of pages traverse fewer than 30 first links!

In [8]:
plt.figure(figsize=(8,6))

sns.set_style("whitegrid")
sns.boxplot(x=df["path length"])

plt.savefig(path+'path_lengths_boxplot.png', format='png', dpi=300, bbox_inches='tight')

Density Plots

In [4]:
plt.figure(figsize=(8,6))

sns.distplot(df["path length"])
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b42b050>
In [5]:
# runtime ~3min

plt.figure(figsize=(8,6))

with sns.axes_style("whitegrid"):
    sns.kdeplot(df["path length"], shade=True, legend=False, color='g')
    sns.despine(left=True)

#make axis font size larger
plt.tick_params(axis='both', which='major', labelsize=14)

#labels
plt.xlabel("path length", fontsize=14)
plt.ylabel("density", fontsize=14)

in log space

In [6]:
# runtime ~3min

plt.figure(figsize=(8,6))

with sns.axes_style("whitegrid"):
    sns.kdeplot(np.log10(df["path length"] + 1), shade=True, legend=False, color='g')
    sns.despine(left=True)

#make axis font size larger
plt.tick_params(axis='both', which='major', labelsize=14)

#labels
plt.xlabel("$\log_{10}$(path length)", fontsize=14)
plt.ylabel("density", fontsize=14)

combined plot

In [ ]:
#defaults
sns.set()
plt.figure(figsize=(8,6))

with sns.axes_style("whitegrid"):
    sns.kdeplot(df["path length"], shade=True, legend=False, color='g')
    sns.despine(left=True)

#make axis font size larger
plt.tick_params(axis='both', which='major', labelsize=14)

#labels
plt.xlabel("path length", fontsize=14)
plt.ylabel("density", fontsize=14)

#define plot axis limits
axes = plt.gca()
xticks = axes.xaxis.get_major_ticks()
xticks[0].label1.set_visible(False)
yticks = axes.yaxis.get_major_ticks()
yticks[0].label1.set_visible(False)

sns.set_style("dark")
#subplot in top corner
a = plt.axes([.53, .51, .37, .37], axisbg='y')

with sns.axes_style("whitegrid"):
    sns.kdeplot(np.log10(df["path length"] + 1), shade=True, legend=False, color='g')
    sns.despine(left=True)

#make axis font size larger
plt.tick_params(axis='both', which='major', labelsize=14)

#labels
plt.xlabel("$\log_{10}$(path length)", fontsize=14)
plt.ylabel("density", fontsize=14)

#define plot axis limits
axes = plt.gca()
xticks = axes.xaxis.get_major_ticks()
xticks[0].label1.set_visible(False)
yticks = axes.yaxis.get_major_ticks()
yticks[0].label1.set_visible(False)

#transparent
a.patch.set_alpha(0.1)

#back to defaults
sns.set()

#save figure
plt.savefig(path+'path_lengths_dist.png', format='png', dpi=300, bbox_inches='tight')

Distribution of Path Lengths

In [41]:
toppdf_10k = df.sort(columns='path length', ascending=False).head(10000)['path length']
toppdf_1k = df.sort(columns='path length', ascending=False).head(1000)['path length']
toppdf_100 = df.sort(columns='path length', ascending=False).head(100)['path length']
In [31]:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
In [40]:
visits_plot1 = toppdf_10k.plot(kind='hist', title="Pages with Longest Paths (top 10k)")
visits_plot1.set_xlabel("path length")
visits_plot1.set_ylabel("frequency")
Out[40]:
<matplotlib.text.Text at 0x168885e90>
In [42]:
visits_plot2 = toppdf_1k.plot(kind='hist', title="Pages with Longest Paths (top 1k)")
visits_plot2.set_xlabel("path length")
visits_plot2.set_ylabel("frequency")
Out[42]:
<matplotlib.text.Text at 0x162bcedd0>
In [44]:
visits_plot3 = toppdf_100.plot(kind='hist', title="Pages with Longest Paths (top 100)")
visits_plot3.set_xlabel("path length")
visits_plot3.set_ylabel("frequency")
Out[44]:
<matplotlib.text.Text at 0x1d40ffc90>