What are highest ranking articles by traversal funnels?

metric: traversal funnels

the number of paths an article directs towards a cycle or invalid link

In [1]:
from collections import defaultdict

import pandas as pd
from scipy import stats 
import numpy as np
import json

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

results_path = "/Users/mark/Desktop/wiki_v4/"
path = "/Users/mark/Dropbox/Math/Complex_Systems/research/wikipedia-network/paper/writeup/graphics/"
In [ ]:
 
In [3]:
#load feeder data
with open(results_path + "feed_count.json") as f:
    feeder_dict = json.load(f)
feeder_df = pd.DataFrame(feeder_dict.items())
feeder_df.columns = ['article', 'traversal funnels']
feeder_df = feeder_df.sort_values(by='traversal funnels', ascending=False)
In [3]:
feeder_df.head(50)
Out[3]:
article traversal funnels
7948850 Philosophy 7374892
224026 Presentation 30799
9030902 Tree of life (biology) 29274
1344349 Southeast Europe 25745
11029885 Feudalism 19276
632584 Census-Designated Place 17483
7652704 United States Constitution 13952
7974918 Reality 13416
8629119 Health care 10762
7739754 BBC 8945
7580925 Hip Hop Music 7166
4495967 Consciousness 6587
5516532 Balkans 6547
3381363 Quality (philosophy) 5712
5866358 Biological system 5568
1749140 Secondary school 4624
4561872 Reservoir 4571
8281815 Armenia 3943
10404923 Dwelling 3767
5437039 Cancer 3219
5264253 Kingdom of France 3177
5253481 Web page 3113
3198232 Jurisdiction 3111
3442364 Affection 3075
8622161 Photography 2955
6420014 Secondary education 2939
7102628 Music magazine 2588
203945 Decimal 2573
8151674 Provinces of Armenia 2475
5744095 Dam 2472
5354565 Residency (domicile) 2336
1647549 Angola 2238
7334137 Rowing (sport) 2139
11206470 Leonard Bloomfield 2120
10011184 Telugu cinema 1941
10085700 City-State 1851
870223 Combat vehicle 1820
5964482 Marriage 1714
1496615 Cold War 1692
5806173 Photograph 1639
4681122 Parliament of New South Wales 1595
9452835 Namibia 1562
5260882 Fossil Fuel 1521
3638527 Women's Tennis Association 1517
8487023 Feminism 1429
5970692 Cinema of India 1341
10130316 Physical geography 1292
3323371 Provinces of the Netherlands 1282
9597819 Vocal group 1263
10235159 Court (royal) 1218

Articles with no traversal funnels

In [5]:
feeder_df.tail(20)
Out[5]:
article traversal funnels
3762162 Galge 0
3762180 Suriyan 0
3762163 Craig Hall 0
3762164 Anthony holborne 0
3762165 Dartmouth Chronicle and Advertiser 0
3762166 Cryptothele 0
3762167 Fishers Pond 0
3762168 Cabannes, Bouches-du-Rhone 0
3762169 Galgo 0
3762170 Gymnastics at the 1960 Summer Olympics – Men's... 0
3762171 Levoberejnii District 0
3762172 Bucket seating 0
3762173 Daisuke (website) 0
3762174 George Flake 0
3762175 1999 San Francisco 49ers season 0
3762176 Batman computer and video games 0
3762177 Joseph Vargas 0
3762178 Ma Rulong 0
3762179 The Phantom (game system 0
11277533 Dungannon Middle 0

What's the distribution of traversal funnels?

In [6]:
feeder_df.describe()
Out[6]:
traversal funnels
count 11277534.000000
mean 0.692524
std 2196.166545
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 7374892.000000

How many articles have more than one traversal funnel?

In [7]:
feeder_df[feeder_df['traversal funnels'] > 0].count()
Out[7]:
article              17821
traversal funnels    17821
dtype: int64

What's the distribution of nonzero traversal funnels?

In [8]:
sns.boxplot(x='traversal funnels', data=feeder_df[feeder_df['traversal funnels'] > 0])
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x11dc8f510>

extreme outlier is philosophy

log scale

In [4]:
feeder_df['rank'] = np.arange(1, feeder_df.shape[0]+1)
feeder_df['log(rank)'] = np.log10(feeder_df['rank'])
feeder_df['log(traversal funnels)']=  np.log10(feeder_df['traversal funnels']+1)

Top Articles (on log scale)

In [7]:
feeder_df.head(50).iloc[::-1].plot(x="article", y="log(traversal funnels)", kind="barh", fontsize=14,
                            legend=False, figsize=(6,16), color="#268bd2")
#no background
ax = plt.gca()
ax.patch.set_visible(False) 


plt.xlabel("$\log_{10}$(Traversal Visis)", fontsize=14)
plt.ylabel("")
ax.xaxis.tick_top()
ax.xaxis.set_label_position('top')

plt.tick_params(axis='x', which='major', labelsize=14)

#save figure
plt.savefig(path+'top_funnels.png', format='png', dpi=300, bbox_inches='tight')

Distribution (log-log)

In [ ]:
plt.scatter(feeder_df['log(rank)'], feeder_df['log(traversal funnels)'], color="#87CEFA")
plt.title("Distribution of Traversal Funnels")
plt.xlabel("log(rank)")
plt.ylabel("log(traversal funnels)")
plt.legend()

Two regimes appear

Fit for log(rank) < 4

In [5]:
feeder_df[feeder_df["log(rank)"] < 4]
Out[5]:
article traversal funnels rank log(rank) log(traversal funnels)
7948850 Philosophy 7374892 1 0.000000 6.867756
224026 Presentation 30799 2 0.301030 4.488551
9030902 Tree of life (biology) 29274 3 0.477121 4.466497
1344349 Southeast Europe 25745 4 0.602060 4.410710
11029885 Feudalism 19276 5 0.698970 4.285039
632584 Census-Designated Place 17483 6 0.778151 4.242641
7652704 United States Constitution 13952 7 0.845098 4.144668
7974918 Reality 13416 8 0.903090 4.127655
8629119 Health care 10762 9 0.954243 4.031933
7739754 BBC 8945 10 1.000000 3.951629
7580925 Hip Hop Music 7166 11 1.041393 3.855337
4495967 Consciousness 6587 12 1.079181 3.818754
5516532 Balkans 6547 13 1.113943 3.816109
3381363 Quality (philosophy) 5712 14 1.146128 3.756864
5866358 Biological system 5568 15 1.176091 3.745777
1749140 Secondary school 4624 16 1.204120 3.665112
4561872 Reservoir 4571 17 1.230449 3.660106
8281815 Armenia 3943 18 1.255273 3.595937
10404923 Dwelling 3767 19 1.278754 3.576111
5437039 Cancer 3219 20 1.301030 3.507856
5264253 Kingdom of France 3177 21 1.322219 3.502154
5253481 Web page 3113 22 1.342423 3.493319
3198232 Jurisdiction 3111 23 1.361728 3.493040
3442364 Affection 3075 24 1.380211 3.487986
8622161 Photography 2955 25 1.397940 3.470704
6420014 Secondary education 2939 26 1.414973 3.468347
7102628 Music magazine 2588 27 1.431364 3.413132
203945 Decimal 2573 28 1.447158 3.410609
8151674 Provinces of Armenia 2475 29 1.462398 3.393751
5744095 Dam 2472 30 1.477121 3.393224
... ... ... ... ... ...
3862595 Derivate (disambiguation) 1 9970 3.998695 0.301030
8348092 Lotta Triven 1 9971 3.998739 0.301030
5892841 Lucius Licinius Lucullus 1 9972 3.998782 0.301030
9667472 Benjamin Stone (character) 1 9973 3.998826 0.301030
4480007 Insignificant Others 1 9974 3.998869 0.301030
2229578 Epps 1907 Monoplane 1 9975 3.998913 0.301030
2093214 O. F. Snelling 1 9976 3.998956 0.301030
6626331 List of films based on war books — fantasy 1 9977 3.999000 0.301030
4658632 Hagop Goudsouzian 1 9978 3.999043 0.301030
4210272 Nee Soon Central Single Member Constituency 1 9979 3.999087 0.301030
1752009 February 25 (Eastern Orthodox liturgics) 1 9980 3.999131 0.301030
1303199 New Zealand Law Students' Association 1 9981 3.999174 0.301030
5363410 The Guide Association 1 9982 3.999218 0.301030
7995441 Kovalevskaya Top 1 9983 3.999261 0.301030
10763091 X-Wind technology 1 9984 3.999305 0.301030
4952982 Zaid Mohseni 1 9985 3.999348 0.301030
2714753 Sons of the Prophet 1 9986 3.999392 0.301030
1585058 Dugazon family 1 9987 3.999435 0.301030
5566838 Taja Sevelle 1 9988 3.999479 0.301030
2742422 Helldorado Days (Tombstone) 1 9989 3.999522 0.301030
1870776 Janet Wu (WHDH) 1 9990 3.999565 0.301030
550615 Christopher Grimm 1 9991 3.999609 0.301030
11074857 Moeru Tairiku 1 9992 3.999652 0.301030
8021241 Kim Hamilton 1 9993 3.999696 0.301030
4155346 Lexicon Avenue 1 9994 3.999739 0.301030
7298908 Donald Canfield 1 9995 3.999783 0.301030
954733 Dominique Perrier 1 9996 3.999826 0.301030
2851002 Craig Shapiro 1 9997 3.999870 0.301030
10581105 Playboys (song) 1 9998 3.999913 0.301030
791086 967 BC 1 9999 3.999957 0.301030

9999 rows × 5 columns

In [7]:
slope, intercept, r_value, p_value, std_err = stats.linregress(feeder_df[:8103]["log(rank)"], 
                                                               feeder_df[:8103]["log(traversal funnels)"])
print slope, intercept, r_value, p_value, std_err 
-1.08410279645 4.64007221841 -0.990266116712 0.0 0.00169296457747
In [6]:
plt.scatter(feeder_df['log(rank)'][:8103], feeder_df['log(traversal funnels)'][:8103], color="#87CEFA", label='r = -0.99')
plt.title("Traversal Funnels Top Regime (log(rank) < 9)")
plt.xlabel("log(rank)")
plt.ylabel("log(traversal funnels)")
plt.legend()
Out[6]:
<matplotlib.legend.Legend at 0x1470ff290>

Power-law exponent for top regime is -1.084

combined plots of distribution

In [8]:
#defaults
sns.set()
plt.figure(figsize=(8,6))

plt.scatter(feeder_df['log(rank)'], feeder_df['log(traversal funnels)'], color="#87CEFA")
plt.xlabel("$\log_{10}$(rank)", fontsize=14)
plt.ylabel("$\log_{10}$(traversal funnels)", fontsize=14)
plt.tick_params(axis='both', which='major', labelsize=14)
plt.legend()
ax = plt.gca()
ax.legend().set_visible(False)

#define plot axis limits
axes = plt.gca()
xticks = axes.xaxis.get_major_ticks()
xticks[0].label1.set_visible(False)
yticks = axes.yaxis.get_major_ticks()
yticks[0].label1.set_visible(False)

sns.set_style("dark")
#subplot in top corner
a = plt.axes([.50, .50, .38, .34], axisbg='y')
a.scatter(feeder_df['log(rank)'][:8103], feeder_df['log(traversal funnels)'][:8103], color="#F08080",
          label=r"$\alpha$ = -1.08"+"\n$\gamma$ = 0.07\n"+"Pearson\'s r = -0.99")
plt.legend(fontsize=14)
plt.title("Top Regime ($\log_{10}$(rank) < 4)", fontsize=14)
plt.xlabel("$\log_{10}$(rank)", fontsize=14)
plt.ylabel("$\log_{10}$(traversal funnels)", fontsize=14)
plt.tick_params(axis='both', which='major', labelsize=14)

#transparent
a.patch.set_alpha(0.1)

#define plot axis limits
axes = plt.gca()
xticks = axes.xaxis.get_major_ticks()
xticks[0].label1.set_visible(False)
yticks = axes.yaxis.get_major_ticks()
yticks[0].label1.set_visible(False)

#save figure
plt.savefig(path+'funnels_distribution.png', format='png', dpi=300, bbox_inches='tight')

#back to defaults
sns.set()
In [ ]: