Foundational paper:

Allotaxonometry and rank-turbulence divergence:
A universal instrument for comparing complex systems

P. S. Dodds, J. R. Minot, M. V. Arnold, T. Alshaabi, J. L. Adams, D. R. Dewhurst, T. J. Gray, M. R. Frank, A. J. Reagan, and C. M. Danforth



My Image

 

Abstract:


Complex systems often comprise many kinds of components which vary over many orders of magnitude in size: Populations of cities in countries, individual and corporate wealth in economies, species abundance in ecologies, word frequency in natural language, and node degree in complex networks.

Comparisons of component size distributions for two complex systems—or a system with itself at two different time points—generally employ information-theoretic instruments, such as Jensen-Shannon divergence. We argue that these methods lack transparency and adjustability, and should not be applied when component probabilities are non-sensible or are problematic to estimate.

Here, we introduce 'allotaxonometry' along with 'rank-turbulence divergence', a tunable instrument for comparing any two (Zipfian) ranked lists of components.

We analytically develop our rank-based divergence in a series of steps, and then establish a rank-based allotaxonograph which pairs a map-like histogram for rank-rank pairs with an ordered list of components according to divergence contribution.

We explore the performance of rank-turbulence divergence for a series of distinct settings including: Language use on Twitter and in books, species abundance, baby name popularity, market capitalization, performance in sports, mortality causes, and job titles.

We provide a series of supplementary flipbooks which demonstrate the tunability and storytelling power of rank-based allotaxonometry.

 

 

Flipbooks:


First, some notes:

\( \newcommand{\zipfrank}{r} \newcommand{\elementsymbol}{\tau} \newcommand{\rtdelement}[1]{\delta D^{{\rm R}}_{#1,\elementsymbol}} \newcommand{\flipbooktwitter}{\mbox{S}1} \newcommand{\flipbooktwitterRT}{\mbox{S}2} \newcommand{\flipbooktwittertimediff}{\mbox{S}3} \newcommand{\flipbooktrees}{\mbox{S}4} \newcommand{\flipbookgirlsyears}{\mbox{S}5} \newcommand{\flipbookboysyears}{\mbox{S}6} \newcommand{\flipbookgirlsalphas}{\mbox{S}7} \newcommand{\flipbookboysalphas}{\mbox{S}8} \newcommand{\flipbookmarketcapsyears}{\mbox{S}9} \newcommand{\flipbooktwittertrunc}{\mbox{S}10} \newcommand{\flipbooktreestrunc}{\mbox{S}11} \newcommand{\flipbookgirlnamestrunc}{\mbox{S}12} \newcommand{\flipbookboynamestrunc}{\mbox{S}13} \newcommand{\flipbookcompaniestrunc}{\mbox{S}14} \newcommand{\flipbooknba}{\mbox{S}15} \newcommand{\flipbookgoogleonegrams}{\mbox{S}16} \newcommand{\flipbookgooglebigrams}{\mbox{S}17} \newcommand{\flipbookgoogletrigrams}{\mbox{S}18} \newcommand{\flipbookharrypotter}{\mbox{S}19} \newcommand{\flipbookharrypotternocaps}{\mbox{S}20} \newcommand{\flipbookdeathcauses}{\mbox{S}21} \newcommand{\flipbookjobnames}{\mbox{S}22} \)
Adapted from Sec. IV of the main paper. References to Sections and Figures apply to the format of the arXiv version .

To help demonstrate rank-turbulence divergence as an allotaxonometric instrument, we have referenced a number of PDF 'Flipbooks' (or 'kineographs') throughout the paper.

Some details:
  • Flipbooks are best 'flipped through' back and forth using a PDF reader with the view set to 'single page' rather than continuous. Viewing in a browser will likely prove to be disappointing.
  • Flipbooks follow various formats which include: Comparisons of two systems with varying rank-turbulence divergence parameter $\alpha$; Comparisons of a series of system pairs, often through time; and Comparisons of systems with truncation applied (Sec. III F).
  • The first 14 Flipbooks concern the four main case studies in the paper and are explored in detail there. The remaining 8 Flipbooks are included to show the range of application for allotaxonometry, and have some discussion as part of the list below.
  • When $\alpha$ is varied the values are $0$, $\frac{1}{12}$, $\frac{2}{12}$, $\frac{3}{12}$, $\frac{4}{12}$, $\frac{5}{12}$, $\frac{6}{12}$, $\frac{8}{12}$, $1$, $2$, $5$, and $\infty$.

The Flipbooks:


Feeling peckish? Download all 22 Flipbooks at once:
$\mbox{Flipbook}~\flipbooktwitter$—Word use on Twitter: US Presidential Election (2016-11-09) versus the Charlottesville Unite the Right Rally (2017-08-13); Variation of $\alpha$.

$\mbox{Flipbook}~\flipbooktwitterRT$—Word use on Twitter: US Presidential Election (2016-11-09) versus the Charlottesville Unite the Right Rally (2017-08-13); Variation of inclusion of retweets from 1% to 100%; $\alpha = 1/3$.

$\mbox{Flipbook}~\flipbooktwittertimediff$—Word use on Twitter: Variation of time comparing 2019/01/04 going forward roughly logarithmically in number of days to a year ahead, 2020/01/03, the day of the assassination of Qasem Soleimani; $\alpha = 1/3$.

$\mbox{Flipbook}~\flipbooktrees$—Tree species abundance on Barro Colorado Island: Fig. 3 with variation of $\alpha$. The Flipbook shows how increasing $\alpha$ from 0 leads to an increasingly poor fit on the rank-rank histogram.

$\mbox{Flipbook}~\flipbookgirlsyears$—Baby girl names over time: Described in Sec. III D, comparisons of baby girl name distributions 50 years apart starting in 1880 and going forward in 5 year increments, with $\alpha = 1/3$. Ends with Fig. 4.

$\mbox{Flipbook}~\flipbookboysyears$—Baby girl names, 1968 vs. 2018: Described in Sec. III D, shows effect of varying $\alpha$, with Fig. 4 as the fifth page.

$\mbox{Flipbook}~\flipbookgirlsalphas$—Baby boy names over time: Described in Sec. III D, comparisons of baby girl name distributions 50 years apart starting in 1880 and going forward in 5 year increments, with $\alpha = 1/3$. Ends with Fig. 5.

$\mbox{Flipbook}~\flipbookboysalphas$—Baby boy names, 1968 vs. 2018: Described in Sec. III D, shows effect of varying $\alpha$, with Fig. 5 as the fifth page.

$\mbox{Flipbook}~\flipbookmarketcapsyears$—Market caps: Comparison of market caps for publicly traded companies in the fourth quarter six years apart, starting with 1995 versus 2001 and ending with 2012 versus 2018, and with $\alpha$ fixed at 1/3.

$\mbox{Flipbook}~\flipbooktwittertrunc$—Word use on Twitter, truncated: Full series of allotaxonographs corresponding to histograms of row 1 in Fig. 7 with $\alpha=1/3$.

$\mbox{Flipbook}~\flipbooktreestrunc$—Tree species abundance, truncated: Full series of allotaxonographs corresponding to histograms of row 2 in Fig. 7 with $\alpha=0$.

$\mbox{Flipbook}~\flipbookgirlnamestrunc$—Baby girl names, truncated: Full series of allotaxonographs corresponding to histograms of row 3 in Fig. 7 with $\alpha=\infty$.

$\mbox{Flipbook}~\flipbookboynamestrunc$—Baby boy names, truncated: Full series of allotaxonographs corresponding to histograms of row 4 in Fig. 7 with $\alpha=\infty$.

$\mbox{Flipbook}~\flipbookcompaniestrunc$—Market caps, truncated: Full series of allotaxonographs corresponding to histograms of row 5 in Fig. 7 with $\alpha=1/3$.

$\mbox{Flipbook}~\flipbooknba$—Season total points scored by players in the National Basketball Association: Season to season comparison of total player points per season, $\alpha$ = 1/3. The Flipbook starts with 1996–1997 versus 1997–1998 and ends in 2017–2018 versus 2018–2019. Rookies, retirements, injuries are all in evidence. For $\alpha=1/3$, Carmelo Anthony in 2003–2004 has the strongest debut, just ahead of Lebron James in the same year. Overall, Dwyane Wade's 2008–2009 season produced the highest $\rtdelement{1/3}$, moving from $\zipfrank$=51 to 1 over the previous year where he was limited in playing time with injuries. In 2008–2009, Wade's points per game of 30.2 would be the highest of his career but his team, the Miami Heat, would founder, achieving the worst record in the NBA.

$\mbox{Flipbook}~\flipbookgoogleonegrams$—Google Books, Fiction in 1948 versus 1987, 1-grams: The first of three Flipbooks exploring $n$-gram usage in books by varying $\alpha$. We have elsewhere documented the deeply problematic influence of scientific literature and individual books, rendering the Google Books $n$-grams project unreliable, as is. Nevertheless, the Version 2 $n$-grams dataset for English fiction is worth exploring with different instruments, and we are endeavoring separately to provide corrective measures. For 1948, we see characters and place names dominate, and these come from a few books (e.g., Upton Sinclair's 'Lanny Budd', 'Raintree County'). The 1987 side shows words that are not tied to specific books but rather cultural and temporal phenomena, as well as cruder language: 'KGB', 'CIA', 'Vietnam', 'lesbian', 'television', 'computer', and 'fucking'. Tuning $\alpha$ towards $\infty$, we can see pronouns changing slightly in rank with 'her and 'she' elevating and 'he' and 'his' dropping.

$\mbox{Flipbook}~\flipbookgooglebigrams$—Google Books, Fiction in 1948 versus 1987, 2-grams: For 2-grams, we again see character names dominate 1947 for low $\alpha$ ('Sung Chiang', 'the Perfessor'), while 'the CIA' and 'the KGB' stand out for 1987. Increasing $\alpha$ brings in the same words as for 1-grams preceded by 'the' ('the phone', 'the computer'). As $\alpha \rightarrow \infty$, bigrams with 'not' as part appear more strongly for 1987.

$\mbox{Flipbook}~\flipbookgoogletrigrams$—Google Books, Fiction in 1948 versus 1987, 3-grams: For 3-grams, while we still see characters and place names for 1947, we now have what we call 'pathological hapax legomena', words (or trigrams in this case) that occur once in many books. The 3-grams are all from standardized, legal-speak front matter coming from outside of the story: 'change without notice', 'your local bookstore', and 'Cover art by'. A second kind of trigram that dominates appears to be one that appears as part of a book's title printed on every page in the header or footer. As we increase $\alpha$, we again see 'not' appearing in contributing 1987 trigrams. Because of the combinatorial explosion around words like 'computer' and 'phone', we no longer see them in the trigram lists. One upshot of this brief inspection of Google Books is to highlight the value of separately examining $n$-grams. We also note that the 3-gram example is our largest system-system comparison with system sizes on the order of $10^9$.

$\mbox{Flipbook}~\flipbookharrypotter$—Harry Potter books, all 1-grams: Comparison of each Harry Potter book relative to all all other books in the series combined, using $\alpha$=1/2 (the single book is the right hand system, the merged set of 6 books the left system). Character names and major objects and places dominate, and the first book is most different from the others combined.

$\mbox{Flipbook}~\flipbookharrypotternocaps$—Harry Potter books, uncapitalized 1-grams: The same comparison as the previous Flipbook but now with all capitalized words excluded, as an example attempt to use a different lens on our allotaxonometer. Hagrid's speech in part separates Book 1 ('yer', 'ter'), Book 3 has 'rat', 'dementor', and a relative abundance of em dashes ('—'), Book 7 has 'sword', 'wand', and 'goblin'. The dominant elements are things, places, and repeated actions (e.g., spells) and descriptors. To examine changes in functional word usage, which may reveal changes in Rowling's writing, we would increase $\alpha$ as we did for Google Books. Again, we see the relative ease of taking subsets with ranks for allotaxonometry.

$\mbox{Flipbook}~\flipbookdeathcauses$—Causes of Death in Hong Kong: Five year gap comparison of causes of death reported per year in Hong Kong, starting with 2001 versus 2006 and moving through to 2012 versus 2017. Overall, pneumonia is the leading cause of death. In the second half of the time frame, 'kidney disease' and 'dementia' stand out as becoming more prevalent. Deaths listed as due to heroin drop off markedly in 2012 and 2013 relative to 5 years before. We note that changes in diagnoses, practices, and categorization are all confounding issues.

$\mbox{Flipbook}~\flipbookjobnames$—Job titles: US job titles based on text analysis of online postings, 2007 compared with 2018; variation across three kinds of job categorization, from coarse- to fine-grainated groupings, with suitable variation of $\alpha$ ($\alpha=0$, $\alpha=1/12$, and $\alpha=1/3$).