Zipf's law holds for phrases, not words [arxiv]

Jake Ryland Williams, Paul R. Lessard, Suma Desu, Eric Clark, James P. Bagrow, Christopher M. Danforth, and Peter Sheridan Dodds

Abstract

Over the last century, the elements of many disparate systems have been found to approximately follow Zipf's law---that element size is inversely proportional to element size rank---from city populations, to firm sizes, and family name. But with Zipf's law being originally and most famously observed for word frequency, it is surprisingly limited in its applicability to human language, holding only over a few orders of magnitude before hitting a clear break in scaling. Here, building on the simple observation that a mixture of words and phrases comprise coherent units of meaning in language, we show empirically that Zipf's law for English phrases extends over seven to nine orders of rank magnitude rather than typically two to three for words alone. In doing so, we develop a simple, principled, and scalable method of random phrase partitioning, which crucially opens up a rich frontier of rigorous text analysis via a rank ordering of mixed length phrases rather than words.