Is the Zipf’s law the new dimension to the relation between Language and Statistics?
Various disciplines use language as a means to communicate and reflect the way they see the world. Often known as the science of citizens, Zipf’s law has created a whole new dimension for research in various fields like psychology that has had a great positive impact. Although language sets us as superior beings and helps us to express through various media like poetry or music which are boundary-less, it has its own strict limitations in the form of mathematical and statistical principles.
The law was a gift from the popular linguist, George Kingsley Zipf, who first proposed it. The law has a direct relationship with mere words and undoubtedly states that “given a large sample of words used, the frequency of any word is inversely proportional to its rank in the frequency table. Its representation graphically is strikingly similar to a scatter plot. Even after bringing out and popularising this law, Zipf claims not to have originated it. A French stenographer, Jean-Baptiste Estoup was the one who originally came up with this revolutionary concept.
The law is most easily observed by plotting it as a scatter graph on a log-log graph. The process of learning to use this technique starts with counting the frequency with which different words are used in a written or spoken text. This is followed by ranking them from the most frequent to the least frequent. The peculiar observation to be made here is that the most frequent word will occur at least twice as the second most frequently used word and three times the third most frequently used word, and so on. This means,
r: f(r) is directly proportional to 1/r
The most generally used and modern form of this relation is:
f(r) is directly proportional to 1/ra,
The parameter ‘a’ is supposed to be close to 1.
The above equation is almost the same as an older law known as the power law. This law has a linear relationship in a log-log graph plot, which is one of its primary characteristics. This phenomenon is followed quite literally in the Zipf’s Law as well. Below is an illustration of how.
f(r) = C/ra
where is some constant. For example, if is equal to the frequency of the most frequent word (i.e., the word of rank ), and (as Zipf argued), then we get the sequence f(1)= C, f(2)= f(1)/2, f(3)=f(1)/3, and so on, as described above.
Next, we can take the logarithm on both sides of equation (2) and rewrite it as follows:
|log(f(r)) = log(C/ra)= log(C) –alog(r)|
Remembering that the general equation of a straight line is y=mx + b, where m is the slope of the line, we see that the logarithm of the frequency, log( f(r)), does indeed follow a linear relationship in terms of the logarithm of the rank, log(r). In this case the slope is .
This phenomenon, as observed in word frequencies, is known as Zipf’s law. Interestingly, though, the same phenomenon has also been observed in many other areas. For example, it occurs in areas closely related to language such as music or computer code, but also in completely unrelated systems such as sizes of cities or connections in networks like the internet or the power grid. It even shows up in snooker statistics (see below)! The main difference, though, is that the values for the parameter in the corresponding power law (i.e., the slope of the linear log-relationship) can be quite different for these different systems.
In order to use this law practically, the R program can be used on all platforms to run linear regression. The world famous book, The Origin of Species by Charles Darwin showed the following results. The results are shown in figure 2. The estimated parameter value for the corresponding power law is a=0.829, a little less close to than for the COCA database, but the fit of the straight line is even better: R2= 0.99.
Zipf’s law has been extremely useful in the field of ethical hacking and passwords too. As one specific application of this law of nature, we propose the number of unique passwords used in regression and the absolute value of slope of the regression line together as a metric for assessing the strength of password datasets, and prove its correctness in a mathematically rigorous manner. In addition, extensive experiments (including optimal attacks, simulated optimal attacks and state-of-the-art cracking sessions) are performed to demonstrate the practical effectiveness of our metric. In two of four cases, our metric outperforms Bonneau’s α-guesswork in simplicity and to the best of knowledge, it is the first one that is both easy to approximate and accurate to facilitate comparisons, providing a useful tool for the security administrators to gain a precise grasp of the strength of their password datasets and to adjust the password policies more reasonably.
The law has been successfully used in famous scripts, novels and speeches too to provide a breakthrough in understanding the psychology of legendary orators and speakers. The law has its uses in a myriad of subjects like Biology, Physiology, City Planning, Information Retrieval and Quantitative Linguistics.
The largest cities, the most frequently used words, the income of the richest countries, and the most wealthy billionaires, can be all described in terms of Zipf’s Law, a rank-size rule capturing the relation between the frequency of a set of objects or events and their size. It is assumed to be one of many manifestations of an underlying power law like Pareto’s or Benford’s, but contrary to popular belief, from a distribution of, say, city sizes and a simple random sampling, one does not obtain Zipf’s law for the largest cities. This pathology is reflected in the fact that Zipf’s Law has a functional form depending on the number of events N. This requires a fundamental property of the sample distribution which we call ‘coherence’ and it corresponds to a ‘screening’ between various elements of the set. We show how it should be accounted for when fitting Zipf’s Law.
In terms of psychology, how authors express their thoughts reveals much about their character, according to psychologist James Pennebaker. In particular, an author’s use of so-called function words (such as pronouns, articles, and a few other categories of words that, on their own, convey little meaning) is apparently directly linked to their social and psychological states. Simply put, your choice of words says something about your personality.
Pennebaker and colleagues have developed a sophisticated computer program (unfortunately not free, though) that collects statistics about an author’s use of words in specific categories (such as function words). With this software they then analysed thousands of books, blogs, and speeches, and were able to link an author’s specific word use to their personality, honesty, social skills, and intentions. This connection had already been discovered earlier, but with this new software tool it has become possible to investigate it in much more detail and on a much larger scale, firmly establishing the link between linguistics and psychology.
The language of statistics is not always easy to understand. But statistical analysis provides a useful and versatile tool. And this mathematical language — as Zipf, Pennebaker, and others have shown — can in turn be used to analyse natural language as well, making your words count by counting your words: the statistics of language. That is exactly how amazing a research and development this law is.