Words in Numbers

nobel prizes quantified

Cover Art

Language has always been our species' most wonderful and valuable tool. We use it to express ideas, feelings and opinions, thus granting other people a glimpse into our inner works. It bridges the gap between two minds like no other form of communication can. The most talented in this art form can put into writing emotions and human experience in such a form, that their readers can experience these as if they were their own. While literature is an art form and intuitively a world of feeling, a numerate eye can be turned on it to reveal the inner structure of language and how those, who mastered this art, use their tools.

2 masters of literature are the German writers Herrmann Hesse and Thomas Mann, who both won a Nobel Prize for their works. I decided to collect several novels of both of them in digital form, to apply statistical methods on the data and possibly reveal some differences between their style and recurring patterns in language. The methods used are simple and understandable by a layperson, as I am myself no linguist – not even close – and as I believe that simple counting and averaging can reveal an astonishing amount of insight.


Author Bios

AUTHORS

Herrmann Hesse and Thomas Mann are German authors who both contributed to world literature with their works. Their portrait, birthplace and lifespan are to give an idea of the person behind their name and in which times they collected their experiences to build their stories on.

The books analysed in my work are by far not their entire work, which is why the books included in the following are listed. The official statements of the Nobel Committee for their choice gives also an idea of why their work is considered outstanding. Hesse won apparently for his entire volume of literary achievement, while Mann won it foremost for his novel "Buddenbrooks".


Vocabulary

VOCABULARY I

On the lowest level, stories consist of words. They are the smallest building blocks with which a writer constructs their story. The size of a writers vocabulary can give a feeling about how many of those building blocks are necessary to build complex stories and to put topics of human experience in relation with each other.

Simply counting the words, including each instance of a single word, sums up to a few hundred thousand for each author. Of those, only 1/10 are unique words. The question is, if each unique word appears roughly according to this ratio. It turns out, that this assumption is far from being true. The appearance of words in a text follow a so called Pareto Distribution, also known as the 80/20-rule. This means that a few words are extremely frequent, making up the bulk of the total word count, while most of the vocabulary only appears in a few instances in a text. It turns out that the 100 most frequent words, make up over 40% of the whole word volume, for each author respectively, with the conjunction "und" being the most frequent word of all, with 4% of the word volume alone.

Looking at the most used nouns might reveal the topics and concepts a writer applies in their work consistently. Especially in the case of Herrmann Hesse, the top 10 nouns conincide beautifully with his general affliction with the literary genre of the Bildungsroman, which usually follows a young protagonist on their journey through life, illustrating their moral and spiritual growth along the way. Words like Gesicht, Augen, Kopf and Hand(German for "face", "eyes", "head" and "hand" respectively), which partly appear in both author's top nouns, seem to emphasise the importance of describing the character's appearance, direction of attention and their interaction with each other and the world.


Vocabulary acummulation in DEMIAN
Vocabulary acummulation in BUDDENBROOKS

VOCABULARY II

Another question is how new words, that haven't occurred earlier in a text, are accumulating along the book. While in the first sentence, all words will be new, at a later stage, sentences will mostly contain already seen words. A visualisation confirms this intuitive assumption. Plotting each spotting of a never seen word with a dot, a cloud forms that is denser at the beginning, fading slowly towards the end of the book.


Grammar
Grammar

PARTS OF SPEECH I

Above the word level lays the level of word category. Words are categorised by grammatical structures called Parts of Speech, such as nouns, verbs, adjectives and pronouns. I have to trust on people remembering this part from their school years, as diving into all the different types would burst the scope of this.

I used a machine learning algorithm (courtesy to WZB) to tag each word with their respective word category. The algorithm was even able to identify the subcategories of each main type of word, such as the difference between a possessive pronoun and a personal pronoun. A aggregation of those across all books of each author resulted in a grammatical cross section, detailing the ratios in which each word type occurs. Apart from small differences, both authors use similar ratios of parts of speech, and the subcategories seem to be almost constant.

A small detail, that stuck out were the ratios of gender articles of German nouns. In the German language, each noun has one of 3 possible genders, described by their respective articles, der (masculine), die (feminine) und das (neutral). The ratios of these occurring in the texts are, surprising to me, far from being equal. The most frequent is the feminine gender, but with a ratio of almost 1:1 to the masculine. The neutral gender only appears half as often.


Grammar along SIDDHARTA
Grammar along KÖNIGLICHE HOHEIT

PARTS OF SPEECH II

Previously we have looked at the word types occurring across all books of a specific author. But one can track the word categories also along a document as a change of ratio through time, as we are progressing through the book.

To achieve that, I let a window of the width of 500 words roll along the text of a single book, summing up the amount of each category that is inside the window at each time, thus creating a timeline detailing the change of grammatical makeup along the document.

Although overall the ratios stay fairly constant, large local changes can be seen, where the occurrance of a word type suddenly rises or falls sharply.


Calculations and Code