Is fiction an area infinitely far from science? This is indeed the case … for the readers. But literary specialists are much less categorical. How does graph theory help you know who will win Game of Thrones, and how programmers have solved J.K. Rowling's secret?

Author, author

The story that unfolded in 2013 around the detective novel "Call of the Cuckoo" turned out to be much more exciting than the plot itself about the murder of a beautiful supermodel. The book's author, retired military man Robert Galbright, sent his debut work to several publishers, but received rejection after rejection. The head of the Orion Publishing association later recalled: “The novel seemed to me well written, but somehow quiet. Nothing really hooked me in him. " Finally, a small publishing house Sphere Books drew attention to the novel. So the entire circulation would have remained in the warehouse, if not for the reporters. No one remembers how the journalists of The Sunday Times discovered Galbraith's novel, but they were the first to realize that a sensation could be hidden under a modest cover.

I must say that investigative journalism is the strong point of The Sunday Times. In the 1970s, the newspaper became famous for an article exposing a British thalidomide manufacturing company. The drug for pregnant women, considered absolutely safe, actually contributed to the birth of children with congenital deformities. Manufacturing company Distillers set up a child injured trust fund, but its size was paltry compared to the firm's assets. This is what the journalists of The Sunday Times drew attention to. The article caused so much discussion in the media that the company had to increase the size of the fund by one and a half times.

So, The Sunday Times had a solid track record of sensationalism, and in the case of an aspiring writer, intuition worked flawlessly. The author is a retired military man, his hero is the same retired military man, now engaged in private investigation. Apparently both wore uniforms for most of their lives. Then where does such knowledge in the field of fashion come from? Galbraith's heroine is trying on in the boutique not just a raincoat, but a trench coat with sequins, and not just a tight-fitting dress, but a dress with a secret underwire corset. Of course, the author may have had consultants versed in fashion, but doubts have already arisen. Employees of the newspaper immediately reported their suspicions on social networks.

Immediately, as in a classic detective story, an anonymous "well-wisher" appeared. In the comments, he said that the so-called Robert Galbright is none other than the famous J.K. Rowling, author of "Harry Potter". The Sunday Times whistleblowers got down to business. It was discovered that Sphere Books was part of the same publishing conglomerate that had printed Rowling's first book for adults a year earlier ("The Accidental Job"), and that the same literary agent was handling both books. It was already something, but there was clearly not enough evidence: hundreds of manuscripts pass through the hands of literary agents every year, and the Little, Brown and Company association includes a dozen publishers. And then linguists and developers of computer programs for stylometry came to the aid of journalists.


Unique style

Stylometry is a set of methods for studying the stylistics of a text. This analysis uses primarily statistical tools. The manner of writing and the choice of vocabulary is individual for each of us, like fingerprints, therefore, in order to get a more or less reliable idea of ​​the artistic "handwriting" of a person, you need to take into account many parameters. Stylometric programs measure the average length of words, sentences and paragraphs, calculate the ratio of the parts of speech used - what is more in the text: verbs or nouns? If a word has synonyms, algorithms keep track of which one the writer chooses most often.For example, how does he say when he speaks of something big: "huge" or "gigantic"? In the process of analysis, the program compiles a frequency dictionary, that is, it collects words that the author of the text uses most often. And so on, and so on - the program that analyzed the "Call of the Cuckoo" takes into account hundreds of thousands of different parameters.

When the statistical portrait of the text is ready, the algorithm compares it with the same portraits of other texts of the alleged author. To check if J.K. Rowling is hiding behind the pseudonym, the program was "fed" the latest volume of the Harry Potter series (Harry Potter and the Deathly Hallows, released in 2007) and the realistic novel The Accidental Vacancy (published in 2012). Then "The Call of the Cuckoo" was compared with several detective novels of contemporaries. It turned out that the books of Rowling and Galbright really have much more in common than all the investigated detectives. Many features coincided, from the peculiarities of the use of prepositions to the enthusiasm for antique quotations.

Can stylometric analysis be considered 100% proof of authorship? No, just as it is impossible to consider the results of DNA analysis to the end as an exhaustive proof of the convict's guilt (after all, a close relative can always be involved in the case). Experts take into account the whole set of factors. In the case of J.K. Rowling, there was enough evidence - the journalists contacted the literary agent of the writer, and in the summer of 2013 she admitted: Robert Galbraith does not exist. According to the writer, she needed the pseudonym to collect honest opinions of publishers about her detective debut.

The story of "The Call of the Cuckoo" seems too dramatic, the thought immediately arises: who was the anonymous informant who started it all? It is possible that Rowling and her agent directed the entire plot from start to finish. Let it be so, after all, everyone won: the readers got a new book, the publishing house got income, and linguists and programmers got room for new research.


Social network "Thrones"

Exact sciences can not only calculate the author of a work, they can be indispensable in studying the plot of a book. In April 2016, Minnesota mathematics professor Andrew Beveridge and his student Ji Shan demonstrated this. They, like many of us, wondered: which of the characters in "A Song of Ice and Fire" by George Martin and the TV series "Game of Thrones" will get to the end of the cycle? Martin is famous for his cruelty, in the first five books dozens of beloved heroes were killed. There are seven novels in the cycle, and the remaining two will finally tell who will get the Iron Throne of the Seven Kingdoms, and who will at least live to the last page - in the world of Thrones this is an achievement in itself.

As a method, Beveridge and Shan chose network science, a branch of graph theory. A graph in mathematics is a collection of a non-empty set of vertices and sets of connections between them. Simply put, graph theory studies exactly how a network of connections between objects of a certain type is built. Which of the objects of interest interact with each other, which "gather" many other objects around themselves, and which remain practically alone. Graph theory has many practical applications. For example, it is used in geographic information systems to find the most convenient routes.

The material for the research was the third volume of the series "A Storm of Swords". Shan and Beveridge explain: in this book, the plot managed to gain momentum, and the heroes - dispersed across a vast map of fictional lands and make a sufficient number of contacts. On the basis of the text "A Storm of Swords", scientists have built a social graph (social network): objects ("peaks") are 107 characters in the novel. There are 12 storytellers in the book, each of them mentioning dozens of other heroes.For a single "connection" of the characters (this connection is called the edge of the graph) mathematicians took the situation when the names of two heroes were mentioned within 15 words of each other. This does not mean at all that the characters are friends, but they clearly interact in some way.


Who will take the Iron Throne?

The social network of fictional characters possessed features that are also found in "real" graphs describing the social connections of real people. It includes several highly connected “subnets” that form around the most influential, often cited heroes. Visually, the graph practically repeats the map of the Seven Kingdoms. For example, the dense "subnet" united around the dragon lady Daenerys is located at a distance from the main array of connections, since the heroine herself is located far from most of the characters, on another continent.

Beveridge and Shan identified seven main "communities" that have developed around the protagonists. Who is the most influential among them, who has the most chances to win the final battle? You can consider this issue from a psychological point of view, taking into account the characteristics of the character of the characters. You can compare behavioral strategies and choose the most profitable one. Graph theory offers its own solution method: calculate which of the heroes is surrounded by the most social connections and most often interacts with other people. It turned out that in the world of Thrones, influence does not always coincide with wealth and titles: some of the most active characters are the illegitimate Jon Snow and the daughter of the unjustly convicted Hand (advisor) of King Sansa Stark. The central "top" of the graph was the beloved Tyrion Lannister, who was involved in many political intrigues.

According to Andrew Beveridge and Ji Shan, their research demonstrated the power of network science and confirmed many of the expectations of A Song of Ice and Fire fans. Inevitably, questions arise: if the connections between characters are statistically described and this allows conclusions to be drawn about the development of the plot, what is the degree of author's freedom when writing the following books? And at what point does the plot acquire its own logic and become stronger than its creator?


Molly Bloom's voice

Mathematical methods make it possible to analyze not only individual books, but also entire literary trends. This work has recently been done by a team of Polish scientists. They investigated works in which the stream of consciousness plays an important role, a technique characteristic of modernist literature. This trend appeared at the end of the 19th century, replacing the era of the classic novel. A modernist writer is no longer a bearer of absolute truth; reality in these works is intertwined from many points of view. To convey this versatility, new means of expression were needed, and one of them was the stream of consciousness.

The technique conveys the inner speech and peculiarities of the hero's thinking in an extremely natural, "raw" form. The text mixes fragments of phrases, often connected not logically, but associatively. Thoughts almost incoherently flow into one another, memories suddenly "flash". This is how Molly Bloom's monologue begins, the concluding chapter of James Joyce's Ulysses:

“Yes, because this has never happened to him to demand breakfast in bed, say a couple of eggs from the hotel itself. hat and she did not even think of giving us a penny, all for one prayer for her darling, the curmudgeon of what light she had seen pressed herself on denatured alcohol to spend four pence all my ears were buzzing about her sores and even this eternal chatter about politics and earthquakes and the end of the world no Let us first have a little fun, God forbid, if all the women were like her fought against the cleavage and swimsuits which, by the way, no one asked her to wear … ".

There is not a single punctuation mark on the final pages of Ulysses. Molly thinks about old acquaintances, sings songs, remembers to sprinkle her coat with mothballs. Likewise, in passing, she ponders what true love is. From this chaos of thoughts and impressions, the heroine's unique voice, her view of events, is formed.

It would seem, what does mathematics have to do with it? Initially, Polish researchers set themselves the goal of studying how the length of sentences in the texts of different eras, from the Bible to The Lord of the Rings, correlates. The abundance of short "chopped" sentences makes the narrative dry, too long sentences are difficult to comprehend. The study was intended to clarify how the authors arrange long and short sentences in the text, creating the rhythm of the piece. The main characteristic of the text in the work of Polish scientists is the variability of the sentence length (English sentence length variability, SLV). They examined this figure in 113 texts (no less than 5,000 sentences in length) in seven languages ​​- English, French, German, Russian, Polish, Italian and Spanish. Scientists "translated" the texts of the books into the language of mathematics: from each sentence there was only the number of words in it, thus each text turned into a sequence of numbers, meaning the number of words in the sentence.

20th century multifractals

It turned out that fractal structures are encountered in all texts, and in some - multifractal ones. A fractal is a set that has the property of self-similarity (such an object as a whole has the same shape as one or more of its parts). Usually, speaking of fractals, they mean monofractals: to describe them, one self-similarity indicator (fractal dimension) is sufficient. To describe a multifractal, a fractal spectrum is needed, which includes a number of self-similarity indices inherent in different elements of this structure. Multifractals are used to describe, for example, the behavior of financial markets.

Fractal structures were found in all the texts studied. But almost all the works in which multifractals with a complex spectrum of self-similarity indicators were found belonged to the modernist and postmodernist literature of the twentieth century and contained a stream of consciousness. This is how James Joyce, Virginia Wolfe, Julio Cortazar built their texts. The exception was, oddly enough, the New Testament - it also contains many multifractals. Joyce's experimental novel Finnegans Wake (the title is most often translated as Finnegans Wake) has an almost ideal multifractal structure. At the same time, many works, in which critics saw the features of the stream of consciousness, did not pass the "test" by the mathematical method. For example, Ayn Rand's novel Atlas Shrugged, contains virtually no multifractal structures.


Does this research have a practical application? Undoubtedly. The scientists themselves say that their method will help to more objectively relate texts to a certain literary direction. The study of quantitative characteristics will complement the usual methods of attribution by the time of creation of the work and the generality of literary and aesthetic principles. This opportunity will be useful both for literary scholars and their electronic "colleagues" - organization algorithms in online libraries and text corpora (these are large collections of texts collected and processed according to certain rules, they serve as material for linguistic research). In addition, such an analysis will help determine whether some literary techniques can be considered a "marker" of a particular direction. For example, the stream of consciousness can definitely (with a few exceptions) be attributed to the characteristic features of modernism.

Another area in which such research is in demand is the development of artificial intelligence. Many companies are working on creating algorithms that can create texts themselves by analyzing a large amount of information. Similar programs already exist: for example, at the Associated Press, a news robot writes small notes about companies' earnings based on their reports. It is unlikely that software will ever completely replace journalists, but robots will definitely write for newspapers and magazines.The requirements for them, in general, will remain the same as for people: to write informatively and not boring.

The original goal of the study by Polish scholars was to determine how long and short sentences should be alternated so that the text does not look dry, "clerical", and at the same time is easy to understand. Such works reveal patterns that can teach electronic journalists to present information consistently and conveniently for the reader.

These are not all methods of using mathematical operations to study fiction. Suffice it to recall the theory of games used by researchers of conflicts in drama, or the works of Andrei Kolmogorov devoted to poetry. All these studies indicate that different fields of science are closely related to each other, they actively borrow methods and materials from each other. Perhaps in the future people will not remember that once there was a division into "physicists" and "lyricists".

