Dear Language Nerd,
I am fascinated by the news that J.K. Rowling wrote a murder mystery, The Cuckoo’s Calling, under the pen name of Robert Galbraith… and got found out! But the news articles I’ve read skip what seem to me to be the most interesting part of the story — what showed that the two books were by the same author?
From Lynda Eggstein
Deciding which of two (or more) people is the more likely author of a piece of writing is its own little subfield of linguistics, called authorship comparison. Note that I did not say “definitively deciding who wrote something.” Your quality forensic linguist is going to say something along the lines of “given [potential authors] x and y, x or y is the more likely candidate of the two” to have written unknown paper u (with the role of “quality forensic linguist” being filled here by John Olsson, pg. 44). Your lesser or pseudolinguist is going to say something more like “mmm, yes, this was 100% written by Eduardo, indubitably” (that quote I just made up), but this kind of excessive certainty is a bad sign.
If there is only one person in question as the possible author, the linguist can say that paper u has similarities to the style of papers known to be written by x, or that it shows notable differences, but without other people involved it’s hard to see how strong or weak a correlation is. As is so often the case, the trustworthy people are the ones qualifying their answers.* People who are too sure of themselves are probably doing something wrong.
Authorship comparison (more often but less accurately called authorship attribution or authorship identification) comes up all the time in court.** For example, forensic linguists might be called on to determine if an e-mail tirade is actually similar to the language of the CEO, or is more like the language of the disgruntled employee known to have figured out the passwords. Or they might analyze whether the language of a suicide note is more similar to the language of the dead man or to the language of his eldest daughter — who now stands to inherit his fortune!
But what exactly can we quantify as similarities? There are aspects of style that are pretty obvious – spelling, layout, running themes, particular phrases that get used frequently. Lovecraft is the king of florid prose. Tolkien has long, winding sentences with lots of poetry and world-description. Dave Barry is likely to tell us that a phrase makes a good rock band name. And while it is less recognizable, this is true for people generally — maybe your cousin is the only person in your family who uses disjunctive pronouns in coordination, or your mom tends to write run-on sentences. (It’s been suggested that I have a proclivity for parentheticals.)
Unfortunately, these obvious style choices are not super useful for authorship comparison. Specifically because they are obvious and notable in a person’s writing, they’re what a person attempting to imitate a particular writer would copy first. And that’s not necessarily a bad thing — legions of imitators have followed Hemmingway when writing he-man fiction, or Raymond Chandler when writing detective stories, or Tolkien when writing fantasy. Learning from beloved authors, seeing how a particular style conveys an idea or connects to a reader, is a useful step in learning how to write in one’s own voice.
But if the conniving daughter has realized that interrobangs are a hallmark of her father’s style, well, that’s the first thing she’ll throw in if she’s trying to forge that note.
And, to come back around to the point, there are some overt similarities between Galbraith and Rowling, but by themselves they prove very little. Do the similarities mean they’re the same person? Or that Galbraith is someone totally different who was inspired by Rowling’s work, read her Potter books six or eight dozen times growing up, or just purposefully copied her prose style because her books are so popular? There’s no knowing. We have to go deeper.
Time to leave behind the more conscious aspects of style and look at things that are a little harder to imitate. Even these don’t prove that a particular writer wrote a particular book. Someone can copy a style thoroughly enough to get similar scores here, too, and as Patrick Juola, the author of one of the original Galbraith analyses, puts it, “Certainly one could do worse than imitate the style of one of the most successful writers of this generation.” But they’re much better indicators of authorship than “eh, that kinda looks like what Rowling would do.” Juola analyzed the text of The Cuckoo’s Calling for four characteristics:
1. Distribution of word length. What percent of words in the text have one letter? What percent have two letters? What percent have thirty-four letters?
2. Distribution of the 100 most common English words. What percent of the text was “the”? What percent was “of”? What percent was “and”? Etc.
3. Distribution of word bigrams. That just means any set of two words. So “that just,” “just means,” “means any,” “any set,” “set of,” “of two,” and “two words” are all word bigrams in that last sentence.
4. Distribution of character 4grams. That one means four characters together, rather than words. So “That” is a character 4gram, and “ract” from “character” is a character 4gram, and so is “is a” because the space counts as a character too.
Juola also ran these tests on four other books, one each by Ruth Rendell, P.D. James, and Val McDermid, and then The Casual Vacancy by Rowling.*** I’m gonna go ahead and oversimplify how the comparison to The Cuckoo’s Calling breaks down for each category:
1. Word length
Very Similar: Rowling and James
Also Similar: —
Not Similar at All: McDermid and Rendell
2. Common words
Very Similar: Rowling and McDermid
Also Similar: James and Rendell
Not Similar at All: —
3. Word bigrams
Very Similar: Rowling
Also Similar: McDermid
Not Similar at All: James and Rendell
4. Character 4grams
Very Similar: McDermid
Also Similar: Rowling
Not Similar at All: James and Rendell
And the conclusion of all this? Juola again: “All it really ‘proves’ — suggests, rather — is that out of the four authors studied, the most likely candidate author is probably Rowling. […] It was fair to say that there was a lot of evidence pointing Rowling as the author and nothing specifically suggesting that she wasn’t.”****
So this analysis was not the final word (heh) on The Cuckoo’s Calling. A lot of good old-fashioned gumshoeing went into the news article as well, and the question was only really resolved when Rowling herself owned up. But it does show that a thorough look at linguistic data in the texts showed the plausibility of a connection between Rowling and Galbraith, which is exactly what authorship comparison is supposed to do. Good show, sir.
There you go. Another exhaustive linguistic post by the Language Nerd.
The Language Nerd
* This is why I’ve avoided the phrase “linguistic fingerprint” in this post – it suggests an unreasonable level of certainty.
**It’s also been used on historical documents, to decide who wrote which bits of the Federalist Papers, for example, or the Bible. But that’s a whole other post.
*** It’s lucky that Rowling had already published an adult book under her own name, because the Potter books are a different genre, and as we’ve discussed previously, genre can be a stronger influence on writing than the author is. Two different authors writing adult novels might be more similar than the same author writing one adult detective novel and one children’s fantasy novel.
**** McDermid also gets a pretty good score, but Juola says that “the word length distribution seemed almost entirely uncharacteristic of her.”
Got a language question? Ask the Language Nerd! firstname.lastname@example.org
Or: Twitter @AskTheLeague / facebook.com/asktheleagueofnerds
Just two references this week. All the meaty info about authorship comparison comes from John Olsson’s basic textbook used in forensic linguistics classes the world over, which is titled, uh, Forensic Linguistics. And all the specifics about the Rowling case come from this post by Patrick Juola himself at Language Log, which is totally readable and highly recommended to anyone interested in this sort of work. In fact, the whole Language Log site is a great read and an amazing resource, with about twenty linguists contributing, peeps like Roger Shuy and Geoffrey Pullum and Sally Thomason. Those wacky folks with all their, like, multi-decade linguistics researach backgrounds and whatnot. Somewhat lacking in fabulous illustrations, though.