Sunday, 29 January 2017

How algorithms expose who we really are as writers — even if we are a potential troll

COMPUTERS are often better at analysing who someone is as a writer than we are as readers. They can identify things in text that we would skip over, caught up as we are in following a novel's plot or trying to work out what someone wants from us by reading a business proposal. Computers produce detailed answers quicker than we can read, and they and aren't burdened by the individual baggage we bring to our reading.

A computer program, for instance, can 'guess' with a reasonable degree of accuracy if an author is male or female, American or English, based on their use of the word 'the'. The claim comes from Jodie Archer and Matthew Jockers, authors of The Bestseller Code, who made this slightly bizarre discovery when creating a program that can diagnose a bestselling book before anyone goes to the trouble of publishing it.

We can argue that the program has a 50/50 chance of being right with its male/female or American/English verdict, but there's still ample evidence from a mix of writing niches to support the notion that algorithms know us as writers better than we know ourselves.


An exclusive story published by The Sunday Times in 2013 began with an anonymous tweet which claimed that Robert Galbraith, debut crime novelist and author of The Cuckoo's Calling was not who he seemed to be. It ended with the discovery, backed by forensic linguistics experts, that Harry Potter author JK Rowling was Galbraith's secret alter ego.

The way The Sunday Times built its case about Galbraith's identity rightly made headlines of its own. When a tweet from a reporter remarked that The Cuckoo's Calling didn't feel like the work of a debut author, it drew an anonymous reply about Rowling's possible involvement. 

Minor detective work supported the tip's credibility. Galbraith and Rowling shared the same agent, and the same editor and publisher as Rowling's previous non-Harry Potter book The Casual Vacancy.

Sunday Times arts editor Richard Brooks called in two forensic linguistics experts to solve the mystery. Both worked separately on opposite sides of the Atlantic. Both had eight texts to analyse: The Cuckoo's Calling, The Casual Vacancy, and two novels apiece from three other British crime authors — Ruth Rendell, PD James and Val McDermid.

It took Patrick Juola, a professor at Duquesne University in Pittsburgh, just 90 minutes to crunch the texts using software he'd been developing with students for more than ten years. Peter Millican, of Oxford University, asked for extra texts from each author and put them through Signature, his own linguistics software.

Both men arrived at the same verdict: Rowling was Galbraith. Curiously, their results rested on mundane writing features such as word, sentence and paragraph lengths, the most commonly used words, and letter and punctuation frequency.

When The Sunday Times published its story, The Cuckoo's Calling shot from 4,709th on Amazon's bestselling list to number one. Sales of the book rose by more than 500,000 per cent.

It's an example not just of the sophistication of forensic linguistics, but of how our writing reputation — built gradually with each piece of our work — attracts or repels readers.


One algorithm can even use our individual writing fingerprint to work out who we might become.

Two researchers from Cornell University have found a way to identify future website trolls before they start any online abuse. It sounds more than a little like the pre-crime world of the film Minority Report — and it is.

Using only five posts as their sample, the researchers' algorithm has an accuracy rate of 80 per cent in predicting those who will become internet trolls. With ten posts, the algorithm's accuracy improves by two percentage points.

Will we ever use the algorithm to ban potential trolls from using the internet before they've done harm? It's an ethical issue yet to be explored.


  • Many companies already use algorithms to refine their marketing activity based on online reviews of their products. Our writing shows where products are loved or hated from quirky regional language tics and companies take action to boost sales and profits.
  • The US Secret Service has been looking for a Twitter sarcasm detector for some time — and now it has one. The aim is to separate those who simply want to use social media to let off steam from those actively bent on causing harm. 
  • One fun website — — can identify 11 personality traits from a handful of tweets. Of course, it can be scary to analyse leading politicians and discover exactly how angry, arrogant or depressed they are. Those of a nervous disposition may prefer to leave this box unopened.
  • Wikipedia has taken steps to safeguard the credibility of the site by developing software to identify 'sock puppet' accounts set up to give over-friendly edits to the site's content.
  • Murderers and terrorists have already been captured, tried and locked up on the basis of linguistics, sometimes with as little as a text message or a spray-painted threat left at the scene of a crime.


Chances are you wouldn't be able to tell if this blog was written by a computer program or a person. And yet, financial reports are already being written by computers instead of journalists. Associated Press generates more than 3,000 of them each quarter, the idea being to free journalists to interpret the meaning of the results.

Algorithms are muscling into the book trade too. A marketing professor at INSEAD, Philip M Parker, has patented a system for using algorithms to compile data into book form. Thanks to Parker, Amazon has more than 800,000 books for sale that use his system. With digital distribution and print-on-demand as an option, the books don't even have to be written until someone buys a copy. The system can produce a book on a subject in a few hours.

Computer programs can't yet write a news story about a terrorist attack or interview experts. They can handle formulaic writing, but only we can decide what's meaningful and what's irrelevant.

And that brings us back to the fundamentals of what drives us to write. In time, we may harness algorithms even more extensively to tackle the mundane parts of our writing world. But there are elements of successful writing that algorithms can't be programmed to replicate. Passion and purpose. Our need to connect with readers. A willingness to break language rules without ruining our work.

Algorithms can already show us who we are as writers. They can help us with our grammar, punctuation and spelling. They can even show us how our personality comes across in our writing.

And yet we're pretty savvy when it comes to working out what's going on in-between the lines of someone's words. As a species, we've survived and evolved by being able to decide who to trust and who to avoid. We're equipped to sense truth from lies, openness from deception, and we apply these skills to interpret someone's writing.

Trouble is, we often doubt ourselves, held back by the feeling that we lack something concrete on which to base our assessment. But we still form an opinion, even if we choose to ignore it, since our view is based on the stored information we have in our brains about what a piece of writing tells us about someone.

We should trust that opinion a little more than we do. In a later post, I'll show you how to harness it.

We're not infallible, but neither are the algorithms. The Bestseller-o-meter doesn't always get it right. Wikipedia's sock puppet algorithms are usually matched by their human equivalent.

We've always been visible through our words. The difference is that now we have convincing evidence that we have a unique writing fingerprint — we are what we write. And if Cornell's researchers are correct, we are what we might write.