Antwerp researchers identify anonymous internet culprits
By collecting vast amounts of data from social media, researchers at Antwerp University have developed a way to uncover information about a text’s author, including their age and gender
Troves of data
Collecting vast amounts of data from social media, members of the Computational Linguistics & Psycholinguistics research group have been looking at how language technology can detect dangers on the internet for young people such as cyber bullying or sexually transgressive behaviour.
They use author profiling to reach conclusions about a text’s author based on how they write. “We analyse a lot of data and look at how different socio-demographic groups use language,” explains Guy De Pauw (pictured left).
It’s fairly obvious that young people write differently online from older people, he says, but you can also analyse the differences between gender, for instance. “For people with bad intentions, it’s very easy to pretend to be somebody they’re not. We look at language features that are not under your conscious control.”
Previous research has shown that in very broad terms, females use more personal pronouns (I, you, we) while males use more quantifiers (one, many) and determiners (a, an, the). These aren’t hard-and-fast rules, but if you feed all that information into a computational model, you can identify somebody’s age and gender with a fair degree of accuracy.
Between the lines
“If we have several thousand messages that we know to have been written by females, the computer can construct a model around that data,” says De Pauw. “And if you keep adding to the data, the computer will adapt to the language rather than us having to tell it what to look for.”
That’s a very powerful technique, he continues, “especially for situations such as sexual predators. Someone with bad intentions may know how youngsters chatted five years ago, but this changes all the time.”
Their technology can also be used in the fight against radicalisation. “The methods are the same,” explains colleague Tom De Smedt (pictured right). “By studying texts written by people who we know are fighting in Syria or otherwise spreading hatred – because they are posting inflammatory tweets or pictures of weapons – we can see they use certain language and combinations of words.”
Using this data, they can trawl vast amounts of text to identify potential radicalisation. “It must be used cautiously,” De Smedt adds, “but we are exploring opportunities for collaboration with security and intelligence agencies.”
By studying texts written by people who we know are fighting in Syria, we can see they use certain language and combinations of words
The technology is to a large degree language-independent: all you need is the data. The team mostly focus on Dutch and English in a research context; their spin-off Textgain, launched at the end of last year, aims to commercialise this expertise.
“Textgain allows us to extract facts, opinions and demographics information from social media, newspaper articles and so on in a wide range of languages,” explains De Pauw. “That type of information has invaluable applications in big data and e-marketing.”
The service is broadly aimed at anyone who has so much text they cannot possibly read it all, but who knows there’s useful information in it. “There’s no way anybody can read all the tweets that are published about the iPhone, for example,” says De Smedt, “but our technology could have a machine read it all, take out the salient points, find out whether people are being positive or negative, estimate who those people are, whether they are older or younger, male or female...”
On a grammatical level, he adds, the machine knows what has been written, but it also needs to understand the topic and the content of the text. “That’s the ultimate goal: to have a computer understand what is actually in the text. Technology is not there yet, but we are definitely fast approaching it.”
Photo courtesy Textgain