Search Mailing List Archives

Limit search to: Subject & Body Subject Author
Sort by: Reverse Sort
Limit to: All This Week Last Week This Month Last Month
Select Date Range     through    

[liberationtech] Linguistics identifies anonymous users

Gregory Foster gfoster at
Wed Jan 9 07:20:25 PST 2013

29c3 - "Stylometry and Online Underground Markets" w/ Aylin Caliskan 
Islam, Rachel Greenstadt, and Sadia Afroz:


On 1/9/13 7:34 AM, Shava Nerad wrote:
> Such a framework can be social engineered as easily as SEO.  I make a 
> small living as a ghost writer and speech writer - the informal 
> version of that very process. Several of my clients say my writing 
> sounds more like them in print than they do, because they are less 
> facile writers - but that is a fault that could be avoided in 
> competent forgeries. ;)
> SN
> On Jan 9, 2013 8:25 AM, "Eugen Leitl" <eugen at> wrote:
>> Linguistics identifies anonymous users
>> By Darren Pauli on Jan 9, 2013 9:49 AM
>> Researchers reveal carders, hackers on underground forums.
>> Up to 80 percent of certain anonymous underground forum users can be
>> identified using linguistics, researchers say.
>> The techniques compare user posts to track them across forums and 
>> could even
>> unveil authors of thesis papers or blogs who had taken to underground
>> networks.
>> "If our dataset contains 100 users we can at least identify 80 of them,"
>> researcher Sadia Afroz told an audience at the 29C3 Chaos Communication
>> Congress in Germany.
>> "Function words are very specific to the writer. Even if you are 
>> writing a
>> thesis, you'll probably use the same function words in chat messages.
>> "Even if your text is not clean, your writing style can give you away."
>> The analysis techniques could also reveal botnet owners, malware tool 
>> authors
>> and provide insight into the size and scope of underground markets, 
>> making
>> the research appealing to law enforcement.
>> To achieve their results the researchers used techniques including
>> stylometric analysis, the authorship attribution framework Jstylo, 
>> and Latent
>> Dirichlet allocation which can distinguish a conversation on stolen 
>> credit
>> cards from one on exploit-writing, and similarly help identify 
>> interesting
>> people.
>> The analysis was applied across millions of posts from tens of 
>> thousands of
>> users of a series of multilingual underground websites including
>>,,, and
>> It found up to 300 distinct discussion topics in the forums, with 
>> some of the
>> most popular being carding, encryption services, password cracking and
>> blackhat search engine optimisation tools.
>> While successful, the work faces a series of challenges. Analysis 
>> could only
>> be performed using a minimum of 5000 words (this research used the "gold
>> standard" of 6500 words) which culled the list of potential targets 
>> from tens
>> of thousands to mere hundreds.
>> It also needs to separate discussion on product information like credit
>> cards, exploits and drugs from conversational text in order to facilitate
>> machine learning to automate the process, according to researcher Aylin
>> Caliskan Islam.
>> And posts must be translated to English, a process which boosted author
>> identification from 66 to around 80 per cent but was imperfect using 
>> freely
>> available tools like Google and Bing.
>> However both of these tasks were performed successfully, and further
>> development including the use of "exclusive" language translation 
>> tools would
>> only serve to boost the identification accuracy.
>> Leetspeak, an alternative alphabet popular in some forum circles, 
>> cannot be
>> translated.
>> The project is ongoing and future work promises to increase the 
>> capacity to
>> unmask users. This Islam said would include temporal information 
>> which would
>> exploit users who logged into forums from the same IP addresses and wrote
>> posts at around the same time.
>> Antichat user analysis
>> "They might finish work, come home and log in," Islam said.
>> It could also tie user identities to the topics they write about and 
>> produce
>> a map of their interactions, identify multiple accounts held by a single
>> author, and combine forum messages with internet relay chat (IRC) 
>> data sets.
>> "We want to automate the whole process."
>> Afroz said while the work appeals to law enforcements and government
>> agencies, it is not designed to catch users out.
>> "We aren't trying to identify users, we are trying to show them that 
>> this is
>> possible," she said.
>> To this end, the researchers released tools last year, updated last 
>> December,
>> which help users to anonymise their writing.
>> One tool, Anonymouth, takes a 500 word sample of a user's writing to 
>> identify
>> unique features such as function words which could make them 
>> identifiable.
>> The other, JStylo, is the machine learning engine which powers 
>> Anonymouth.
>> The Drexel and George Mason universities research team is composed of 
>> Sadia
>> Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and 
>> Damon
>> McCoy.

Gregory Foster || gfoster at
@gregoryfoster <>

More information about the liberationtech mailing list