Search Mailing List Archives
[liberationtech] Linguistics identifies anonymous users
gfoster at entersection.org
Wed Jan 9 07:20:25 PST 2013
29c3 - "Stylometry and Online Underground Markets" w/ Aylin Caliskan
Islam, Rachel Greenstadt, and Sadia Afroz:
On 1/9/13 7:34 AM, Shava Nerad wrote:
> Such a framework can be social engineered as easily as SEO. I make a
> small living as a ghost writer and speech writer - the informal
> version of that very process. Several of my clients say my writing
> sounds more like them in print than they do, because they are less
> facile writers - but that is a fault that could be avoided in
> competent forgeries. ;)
> On Jan 9, 2013 8:25 AM, "Eugen Leitl" <eugen at leitl.org> wrote:
>> Linguistics identifies anonymous users
>> By Darren Pauli on Jan 9, 2013 9:49 AM
>> Researchers reveal carders, hackers on underground forums.
>> Up to 80 percent of certain anonymous underground forum users can be
>> identified using linguistics, researchers say.
>> The techniques compare user posts to track them across forums and
>> could even
>> unveil authors of thesis papers or blogs who had taken to underground
>> "If our dataset contains 100 users we can at least identify 80 of them,"
>> researcher Sadia Afroz told an audience at the 29C3 Chaos Communication
>> Congress in Germany.
>> "Function words are very specific to the writer. Even if you are
>> writing a
>> thesis, you'll probably use the same function words in chat messages.
>> "Even if your text is not clean, your writing style can give you away."
>> The analysis techniques could also reveal botnet owners, malware tool
>> and provide insight into the size and scope of underground markets,
>> the research appealing to law enforcement.
>> To achieve their results the researchers used techniques including
>> stylometric analysis, the authorship attribution framework Jstylo,
>> and Latent
>> Dirichlet allocation which can distinguish a conversation on stolen
>> cards from one on exploit-writing, and similarly help identify
>> The analysis was applied across millions of posts from tens of
>> thousands of
>> users of a series of multilingual underground websites including
>> thebadhackerz.com, blackhatpalace.com, www.carders.cc, free-hack.com,
>> hackel1te.info, hack-sector.forumh.net, rootwarez.org, L33tcrew.org and
>> It found up to 300 distinct discussion topics in the forums, with
>> some of the
>> most popular being carding, encryption services, password cracking and
>> blackhat search engine optimisation tools.
>> While successful, the work faces a series of challenges. Analysis
>> could only
>> be performed using a minimum of 5000 words (this research used the "gold
>> standard" of 6500 words) which culled the list of potential targets
>> from tens
>> of thousands to mere hundreds.
>> It also needs to separate discussion on product information like credit
>> cards, exploits and drugs from conversational text in order to facilitate
>> machine learning to automate the process, according to researcher Aylin
>> Caliskan Islam.
>> And posts must be translated to English, a process which boosted author
>> identification from 66 to around 80 per cent but was imperfect using
>> available tools like Google and Bing.
>> However both of these tasks were performed successfully, and further
>> development including the use of "exclusive" language translation
>> tools would
>> only serve to boost the identification accuracy.
>> Leetspeak, an alternative alphabet popular in some forum circles,
>> cannot be
>> The project is ongoing and future work promises to increase the
>> capacity to
>> unmask users. This Islam said would include temporal information
>> which would
>> exploit users who logged into forums from the same IP addresses and wrote
>> posts at around the same time.
>> Antichat user analysis
>> "They might finish work, come home and log in," Islam said.
>> It could also tie user identities to the topics they write about and
>> a map of their interactions, identify multiple accounts held by a single
>> author, and combine forum messages with internet relay chat (IRC)
>> data sets.
>> "We want to automate the whole process."
>> Afroz said while the work appeals to law enforcements and government
>> agencies, it is not designed to catch users out.
>> "We aren't trying to identify users, we are trying to show them that
>> this is
>> possible," she said.
>> To this end, the researchers released tools last year, updated last
>> which help users to anonymise their writing.
>> One tool, Anonymouth, takes a 500 word sample of a user's writing to
>> unique features such as function words which could make them
>> The other, JStylo, is the machine learning engine which powers
>> The Drexel and George Mason universities research team is composed of
>> Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and
Gregory Foster || gfoster at entersection.org
@gregoryfoster <> http://entersection.com/
More information about the liberationtech