Search Mailing List Archives
[liberationtech] Author Identification
eugen at leitl.org
Mon Dec 17 04:59:27 PST 2012
Authorship attribution is an important problem in many areas including
information retrieval and computational linguistics, but also in applied
areas such as law and journalism where knowing the author of a document (such
as a ransom note) may be able to save lives. The most common framework for
testing candidate algorithms is a text classification problem: given known
sample documents from a small, finite set of candidate authors, which if any
wrote a questioned document of unknown authorship? It has been commented,
however, that this may be an unreasonably easy task. A more demanding problem
is author verification where given a set of documents by a single author and
a questioned document, the problem is to determine if the questioned document
was written by that particular author or not. This may more accurately
reflect real life in the experiences of professional forensic linguists, who
are often called upon to answer this kind of question. Given a small set (no
more than 10, possibly as few as one) of "known" documents by a single person
and a "questioned" document, the task is to determine whether the questioned
document was written by the same person who wrote the known document set.
One problem comprises a set of known documents by a single person and a
questioned document. There will be several such problems covering English,
Greek, and Spanish (about 20 cases per language) and a varying number of
known documents (1-10). All documents within a single problem will be in the
same language and best efforts will be applied to assure that within-problem
documents are matched for genre, register, theme, and date of writing. The
documents will possibly be fragmentary, with a minimum length of 1,000 words.
View details » Download data (Release by mid-December)
Participants are asked to provide a simple "yes/no" binary answer for each
problem. Grading will be based on the percentage of correct answers. Beyond
the accuracy on the entire corpus, separate rankings will be provided for the
subsets of problems for each language. In addition, participants may also
provide a score, a real number in the set [0,1] inclusive, where 0
corresponds to NO and 1 to YES. In that case, ROC curves will be produced and
the area under the curve will be used to grade participant systems.
We refer you to:
PAN @ CLEF'12 (overview paper),
PAN @ CLEF'11 (overview paper),
Patrick Juola. Authorship Attribution. In Foundations and Trends in
Retrieval, Volume 1, Issue 3, December 2006.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. Computational Methods
Authorship Attribution. Journal of the American Society for Information
Science and Technology, Volume 60, Issue 1, pages 9-26, January 2009.
Efstathios Stamatatos. A Survey of Modern Authorship Attribution Methods.
of the American Society for Information Science and Technology, Volume 60,
Issue 3, pages 538-556, March 2009.
We ask you to prepare your software so that in can be executed via a command
line. However, you can choose freely among the available programming
languages and among the operating systems Microsoft Windows 7 and Ubuntu
12.04. We will ask you to deploy your software onto a virtual machine that
will be made accessible to you after registration. You will be able to reach
the virtual machine via ssh and via remote desktop.
More information about the liberationtech