DATE: Tuesday, Apr 13, 2010
TIME: 3:30 pm
PLACE: Council Room (SITE 5-084)
TITLE: Wikipedia Vandalism Detection
PRESENTER: Leanne Seaward
University of Ottawa
ABSTRACT:

The 4th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse PAN 2010 is holding a Wikipedia Vandalism Detection competition. The task involves classifying edits as vandalism or not. The training corpus consists of 30,000 labeled edits and the test corpus will be comprised of 100,000 edits. This talk will introduce current research and explore feature selection for Wikipedia vandalism detection. Presenter’s work in progress on this topic in the context of the competition will be presented. According to www.wikipedia.org, ``Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. `` Basically, it is a free online encyclopedia that anyone can edit. Not only can anyone edit the encyclopedia, but it can even be done anonymously. It seems incredulous that such a system results in an encyclopedia which contains over 15 million articles in 23 languages and has been shown to have comparable accuracy to commercial encyclopedias. Wikipedia works because vandalism is usually quickly detected and reverted. Wikipedia vandalism occurs when someone knowingly adds false information or irrelevant or obscene comments to a topic. Page blanking or deleting some or all contents of a topic is another form of vandalism. Wikipedia defines vandalism as ``any addition, removal, or change of content made in a deliberate attempt to compromise the integrity of Wikipedia. `` Wikipedia vandalism detection schemes currently in use are in their infancy and rely mainly on manually created rule-based heuristics. Current research treats this problem as a cross between spam detection and authorship attribution.