DATE: | Monday, Sep 27, 2010 |
TIME: | ***EXCEPTIONALLY at 2:30 pm*** |
PLACE: | Council Room (SITE 5-084) |
TITLE: | 2010 PAN Wikipedia Vandalism Competition |
PRESENTER: | Leanne Seaward University of Ottawa |
ABSTRACT:
The 4th International Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse, PAN 2010, was host to an international Wikipedia Vandalism Detection competition. The task involved classifying edits as vandalism. The training corpus consists of 14961 labeled edits. Participants were required to label 297541 edits as vandalism or not and give the confidence of that label as a number between 0 and 1. Of these edits, 17433 had known labels and were used to compute an ROC curve. There were 10 participants and our method ranked 5/10. In this talk, the features and classification methods used will be discussed. “Wikipedia is a free, web-based, collaborative, multilingual encyclopaedia project supported by the non-profit Wikimedia Foundation.” Basically, it is a free online encyclopaedia that anyone can edit. Not only can anyone edit the encyclopaedia, but it can even be done anonymously. It seems incredulous that such a system results in an encyclopaedia which contains over 15 million articles in 23 languages and has been shown to have comparable accuracy to commercial encyclopaedias. Wikipedia works because vandalism is usually quickly detected and reverted. Wikipedia vandalism occurs when someone knowingly adds false information or irrelevant or obscene comments to a topic. Page blanking or deleting some or all contents of a topic is another form of vandalism. Wikipedia defines vandalism as ``any addition, removal, or change of content made in a deliberate attempt to compromise the integrity of Wikipedia. `` Wikipedia vandalism detection schemes currently in use are in their infancy and rely mainly on manually created rule-based heuristics. Current research treats this problem as a cross between spam detection and authorship attribution. |