DATE: Tuesday, Mar. 27, 2007
TIME: 2:30 pm
PLACE: Council Room (SITE 5-084)
TITLE: Sentence Selection Representation with Syntactic and Semantic Features
PRESENTER: Maria Fernanda Caropreso
University of Ottawa
ABSTRACT:

The basic Bag of Words representation usually used in Text Categorization loses important syntactic and semantic information of the documents. When the texts are of a short length this may be particularly problematic.

We study the contribution of incorporating syntactic and semantic information into the representation in a Sentence Selection task in a genomics corpus. We analyze the use of a hierarchical technical dictionary by either replacing a gene or protein name by a generic term or adding its ancestor terms for each gene or protein name in the representation. We then introduce the hierarchical terms into a syntactic representation that uses relations between words in the sentences. We show that using hierarchical technical dictionaries together with syntactic relations is beneficial for our problem when using state of the art machine learning algorithms. These results are validated in a bigger dataset of a similar nature, as well as in a dataset from the Legal domain.

We believe that because of the short length and the highly specific vocabulary of this corpus, and the particular characteristics of the classification, the use of syntactic and semantic knowledge could be more beneficial than in a collection of a more general nature. We will also present a few preliminary results in the collection of abstracts from where these sentences were extracted and in the Reuters collection.