ABSTRACT:
The basic Bag of Words representation usually used in Text Categorization
loses important syntactic and semantic information of the documents. When
the texts are of a short length this may be particularly problematic.
We study the contribution of incorporating syntactic and semantic
information into the representation in a Sentence Selection task in a
genomics corpus. We analyze the use of a hierarchical technical dictionary
by either replacing a gene or protein name by a generic term or adding its
ancestor terms for each gene or protein name in the representation. We
then introduce the hierarchical terms into a syntactic representation that
uses relations between words in the sentences. We show that using
hierarchical technical dictionaries together with syntactic relations is
beneficial for our problem when using state of the art machine learning
algorithms. These results are validated in a bigger dataset of a similar
nature, as well as in a dataset from the Legal domain.
We believe that because of the short length and the highly specific
vocabulary of this corpus, and the particular characteristics of the
classification, the use of syntactic and semantic knowledge could be more
beneficial than in a collection of a more general nature. We will also
present a few preliminary results in the collection of abstracts from
where these sentences were extracted and in the Reuters collection.
|