Mar. 3, 2005

DATE:	Thursday, Mar. 3, 2005
TIME:	1:30 pm
PLACE:	Council Room (SITE 5-084)
TITLE:	Corpus Construction for Terminology
PRESENTER:	Caroline Barriere NRC
ABSTRACT: Texts on any domain are today easily accessible on the Web. In computational terminology, the problem is not availability but mostly quality or even just usefulness of these texts for the purpose of understanding a domain. In this research, our theoretical goal is to investigate what characterizes relevant documents from a terminological point of view, and our practical goal is to develop a web-application to help terminologists in their task of building a domain-specific corpus. In this presentation, we will first introduce previous research on "knowledge patterns" and show how these patterns are used to find relevant information in text and structure this information into semantic networks. Then we suggest to further use these knowledge patterns in an estimation of document relevance for terminological studies. We present TerminoWeb, a tool which filters documents retrieved on the web by their density of knowledge patterns. The tool is very flexible, allowing a user to construct multiple domain-specific corpora, and obtain for each one a set of web documents sorted by decreasing value of knowledge-richness.