TAMALE SEMINAR

The Text Analysis and Machine Learning Group

Fall 1999

SPEAKER: Heide Brucher

TOPIC: Clustering queries

DATE: Friday, September 10, 1999

PLACE: Room 318, MacDonald Hall, University of Ottawa

ABSTRACT: Every time a user issues a query he tells us something about his actual information needs. Probably he wants to reuse the queries he executed some time ago maybe with a light topic shift. The query he had issued once probably would be a good starting point for him. Concerning a sequence of queries we will have sets of queries in this sequence that belong together because they are dealing with the same topic. Usually, the queries which are topically related will not be issued in one contiguous perpetual stream. If the user wants to reuse a query that is part of that stream he has to recover it. For the recovery of the used queries it is necessary to organize them in a way so it is possible to reuse them. One possibility to organize the queries is to cluster them according to the topic they are related to (content-based clustering).

SPEAKER: Jérôme Tétreault, University of Ottawa , jtetreau@site.uottawa.ca

TOPIC: Bilingual text alignment based on word occurrence information

DATE: Friday, October 22, 1999

PLACE: Room 318, MacDonald Hall, University of Ottawa

ABSTRACT: Bilingual parallel corpora represent one of the most valuable source of information for the development of translation resources. Aligned corpora, which are obtained by aligning corresponding segments (usually sentences) of texts, have proved very useful in many tasks, such as statistical machine translation, bilingual lexicography, and word sense disambiguation. In this talk, I will give a brief overview of published work on parallel texts alignment, outlining different approaches and their domain of application. I will present, in more details, an algorithm which uses dynamic programming techniques to compare word ccurrence vectors. This algorithm is based on previous work by Fung, to which some modifications have been introduced. The algorithm aims at extracting approximate bilingual lexicons from bilingual corpora, assuming no knowledge of either language and no prior sentence-level or paragraph-level alignment. Results of extracted bi-lexicons using the Hansard corpus will be presented. We envisage that the extracted bi-lexicon could further be used to produce a set of anchor points between the texts, allowing alignment at a finer level with high accuracy.

SPEAKER: Ebenezer Ntienjem, ntienjem@usa.net

TOPIC: Completion of Logic Programs with respect to Negation-As-Finite-Failure.

DATE: Friday, October 29, 1999

PLACE: Room 318, MacDonald Hall, University of Ottawa

ABSTRACT: The procedural semantics of logic programs, expressed by the so-called SLDNF-resolution, treats a variable occurring in a negative literal in the body but not in the head of a program clause as universally quantified whereas Clark 's completion treats such a variable as existentially quantified. To close this gap between the declarative and procedural semantics of logic programs, we define a new approach to the completion of logic programs. To successfully define this new approach, we augment the syntax of a language for first-order predicate logic. We compare this new approach with Clark 's completion and with the partial completion of logic programs. We also relate the two-valued models of this completion of logic programs to rule-based inference systems.

SPEAKER:             Tom Mitchell, President-Elect, American Association for Artificial Intelligence
                                 Professor of AI and Learning Director,
                                 Center for Automated Learning and Discovery,
                                 School of Computer ScienceCarnegie Mellon University , email: tom.mitchell@cmu.edu

DATE: Friday, November 12, 1999

TITLE: Extracting Information from the World Wide Web

ABSTRACT: Consider the fact that although your computer workstation can now retrieve any of 600,000,000 pages on the World Wide Web, it unfortunately cannot understand their content. This is, of course, because web pages are written to be understandable to people, not computers. The goal of our research is to automatically extract a very large database of facts that mirror the content of the Web, and that can be manipulated by computer. If we can achieve this goal, it will enable using the web as a gargantuan data base and knowledge base to support a rich variety of applications. Our approach is to use machine learning algorithms to train a system to automatically extract information from web hypertext. For example, in one set of experiments our system was trained to extract descriptions of faculty, students, research projects, and courses from web sites of computer science departments. It then used these learned extraction routines to build a database containing thousands of new entries by automatically browsing new university web sites. The system is currently running 24 hours per day, and over the past eight months has built a knowledge base containing over 100,000 assertions, with an accuracy of roughly 70%. This talk will present the machine learning algorithms we have developed to date, along with experimental results suggesting these methods can be quite effective for information extraction in certain domains.

BIO: Tom M. Mitchell is the Fredkin Professor of Artificial Intelligence and Learning in the School of Computer Science , Carnegie Mellon University . He is also the Founding Director of CMU's Center for Automated Learning and Discovery, an interdisciplinary center for research on data mining. Mitchell is best known for his research on machine learning, in which he has developed applications such as online calendars that learn their users' scheduling preferences, web browsers that learn to extract information from hypertext, and systems that predict birth risks in new pregnancies based on hospital records of previous pregnancies. Mitchell is the author of the widely used textbook "Machine Learning" (McGraw Hill, 1997), President-Elect of the American Association for Artificial Intelligence, and a member of the Computer Science and Telecommunications Board of the National Academy of Sciences' National Research Council. Mitchell received his B.S. degree from the Massachusetts Institute of Technology in 1973, and his Ph.D. in electrical engineering from Stanford University in 1979.

SPEAKER: Berry DeBruijn, University of Ottawa , Email: debruijn@csi.uottawa.ca

TOPIC: Evaluation of Interactive Information Retrieval

DATE: Friday, November 19, 1999

PLACE: Room 318, MacDonald Hall, University of Ottawa

ABSTRACT: Our experiment platform - a full-text information retrieval system - supports Query Expansion (QE) by Relevance Feedback (RF). I.e., the system suggests additional query terms that are typical for relevant documents and a typical for irrelevant or unretrieved documents. Evaluation of QE/RF is slightly problematic. A user study is costly and due to tensions between experiment size, expected variations between users and variations between retrieval tasks, effects are difficult to establish statistically and the results are difficult to generalize. A system study lacks the input of users and disregards the interactive character of the system. A system study with a simulated user will lack the essential credibility. A system study with a "cohort" of simulated users - each one behaving slightly differently - could be a step to solve the evaluation problems. The presentation describes: (1) the place that this new experimental method has among existing methods, (2) the design of the study, (3) the design of the group of simulated users, (4) the results of the experiments, (5) the restrictions and assumptions that apply, (6) examples of possible future studies of the framework of the Intelligent Information Access project. How does it relate to my previous Tamale presentation: a more extensive methodological discussion, more results, further analyses, in-depth interpretation, still no clip-art, still no donuts.

SPEAKER: Joel Martin, National Research Council, Email: joel.martin@iit.nrc.ca

TOPIC: Design of a Better Question Answering System

DATE: Friday, December 3, 1999

PLACE: Room 318, MacDonald Hall, University of Ottawa

ABSTRACT: A better search-engine would allow you to ask a natural language question and would return an answer instead of 10,000 web pages. In this talk I will review our current design for question answering and summarize the design of all the other systems that were presented at TREC-8 in Gaithersburg , MD. Our system, like almost all the other systems, does 'passage retrieval' (find a short passage likely to contain the answer) and then scans for answer types that match the question (for a Who question look for a person). The most obvious deficiencies in our current design are speed and the accuracy of "answer-type" identification. Other passage retrieval systems can do a search in 5-25 seconds while ours takes minutes and other named-entity system components have greater than 90% accuracy while ours is around 50%. I will finish the talk with an outline of the design of a better, more robust question answering system. This work was done with Chris Lankester from the University of Ottawa .

Return to main page