The Text Analysis and Machine Learning Group

Executive summary | What is Knowledge Management? | The team | Brief history | Results and accomplishments | Current graduate students | International, national and industrial cooperation

 


Winter 2001


 

SPEAKER:             Rob Cattral, School of Computer Science, Carleton University  

DATE:                    Thursday 1st February, 2001  

TITLE:                    Rule Acquisition with a Genetic Algorithm  

ABSTRACT:         RAGA is a genetic algorithm (GA) and genetic programming (GP) hybrid designed for the tasks of supervised and certain types of unsupervised knowledge extraction from large and possibly noisy databases. Applying the GA to the task of data mining required that certain changes be made to the traditional algorithm, particularly with respect to knowledge representation. RAGA uses a system of rules of varying length and complexity, while the traditional GA has chromosomes of fixed length using a binary alphabet. Although it is similar to GP in this respect, RAGA uses mutation operators and does not evolve programs. Unlike both GA and GP, it promotes validity and nonredundancy by intergenerational processing on a fluctuating number of individuals. It also implements a form of elitism that causes a wide exploration of the dataset, and, by making data coverage a component of fitness, it automatically evolves default hierarchies of rules.

 

SPEAKER:             Yves Kodratoff, CNRS & University Paris - Sud  

DATE:                    Thursday 8th February, 2001  

TITLE:                    Knowledge Discovery in Texts (= " Text Mining ")  

ABSTRACT:         At first, we will make a short demo of the industrial software TROPES developed by the company Acetic. It will illustrate the importance to work with a " good " taxonomy of concepts to detect occurrences of concepts in texts. This presentation is aims at underlining two difficult problems -enough- induced by KDT. This discipline uses two kinds of tools, both kinds imperfect: Natural Language Processing tools, and tools for the discovery of the " associations " (= implications without variables). We deal with these two problems as follows: - For NLP, we insist on the absolute necessity to develop reliable enough tools to create taxonomies of concepts from texts. We noticed that the syntactic situation of a word in the sentence most often determines which concept the word refers to. We will describe LRI's ASIUM tool helping to the building of taxonomies from texts, and the ROWAN, still under development (a demo will be done), that will soon allow characterizing the syntactic position linked to a particular meaning of the word.. For association detection, we will underline the lack of interest of a large coverage. We will show the equivalence, in practice, between: 1. measures of dependency, 2. intensity of the implication, and 3. naive Bayes. We will illustrate the necessity, as a consequence, to construct, from data, complete Bayesian networks.

 

SPEAKER:             Rob Holte, School of Information Technology and Engineering, University of Ottawa  

DATE:                    Thursday 15th February, 2001  

TITLE:                    Combinatorial Auctions, Knapsack Problems, and Hill-climbing Search  

ABSTRACT:         This talk examines the performance of hill-climbing algorithms on standard test problems for combinatorial auctions (CAs).  On single-unit CAs, deterministic hill-climbers are found to perform well, and their performance can be improved significantly by randomizing them and restarting them several times, or by using them collectively.  For some problems this good performance is shown to be no better than chance; on others it is due to a well-chosen scoring function. Multi-unit CAs have been studied widely under a different name: multidimensional knapsack problems (MDKP).  On standard test problems for MDKP, one of the deterministic hill-climbers generates solutions that are on average 99% of optimal.

   

SPEAKER:             Svetlana Kiritchenko
        
                       
School of Information Technology and Engineering 
                
               
University of Ottawa  

DATE:                    Thursday 1st March, 2001  

TITLE:                    Co-Training. A new algorithm for using unlabeled data in Supervised Learning  

ABSTRACT:          In Supervised Learning there is often a problem of shortage of labeled data.  We may have plenty of unlabeled examples, but getting labeled data means manually labeling each example that is expensive and time-consuming work. In such a case, we'd like to have a way of putting to use unlabeled data in the training process. Co-Training is a new, well-working algorithm suggested by T. Mitchell (and presented in this seminar series in the Fall of 1999) that lets you start with only a few labeled examples and then use unlabeled data to improve your initial classifier. In this presentation we will talk about how Co-Training works, when it can be applied, and what advantages it has compared with other learning algorithms that use unlabeled data. I will also present our experimental results of applying Co-Training to the e-mail classification problem.

   

SPEAKER:             Dale Schuurmans
            
                    Joint work with Finnegan Southey (U.
Waterloo ).
                 
               Department of Computer Science
                     
          
University of Waterloo  

DATE:                    Wednesday 7th March, 2001  

TITLE:                   Monte Carlo inference via greedy importance sampling  

ABSTRACT: It is well known that general inference and learning with Bayesian networks is computationally hard, and it is therefore necessary to consider heuristic and approximate algorithms for these tasks.  Among the most convenient and successful techniques are Monte Carlo estimators which can be easily applied to complex inference problems that overwhelm deterministic approaches.  However, Monte Carlo estimation often suffers from high variance, which renders the approach ineffective in many natural situations. In this talk I will discuss a new approach to Monte Carlo estimation that incorporates a deterministic search for important points in the target distribution---in an attempt to reduce the dependence on a poor proposal distribution.  The challenge is to explicitly incorporate search while still maintaining an unbiased estimate of the desired quantities.  I will introduce the basic problem and our approach, and demonstrate that the method yields unbiased estimates while simultaneously reducing the variance of well known methods (including importance sampling, Gibbs sampling, Metropolis sampling, and hybrid Monte Carlo ).

   

SPEAKER:             Dale Schuurmans
              
                  Joint work with Finnegan Southey (U.
Waterloo )
                     
           and Rob Holte (U.
Ottawa ).
                
                Department of Computer Science
                    
           
University of Waterloo  

DATE:                    Thursday 8th March, 2001  

TITLE:                    The exponentiated subgradient algorithm for heuristic boolean programming  

ABSTRACT: Boolean linear programs (BLPs) are ubiquitous in artificial intelligence research.  Satisfiability testing, planning with resource constraints, and winner determination in combinatorial auctions are all examples of this type of problem.  Although increasingly well-informed by work in operations research, current work in AI has tended to focus on specialized algorithms for each type of BLP task and has only loosely patterned new algorithms on effective methods from other tasks. In this talk I will introduce a single general-purpose local search procedure that can be simultaneously applied to the entire range of BLP problems, without modification.  Although one might suspect that a general-purpose algorithm might not perform as well as specialized algorithms, we find that this is not the case here.  Our experiments show that our generic algorithm simultaneously achieves performance comparable with the state of the art in satisfiability search and winner determination in combinatorial auctions---two very different BLP problems.  Our algorithm is simple, and combines an old idea from OR with recent ideas from AI.

   

SPEAKER:             Berry de Bruijn
              
                  Institute for Information Technology, 
                                National Research Council of
Canada  

DATE:                    Thursday 15th March, 2001  

TITLE:                    Sentence extraction for  justifying text categorization.  

ABSTRACT: Our SVM based text categorization algorithm separates medical articles (or rather, abstracts) into those that report a protein interaction and those that don't. Joel Martin talked about that in a previous Tamale seminar. In this presentation, I will talk about efforts to extract the key sentence from a positive document, i.e. a protein interaction abstract.  A good sentence contains enough evidence for a human judge to certify the true class of the document. An automaton that can the good sentences, could reduce a judge's reading time. I will discuss two methods for sentence extraction, results for tests on two data collections, and related work (notably automatic summarization and question answering).

   

SPEAKER:             Fernando C. Gomes
              
Federal University , at Fortaleza ( Brazil
                  
              (on sabbatical leave in SITE)
 

DATE:                    Thursday 22nd March, 2001  

TITLE:                    PRACTICAL SVM USAGE  

ABSTRACT: Over the last three years Support Vector Machines (SVM) has been successfully used in a large variety of domains. The central idea behind SVM is to map the input data into some other feature space, via a nonlinear map.  This makes SVM a serious competitor of high-performance classifiers. However, SVM shows some weakness and difficulties related to its implementation and use, namely, computational performance, choice of mapping function, and parameter tuning. In this seminar I will present the method, as a tutorial, and show how I have been using SVM in classification and regression in image analysis, and diagnosis.

   

SPEAKER:             Istvan Hernadvolgyi
        
                       
School of Information Technology and Engineering 
                                University of Ottawa  

DATE:                    Thursday 29th March, 2001  

TITLE:                    Searching for Macro Operators with Automatically Generated Heuristics  

ABSTRACT: The Rubik's Cube has 4.3*10^19 states, 40,000,000 times more than the 15 puzzle. Because of its size, this puzzle has been a driving force of research both in Mathematics and Computer Science. Recently (1997), for the first time, a few dozen optimal solutions were found by heuristic search. Each instance took several days of CPU time and large heuristic tables. Suboptimal solutions, while take significantly less computational effort, are often 5 times of the optimal solution. Our aim is to come up with a technique which is general and can find short solutions fast. Our method is to find optimal macros with heuristic search and to reduce the number of subgoals which confine the solution path.  The heuristics are automatically derived by generating an abstract space, while the subgoals are reduced by merging consecutive ones. Using this method, the solution lengths are reduced by 44% over previous suboptimal techniques, at reasonable extra computational effort.  This method is a general search technique, almost entirely automatic and it is not in any way limited to solving the Rubik's Cube. I will also present yet another technique which exploits symmetries in the subgoals and makes the same macro table applicable to find even shorter solutions.

   

SPEAKER:             Erick Alphonse
        
                       
School of Information Technology and Engineering 
                
               
University of Ottawa  

DATE:                    Thursday 12th April, 2001  

TITLE:    Learning Relational Concepts by Lazy Propositionalization  

ABSTRACT: Inductive Logic Programming is a domain of machine learning which deals with representation languages in subsets of First Order Logics (FOL). It is a well known fact that concept learning in restrictions of FOL has to cope with complexity issues that stem from both the size of the search space and the complexity of the subsumption test. Both aspects are traditionally handled in Inductive Logic Programming (ILP) directly in a FOL framework by adopting a generate and test approach carefully controlled through sophisticated language and search bias. A novel approach to learning relational concepts from positive and negative examples is presented. We show how one can perform a test incorporation into the generator in order to conduct a data-driven search strategy. This upgrade of the well-known attribute-value technique, motivated by Winston and used in the AQ-family, into a downward refinement operator enables us to dynamically reduce the dimensionality of ILP problems by pruning irrelevant branches of the refinement graph.  Furthermore, this refinement operator is computed efficiently in polynomial time. We present an AQ-like algorithm exploiting this approach and then provide a first empirical evaluation on a standard benchmark dataset for ILP, the Mutagenesis problem. If we have enough time, we will discuss the soundness of our approach and the link between the completeness of the two learning strategies in ILP will be underlined.

   

SPEAKER:             Aijun An
         
                       Department of Computer Science
                               
University of Waterloo  

DATE:                    Thursday 10th May, 2001  

TITLE:                    A Rule Induction System and its Applications  

ABSTRACT: I present ELEM2, a rule induction system, that induces decision rules from a set of data based on selection of attribute-value pairs. ELEM2 is distinguished from other rule induction systems in three aspects. First, in the search for attribute-value pairs, ELEM2 uses a heuristic function that reflects the degree of relevance of an attribute-value pair to a target concept. The function leads to selection of the most relevant pairs to formulate rules. Second, ELEM2 handles inconsistent training examples by defining an unlearnable region of a concept based on the probability distribution of that concept in the training data. The unlearnable region is used as a stopping criterion for the concept learning process, which resolves conflicts without removing inconsistent examples. Third, ELEM2 makes use of a rule quality measure in its post-pruning process to prevent rules from overfitting the data. The default rule quality formula used in ELEM2 measures the extent to which a rule can discriminate between the positive and examples of the target concept. The rule quality formula can also be selected from a set of formulas based on characteristics of the data set and a set of formula-selection rules. I will provide an overview of experimental results obtained by applying ELEM2 to real and synthetic data sets and compare the results to the results from some other systems. I will also discuss some recent industrial applications of the research, including identifying active compounds for drug discovery and target marketing of products and services for telecommunication companies.

   

SPEAKER:             Mario Jarmasz
       
                        
School of Information Technology and Engineering
                     
          
University of Ottawa  

DATE:                    Thursday 17th May, 2001  

TITLE:                    Roget's Thesaurus: a Lexical Resource to Treasure  

ABSTRACT: WordNet proved that it is possible to construct a large-scale electronic lexical database on the principles of relational lexical semantics. It has been accepted and used extensively by computational linguists ever since it was released. Inspired by its success, I propose as an alternative a similar resource based on the 1987 Penguin edition of Roget's Thesaurus of English Words and Phrases. This seminar describes the steps involved in creating a machine tractable version of Roget's Thesaurus. A comparison to WordNet and an evaluation  of its merits for natural language processing shall be discussed. Roget's  classification system as well as its lexical entry format will be shown, the organization and distribution of words and phrases described in depth. After analyzing the Thesaurus, a methodology for mapping Roget's senses onto WordNet and labelling implicit semantic relations shall be explained. A survey of how Roget's has been used in computational linguistics will be presented. Three applications of the computerized Thesaurus, measuring  the semantic similarity of words and phrases, word sense disambiguation, and text classification shall be discussed.

 

Return to main page