The Text Analysis and Machine Learning Group

Executive summary | What is Knowledge Management? | The team | Brief history | Results and accomplishments | Current graduate students | International, national and industrial cooperation

 


Fall 2000


 

SPEAKER:             Igor Jurisica, Ontario Cancer Institute/ Princess Margaret Hospital
                               
University Health Network, , Division of Cancer Informatics, University of
Toronto  

DATE:                    Wednesday 20th September, 2000  

TITLE:                    Systematic Knowledge Management:
       
                         Taking Advantage of High-Throughput Methods in Biology
 

ABSTRACT:         Biomedical domains are characterized by substantial amounts of complex data,  many  unknowns,   lack   of  complete   theories,  and   rapid evolution.  In such domains,  reasoning is  often based  on experience rather   than  on   general  knowledge.   Experts   remember  positive experiences  (cases)   for  possible  reuse   of  solutions;  negative experiences are used to avoid potentially unsuccessful outcomes. In  general,  knowledge  management  systems  support  representation, organization, acquisition, creation, usage, and evolution of knowledge in  its  many  forms.  Knowledge-based  decision-support  systems  can successfully   support  problem-solving   by   aiding  classification, diagnosis, planning  or prediction. The competence  and performance of these systems depends on  the knowledge and its representation schema, and reasoning mechanism used. It is usually desired to have expressive knowledge representation and  scalable reasoning mechanism, where both are also understandable  to human users. In addition,  the system must be flexible in order to support domain knowledge evolution. Existing  decision-support  systems  have  mostly  been  designed  and implemented  by  knowledge engineers,  using  domain expert  knowledge compiled  into appropriate  representation schemes.  However, manually evolving  the  system to  meet  the  changing  user needs  and  domain knowledge   is   neither  efficient   nor   even   feasible  in   many situations.  Preferably  we would  like  the  system to  automatically assist in knowledge evolution. We  show how  a  case-based reasoning  (CBR)  system can  be used  for knowledge  management in  biology and  medicine. CBR  is  an important paradigm for knowledge management because:  (a) it is similar to human problem solving and  thus it can complement the  user; (b) it supports evolving  domain   models  and  thus  can  help   to  increase  domain understanding;  and (c) it  diminishes the  problem of  exceptions and over-generalizations.  We  discuss  extensions  of CBR  paradigm  with automated image-analysis  and knowledge-discovery tools.  This is used to support  systematic knowledge management, and thus  helps to ensure that  important information  will  not be  missing  or incorrect,  the representation  formalism  limited,  or  only a  subset  of  available experience represented and used during reasoning.

   

SPEAKER:             George Foster, Département d'informatique et recherche opérationnelle,
                
                Université de Montréal, Montréal,
(Currently at SITE, Ottawa)
 

DATE:                    Thursday 12th October, 2000  

TITLE:                    A Maximum Entropy Translation Model         

ABSTRACT:         Statistical translation (from one human language to another) can be cast as a classification problem which requires that diverse sources of predictive information be combined in an effective way. Two statistical machine learning techniques which provide flexible frameworks for combining different information sources are linear interpolation and Maximum Entropy / Minimum Divergence (MEMD) modelling.  In this talk, I will describe a simple linear translation model and an equivalent MEMD model, and compare their performances using a theoretical measure. I will also illustrate the practical differences between these models by showing how they perform in a prototype text-prediction tool for translators.

   

SPEAKER:             Tony White, Systems and Computer Engineering, Carleton University .  

DATE:                    Thursday 19th October, 2000  

TITLE:                    SynthECA: A Synthetic Ecology of  Chemical Agents  

ABSTRACT:         Emergence, or self organization, is being increasingly associated with distributed or decentralized problem solving. Naturally occurring systems demonstrate that complex problem solving with simple agents is possible and that such systems are extremely robust to changes in their environment and to the loss of individual agents. Furthermore, such systems are self managing, requiring no "blind watchmaker" to ensure that the system operates efficiently. Systems that depend upon the interaction of a large number of simple agents for their problem solving properties stand in stark contrast to monolithic agent systems that have been the target of considerable research. The frailty of symbolic systems, in the face of unforeseen situations and the failure of individual agents provides motivation for the research documented in this talk. Social insect behaviour, and swarm systems generally, provide the inspiration for the research described here as does the increasing importance of activation-oriented computing. This talk proposes an agent architecture that relies on local communication only, all algorithms having no knowledge of the global scene. The agent architecture is analysed in terms of the requirements necessary to support emergent problem solving. The solutions to problems in the Communications domain, specifically routing and fault localization, demonstrate the utility of the proposed architecture along with algorithms that demonstrate the self-management of agent swarms.

   

SPEAKER:             Robert C. Holte, School of Information Technology and Engineering, University of Ottawa  

DATE:                    Thursday 26th October, 2000  

TITLE:                    Combinatorial Auctions and Knapsack Problems  

ABSTRACT:         This talk will largely be tutorial in nature.  I will introduce "combinatorial auctions", a topic of much recent interest in AI which naturally generalizes to the multi-dimensional knapsack problem.  I will discuss applications (including weighted constraint satisfaction problems) and review the main algorithms and results for these problems.  I will also present new results showing that simple hill-climbing, with an appropriate objective function, does well on these problems.

   

SPEAKER:             Florence d'Alché-Buc, (Work joint with G. Siolas)
        
                       
Laboratoire d'Informatique de Paris 6, France.
 

DATE:                    Thursday 2nd November, 2000  

TITLE:                    Support Data Machines and Semantic Kernels for Text  

ABSTRACT:         The learning principle of Support Vector Machines can be applied to any data (not only vectors) that can be described as proximity data by the means of a kernel. In many pattern recognition applications, this can lead to consider using more expressive representations of the data, eventually structured ones, combined with well-founded statistical learning tools. Text categorization and more generally information retrieval tasks, offer a very promising application domain for testing these ideas. We explore and compare different ways of representing documents lying on kernels that encode semantic relationships between words.  Results in some text categorization tasks are presented and commented.

   

SPEAKER:             Mario Marchand, School of Information Technology and Engineering, 
                   
            
University of Ottawa                        

DATE:                    Thursday 9th November, 2000  

TITLE:                    Learning with the Set Covering Machine  

ABSTRACT:         We generalize the classical algorithms of Valiant and Haussler for learning conjunctions and disjunctions of Boolean attributes to the problem of learning these functions over arbitrary sets of features. The result is a general-purposed learning machine, suitable for practical learning tasks, that we call the Set Covering Machine. We present a version of the Set Covering Machine that uses generalized balls for its set of features and compare its performance to the famous Support Vector Machine.

   

SPEAKER:             Fabrizio Sebastiani, (co-authors: Alessandro Sperduti and Nicola Valdambrini)
            
                    Information Engineering research group,     Italian National Research Council, 
                        
       
Pisa , Italy  

DATE:                    Friday 10th November  

TITLE:                    An improved boosting algorithm and its application to automated text categorization  

ABSTRACT:         We describe AdaBoost.MH^{KR}, an improved boosting algorithm, and its application to text categorization.  Boosting is a method for supervised learning which has successfully been applied to many different domains, and that has proven one of the best performers in text categorization exercises so far. Boosting is based on the idea of relying on the collective judgment of a committee of classifiers that are trained sequentially. In training the i-th classifier special emphasis is placed on the correct categorization of the training documents which have proven harder for the previously trained classifiers. AdaBoost.MH^{KR} is based on the idea to build, at every iteration of the learning phase, not a single classifier but a sub-committee of the K classifiers which, at that iteration, look the most promising. We report the results of systematic experimentation of this method performed on the standard Reuters-21578 benchmark. These experiments have shown that AdaBoost.MH^{KR} is both more efficient to train and more effective than the original AdaBoost.MH^{R} algorithm.

   

SPEAKER:             Evangelos Milios, (Joint work with Kori MacCara and Nathalie Japkowicz)
                
                Faculty of Computer Science,
Dalhousie University  

DATE:                    Thursday 16th November, 2000  

TITLE:                    Text Similarity Based on a Lexical Ontology  

ABSTRACT:         Measuring the similarity of documents is a key task in information retrieval. The classical vector representation of documents based on TF-IDF information suffers from the potential of classifying as dissimilar documents with the same semantic content but using different words. The problem is more serious for short documents (for example abstracts).  Our research is motivated by the desire to build focused Web crawlers that rely on potentially very short text fragments, such as anchor text, to decide on the relevance of links. In this talk we describe a novel domain-specific text similarity measure that relies on a hierarchical lexical ontology to assess the similarity of documents at various levels of semantic detail. The approach is inspired by scale-space techniques for measuring shape similarity at various levels of geometric or visual detail.

   

SPEAKER:             Joel Martin, Institute for Information Technology, National Research Council of Canada  

DATE:                    Thursday 23rd November, 2000  

TITLE:                    Surface Features for Text Classification: Beyond Words and NGrams  

ABSTRACT:         What features should be used to classify text?  Words, phrases, ngrams and their frequencies are often used because they are efficient to calculate and work quite well.  Unlike other recent talks, I'm going to ignore background knowledge and rely solely on surface level features.  Moreover, I am going to take an extreme perspective and consider a text feature to be a partially ordered set of substrings. For example, this abstract has the text feature '1:"classif", 2:"phrase", before(1,2)' occurring five times.  In addition to words, phrases, and ngrams, this definition of a text feature includes all substrings and all subsequences.  Partly because of this power, these text features have three problems.  First, it is not practical to enumerate all such features.  I address this problem using simpler feature types and kernels for support vector machines.  Second, most of these features are nuisance features and will lead to spurious correlations.  Third, the complex features may carry no extra information than is already available from single words.  I try to address these last two problems empirically, but will only present preliminary evidence with some artificial and one real-world document collection.

   

SPEAKER:             Nathalie Japkowicz, School of Information Technology and Engineering, 
                      
         
University of Ottawa  

DATE:                    Thursday 30th  November , 2000  

TITLE:                    A Combination Scheme for Learning from Imbalanced Data Sets  

ABSTRACT:         When learning algorithms designed to maximize accuracy are presented with class imbalanced data sets, they typically produce biased classifiers which have high predictive accuracy over the over represented class, but low predictive accuracy over the under represented class. Several methods have previously been proposed to address this problem including methods that compensate for the asymmetry in the training set and methods that ignore this asymmetry. This talk is concerned with methods that compensate for the class imbalance.  In particular, it addresses the problem of how to tune these methods optimally.  In a first part, I will present an experimental study designed to derive optimal parameters in domains of increasing complexity and increasing imbalances. I will show how different tunings of our compensation approach yield different solutions to our classification problems, thus suggesting that a combination approach may be more effective than the use of a single classifier with optimal tuning. The second part of the talk will present the combination scheme that we proposed as well as the results it yielded on a text classification task.

 

SPEAKER:             Petko Valtchev (Rokia Missaoui & Pierre Lebrun), Universite du Quebec a Montreal  

DATE:                    Thursday 7th  December, 2000  

TITLE:   A Partition-based Approach Towards Building Galois (concept) Lattices  

ABSTRACT:         Galois lattices and formal concept analysis of binary relations have attracted significant attention as they proved useful in the resolution of many problems of theoretical or practical interest. Recent applications, especially in data mining and software engineering, put the emphasis on the need for both efficient and flexible algorithms to build the lattice.  In this talk we present a novel approach for lattice building based on the apposition of binary relation fragments, previously used for visualization purposes. We extend the existing theory into a complete characterization of the global Galois (concept) lattice as a substructure of the direct product of the partial lattices related to both relation fragments. The structural properties underlie a procedure that extracts the global lattice from the direct product, and which is the basis for a full-scale divide-and-conquer lattice building algorithm. Our solution represents a unified framework for both incremental and batch approaches and therefore enables the design of novel algorithms implementing hybrid batch-incremental strategies, in a flexible, data-driven way. We then provide a complexity analysis together with some results about the practical performances of the algorithm. We also indicate a class of binary relations for which our algorithm outperforms some known lattice building methods and discuss the many-fold impact of our study on the current state of knowledge in the domain.

 

Return to main page