![]() The Text Analysis and Machine Learning Group Executive summary | What is Knowledge Management? | The team | Brief history | Results and accomplishments | Current graduate students | International, national and industrial cooperation
SPEAKER:
Igor Jurisica, Ontario Cancer Institute/ DATE:
TITLE:
Systematic Knowledge Management: ABSTRACT:
Biomedical domains are characterized by substantial amounts of
complex data, many
unknowns, lack
of complete
theories, and
rapid evolution. In such
domains, reasoning is
often based on experience
rather than
on general
knowledge. Experts
remember positive
experiences (cases)
for possible
reuse of
solutions; negative
experiences are used to avoid potentially unsuccessful outcomes. In
general, knowledge
management systems
support representation,
organization, acquisition, creation, usage, and evolution of knowledge in
its many
forms. Knowledge-based
decision-support systems
can successfully support
problem-solving by
aiding classification,
diagnosis, planning or
prediction. The competence and
performance of these systems depends on
the knowledge and its representation schema, and reasoning mechanism
used. It is usually desired to have expressive knowledge representation and
scalable reasoning mechanism, where both are also understandable
to human users. In addition, the
system must be flexible in order to support domain knowledge evolution.
Existing decision-support
systems have
mostly been
designed and implemented
by knowledge engineers,
using domain expert
knowledge compiled into
appropriate representation
schemes. However, manually
evolving the
system to meet
the changing
user needs and
domain knowledge is
neither efficient
nor even
feasible in
many situations. Preferably
we would like
the system to
automatically assist in knowledge evolution. We
show how a
case-based reasoning (CBR)
system can be used
for knowledge management
in biology and
medicine. CBR is
an important paradigm for knowledge management because:
(a) it is similar to human problem solving and
thus it can complement the user;
(b) it supports evolving domain
models and
thus can
help to
increase domain
understanding; and (c) it
diminishes the problem of
exceptions and over-generalizations.
We discuss
extensions of CBR
paradigm with automated
image-analysis and
knowledge-discovery tools. This
is used to support systematic
knowledge management, and thus helps
to ensure that important
information will
not be missing
or incorrect, the
representation formalism
limited, or
only a subset
of available experience
represented and used during reasoning. SPEAKER:
George Foster, Département d'informatique et recherche opérationnelle, DATE:
TITLE:
A Maximum Entropy Translation Model
ABSTRACT:
Statistical translation (from one human language to another) can be
cast as a classification problem which requires that diverse sources of
predictive information be combined in an effective way. Two statistical
machine learning techniques which provide flexible frameworks for combining
different information sources are linear interpolation and Maximum Entropy /
Minimum Divergence (MEMD) modelling. In
this talk, I will describe a simple linear translation model and an
equivalent MEMD model, and compare their performances using a theoretical
measure. I will also illustrate the practical differences between these
models by showing how they perform in a prototype text-prediction tool for
translators. SPEAKER:
Tony White, Systems and Computer Engineering, DATE:
TITLE:
SynthECA: A Synthetic Ecology of
Chemical Agents ABSTRACT:
Emergence, or self organization, is being increasingly associated
with distributed or decentralized problem solving. Naturally occurring
systems demonstrate that complex problem solving with simple agents is
possible and that such systems are extremely robust to changes in their
environment and to the loss of individual agents. Furthermore, such systems
are self managing, requiring no "blind watchmaker" to ensure that
the system operates efficiently. Systems that depend upon the interaction of
a large number of simple agents for their problem solving properties stand
in stark contrast to monolithic agent systems that have been the target of
considerable research. The frailty of symbolic systems, in the face of
unforeseen situations and the failure of individual agents provides
motivation for the research documented in this talk. Social insect behaviour,
and swarm systems generally, provide the inspiration for the research
described here as does the increasing importance of activation-oriented
computing. This talk proposes an agent architecture that relies on local
communication only, all algorithms having no knowledge of the global scene.
The agent architecture is analysed in terms of the requirements necessary to
support emergent problem solving. The solutions to problems in the
Communications domain, specifically routing and fault localization,
demonstrate the utility of the proposed architecture along with algorithms
that demonstrate the self-management of agent swarms. SPEAKER:
Robert C. Holte, DATE:
TITLE:
Combinatorial Auctions and Knapsack Problems ABSTRACT:
This talk will largely be tutorial in nature.
I will introduce "combinatorial auctions", a topic of much
recent interest in AI which naturally generalizes to the multi-dimensional
knapsack problem. I will discuss
applications (including weighted constraint satisfaction problems) and
review the main algorithms and results for these problems.
I will also present new results showing that simple hill-climbing,
with an appropriate objective function, does well on these problems. SPEAKER:
DATE:
TITLE:
Support Data Machines and Semantic Kernels for Text ABSTRACT:
The learning principle of Support Vector Machines can be applied to
any data (not only vectors) that can be described as proximity data by the
means of a kernel. In many pattern recognition applications, this can lead
to consider using more expressive representations of the data, eventually
structured ones, combined with well-founded statistical learning tools. Text
categorization and more generally information retrieval tasks, offer a very
promising application domain for testing these ideas. We explore and compare
different ways of representing documents lying on kernels that encode
semantic relationships between words. Results
in some text categorization tasks are presented and commented. SPEAKER:
Mario Marchand, DATE:
TITLE:
Learning with the Set Covering Machine ABSTRACT:
We generalize the classical algorithms of Valiant and Haussler for
learning conjunctions and disjunctions of Boolean attributes to the problem
of learning these functions over arbitrary sets of features. The result is a
general-purposed learning machine, suitable for practical learning tasks,
that we call the Set Covering Machine. We present a version of the Set
Covering Machine that uses generalized balls for its set of features and
compare its performance to the famous Support Vector Machine. SPEAKER:
Fabrizio Sebastiani, (co-authors: Alessandro Sperduti and Nicola
Valdambrini) DATE:
Friday 10th November TITLE:
An improved boosting algorithm and its application to automated
text categorization ABSTRACT:
We describe AdaBoost.MH^{KR}, an improved boosting algorithm, and its
application to text categorization. Boosting
is a method for supervised learning which has successfully been applied to
many different domains, and that has proven one of the best performers in
text categorization exercises so far. Boosting is based on the idea of
relying on the collective judgment of a committee of classifiers that are
trained sequentially. In training the i-th classifier special emphasis is
placed on the correct categorization of the training documents which have
proven harder for the previously trained classifiers. AdaBoost.MH^{KR} is
based on the idea to build, at every iteration of the learning phase, not a
single classifier but a sub-committee of the K classifiers which, at that
iteration, look the most promising. We report the results of systematic
experimentation of this method performed on the standard Reuters-21578
benchmark. These experiments have shown that AdaBoost.MH^{KR} is both more
efficient to train and more effective than the original AdaBoost.MH^{R}
algorithm. SPEAKER:
Evangelos Milios, (Joint work with Kori MacCara and Nathalie
Japkowicz) DATE:
TITLE:
Text Similarity Based on a Lexical Ontology ABSTRACT:
Measuring the similarity of documents is a key task in information
retrieval. The classical vector representation of documents based on TF-IDF
information suffers from the potential of classifying as dissimilar
documents with the same semantic content but using different words. The
problem is more serious for short documents (for example abstracts).
Our research is motivated by the desire to build focused Web crawlers
that rely on potentially very short text fragments, such as anchor text, to
decide on the relevance of links. In this talk we describe a novel
domain-specific text similarity measure that relies on a hierarchical
lexical ontology to assess the similarity of documents at various levels of
semantic detail. The approach is inspired by scale-space techniques for
measuring shape similarity at various levels of geometric or visual detail. SPEAKER:
Joel Martin, Institute for Information Technology, National
Research Council of DATE:
TITLE:
Surface Features for Text Classification: Beyond Words and NGrams ABSTRACT:
What features should be used to classify text?
Words, phrases, ngrams and their frequencies are often used because
they are efficient to calculate and work quite well.
Unlike other recent talks, I'm going to ignore background knowledge
and rely solely on surface level features.
Moreover, I am going to take an extreme perspective and consider a
text feature to be a partially ordered set of substrings. For example, this
abstract has the text feature '1:"classif", 2:"phrase",
before(1,2)' occurring five times. In
addition to words, phrases, and ngrams, this definition of a text feature
includes all substrings and all subsequences.
Partly because of this power, these text features have three
problems. First, it is not
practical to enumerate all such features.
I address this problem using simpler feature types and kernels for
support vector machines. Second,
most of these features are nuisance features and will lead to spurious
correlations. Third, the complex
features may carry no extra information than is already available from
single words. I try to address
these last two problems empirically, but will only present preliminary
evidence with some artificial and one real-world document collection. SPEAKER:
Nathalie Japkowicz, DATE:
TITLE:
A Combination Scheme for Learning from Imbalanced Data Sets ABSTRACT:
When learning algorithms designed to maximize accuracy are presented
with class imbalanced data sets, they typically produce biased classifiers
which have high predictive accuracy over the over represented class, but low
predictive accuracy over the under represented class. Several methods have
previously been proposed to address this problem including methods that
compensate for the asymmetry in the training set and methods that ignore
this asymmetry. This talk is concerned with methods that compensate for the
class imbalance. In particular,
it addresses the problem of how to tune these methods optimally.
In a first part, I will present an experimental study designed to
derive optimal parameters in domains of increasing complexity and increasing
imbalances. I will show how different tunings of our compensation approach
yield different solutions to our classification problems, thus suggesting
that a combination approach may be more effective than the use of a single
classifier with optimal tuning. The second part of the talk will present the
combination scheme that we proposed as well as the results it yielded on a
text classification task. SPEAKER:
Petko Valtchev (Rokia Missaoui & Pierre Lebrun), Universite du
Quebec a DATE:
TITLE:
A Partition-based Approach Towards Building Galois (concept)
Lattices ABSTRACT:
Galois lattices and formal concept analysis of binary relations have
attracted significant attention as they proved useful in the resolution of
many problems of theoretical or practical interest. Recent applications,
especially in data mining and software engineering, put the emphasis on the
need for both efficient and flexible algorithms to build the lattice.
In this talk we present a novel approach for lattice building based
on the apposition of binary relation fragments, previously used for
visualization purposes. We extend the existing theory into a complete
characterization of the global Galois (concept) lattice as a substructure of
the direct product of the partial lattices related to both relation
fragments. The structural properties underlie a procedure that extracts the
global lattice from the direct product, and which is the basis for a
full-scale divide-and-conquer lattice building algorithm. Our solution
represents a unified framework for both incremental and batch approaches and
therefore enables the design of novel algorithms implementing hybrid
batch-incremental strategies, in a flexible, data-driven way. We then
provide a complexity analysis together with some results about the practical
performances of the algorithm. We also indicate a class of binary relations
for which our algorithm outperforms some known lattice building methods and
discuss the many-fold impact of our study on the current state of knowledge
in the domain. |