![]() The Text Analysis and Machine Learning Group Executive summary | What is Knowledge Management? | The team | Brief history | Results and accomplishments | Current graduate students | International, national and industrial cooperation
SPEAKER:
Takashi Gomi, Applied AI Systems, Inc., DATE:
PLACE:
Room 318, MacDonald Hall, TITLE:
Non-Cartesian Robotics and Evolutionary Robotics ABSTRACT:
A new trend in intelligent robotics began with the appearance of the
Behavior-Based Artificial Intelligence exemplified by Rodney Brooks'
subsumption architecture (MIT 1986). Behavior-based
AI offers the potential to create fast, flexible, robust, extremely
efficient, animate, and economic robots using methods much simpler than
those employed in conventional robot development.
Robots developed using this methodology have been called non- or
post-Cartesian because of their
great divergence from the philosophical perspective underlying conventional
robotics. Debate between the two
schools recently spilled over into general AI research in the United States,
where behavior-based AI (more commonly called New AI in Europe) challenges
the notion of intelligence advocated for
the past 25 years by conventional AI researchers (now often termed GOFAI, or
Good Old Fashioned AI). While
efforts to identify and formalize the origin of the desirable
characteristics mentioned above are in general speculative and at best
tentative, research and development in the new paradigm has progressed
steadily for a decade and is on the verge of producing applications which
can be used in the real world. Dr.
Gomi will provide examples highlighting
how behavior-based robotics achieves each of the listed characteristics. A
list of Dr. Gomi's recent publications on behavior-based AI is available on
request. BIO:
Takashi Gomi was born in SPEAKER:
Rokia Missaoui, UQAM, email: missaoui.rokia@uqam.ca DATE:
PLACE:
Room 318, MacDonald Hall, TITLE:
Mining from Data Fragments ABSTRACT:
In order to reduce the complexity of the knowledge discovery process,
one can perceive the mining of a very large data set as the mining of a set
of data fragments obtained from a decomposition of the original input. In
this talk, we first give some background about formal concept analysis (Ganter
and Wille 1999), which is a conceptual clustering approach. From the context
(O, P, R) describing a set O of objects, a set P of descriptors and a binary
relation R between O and P, a unique ordered set can be derived, which
describes the inherent lattice structure defining natural groupings and
relationships among the objects and their descriptors. This structure is
known as a concept lattice and allows the identification of concepts and
rules. We then present an approach to mining a set of data fragments
resulting from a vertical decomposition of a relational table (or context).
The approach makes use of formal concept analysis and explores the power of
nested line diagrams, which express the nesting of concept lattices
corresponding to individual fragments. The approach is useful for many
reasons: -
It enhances the readability and interpretation of the discovered knowledge
by offering a good visualization tool. -
It offers a progressive and exploratory way to perform data mining. -
The discovery of concepts and rules from a given input can be handled by
exploring independently and separately each fragment resulting from a
vertical decomposition of that input. Our
contribution lies in the development of procedures for the following tasks:
(i) building lattices at two and even higher levels of nesting, (ii)
creating mappings between nested and un-nested lattices, and (iii)
generating concepts and rules from the nested structure without considering
the input as a whole. This work can be useful for distributed and parallel
mining. SPEAKER:
Martin Fontaine, University
of Ottawa, email: mfontaine@omnisig.com DATE:
Friday, February 4th, 2000 PLACE:
Room 318, MacDonald Hall, University of Ottawa TITLE:
Structural Identification of Unintelligible Documents
ABSTRACT:
This presentation describes some techniques and approaches to solve
the document identification problem in a certain particular context. We are
trying to classify documents by their structure not by their content. This
means that we do not assume that the documents are written in a natural
language. The only assumption
made is that the target concept that we are trying to learn (with machine
learning techniques) can be expressed with a regular expression. The
following topics will be discussed: 1.
A new approach for textual features extraction and features pruning inspired
from compression technique and specially developed to efficiently extract
features from large training set containing significantly large documents. 2.
The utility of grammatical inference techniques for text classification. 3.
A hybrid approach between grammatical inference and decision tree. SPEAKER:
Dr. Richard Nock, Université des Antilles-Guyane (France) DATE:
Thursday, February 3rd, 2000
PLACE:
Room 318, MacDonald Hall, University of Ottawa TITLE:
Learning logical
formulas having limited size: theoretical aspects and algorithms (with
related results) ABSTRACT:
The work I will present in ML is related to the improvement of
classifier's accuracy, succinctness and interpretability. Two types of
algorithms I have co-developed will be detailed: the first one preprocesses
the data, the second one unifies and improves the induction of some types of
concept representations. My COLT work has been focused on formal analyses of
some of the ML goals presented before for various kinds of concept
representations. I will present a result on compression algorithms also
known as Ockham's razors, and an application of some results in Network
Supervision. This talk will end by a short presentation of some recent
results on Image Segmentation in optimal time/space complexities, using
specifically developed statistical concentration bounds. BIO:
Richard Nock was born in France in 1970. In 1993, he received an
engineering degree in Agronomy from the graduate school of engineering:
ENSAM (Ecole Nationale Superieure Agronomique de Montpellier/Institut
National de la Recherche Agronomique, France), and a M.Sc.
in Computer Science from the LIRMM, Univ. Montpellier II (Lab. of Computer
Science, Robotics and Microelectronics, France), ranking major. He received
the PhD in Computer Science in 1998 from the LIRMM, and then served as
Assistant Professor in the Univ. Montpellier II and the Univ. des Antilles-Guyane.
During his military service in 1996, Richard obtained contractual consulting
activities in Statistics and Computer Science for a Research Center on
Virtual Reality and Human Factors of the DGA (General Army Delegation). His
research topics include Machine Learning, Statistics, Algorithmic
Complexity, Image Processing, with applications to domains such as Data
Mining and Network Supervision. In this talk, I will synthetize some of the
research works I have pursued in Machine Learning (ML), Computational
Learning Theory (COLT) and Image Processing. SPEAKER:
Dr. B. John Oommen, Professor, School of Computer Science,
Carleton University DATE:
Friday, February 11th, 2000 PLACE:
Room A707, COLONEL BY TITLE:
Parameter Learning from Stochastic Teachers and Stochastic
Compulsive Liars ABSTRACT:
All of the research that has been done in learning, has involved
learning from a Teacher who is either deterministic or stochastic. In this
talk, we shall present the first known results of how a learning mechanism
can learn while interacting with either a stochastic teacher or a stochastic
compulsive liar. In the first instance, the teacher intends to teach the
learning mechanism. In the second, the compulsive liar intends to
consciously mislead the learning mechanism. We shall present a formal
strategy for the mechanism to perform e-optimal learning without it knowing
whether it is interacting with a teacher or a compulsive liar. Believe It Or
Not - IT WORKS ! This
is a joint work with Dr. Govindachari (Bangalore) and Dr. Kuipers (Texas). SPEAKER:
Elizabeth Scarlett, University of Ottawa, email: scarlett@site.uottawa.ca DATE:
Friday, February 18th, 2000 PLACE:
Room A707, Colonel By, University of Ottawa
TITLE:
An Evaluation of a Rule-Based Parser of English Sentences
ABSTRACT:
Parsers are essential components of many Natural Language Processing
applications. In such
applications, parsing is the process of analyzing and assigning grammatical
types to each word, phrase, and clause in a sentence.
A broad-coverage parser attempts to analyze any grammatical sentence
of a natural language. A parser depends upon a grammar, which is a set of
codified rules of language, to analyze the sentence structure. Often in such
systems, the strength of the parser/grammar has a direct effect on the
desired results. Achieving good
results rests on reliable evaluation methods to determine and eliminate
weaknesses in the parser or grammar. DIPETT (Domain Independent Parser of
English Technical Text) is a broad-coverage parser of English technical text
that was developed for a Ph.D. research project (Delisle, 1994), and is used
primarily in the TANKA (Text Analysis for Knowledge Acquisition) project.
The TANKA project seeks to build a model of a technical domain by
semi-automatically processing written text that describes the domain. No
other source of domain-specific knowledge is available. The accuracy and
completeness of a semantic representation generated by TANKA is partly
determined by the accuracy of DIPETT's syntactic analysis of the text. Test
suites have long been accepted in NLP because they provide for controlled
data that is systematically organized and documented. I will discuss the use
of test suites for evaluating the performance of parsers of English
sentences, and present two test suites used to evaluate DIPETT. The
evaluation results were used to make significant improvements to DIPETT, and
the test suites were used in part to compare DIPETT's performance to that of
several other publicly available large-coverage parsers. SPEAKER:
Vivi Nastase, University of Ottawa, email: vnastase@csi.uottawa.ca PLACE:
Room 318, MacDonald, University of Ottawa
TITLE:
Attempting to unify semantic representations across syntactic
levels ABSTRACT:
There are many forms of linguistic expression, and a speaker may
choose one or another depending on their goal, or on the hearer's background
or expectations. We can use one long composite sentence, or a few short,
simple clauses. We can choose to pack much information into a noun phrase,
or to spread this information over the surrounding clause. We would like
language processing computer systems to recognize the fact that the same
concept/situation/relation is presented, even if it takes different forms. I
consider three syntactic levels of language: multi-clause sentences, clauses,
noun phrases. My work starts with three separate lists of
semantic relations, one for each of these syntactic levels. The intended
result is to find regularities, or maybe even rules that associate semantic
relations on different levels. I have decomposed this task into several
subtasks. In this talk I will present these subtasks, some theories about
the underlying phenomena, and some preliminary experiments. I will talk
about the next experiment I am preparing, and some lexical-semantic problems
I need to tackle. SPEAKER:
Nathalie Japkowicz, email: nat@paul.rutgers.edu DATE:
Friday, March 17th, 2000 PLACE:
Room A707, Colonel By, University of Ottawa
TITLE:
Learning from Imbalanced Data Sets
ABSTRACT:
As the field of machine learning makes a rapid transition from the
status of "academic discipline" to that of "applied
science", a myriad of new
issues, not previously considered by the machine learning research
community, is now coming to light. One such issue is the problem of
imbalanced data sets. Indeed, the majority of learning systems
previously designed and tested on toy problems or carefully crafted
benchmark data sets usually assumes that the training sets are well
balanced. In the case of concept-learning, for example, classifiers
typically expect that their training set contains as many examples of the
positive as of the negative class. Unfortunately,
this balanced assumption is often violated in real world settings. Indeed,
there exist many domains for which one class is better represented than the
other. This is the case, for example, in fault- monitoring tasks where
non-faulty examples are plentiful since they typically involve recording
from the machine during normal operation whereas faulty examples involve
recording from a malfunctioning machine, which is not always possible, easy,
or financially worthwhile. The purpose of this talk is 1) to demonstrate
experimentally that, at least in the case of connectionist systems, class
imbalances hinder the performance of standard classifiers; 2) to compare the
performance of several approaches previously proposed to deal with the
problem; 3) to present an elaboration of a scheme previously used on this
problem; and 4) to present a method for combining several schemes in the
context of Text Classification. SPEAKER:
Ismaïl Biskri, Université du Québec à Trois-Rivières, email: ismail_biskri@uqtr.uquebec.ca DATE:
Friday, March 24th, 2000 PLACE:
Room 318, MacDonald, University of Ottawa TITLE:
Hybrid, Semi-Automatic Approaches to NLP ABSTRACT:
We present hybrid approaches to NLP that result from a synergistic
combination of numerical computations and linguistic grammars (or >
linguistic filters). Numerical data mining tools are generally quite robust
but only provide coarse-granularity results; such tools can handle very
large inputs. Computational linguistic tools are able to provide
fine-granularity results but are less robust; such tools usually handle
relatively short inputs. We apply our hybrid approach to knowledge
extraction, multi-word term identification and web customisation (work in
progress). Our software tools are different from most other term
identification or knowledge extraction tools in that they are by design
semi-automatic : i.e. they are interactive and constantly under the user's
control. The software supports the knowledge engineer's work, the (corpus)
domain's expert, or the linguist's task, by helping them do their job more
efficiently. We justify this semi-automatic approach by the need to have a
more flexible and customisable tool to perform certain term identification
tasks and certain knowledge extraction tasks. More specifically, in some
applications we want to allow the user's perspective, knowledge and
subjectivity, influence the results: all this within certain limits, of
course. An example of such an application on which we are currently working
is that of Web personalisation: to allow individuals to develop their own
vision of information universes of interest to them, we need flexible and
customisable tools that can support them in such a challenging task, not
tools that will impose on them a pseudo-standardised vision of the world. SPEAKER:
Caroline Barriere, University of Ottawa, email: barriere@site.uottawa.ca DATE:
Friday, March 31th, 2000 PLACE:
Room A707, Colonel By, University of Ottawa
TITLE:
Extraction and classification of causal relations from
semi-technical texts ABSTRACT:
This work investigates the causal relation as it is expressed in
text, more specifically in semi-technical texts.
The purpose of semi-technical texts is to explain notions within a
particular domain to a non-initiated person, which means they would contain
useful information for the domain model.
Our goal is twofold, first we aim at gaining some insight into the
feasibility of automatic extraction of causal relations from text, and
second we make a proposal for the classification of the causal relations to
capture the variations expressed in the text. SPEAKER:
Eibe Frank, DATE:
Friday, April 7th, 2000 PLACE:
Room 318, MacDonald, University of Ottawa TITLE:
Pruning decision trees and lists with significance tests ABSTRACT:
Decision trees and lists are potentially powerful predictors and
embody an explicit representation of the structure in a dataset. Their
accuracy and comprehensibility depends on how concisely the learning
algorithm can summarize this structure.
The final model should not incorporate spurious effects---patterns
that are not genuine features of the underlying domain.
Given an efficient mechanism for determining when a particular effect
is due to chance alone, non-predictive parts of a model can be pruned.
Pruning mechanisms require a sensitive instrument that uses the data
to detect whether there is a genuine relationship between the components of
a model and the domain. Statistical
significance tests are theoretically well-founded tools for doing exactly
that. In this talk I will
discuss pruning algorithms for decision trees and lists that are based on
significance tests. I will
explain why pruning is often necessary to obtain small and accurate models
and show that the performance of standard pruning algorithms can be improved
by taking the statistical significance of observations into account.
I will compare the effect of parametric and non-parametric tests,
show why current pruning algorithms for decision lists often prune too
aggressively, and propose a new algorithm for pruning decision lists that is
designed to overcome this problem. SPEAKER:
Istvan Hernadvolgyi, DATE:
PLACE:
Room 318, MacDonald, University of Ottawa TITLE:
Using Pattern Databases to Find Macro Operators
ABSTRACT:
Macro operators (macros) reach subgoals without search. The goal is
reached by applying macros in the order of subgoals. While macro search does
not find optimal solutions, if the macros are nearly optimal the solutions
are short and are obtained very fast. For the first time, we applied
(semi-)automatically generated heuristics to find optimal macro operators
and built optimal macro tables for the Rubik's Cube. We contrast our
technique with two powerful methods: Korf's "bi-directional
partial-match" and the Schreier-Simms algorithm well known in
computational group theory. http://www.site.uottawa.ca/~istvan/phd/presentations/macro.html SPEAKER:
Chris Drummond, DATE:
Friday, April 28th, 2000 PLACE:
Room 318, MacDonald, University of Ottawa TITLE:
Explicitly Representing Expected Cost: An Alternative to ROC
Representation ABSTRACT:
In this talk we present an alternative to ROC representation, in
which the expected cost of a classifier is represented explicitly. From our
experience in using ROC analysis we found that despite all of its strengths
the graphs produced were not always easy to interpret. We show that the
expected cost representation maintains many of the advantages of the ROC
representation, but is easier to understand.
It allows the experimenter to immediately see the range of costs and
class frequencies where a particular classifier is the best and
quantitatively how much better it is than other classifiers. One interesting
feature of our expected cost representation is that there is a point/line
duality between it and the ROC representation. A point in ROC space
representing a classifier becomes a line segment spanning the full range of
costs and class frequencies. This duality produces equivalent operations in
the two spaces, allowing techniques used in ROC analysis to be readily
reproduced in the cost space. It also means that it easy to switch between
the two representations and the choice might depend on what aspects of the
learning task are under scrutiny. SPEAKER:
Jelber Sayyad, DATE:
Friday, May 5th, 2000 PLACE:
Room 318, MacDonald, University of Ottawa TITLE:
Learning the Concept of Relevance Among Files in a Software
Maintenance Context ABSTRACT:
Software maintenance is the costly part of software life cycles.
Estimates on the cost of software maintenance vary from 50% to the excess of
70% of the total cost of the system. The 20-80 rule suggests that 20% of the
development work is performed during original application development, and
the other 80% is performed during maintenance. Consequently it is expected
that assisting software engineers in maintenance activities result in a
reduction in the overall cost and effort spent during the lifetime of a
software system. In early 1980s, in an attempt to address some of the
existing problems in software engineering, researches proposed the use of
artificial intelligence in different phases of software development process.
The application of artificial intelligence technology to software
engineering is known as Knowledge Based Software Engineering (KBSE). While
this definition is fairly broad, in practice it has meant the creation of
systems which employ knowledge bases which explicitly encode the knowledge
regarding the particular task of interest. On the other hand, machine
learning techniques which have the potential of learning general concepts
from previously seen examples have not been widely applied to software
engineering domain. The focus of our research is assisting software
engineers with low level, daily maintenance activities. When a maintenance
programmer is looking at a piece of code, such as a
file, one of the important questions which he needs to answer is:
"which other files should I know about, i.e. what else might be
relevant to this piece of code ?" In this presentation we will discuss
our research, which employs machine learning to learn the concept of
relevance among files in a legacy software system. We use the records of
program updates, and tasks performed by software engineers to learn, within
the context of software maintenance, what makes two files in our target
system to become relevant to each other. SPEAKER:
Fernando de Carvalho Gomes, Postdoc Fellow on leave from UFC -
Brazil DATE:
Friday, May 12th, 2000
PLACE:
Room 318, MacDonald, University of Ottawa TITLE:
A Stochastic Algorithm for Learning Decision Lists with limited
Complexity ABSTRACT:
This work deals with learning decision lists from examples. In real
world problems, data are often noisy and imperfectly described. It is
commonly acknowledged that in such cases, consistent but inevitably complex
classification procedures usually cause overfitting: results are perfect on
the learning set but worse on new examples. Therefore, one searches for less
complex procedures which are almost consistent or, in other words, for a
good compromise between complexity and goodness of fit. We propose an
optimization stochastic algorithm to search the space of the
complexity-limited decision lists, where the minimization criterion is
Vapnik's empirical risk. |