The Text Analysis and Machine Learning Group Executive summary | What is Knowledge Management? | The team | Brief history | Results and accomplishments | Current graduate students | International, national and industrial cooperation
SPEAKER:
Rob Cattral, DATE:
TITLE:
Rule Acquisition with a Genetic Algorithm ABSTRACT:
RAGA is a genetic algorithm (GA) and genetic programming (GP) hybrid
designed for the tasks of supervised and certain types of unsupervised
knowledge extraction from large and possibly noisy databases. Applying the
GA to the task of data mining required that certain changes be made to the
traditional algorithm, particularly with respect to knowledge
representation. RAGA uses a system of rules of varying length and
complexity, while the traditional GA has chromosomes of fixed length using a
binary alphabet. Although it is similar to GP in this respect, RAGA uses
mutation operators and does not evolve programs. Unlike both GA and GP, it
promotes validity and nonredundancy by intergenerational processing on a
fluctuating number of individuals. It also implements a form of elitism that
causes a wide exploration of the dataset, and, by making data coverage a
component of fitness, it automatically evolves default hierarchies of rules. SPEAKER:
Yves Kodratoff, CNRS & University Paris - Sud DATE:
TITLE:
Knowledge Discovery in Texts (= " Text Mining ") ABSTRACT:
At first, we will make a short demo of the industrial software TROPES
developed by the company Acetic. It will illustrate the importance to work
with a " good " taxonomy of concepts to detect occurrences of
concepts in texts. This presentation is aims at underlining two difficult
problems -enough- induced by KDT. This discipline uses two kinds of tools,
both kinds imperfect: Natural Language Processing tools, and tools for the
discovery of the " associations " (= implications without
variables). We deal with these two problems as follows: - For NLP, we insist
on the absolute necessity to develop reliable enough tools to create
taxonomies of concepts from texts. We noticed that the syntactic situation
of a word in the sentence most often determines which concept the word
refers to. We will describe LRI's ASIUM tool helping to the building of
taxonomies from texts, and the ROWAN, still under development (a demo will
be done), that will soon allow characterizing the syntactic position linked
to a particular meaning of the word.. For association detection, we will
underline the lack of interest of a large coverage. We will show the
equivalence, in practice, between: 1. measures of dependency, 2. intensity
of the implication, and 3. naive Bayes. We will illustrate the necessity, as
a consequence, to construct, from data, complete Bayesian networks. SPEAKER:
Rob Holte, DATE:
TITLE:
Combinatorial Auctions, Knapsack Problems, and Hill-climbing
Search ABSTRACT:
This talk examines the performance of hill-climbing algorithms on
standard test problems for combinatorial auctions (CAs).
On single-unit CAs, deterministic hill-climbers are found to perform
well, and their performance can be improved significantly by randomizing
them and restarting them several times, or by using them collectively.
For some problems this good performance is shown to be no better than
chance; on others it is due to a well-chosen scoring function. Multi-unit
CAs have been studied widely under a different name: multidimensional
knapsack problems (MDKP). On
standard test problems for MDKP, one of the deterministic hill-climbers
generates solutions that are on average 99% of optimal. SPEAKER:
Svetlana Kiritchenko DATE:
TITLE:
Co-Training. A new algorithm for using unlabeled data in
Supervised Learning ABSTRACT:
In Supervised Learning there is often a problem of shortage of labeled data.
We may have plenty of unlabeled examples, but getting labeled data
means manually labeling each example that is expensive and time-consuming
work. In such a case, we'd like to have a way of putting to use unlabeled
data in the training process. Co-Training is a new, well-working algorithm
suggested by T. Mitchell (and presented in this seminar series in the Fall
of 1999) that lets you start with only a few labeled examples and then use
unlabeled data to improve your initial classifier. In this presentation we
will talk about how Co-Training works, when it can be applied, and what
advantages it has compared with other learning algorithms that use unlabeled
data. I will also present our experimental results of applying Co-Training
to the e-mail classification problem. SPEAKER:
Dale Schuurmans DATE:
TITLE:
ABSTRACT:
It is well known that general inference and learning with Bayesian networks
is computationally hard, and it is therefore necessary to consider heuristic
and approximate algorithms for these tasks.
Among the most convenient and successful techniques are SPEAKER:
Dale Schuurmans DATE:
TITLE:
The exponentiated subgradient algorithm for heuristic boolean
programming ABSTRACT:
Boolean linear programs (BLPs) are
ubiquitous in artificial intelligence research. Satisfiability
testing, planning with resource constraints, and winner determination in
combinatorial auctions are all examples of this type of problem.
Although increasingly well-informed by work in operations research, current
work in AI has tended to focus on specialized algorithms for each type of
BLP task and has only loosely patterned new algorithms on effective methods
from other tasks. In this talk I will introduce a single general-purpose
local search procedure that can be simultaneously applied to the entire
range of BLP problems, without modification. Although one might
suspect that a general-purpose algorithm might not perform as well as
specialized algorithms, we find that this is not the case here. Our
experiments show that our generic algorithm simultaneously achieves
performance comparable with the state of the art in satisfiability search
and winner determination in combinatorial auctions---two very different BLP
problems. Our algorithm is simple, and combines an old idea from OR
with recent ideas from AI. SPEAKER:
Berry de Bruijn DATE:
TITLE:
Sentence extraction for justifying
text categorization. ABSTRACT:
Our
SVM based text categorization algorithm separates medical articles (or
rather, abstracts) into those that report a protein interaction and those
that don't. Joel Martin talked about that in a previous Tamale seminar. In
this presentation, I will talk about efforts to extract the key sentence from a positive document, i.e. a protein
interaction abstract.
A good sentence contains enough evidence for a human judge to certify
the true class of the document. An automaton that can the good sentences,
could reduce a judge's reading time. I will discuss two methods for sentence
extraction, results for tests on two data collections, and related work
(notably automatic summarization and question answering). SPEAKER:
Fernando C. Gomes DATE:
TITLE:
PRACTICAL SVM USAGE ABSTRACT:
Over
the last three years Support Vector Machines (SVM) has been successfully
used in a large variety of domains. The central idea behind SVM is to map
the input data into some other feature space, via a nonlinear map.
This makes SVM a serious competitor of high-performance classifiers.
However, SVM shows some weakness and difficulties related to its
implementation and use, namely, computational performance, choice of mapping
function, and parameter tuning. In this seminar I will present the method,
as a tutorial, and show how I have been using SVM in classification and
regression in image analysis, and diagnosis. SPEAKER:
Istvan Hernadvolgyi DATE:
TITLE:
Searching for Macro Operators with Automatically Generated
Heuristics ABSTRACT:
The
Rubik's Cube has 4.3*10^19 states, 40,000,000 times more than the 15 puzzle.
Because of its size, this puzzle has been a driving force of research both
in Mathematics and Computer Science. Recently (1997), for the first time, a
few dozen optimal solutions were found by heuristic search. Each instance
took several days of CPU time and large heuristic tables. Suboptimal
solutions, while take significantly less computational effort, are often 5
times of the optimal solution. Our aim is to come up with a technique which
is general and can find short solutions fast. Our method is to find optimal
macros with heuristic search and to reduce the number of subgoals which
confine the solution path. The
heuristics are automatically derived by generating an abstract space, while
the subgoals are reduced by merging consecutive ones. Using
this method, the solution lengths are reduced by 44% over previous
suboptimal techniques, at reasonable extra computational effort.
This method is a general search technique, almost entirely automatic
and it is not in any way limited to solving the Rubik's Cube. I
will also present yet another technique which exploits symmetries in the
subgoals and makes the same macro table applicable to find even shorter
solutions. SPEAKER:
Erick Alphonse DATE:
TITLE:
Learning Relational Concepts by Lazy Propositionalization ABSTRACT:
Inductive
Logic Programming is a domain of machine learning which deals with
representation languages in subsets of First Order Logics (FOL). It is a
well known fact that concept learning in restrictions of FOL has to cope
with complexity issues that stem from both the size of the search space and
the complexity of the subsumption test. Both
aspects are traditionally handled in Inductive Logic Programming (ILP)
directly in a FOL framework by adopting a generate and test approach
carefully controlled through sophisticated language and search bias. A novel
approach to learning relational concepts from positive and negative examples
is presented. We show how one can perform a test incorporation into the
generator in order to conduct a data-driven search strategy. This upgrade of
the well-known attribute-value technique, motivated by Winston and used in
the AQ-family, into a downward refinement operator enables us to dynamically
reduce the dimensionality of ILP problems by pruning irrelevant branches of
the refinement graph. Furthermore,
this refinement operator is computed efficiently in polynomial time. We
present an AQ-like algorithm exploiting this approach and then provide a
first empirical evaluation on a standard benchmark dataset for ILP, the
Mutagenesis problem. If we have enough time, we will discuss the soundness
of our approach and the link between the completeness of the two learning
strategies in ILP will be underlined. SPEAKER:
Aijun An DATE:
TITLE:
A Rule Induction System and its Applications ABSTRACT:
I
present ELEM2, a rule induction system, that induces decision rules from a
set of data based on selection of attribute-value pairs. ELEM2 is
distinguished from other rule induction systems in three aspects. First, in
the search for attribute-value pairs, ELEM2 uses a heuristic function that
reflects the degree of relevance of an attribute-value pair to a target
concept. The function leads to selection of the most relevant pairs to
formulate rules. Second, ELEM2 handles inconsistent training examples by
defining an unlearnable region of a concept based on the probability
distribution of that concept in the training data. The unlearnable region is
used as a stopping criterion for the concept learning process, which
resolves conflicts without removing inconsistent examples. Third, ELEM2
makes use of a rule quality measure in its post-pruning process to prevent
rules from overfitting the data. The default rule quality formula used in
ELEM2 measures the extent to which a rule can discriminate between the
positive and examples of the target concept. The rule quality formula can
also be selected from a set of formulas based on characteristics of the data
set and a set of formula-selection rules. I
will provide an overview of experimental results obtained by applying ELEM2
to real and synthetic data sets and compare the results to the results from
some other systems. I will also discuss some recent industrial applications
of the research, including identifying active compounds for drug discovery
and target marketing of products and services for telecommunication
companies. SPEAKER:
Mario Jarmasz DATE:
TITLE:
Roget's Thesaurus: a Lexical Resource to Treasure ABSTRACT:
WordNet
proved that it is possible to construct a large-scale electronic lexical
database on the principles of relational lexical semantics. It has been
accepted and used extensively by computational linguists ever since it was
released. Inspired by its success, I propose as an alternative a similar
resource based on the 1987 Penguin edition of Roget's Thesaurus of English
Words and Phrases. This
seminar describes the steps involved in creating a machine tractable version
of Roget's Thesaurus. A comparison to WordNet and an evaluation
of its merits for natural language processing shall be discussed.
Roget's classification system as
well as its lexical entry format will be shown, the organization and
distribution of words and phrases described in depth. After analyzing the
Thesaurus, a methodology for mapping Roget's senses onto WordNet and
labelling implicit semantic relations shall be explained. A
survey of how Roget's has been used in computational linguistics will be
presented. Three applications of the computerized Thesaurus, measuring
the semantic similarity of words and phrases, word sense
disambiguation, and text classification shall be discussed. |