DATE: Friday, Jan. 24, 2003
TIME: 3:30 pm
PLACE: Council Room (SITE 5-084)
TITLE: Pattern discovery in genomics data
PRESENTER: Marcel Turcotte
University of Ottawa
ABSTRACT:

I will present two on-going research projects in the field of bioinformatics: i) detecting over-represented motifs in genomic sequences and ii) learning grammatical representations of regulatory elements.

Genomic data is being produced at an incredible rate. It has been estimated that one sequence every minute enters the genomic sequence databases. The complete genomic content for more than 800 organisms is now known - and this information is available in public databases. There is an urgent need for tools to assist the process of genome annotation and knowledge discovery.

The objective of our first research project is to develop tools to study local interspersed repeated motifs in large genomic sequences. The cmv1r allele of the mouse genome is used as a case study. This particular region is known to confer resistance to the cytomegalo- virus. The hypothesis space is a subset of all strings over a 4 letters alphabet. Our approach consists in searching the set of all pairs of sub-strings from the original input that are at a user-defined distance (Hamming or edit-distance) from one another, selecting the interesting pairs, and finally clustering them.

The second project is set in the context the study of gene regulation at the level of translation. Understanding the regulation of gene expression is fundamental and has impacts on the development of new therapeutic strategies. The regulation of gene expression occurs at two main levels, transcription and translation. During the transcription of the gene, this usually involves the binding of regulatory proteins near the start of transcription and affect the rate of initiation. Several pattern recognition algorithms have been developed to predict the location of the binding sites of the transcription factors. These programs rely on experimental evidence indicating that a set of genes is co-regulated. Based on this information, the programs then infer a sequence motif, deterministic or probabilistic, that is common to all the genes.

Translational control is the final step in a complex network of regulatory processes involved in the control of gene expression. In contrast to transcriptional regulation, translational regulation is believed to involve binding of proteins to RNA sequence but also structural motifs. The aim of this research project is therefore to discover automatically higher-order, grammatical, motifs in RNA sequences.

Basic concepts of biology, genomics and evolution will be presented :)