CCMB Seminar Series 2005-2006
_______________________________________________________ Events
Center for Computational Molecular Biology Seminar Series
Luis E. Ortiz, Postdoctoral Lecturer
Department of Electrical Engineering and Computer Science
Massachusetts Institute of Technology
Computer Science and Artificial Intelligence Laboratory (CSAIL)
Game Theory, Biology and the DNA Binding Game
Abstract:
We propose a game-theoretic approach to learn and predict coordinate binding of
multiple DNA binding regulators. The framework implements resource constrained allocation
of proteins to local neighborhoods as well as to sites themselves, and explicates
coordinate and competitive binding relations among proteins with affinity to the
site or region. Our model permits us to make numerical predictions genome-wide under
different perturbations
This talk will emphasize the mathematical and computational foundations of the new
modeling approach. I will start by formally presenting our proposed model: the DNA
Binding game. I will establish its ability to make predictions under any perturbations
by showing that an equilibrium exists in any instantiation of the game. I will present
in some detail a simple iterative algorithm that monotonically converges to an equilibrium
of the game (thus providing a constructive proof of existence). Time permitting,
I will show a small-scale illustration of our approach on a well-known biological
subsystem: lambda-phage. I will conclude by briefly discussing work in progress
on learning games from data to address large-scale biological problems.
Joint work with Luis Perez-Breva, Chen-Hsiang Yeang and Tommi Jaakkola.
Wednesday, May 17, 2006
4:00 - 5:00 pm
Applied Mathematics Building
182 George Street ~ Room 110
Host: Professor Charles E. Lawrence
_____________________________________________
Eran Halperin
International Computer Science Institute, Berkeley
New Applications of DNA Pools for Disease Association Studies
Abstract:
The recent release of the Haplotype Mapping project (Nature, Oct. 26, 2005), and
the rapid reduction in genotyping costs open new directions and opportunities in
the study of complex genetic disease such as cancer or Alzheimer's disease. The
datasets collected for many of these studies include Single Nucleotide Polymorphisms
(SNP) data, which are DNA sequence variations that occur when a single nucleotide
(A,T,C,or G) in the genome is altered.
Even though technological improvements have recently reduced the genotyping costs
considerably, the genotyping burden on disease association studies is still heavy.
One technique that may be able to reduce this burden is the use of DNA pools. In
DNA pools, the DNA samples of a group of individuals is pooled, and the resulting
pool is then genotyped, resulting in a measure of the allele frequency in the pool.
In this talk, I will describe new methods that use DNA pools for association studies
involving unrelated individuals or mother-father-child trios. I will show how some
combinations of DNA pools can reduce the genotyping burden considerably, or alternatively,
can serve as "error detecting codes". I will also describe some wet lab
experiments that support these results.
Thursday, May 11, 2006
3:00 - 4:00 pm
CIT Bldg, Room 368
115 Waterman Street
Host: Professor Sorin Istrail
_________________________________________________
Russell J. Turner
Johns Hopkins University
Applied Physics Laboratory
Visualizing Comparative Genomics Data
Abstract
Since the sequencing of the human genome in 2000, over 25 large eukaryotic genomes
have been assembled. Many of these species are closely related to humans, ranging
in evolutionary distance from 5 million (chimpanzee) to 400 million (fish) years.
In the current post-genomic era, much of the focus of genomic research has shifted
to comparative genomics, the study of the similarities and differences between the
entire genomes of related species.
Comparative genomics can not only shed light on the evolutionary relationships among
species, but also be used as a tool to annotate genes on newly sequenced genomes
by projecting known gene locations from similar species.
Visualizing comparative genomics data presents a challenge due to the complexity,
range of scale, and discontinuities in the differences between genomes. These can
vary in size from single nucleotide polymorphisms to rearrangements of major portions
of entire chromosomes. In this talk, we will present some of the techniques we have
developed and experimented with to visualize comparative genomic data at Applied
Biosystems Corporation, and discuss their implementation in a visualization tool
we have developed called Atavist.
Biography:
Dr. Turner is a Senior Computer Scientist at Johns Hopkins University Applied Physics
Laboratory. His research interests include bioinformatics visualization, interactive
2D and 3D graphics, object-oriented software design and 3D character animation.
Before working at APL, he was a member of the Informatics Research group at Applied
Biosystems and technical lead for development of the Celera Genome Browser at Celera
Genomics.
Friday, April 28, 2006
3:00 - 4:00 pm
CIT Building, Room 368
Host: Professor Sorin Istrail
________________________________________
Hagit Shatkay
School of Computing
Queen's University
Hairpins in Bookstacks: Information Retrieval for Biomedical Text Mining
Abstract:
Current advances in high-throughput biology are accompanied by a tremendous increase
in the number of related publications. Much biomedical information is reported in
the abundant literature. The ability to rapidly and effectively survey the literature
can support both the design and the interpretation of large-scale experiments, and
the curation of structured biomedical knowledge in public databases.
In an effort to meet these goals, a variety of text-mining methods are being applied
to the biomedical literature.
This talk will briefly survey such methods, and will focus on two applications in
which we use information retrieval, in non-traditional ways, to directly support
biomedical discovery.
Tuesday, April 18, 2006
11:00 am
Lubrano Conference Room (CIT 4th floor)
Host: Professor Sorin Istrail
____________________________________________
Speakers:
Ruan, Ph.D. and Chia Lin Wei, Ph.D.
Genome Institute of Singapore
Genome Sequencing After the Human Genome Sequencing
Abstract:
Our primary interest is to elucidate the structures and dynamics of all functional
DNA elements in complex genomes through transcriptome characterizations. To facilitate
such understanding we have been developing highly efficient and accurate tag-based
DNA sequencing and mapping methodologies to characterize transcripts and transcription
regulatory elements in the human genome. We are also pushing to apply these technologies
to address complex biological questions such as how cancer cells progress and how
stem cells maintain their unique properties. Another major interest in our lab is
to discover previously uncharacterized viruses and bacteria that reside in body
cavities of human. To this end, we have developed a metagenome analysis capability
that use shotgun sequencing and genome sequence assembly techniques to uncover genomes
from uncultured microorganisms. We are currently focusing on characterizing the
microbiota in human gastrointestinal (GI) system.
Tuesday, February 14, 2006
2:00 - 3:00 pm
LMM Room 107
70 Ship Street
Host: Professor Charles E. Lawrence
Computational Analysis of ChIP-chip on Affymetrix Tiled
Arrays
Xiaole Shirley Liu
Department of Biostatistics
Harvard School of Public Health
Abstract:
Chromatin immunoprecipitation coupled with DNA microarray analysis (ChIP-chip) has
evolved as a popular technique to study the genome level in vivo binding of transcription
factors and chromatin remodeling and modifying proteins. Recently genome tiled microarrays
have been developed that allow biologists to conduct unbiased genome-wide ChIP-chip
experiments in mammalian genomes. However, they also generate massive amounts of
data, and pose challenges for the development of effective analysis algorithms.
I will present an approach to analyze ChIP-chip on Affymetrix tiled arrays which
are the cheapest, yet the most difficult to analyze. The low-level analysis pools
data from multiple samples to estimate probe behavior, then uses a hidden Markov
model to detect genomic regions bound by the transcription factor. The high-level
analysis finds common sequence patterns from regions enriched by the transcription
factor ChIP-chip, thus characterizes the binding of the transcription factor and
its cooperative binding partners. I will present our analysis of p53 and estrogen
receptor ChIP-chip on chr21/22 tiled arrays.
Wednesday, November 2, 2005
4:00 - 5:00 pm
BMC 291
____________________________________________
Structure, Function, and Evolution of Transient and Obligate
Protein-protein Interactions
Zhiping Weng
Bioinformatics Program
Biomedical Engineering Department
Boston University
Abstract:
Recent analyses of high-throughput protein interaction data coupled with large-scale
investigations of evolutionary properties of interaction networks have left some
unanswered questions. To what extent do protein interactions act as constraints
during evolution of the protein sequence? How does the type of interaction, specifically
transient or obligate, play into these constraints? Are the mutations in the binding
site of an interacting protein correlated with mutations in the binding site of
its partner? We address these and other questions by relying on a carefully curated
dataset of protein complex structures. Results point to the importance of distinguishing
between transient and obligate interactions. We conclude that residues in the interfaces
of obligate complexes tend to evolve at a relatively slower rate, allowing them
to coevolve with their interacting partners. In contrast, the plasticity inherent
in transient interactions leads to an increased rate of substitution for the interface
residues and leaves little or no evidence of correlated mutations across the interface.
Wednesday, October 26, 2005
2:30 pm
BMC 291
__________________________________________
Space-efficient Whole Genome Comparisons with Burrows-Wheeler
Transforms
Ross A. Lippert
Massachusetts Institute of Technology
Department of Mathematics
Abstract:
Many genome-scale search or comparison projects require the creation of a data-structure
which supports the efficient location of nucleotide or amino acid words. Such indices
can, for example, provide the seeds for genome alignments (to proteins, ESTs, or
other genomes) or an initial set of "overlaps" for assembly. These indices tend
to be space intensive. For example, the suffix tree, a popular data structure for
this purpose requires more than an order of magnitude more space than the original
sequence. This requiring at least some part of the project to be run on a "big memory"
machine or a cluster of computers, providing a significant obstacle to resource-poor
researchers. With a recent data-structure, the compressed suffix array (CSA) implemented
via the Burrows-Wheeler transform, we can trade time-efficiency for space-efficiency,
taking equal or logarithmically more time, but typically taking up less space than
that of the indexed sequence. This is more than an order of magnitude trade between
the run time and the memory required. If space is more expensive than time, this
is an appropriate approach to consider. I implemented a space-efficient implementation
of the CSA on nucleotide data requiring less than 5 bits per nucleotide character
to build, and less than 2.5 bits per character, once built. I will present a description
of this data structure and how it can used to obtain matches. My implementation
was demonstrated by aligning two mammalian genomes on a modest workstation equipped
with under 2 GB of free RAM in time superior to that of the implementations of other
data structures. I will also give rough comparisons to a few publicly available
indexing structures.
Wednesday, October 19, 2005
BMC 291
_________________________________________
Structural Analysis of Protein-DNA Complexes: Insights into
the Mechanism and Evolution of Transcriptional Regulation
Alexandre Morozov
Rockefeller University
Center for Studies in Physics and Biology
Abstract:
Structural modeling of protein-DNA complexes is complementary to genomic sequence
based bioinformatics methods - the two can be used together to understand transcriptional
regulatory networks. Using structural analysis, evolution of transcription factor
binding sites due to mutations at the protein-DNA binding interface can be characterized.
I will demonstrate how genome-wide sequence-structure threading can be used to study
the degree of protein-DNA interface conservation across multiple genomes. Focusing
on protein-DNA interfaces provides classification of transcription factors by their
binding specificity, and allows us to find orthologs and paralogs in related species,
complementing existing algorithms based on the overall sequence similarity. When
a suitable structural template for modeling a transcription factor is available,
transcription factor binding sites and energies can be directly predicted by computational
modeling, and compared with experimental data.
Wednesday, October 12, 2005
4:00 pm
BMC 291
______________________________________________
Coding SNPs, Evolution and Disease Phenotype: Genome-wide
Bioinformatics Predictions and Expirimental Functional Studies
Paul D. Thomas
Computational Biology
Applied Biosystems
Abstract:
Most human variation is selectively neutral, but a large number of both rare and
common allelic variants are associated with human disease. Predicting which allelic
variants may be causative for disease is an open problem. Many diseases, both Mendelian
and complex, have been associated with single-nucleotide changes (SNPs) that lead
to an amino acid substitution in the encoded protein (nonsynonymous SNPs, or nsSNPs).
Because of ascertainment bias, nsSNPs may not necessarily be the dominant cause
of human disease. Nevertheless, nsSNPs provide an excellent testing ground for using
evolutionary analysis to predict the functional effects of genetic variation, as
computational methods for inferring selective pressure in protein-coding sequences
are well-established. We have applied models of both negative and positive selection.
One signature of negative selection is that in groups of related protein sequences,
many positions in the protein are “conserved”; for instance, all serine proteases
must possess the catalytic serine residue. To quantify this negative selection,
we developed a “substitution position-specific evolutionary conservation” (subPSEC)
score. We then analyzed a large number of nsSNPs from a number of data sets: “normal”
variation, Mendelian disease associated mutations and complex disease associated
variation. We find that while Mendelian disease-associated nsSNPs tend to occur
at highly conserved positions in proteins, complex disease nsSNPs do not. In contrast,
applying a method for estimating positive selection, we show that genes involved
in complex disease tend to have relatively large Ka/Ks ratios between human and
mouse orthologs, suggesting that measures of recent positive selection may be useful
in identifying complex disease-associated genetic variation. In collaboration with
Dr. M.R. Hayden and colleagues at UBC we have experimentally and computationally
characterized amino acid substitutions in one disease-associated gene, ABCA1, to
assess evolutionary prediction methods in detail. The ABCA1 transporter has been
implicated in both Mendelian and complex disease. We find that evolutionary conservation
is, in most cases, an excellent predictor of functional importance of an amino acid
in ABCA1. However, we also find that measures of positive selection are critical
for predicting some of the mutational effects.
Wednesday, September 28, 2005
4:00 pm
BMC 291
_______________________________________________________ Events
|