E-Book Overview
Written by a pioneer of the use of bioinformatics in research, the second edition of Introduction to Bioinformatics introduces the student to the power of bioinformatics as a set of scientific tools. Retaining and enhancing the rich pedagogy and lucid presentation of the first edition, this new edition explains how to access the data archives of genomes and proteins, and the kind of questions these data and tools can answer. It also discusses how to make inferences from the data archives, how to make connections among them, and how to derive useful and interesting predictions. The book is accompanied by a fully integrated companion website.
E-Book Content
1
Introduction
A scenario
3
Life in space and time
4
Dogmas: central and peripheral
5
Observables and data archives
Curation, annotation, and quality control
8
10
The World Wide Web
11
Computers and computer science
14
Biological classification and nomenclature
19
Use of sequences to determine phylogenetic relationships
22
Searching for similar sequences in databases
31
Introduction to protein structure
39
Protein structure prediction and engineering
48
Clinical implications
50
Recommended reading Exercises, Problems, and Weblems
54 55
The hURLy-bURLy Electronic publication Programming
Use of SINES and LINES to derive phylogenetic relationships
The hierarchical nature of protein architecture Classification of protein structures Critical Assessment of Structure Prediction (CASP) Protein engineering The future
13 13 15
29
40 43 49 50 53
Biology has traditionally been an observational rather than a deductive science. Although recent developments have not altered this basic orientation, the nature of the data has radically changed. It is arguable that until recently all biological observations were fundamentally anecdotal – admittedly with varying degrees of precision, some very high indeed. However, in the last generation the data have become not only much more quantitative and precise, but, in the case of nucleotide and amino acid sequences, they have become discrete. It is possible to determine the genome sequence of an individual organism or clone not only
1: Introduction
completely, but in principle exactly. Experimental error can never be avoided entirely, but for modern genomic sequencing it is extremely low. Not that this has converted biology into a deductive science. Life does obey principles of physics and chemistry, but for now life is too complex, and too dependent on historical contingency, for us to deduce its detailed properties from basic principles. A second obvious property of the data of bioinformatics is their very very large amount. Currently the nucleotide sequence databanks contain 6 × 109 bases (abbreviated 6 Mbp). If we use the approximate size of the human genome – 3 × 109 letters – as a unit, this amounts to two HUman Genome Equivalents (or 2 huges, an apt name). For a comprehensible standard of comparison, 1 huge is comparable to the number of characters appearing in six complete years of issues of The New York Times. The database of macromolecular structures contains 15 000 entries, the full three-dimensional coordinates of proteins, of average length ∼400 residues. Not only are the individual databanks large, but their sizes are increasing at a very high rate. Figure 1.1 shows the growth over the past decade of GenBank (archiving nucleic acid sequences) and the Protein Data Bank (PDB) (archiving macromolecular structures). It would be precarious to extrapolate.
15000
(a)
Size (Mbp)