Bringing it all back home: next-generation sequencing technology and you
This essay was written by Mike Gilchrist and was first published in the 2010 Mill Hill Essays.
Technological advances in experimental biology are accelerating at such a pace that today a California company can propose, for ten thousand dollars and given a sample of your DNA, to do something that would have cost us three billion dollars a decade ago – sequence a human genome. How can they do that? And possibly more importantly, why would you want them to do that? What would it tell you about your state of health, both now and in the future? What might it tell you about your parents’ and your children’s prospects?
To understand what is really quite a revolution in experimental biology we will have to look not only at the architecture and organisation of the genome, but also at the methods we have been using to get at the information it carries, and how this information is used by the organism to grow correctly from an embryo to an adult. We will also need to understand how small differences between the genomes of different individuals arise, and how these lead to some of the observable differences that make each of us unique. Let us start with the genome.
The molecular structure and information content of our genome
The human genome has two primary functions: it carries the coded instructions for how to build and run a human being; and it conveys this genetic information from generation to generation, while allowing a little mixing and small amounts of change so that future generations remain robust and can evolve. Since Crick and Watson solved the structure of DNA we have understood how the molecular structure of the genome allows it do these things. The genome is carried on twenty three long molecules of DNA, our chromosomes, and we have two copies of it in most of the cells in our body. The genetic information on the chromosomes is encoded into the sequence of bases that link the twin helical backbones of the DNA molecule (see Figure 1). This genetic code is very simple, consisting of just four different bases; adenosine, cytosine, guanine, and thymine; which we usually refer to by the letters A, C, G and T. A base on one strand of the DNA must bond with its complementary base on the other strand, A with T, C with G, and vice versa, so that the two strands make exact but complementary copies of each other.
This is what facilitates the precise replication of the chromosomes when cells divide during growth of an organism. Although the basis of the genetic code is simple, the sheer length of the DNA – there are approximately 100 million base pairs in an average size chromosome – provides more than enough variation to build the tens of thousands of distinct molecules that are needed for life.
The international Human Genome Project set out in 1990 to determine the sequence of our DNA, involving hundreds of scientists from North America, Europe, Asia and the Pacific Rim. It took ten years to produce a draft version of the genome and we now know that it contains close to 3 x 109, or three thousand million, base pairs, and we know what most of its sequence is. Can we get a feel for how much ‘information’ the genome contains? In simple terms a good thick airport novel contains about a million letters, and so a library of 3,000 such books would hold the same amount of ‘text’ as the human genome. This would fit easily into bookcases lining one wall of a generously sized living room. If we were to devote our leisure time to reading these ‘books’, and could get through one a week, it would take sixty years to plough through our entire genome – a nice fit with our lifespan, and hopefully a few years left to digest what we have read.
As we read through our genome, however, we would quickly be assailed by a recurring sense of déjà vu: that we had read this or that bit before, possibly many times over. It turns out that not all of the genome ‘text’ is unique, and not all of it carries useful information, at least so far as we can currently interpret it. We could improve our analogy to take this into account. Suppose that our book collection starts off with all the novels of (say) Austen, Tolstoy and Zola. This would make up a little under 2% of the total of 3,000 books. Then we need to imagine that the balance is made up of 2950 copies of Marcel Proust’s À la recherche du temps perdu: Du côté de chez Swann. Furthermore, these multiple copies are themselves cut up into fragments ranging from a few characters to several pages, and the resulting mixture, much of which is highly repetitive, is distributed randomly amongst the pages of the other books.
Going back to the real genome what this means is that our genome sequence is largely made up of repetitive, or uninformative, sequence, which we think of as having little impact on the day to day running of our bodies. This material is sometimes called ‘junk’ DNA, and as you might guess, it turns out that the repetitive elements have a significant adverse impact on our ability to experimentally determine the sequence of our genome. Our genes, which do most of the useful work, occupy the remaining ‘interesting’ 2% of the genome, and are the key to our biology. We have about 20,000 – 25,000 genes, and each is responsible for making one of the many smaller molecules (mostly proteins) that we need to grow and to function. The recipe for each gene’s product is contained in its DNA – think of it as long paragraphs of unique text interspersed amongst the junk – and the precise sequence of the gene determines the structure and nature of (say) the protein it produces. In addition, the sequence around a gene contains signals which, in concert with other genes, tell the body where and when to produce that gene’s protein. This is the origin of our need as biologists to sequence the genome, as only this way can we begin to understand the functioning of all these genes.
So how do we go about sequencing a genome?
Sequencing the human genome: old style
Consider first what we could do ten years ago, at the turn of the millennium. The primary method for sequencing DNA then was called Sanger sequencing (after the Nobel Prize winner, Fred Sanger, who invented it), and we could routinely sequence sections of DNA for up to about 800 bases with high accuracy. This however was as far as we could go, and even though we could sequence the ends of quite large fragments of DNA, we could not sequence the section in the middle (see Figure 2a). Extracting someone’s DNA is easy enough, but how do we begin to sequence a whole genome when our technology allows us to access only the very end sections of DNA molecules? As the Human Genome Project progressed, the methods used to sequence the genome changed quite radically. At first it was a methodical sequencing of small regions of known sequence to build up larger ones; now the standard approach for vertebrate genomes is so-called shotgun sequencing. In very simple terms we take the genome and shatter it into many thousands of molecular fragments, and then sequence the ends of all the fragments. And we do this not just with one copy of the genome, but with millions of copies, although this is easy, as our body contains trillions of cells, each one containing its own copies of the genome. Then, using a considerable amount of computing power, we begin to look for sequenced ends that match each other, and gradually we assemble a copy of the genome sequence from these overlapping matched ends, not entirely unlike a very large jigsaw puzzle (see Figure 2b) with tens of millions of pieces.
The repetitive nature of the genome is, however, quite a problem for the assembly process. By working with DNA fragments that are generally larger than most of the repetitive regions we can effectively ‘step over’ these regions when assembling the genome pieces in the right order. Then, when the overall order is correct the individual repetitive regions are easier to assemble. A simple shotgun sequence strategy might combine groups of carefully size-selected fragments of 5kb (i.e. 5,000 base pairs), 20kb and 100kb. This enables the assembly software to handle quite large repeat regions, and increases the chances of a robust and accurate assembly.
In many ways it is amazing that we have been able to do this, and we have sequenced a few more species with genomes as large and complex as ours, but it is prohibitively expensive and it is unlikely there will be any more large genome projects like this.
The new sequencing technology
New sequencing technologies began to emerge during the closing phases of the Human Genome Project with the promise of very much higher throughput than Sanger sequencing, and these methods quickly became known generically as massively parallel, next generation sequencing. The most widely used system today is owned by the US company Illumina Inc., which bought the UK company Solexa that developed the technology. The technical details are quite fascinating: it involves the simultaneous sequencing of millions of tiny fragments of DNA on the surface of a glass slide about the size of a large matchbox, essentially by imaging them as they grow. Fragments of DNA to be sequenced are anchored at one end to the surface of the glass slide, in an enclosed flow cell through which reagents can be passed. The fragments have been transformed into single-stranded DNA and a second strand can now be re-synthesised using the anchored strand as a template (see Figure 3). Individual building blocks of DNA, or nucleotides, are used, which have been modified so that each of the four bases (A, C, G or T) emits a different coloured light when excited by a laser. When these are washed over the template fragments on the glass slide the nucleotide with the next correct base is chemically locked in place on the growing second strand. The glass slide is then photographed at very high resolution while laser light is shone on it, so that each growing fragment shows up as a tiny dot of light, coloured depending on which base was just added. The light emitting capability is then removed chemically, and the process repeated in cycles until the point where it becomes unreliable. The resultant stack of images is then analysed by computer, and the DNA sequence of each fragment can be easily ‘read out’ according to the sequence of colour changes at each dot.
One of these new technology sequencing machines costs about the same as the older sequencing machines to run, but instead of a few hundred sequences each time, we generate hundreds of millions of sequences for the same effort, and this is why they are so powerful. But there is a catch, and in fact there are two catches: first, the sequences are very short, only tens of bases, and secondly, the maximum fragment size is limited, in the simplest version of the technology, to a few hundred bases.
Could one assemble a genome sequence with these short reads, if there were enough of them? Small microbial genomes can be assembled as they lack the great number and size of the repeat sequences found in genomes like our own. Even quite small repeat regions of just a few thousand bases become impenetrable barriers to assembly when the two sequenced ends of your DNA fragments are no more than a few hundred bases apart, and so this new technology is unsuitable for the complete assembly of human-size genomes. The resolution of this apparent problem lay in an unexpected direction – after all why assemble the human genome again from scratch when all our genomes are very nearly identical? Simply by matching the sequence of these short reads against the existing ‘reference’ genome and laying them out alongside it we can effectively re-sequence the genome for any individual. Furthermore, this works best just where we have the most interest: the relatively unique regions containing genes and their control signals.
But why would we want to sequence other humans if we have already sequenced one? The answer is that a lot of what makes us different from each other is determined by many small differences in our DNA, and next-generation sequencing technology gives us a quick and powerful way of finding these differences. What we see if we line up all the short sequence reads for a given individual against the reference genome is that at many positions the individual has a different sequence to the reference, often over as little as a single base (see Figure 4). In fact we may see two different bases at some positions, and this reflects the inheritance of different genomes from our parents. In order to understand why some of these differences are so important we need to return to the genome sequence and see how signals are encoded in DNA, and how changes to the DNA arise.
The genetic code and signals in the DNA sequence
We have established already that the DNA of our genome is a long and variable sequence of the four bases A, C, G and T. The precise sequence of these bases determines many things: where on a chromosome a gene begins and ends, what the nature and structure of the molecule it produces is, and when and where the gene is active in the body. We can usefully think of the bits of sequence that do these things as signals between the genome and the machinery of the cell which has to translate the signals into meaningful molecular action. If these signals are changed or lost because of damage to our DNA, then our cells may misbehave with sometimes unfortunate consequences.
For example there are two simple signals in the sequence of a protein coding gene: an ATG where the cell should start using the sequence to build the protein, and TAA, TAG or TGA where it should stop. The important point is that if either signal is lost then the gene will not produce the right protein. The converse is also true: if the DNA within the gene sequence is altered so that a new ‘stop’ signal is gained in the wrong place then the protein it produces may be shortened, and in all likelihood will not work as it should (see Figure 5a). There are many other types of signal in our DNA which control the behaviour of genes; some are very precise, and some, to our eyes, look uncomfortably ill-defined, but all have in common the property that changes to the DNA sequence will have effects on the functioning of our cells and our body.
Fortunately DNA is quite a robust molecule and can generally be repaired by the cell, but it is not immune to alteration. External agents like radiation or chemicals can cause breaks in the DNA, which may be mis-repaired, or they can cause the base at a given position to change. Of course this is damaging the DNA in just one of the body’s many cells. In most cases the effect will probably be insignificant, but not if the damage occurs in one of the specialised cells from which our offspring derive. These germ cells, eggs in women and sperm in men, carry only one copy of the genome, and it is from the fusing of a single sperm with a single egg that a new human being grows. The critical point, for this story, is that any changes which have occurred in the DNA of either of these two parental copies of the genome are now copied into every cell of the new body as it divides and grows. Changes in the functioning of the altered gene may affect any, or all, of the body’s tissues where the protein from that gene is required.
Human genetic variation and complex diseases
When a change in the DNA is created it is called a mutation, but if it is passed on to future generations and becomes established in part of the population we refer to it as a polymorphism – i.e. a difference in the genetic code present in some people at a specific position in their genomes. Small pieces of DNA may be lost or inserted, but the commonest form of polymorphism is the alteration of a single base, or nucleotide, to a different base, and we refer to these as single nucleotide polymorphisms, or SNPs (pronounced ‘snips’). We inherit one complete copy of the genome from each of our parents, so at the site of each known polymorphism we might inherit the ‘normal’ version from one parent, and the ‘deleterious’ version from the other parent, or any other possible combination depending on which versions of the polymorphism our parents had inherited from their parents, and so on (see Figure 5b). We refer to the different versions of the sequence in the populations as alleles. If one allele is clearly deleterious, we may refer to it as a mutation even when established in the population, and we will use this meaning here.
What are the effects on each of us of carrying these polymorphisms around in our DNA? Human variation, by which we mean everything that distinguishes us as individuals, from hair and eye colour, to height and weight, and the ability to play chess or make music, is the combination of two things: our genetic inheritance and the environment in which we grow up. Some of these, like eye colour, are purely genetic in quite a simple way, i.e. the eye colours of our ancestors determines the possible eye colours we may be born with. But some characteristics are much more complex, and in ways we are only beginning to understand. For example, we know that height is a combination of genetic and environmental factors. By careful study of family relationships we estimate that well over half of human height variation should come from our genes, but even with masses of data on this most measurable of traits, we cannot yet identify all the genes that are contributing.
We can make a broad distinction between simple traits and complex traits. With simple traits we can generally track down the causative gene with ease, and indeed some have been known for many years. Complex traits are more of a puzzle, and we do not yet have the ability to find all the contributing genes. Where this creates difficulties for us is in developing our understanding of the causes of genetically complex diseases, like diabetes, asthma, cancer, autism and schizophrenia.
As with other simple traits, the genetic causes of simple or monogenic diseases, such as cystic fibrosis and sickle cell anemia, are well known and we can identify specific mutations in a given gene as the cause with some degree of certainty. If you are unfortunate enough to have inherited such a genetic variant you will almost certainly develop the disease at some point. With complex diseases we only know enough to say that if any of your parents or grandparents had the disease, then there is some likelihood, greater than in the general population, that you may develop the disease yourself. We do know some of the contributory genetic variants and affected genes but we have a pretty good idea that, even for well studied diseases, there are many we have not found yet.
Genome wide association studies
Early attempts to track down the genes involved in complex diseases using data from affected families largely failed, probably because we underestimated the numbers of genes involved. Now that we have catalogued large amounts of human genetic variation we can use a more brute force approach.
Genome wide association studies (GWAS) rely on being able to determine the genetic status (i.e which allele they have) for many individuals at many known polymorphic sites in the genome, a process we call genotyping. Initially we have no idea which polymorphisms, and hence which genes, are associated with a disease being investigated, but by studying the genotypes of a sufficiently large group of people with the disease, and comparing them with a similar size group of healthy people, we can begin to track down the genes involved. For polymorphisms that have an effect on the likelihood of you getting a disease, one of the alleles will predispose you to the disease, and the other will be protective. This should show up in a statistically different bias between the two groups in the study: i.e. the disease group are more likely to carry the disease associated allele than are the control (healthy) group. This both identifies the polymorphism as being disease associated, and tells us which allele increases our risk of developing the disease. In contrast, polymorphisms that are not associated with the disease will show the same allele distribution in both groups. The outcome from this type of experiment is therefore a list of genomic locations where there are polymorphisms whose alleles are to some extent predictive of our disease status.
The first surprise is that, for a given disease, these lists are quite long; recent studies in type-I diabetes and Crohn’s disease have suggested more than 30 genetic risk indicators in both cases. The second surprise is that although we can estimate the genetic contribution for each polymorphism associated with a given disease, adding them all together falls some way short of explaining the proportion of that disease we expect to be genetic from a study of its behaviour in families. A pervasive concern is that this ‘missing heredity’ may be in very rare alleles, found in only a small proportion of the population, and that these polymorphisms will, as a consequence of their rarity, be very difficult to find. So the picture we have now of complex diseases is that they are the result of possibly quite subtle failures in some, but not all, of the genes associated with developing the disease; and it is proving quite difficult to track down all the genes involved, and even more so to understand precisely what functional role they each play.
Let us try and pull these diverse threads together. We have already assembled a complete version of the human genome – a reference copy – using ‘old fashioned’ long-read sequencing, and this was expensive and time consuming. Now we have new sequencing technologies that can easily produce many millions of short sequence fragments of our own DNA. These cannot be ‘assembled’ in the traditional sense, but they can be lined up against the reference genome, to see where our own DNA differs from the reference. We note in passing that the reference genome is just another person’s genome and is no more ‘right’ than our own, it is just the one we sequenced first. We also have an extensive catalogue of human polymorphisms. Some of these we know indicate clear causality for simple diseases; many are associated, with varying degrees of risk but little clear causality, with a steadily growing list of complex diseases. The vast majority, however, have a quite unknown effect, even those which are clearly likely to affect the functioning of a gene.
So it is entirely possible to have one’s genome ‘sequenced’. The cost would be measured in thousands, rather than tens of millions, of dollars, and the data would be ready in just a few weeks. The important stuff will be there, as the re-sequencing approach works just where it is needed: in the regions of relatively unique sequence where the genes lie and are controlled.
What would we learn from our genome? For each polymorphism that was known or suggested to be associated with a disease we could get a read-out of the two alleles we have at that position, our genotype, and from that an indication of the risk of having or developing that disease. For complex disease the excess risk for each polymorphism can be aggregated to give an estimated overall risk, although our confidence in these calculations is not as high as we would like. Some of these pieces of information may help us make beneficial adjustments to our lifestyle: for example if our genetic profile indicates a 10% additional risk for type-II diabetes then we might make extra effort not to put on weight, as that is a known aggravating factor that we can control. For known monogenic diseases there would probably be no big surprises. These are mostly rare and mostly severe, so the likelihood is that you already know you have it, or for later onset disease, that it is in your family and you already know the chance of developing it. The same might also be true for some complex diseases where we have specific genetic tests for some of the contributory factors, like breast cancer.
There is a further twist: although we have made progress in understanding the genetic basis for disease, even in those cases where we can pinpoint the precise causative mutation in a particular gene it remains an extremely challenging problem to figure out how to repair broken genes in vivo. We hope that in the future we will be able to use gene therapy, where we introduce a new copy of a gene into the genome in the cells where it is needed, but just now this is a risky and somewhat controversial approach. A more promising short term gain may be in the field of personalised medicine, which starts from the observation that the effectiveness of some drugs can be quite variable between individuals, and this may have a genetic component. Our genotype may lead to more appropriate drug treatment.
We may be curious to know everything that is in our genome, but there may be unpleasant surprises that we are not prepared for, and can do nothing about. The arguments here are not much different from those associated with genetic counselling, where people at known risk of a genetic disorder are contemplating taking a test: what is the value of knowing? If there are no cures for what you find, are you better off not knowing? But what if you are contemplating having a family? This brings us to our final point, and that is that in some senses the information in your genome is not yours alone. Choices you might make to ‘know’ your genetic profile will affect people you share your genes with: your parents, your siblings and your children. There are some profound ethical issues here, and, at least until we can make more and better use of the knowledge gained, it may be best to think carefully whether or not to read Pandora’s genome.