If you’ve been genotyped by 23andMe, AncestryDNA, MyHeritage, Family Tree DNA (FTDNA), or Living DNA you may believe you had all of your DNA sequenced. What if you learned that 23andMe only genotypes around 0.02% of your DNA? Whole genome sequencing gives you over 4,000 times more data and over 900 times more known genetic variants. Yes, you’ve read that right — over four thousand times more data!
Surprised? You are not alone. You can now get affordable Whole Exome Sequencing (WES) and Whole Genome Sequencing (WGS) from several providers. You can sequence 100% of your DNA with Whole Genome Sequencing.
The Difference Between Genotyping and Whole Genome Sequencing
Comparing consumer genotyping data to Whole Genome Sequencing data is like comparing a mountain to a small mound. Or for a better perspective it’s like comparing a 1/4000 scale Star Wars Imperial Star Destroyer to a actual Imperial Star Destroyer!
Don’t get me wrong, a 1/4000 scale Imperial Star Destroyer is pretty cool. But do you really think it can take on a real Star Destroyer. I think not! Jokes aside, you honestly can’t compare genotyping to sequencing. It’s like comparing apples to oranges.
Genotyping is like picking specific words from a page and sequencing is like reading and storing the entire book. With genotyping, if you want to grab a specific word that you did not have genotyped out of the book, you are out of luck. You have to be genotyped again looking for that specific word. In more scientific terms, genotyping gives you selected Single Nucleotide Polymorphisms (SNPs) and Insertion/Deletion variants (Indels). Whole Genome Sequencing gives you the SNPs, Indels, and all other DNA.
I borrowed this comparison from Helix. Perhaps they state this concept better than I did in this video:
Back to the Book Analogy
When sequencing, we use some pretty strange techniques. We basically take your book (your DNA), shred it into tiny fragments, and then xerox it 30 times (for 30x sequencing). These fragments are usually around 100-250 letters long (scientifically referred to as base pairs). So it’s usually shredded into somewhere between 12 million and 30 million unique parts depending on the type of sequencing. Counting the xeroxed copies, there are around 360 million to 900 million fragments.
After we are done shredding the book and xeroxing the shreds, we take all these shreds and figure out where they belong again to the best of our ability. We reference another book to do this. This book has slight differences than our book. But the purpose of this process it determine what the difference is between our book and the reference book. And we can almost put this book completely back together even with the slight differences. About 5% of the book is too difficult to assemble. But it’s the best we can currently do!
After we finish assembling the book, we store all the pieces with a map and index of exactly where they belong. We call this book, “My Genome.” We don’t usually discard the pieces we couldn’t map. Instead, we store them in an envelope called “unmapped reads.” These unmapped reads may belong to pieces of the book that couldn’t be assembled. Or it may be DNA of things that contaminated the book — Viruses, Bacteria, Archae, Fungi and Parasites. Gross! There are a lot of germs on books!
In technical jargon, the shredded book is random readings of your DNA stored in compressed computer files called a FASTQ files. A high end computer or cloud computing would then take these random FASTQ reads in these files and create a map of where they belong. The computer stores the reads, the map, and an index into a compressed binary file called a BAM (Binary Alignment Map) file. These files can vary in size, but for a whole genome, it’s not uncommon that the FASTQ and BAM files are around 50-120 gigabytes each and about 100-240 gigabytes combined.
But this book is very large in size and is hard to carry around. And we are mostly interested in the differences between “My Genome” and the reference book. So we scan the assembled version of this book and create notes of where all the differences are. We also reference the xeroxed copies to determine not only how many times sentences of the book was read, but how accurately we read it. Remember, when you xerox things a lot of times, parts will inevitably come out blurry and it can be hard to read each letter or word. If the accuracy is good, we mark things as pass. If we can’t tell exactly what the book says, we mark why it didn’t pass. We then have a notebook filled with everything that’s different between “My Genome” and the reference book and whether or not it passed our accuracy check.
Going technical again, what we are doing is generating what’s called a VCF (Variant Call Format) file from the BAM file. The VCF file tells us where our genome didn’t match the reference genome. In simpler words, this helps show us where are genetic polymorphisms and mutations are. This file is compressed to save disk space and make it more portable. VCF files are typically between 150-300 megabytes for a Whole Genome Sequence. But they can be a little bigger or smaller than this depending on the tools used to make them.
Summary of What We Just Learned
Wow, that was a lot. But in summary:
- Consumer genotyping gives you about 0.02% of your DNA while Whole Genome Sequencing gives you roughly 4000x that amount.
- Consumer genotyping represents a tiny fraction of your DNA while Whole Genome Sequences represents close to 100% of your DNA.
- For 30x sequencing, random reads of your DNA are read 30x on average and stored in a files called FASTQ files. Since these reads are random, they are not mapped to you or anything.
- We use intensive computing (usually cloud computing) to map these FASTQ files to a BAM file. This BAM file contains all the data in our FASTQ files along with a map and an index of where the reads align to the reference genome.
- From this BAM file, we make a more compact file called a VCF file that tells us exactly where we differ from the reference genome. These are our variants and mutations. VCF files contain 1 letter differences from the reference genome (SNPs) or insertions or deletions of several letters (Indels).
If you are still confused, I recommend watching this two minute video titled What is Genomic Sequencing? from Mayo Clinic:
I hope that helped explain the difference between genotyping and sequencing and explained the concepts of sequencing without breaking your brain. I’m pretty sure I broke my brain a few times trying to write this, so don’t feel bad if your brain feels broken too.
If you have any feedback or think I got a concept wrong, please let me know in the comments.