Advances in genomics research have ushered in a new era of analysis into our physiological and behavioral traits, uncovering insights about our identity and its determinants. This research links our phenotype (observable characteristics) with our genotype (genetic composition) in hopes of identifying common attributes that allow us to better understand and predict differences in our personality and health. The human genome, however, is remarkably large – it contains three billion base pairs of genetic material, which if written out, would fill over 200 New York City telephone books (averaging 1000 pages each) 1. It is also remarkably complex: there are roughly ~20,000 genes and even more regions that control how these genes are expressed. Recently discovered techniques in machine learning offer a new opportunity to understand genetic differences at scale for the first time, providing companies like 23&Me the chance to drive consumer applications in precision medicine and genetic lifestyle analysis.
The Big Problem:
The Human Genome Project, completed in 2003, offered our first mapping of the human genome, identifying all of the 20,500 genes in human DNA, and determining the sequences of its 3 billion chemical base pairs 2. However, decades after its completion, scientists are still looking to establish more causal relationships between genotype and phenotype traits. Why the lacking developments? Two primary factors hinder our progress: scale and decoding complexity. To the former, small variations in genes and regulatory regions are ultimately what make each of us unique (and, unfortunately, result in disease). These small variations are very hard to detect within such a large data set. To the latter, manual genome sequencing is still very error prone and may lead to misread DNA elements.
The Cluster of Solutions:
To tackle these issues, 23&Me has used statistical techniques such as deep neural nets to understand correlations in genetics data, generate hypotheses for clinical validation, and standardize the genome reconstruction process. This is primarily done in two ways:
- Identify connections between genotypic and phenotypic patterns using unsupervised machine learning, a technique that clusters genes by their expression in cells and tissues.
- Improve sequencing methodologies by transforming the genome reconstruction problem into an image classification problem, which results in greater speed and accuracy 3.
Unsupervised machine learning clusters individuals within an unlabeled data set, indicating key traits that are consistent and relevant to the genetic factors driving our susceptibility to disease. 23&Me applies this method to ancestry data as well, mapping the locations in your genome that carry small amounts of information about your family’s geographic history.
Accordingly, their use of machine learning to improve sequencing methodologies has enabled improved accuracy and innovation. The genome sequencing process begins with high-throughput sequencing (HTS), a method that produces ~1 billion short sequences of bases, but are unfortunately not organized into a human-recognizable genome sequence 4. Deep learning methods facilitate a true genome sequence (guanine, cytosine, adenine and thymine organized into 23 pairs of chromosomes) from HTS sequencer data with significantly greater accuracy than previous classical methods 5.
Our Genetic Future:
Genomic applications in machine learning have enabled a wave of research and consumer innovation towards understanding how our DNA influences who we are. However, genetic makeup is only one piece of our map of human physiology. The next wave of innovation will arise from combining this data with other clinical research to build a true picture of health and disease. These partnerships can uncover additional findings, including how our phenotype expression changes within varying environmental conditions (e.g. lack of sleep, high levels of exercise) through comparison with other clinical measurements.
Machine learning applications in genetics research can produce generative innovations as well. By learning the characteristics of genotypes that produce desired phenotype expressions, we may begin to edit and change our own genomes to reduce our susceptibility to disease.
While ripe with promise, machine learning applications in genetics carry deep, philosophical questions about the nature of privacy, data sharing, and the human condition. Critics push back on data sharing and commercial opportunities – should we allow 23&Me to negotiate contractual agreements to share data with 3rd parties? And should they be able to profit from their genetic analysis 6?
Perhaps more broadly, these applications provoke questions about what makes us human. If we are able to sequence and modify our own genome to optimize for desired traits, should we? Are we sacrificing our morality to improve our material wealth? Genomics research can answer many questions about our physiology, but these questions will require a deeper philosophical reflection into the essence of our humanity and values.
 “NOVA Online | Cracking The Code Of Life | Genome Facts”. 2018. Pbs.Org. https://www.pbs.org/wgbh/nova/genome/facts.html.
 “Human Genome Project Information”. 2018. Web.Ornl.Gov. https://web.ornl.gov/sci/techresources/Human_Genome/index.shtml.
 2018. Biorxiv.Org. https://www.biorxiv.org/content/biorxiv/early/2016/12/14/092890.full.pdf.
 Thangaraj, Andrew. 2018. “Evaluating Deepvariant: A New Deep Learning Variant Caller From The Google Brain Team”. Inside Dnanexus. https://blog.dnanexus.com/2017-12-05-evaluating-deepvariant-googles-machine-learning-variant-caller/.
 2018. Permalinks.23Andme.Com. https://permalinks.23andme.com/pdf/23_17-GeneticWeight_Feb2017.pdf.
 Brodwin, Erin. 2018. “DNA-Testing Company 23Andme Has Signed A $300 Million Deal With A Drug Giant. Here’s How To Delete Your Data If That Freaks You Out.”. Business Insider. https://www.businessinsider.com/dna-testing-delete-your-data-23andme-ancestry-2018-7.