Still struggling with genome assemblies?

06.07.2017 | Detlef Janssen

Circos plot: Number of contigs based on short-read sequencing and long-read sequencing
Circos plots of the assembled genomes [5]. Alignments of
the HGAP-assembly of the INVIEW DE NOVO GENOME 2.0 data (left)
and Velvet assembly (right) of the short read Illumina data against
E.coli DH1 reference are shown. Regions of homology are highlighted
by colored ribbons.

Sequencing and assembly of whole genomes, even of small genomes, has been cost- and labor extensive for a long time. The advent of next-generation sequencing eased and simplified this process dramatically.

But a single contig in genome assembly is nearly impossible to reach with sequencing based on short reads.  The ultra-long reads gained through PacBio’s SMRT sequencing technology and optimised bioinformatics analysis enable high quality genome assemblies. Additionally, plasmid present in your DNA preparation will be assembled as well as often carrying valuable information. 

Read the full story of how to proceed “towards a "single contig" genome assembly with INVIEW DE NOVO GENOME 2.0

Infographic: Evolution of Sanger sequencing - 1970 in comparison to 2017

Five years ago, Cologne welcomed a distinguished biotech company to the city. GATC Biotech had just centralised its Sanger sequencing laboratories from Constance, London and Düsseldorf to Cologne. The Cologne lab was equipped with technology spread on over 470 m2 of space and was staffed with 20 experienced employees eager to get busy with processing Sanger sequencing samples.

The location of the laboratory, in close proximity to the Cologne airport, made overnight Sanger sequencing an accepted and expected standard for GATC customers all over Europe. The company established a precedent in speedy sequencing and high quality data and is proud to deliver the fastest results to this day.

In the last five years, the Cologne laboratory has nearly doubled in size. Currently 27 employees work round the clock to deliver quick reliable data to GATC’s Sanger sequencing customers. To honour the lab’s birthday we took a brief walk down GATC’s Sanger sequencing memory lane with our Director of Sales & Customer Care, Mr. Jochen Schäfer.

1. What machines were used for Sanger sequencing in the 1990s?

In the early 1990s, we used our self-developed machines called “Direct Blotting System GATC 1500”. In 1995 the first 48-hours sequencing service was established with ABI 373, the so-called “Plate Sequencer”. In 2000, we changed over to Capillary Sequencing with ABI 3700 and introduced the world’s first 24 hour sequencing service. Since 2005 we have been using ABI 3730 XL and since 2006 we have been offering overnight sequencing with our NightXpress service. 

2. How did the data look like back then?

In the 1990s the read size was only 450 bases, in the following years, the read length grew to 650 bases and now we have up to 1100 bases.

3. How long did it take to process and deliver the data?

At the beginning, it took 48 hours to sequence the samples, in 2004 it took approximately 24 hours and since 2007 we offer the “NightXpress” service. Customers drop off their samples in the evening and get their results in the morning between 8 a.m. and 11 a.m.

4. How was the data delivered?

At the beginning data was delivered with floppy discs, in the late 1990s per e-mail and currently via the Internet. Customers receive their results directly in their online myGATC accounts in their Watchboxes. 

5. What is the GATC “Watchbox” and how did this name come to be?

In general a “watch box” is a small container where you can store your watches or in our case, your sequencing data. With such a box you can keep track of the time or in our case you can keep detailed tracking of all your sequencing samples. At GATC, this analogy was accepted around the time of the introduction of the first LIMS-system in 1999. Customers can track all their samples with a Watchbox, which makes it an essential part of GATC´s processes.

6. How much did sequencing cost?

In 1990 the selling price was 20 to 25 DM (Deutsche Mark) for one Sanger read. By 2004, the price had altered to 12 to 15 EUR and with the ABI 3730 we reduced the selling price to under 10 EUR. Today a customer can purchase a bar code for a Sanger read starting a couple of Euros depending on the sequencing service.

7. How has GATC Biotech innovated the Sanger sequencing field?

GATC Biotech holds a lot of industry firsts. We had the first non-radioactive sequencer, the first 24-hour sequencing service, the first online ordering system for the life sciences and the first overnight sequencing service with results the next morning after sample receipt. (See our latest video about Logistics@GATC: Böxle on the road)

In addition, we were one of the first to introduce a barcoding system, where each sample is identified by a unique barcode. This made possible the full automatisation of the Sanger sequencing workflow, facilitated the ordering process and enabled easy online sample tracking for our customers. 

8. How do you see the Sanger sequencing market changing?

It is getting faster and faster, because the orders are increasing and the applications are getting more diverse. We celebrated the first millionth Watchbox in February 2015, the second in December 2016 and the third we expect in 12 months.

9. What products does GATC Biotech offer for Sanger sequencing today?

We offer a variety of Sanger sequencing products. Our LIGHTRUN service is our simplest and most convenient sequencing service for both tubes and plates. The service offers quick, reliable sequencing of DNA samples pre-mixed with primer.

Our SUPREMERUN service is ideal for more challenging templates. This product is also available for single samples or high-throughput Sanger sequencing. The DNA and primer are provided separately. Here, the customer can choose freely from our expansive list of universal primers. 

Cell culture contaminated with Mycoplasma
Image by courtesy of Multiplexion

Mycoplasma detection is an essential task for any cell culture laboratory. Nearly all prestigious scientific journals require evidence of absence of mycoplasma contamination before publication of data from immortalised cell lines. A 2013 survey in Australia and New Zealand found that about 75% of participating researchers perform mycoplasma testing. Interestingly, when testing was performed, 18-20% of scientists detected contamination in at least one sample (Shannon et al. 2016). 

The survey also found that 32% of researchers perform mycoplasma detection in their own laboratories, whereas others chose to use in-house services or external providers for mycoplasma testing (Shannon et al. 2016).

Here are three tips for how to make mycoplasma testing more efficient and less tedious:

1. Find a trustworthy test – Decide for a test that is most convenient and dependable for you. Many researchers opt for the PCR method, as it is the quickest option and it picks up on the majority of mycoplasma species. The sensitivity of the technique is increased when using qPCR and standardised protocols, such as the ones suggested by the “World Health Organization International Standard to Harmonize Assays for Detection of Mycoplasma DNA” (Nübling et al., 2015).

2. Establish a strict mycoplasma testing program and follow it regularly – Importantly, test all actively growing cell lines at defined time intervals. Typically, scientists test monthly to quarterly depending on the volume of cell culture work and on individual risk assessment. If you receive new, non-tested cell lines, quarantine them until they test negative for mycoplasma contamination. When possible maintain cell cultures for two to three months only, then discard and replace with fresh vials from the same pre-tested working stocks. 

3. Consider outsourcing – One way to make the task of mycoplasma testing more pleasant is to have someone else do it for you. Consider outsourcing to a service provider to save resources and material costs in your own laboratory and to gain access to trained specialists with mycoplasma detection expertise. Outsourcing allows you to save people-power and invest scientists’ time into more productive experimental work. There is also no need to train new staff into the methods of mycoplasma testing. Moreover, you will gain the advantage of objectivity, as an independent party is more likely to judge the testing results impartially. Consider the many advantages services like MYCOPLASMACHECK can offer, including quality certified testing with proper controls and with no risk of cross-contamination, as well as results delivered in reports ready for journal submission.    


Corral-Vazquez C. 2017. Cell lines authentication and mycoplasma detection as minimun quality control of cell lines in biobanking. Cell and Tissue Banking 18(2):pp.271-280.

Shannon M. et al (2016) Is cell culture a risky business? Risk analysis based on scientist survey data. Int J Cancer 138(3):664–670.

Nübling CM et al. (2015). World Health Organization International Standard to Harmonize Assays for Detection of Mycoplasma DNA. Applied and Environmental Microbiology, 81(17):5694-702.

Davis L. (2015). The risky business of cell culture. Retrieved from (June 2, 2017): 

Infographic: Liquid biopsy market overview

As a non-invasive test for cancer research and diagnostics, liquid biopsy has already gained lots of traction. The market is hot and business analysts are anticipating more future growth as liquid biopsy is increasingly adopted by healthcare providers. The clinical acceptance of liquid biopsies will likely be boosted by several advantages the tests have over traditional tissue biopsies. Some of these benefits include lower total test cost, quicker turnaround times, ability to capture tumour heterogeneity, ability to monitor recurrence and the minimally invasive nature of liquid biopsies.

Currently, three major biomarkers are explored by liquid biopsies. A report by Research and Markets identifies over 50 liquid biopsy tests that are presently offered on the market. Of these, 50% of the tests are based on detection of cancer biomarkers in circulating tumour DNA (ctDNA). Roughly 37% of the tests are based on characterisation of circulating tumour cells (CTCs) and the remaining 13% draw conclusions from exosome analysis. 

A Kalorama information report shows that the most common genes currently analysed in cell-free DNA (cfDNA) include BRAF, EGFR, ESR1, KRAS, MET, PIK3CA, TP53, KIT and PDGFRA. The report points to a variety of clinical uses of liquid biopsies in oncology including early detection, identification of mutations for targeted therapy, patient stratification, companion diagnostics, tracking of minimal residual disease, characterisation of molecular heterogeneity, monitoring of tumour dynamics and metastases, cancer prognosis and others. 

Financially, liquid biopsies are meant to be a lucrative investment. Research and Markets predicts that the global liquid biopsy market will reach nearly $4.5 billion by 2020. The cancer application segment is expected to make up $2.5 billion of the market. Research and Markets predicts that four major cancer types, prostate cancer, breast cancer, colorectal cancer and lung cancer, will be the main market drivers by 2030, accounting for over 70% of the total liquid biopsy market.

Convinced of the potential of liquid biopsy to transform patient care, GATC Biotech has established a unique service line for non-invasive analysis of cfDNA. GATCLIQUID offers three services for accurate tumour mutation profiling from blood. GATCLIQUID ONCOEXOME is a unique service for whole exome sequencing of cfDNA that provides an unbiased overview of all mutations in protein coding regions. GATCLIQUID ONCOPANEL ALL-IN-ONE is a next-generation sequencing based cancer panel that offers targeted screening of key cancer drivers. GATCLIQUID ONCOTARGET enables ultra-sensitive monitoring of the most important tumour mutation in a given case. Together the services serve aim to improve cancer research and diagnostics today and in the years to come.


Kalorma Information. (2017). Cell-free DNA (cfDNA): Market Size and Share Analysis (Report No. KLI15188961).

Research and Markets. (2016). Liquid Biopsy Resarch Tools, Services and Diagnostics: Global Markets (Report No: 3632954). 

Research and Markets. (2015). Non-Invasive Cancer Diagnostics Market, 2015-2030 (Report No. 3454294 ). 

Example of an “index hopping” measurement using a lane with 6 ChIP libraries
Example of an “index hopping” measurement using a lane with 8 RNA libraries:
Plots of analysis of “index hopping” events
Library prepration for HiSeq 4000 at GATC Biotech


A recent publication of Sinha et al  from April 2017 stimulated a lively discussion about a phenomenon referred to as “index hopping”, “index swapping” or “barcode mis-assignment”. It occurs when multiplexed samples are being sequenced on Illumina´s HiSeq 3000/4000/X Ten systems using Exclusion-Amplification (ExAmp) chemistry. They observed that “up to 5-10% of sequencing reads are incorrectly assigned from a given sample to other samples in a multiplexed pool”. Illumina reacted with a white paper describing the impact and best practices for minimising barcode mis-assignment and reported “index hopping” rates of below 2% on patterned flow cells. “Index hopping” rates were dependent on the library preparation method showing highest rates for PCR-free libraries and libraries contaminated through free adapters and primers. While the underlying mechanism remains elusive, the overall consensus from Illumina´s white paper, as well as bloggers from Enseqlopedia and UC Davis Genome Centre, is that clean sequencing libraries are essential for sequencing on the HiSeq 3000/4000/X Ten. Moreover, they declared that “for the majority of applications ’index hopping’ between clean libraries will be minimal and will have minimal or no impact on the data analysis”.


Since we run a large number of HiSeq 4000 projects, we took the matter very seriously and started digging into our data to assess the level of “index hopping” at GATC Biotech. From two recent HiSeq 4000 sequencing runs, 5 lanes were selected with 5 to 9 libraries per lane comprising different library types: strand-specific RNA libraries from different organisms (2 lanes), all exome-enriched DNA libraries (WES) (1 lane), and ChIP libraries (2 lanes).

The number of reads with unexpected dual index combinations not matching the combinations of the loaded libraries were retrieved from the file ‘DemuxSummaryF1L[1-8].txt’, which is generated automatically for each lane during demultiplexing. For each possible dual index combination (including the ones present in the pool and all combinations not present in the pool), the number of reads was divided by the total number of reads of the lane to get percent index representation values. The results of one lane with 6 ChIP libraries with unique i7 and i5 indices are shown in figure 1 and another lane with 8 RNA libraries in figure 2. The observed levels of “index hopping” were substantially lower than the ones reported in the Illumina white paper. The background read distribution seems to be random as every possible combination of indices was detected. We could not observe a significant correlation between the library type and the level of “index hopping”. The three other analysed lanes containing RNA, ChIP and WES libraries showed similar levels of “mis-assignments” (data not shown). Analysing all “index hopping” events across 5 lanes, a median value of 0.008% was determined (Figure 3). 

By summing up all “index hopping” events per lane, the cumulative median frequency of “index hopping” per lane was 0.27%. In contrast to our findings, applying this measure to the example data presented in Illumina’s white paper (Figure 3 of Illumina’s white paper) the cumulative rate of “index hopping” was 1.59%, which is nearly six times higher than what we observed at GATC Biotech. 


The data presented was derived from currently ongoing customer projects and was randomly selected. We assume that the library preparation has the highest influence on levels of “index hopping” events. As our library preparation process results in clean high-quality libraries (i.e. no detectable primer and adapter dimers), we consequently have extremely low rates of “index hopping”. At GATC Biotech most steps of the library preparation are automated using liquid handlers and very strict purification steps are performed, which seem to mitigate this effect to nearly negligible amounts (Figure 4).

With our workflow, a non-uniquely dual indexed library may contain on average 0.008% of the reads coming from a library sharing one of the index sequences. This equals to 1 mis-assigned read per 1,250 correctly assigned reads. For example, if a non-uniquely dual indexed library was loaded with approximately 10% of total reads (e.g. 30 million read pairs) per lane and this library was affected by “index hopping” as another library present on the lane shared one of the indices, then 0.08% of reads (e.g. 24,000 read pairs) of the affected library would originate from the contaminating library.

Does this level of mis-assigned reads influence data interpretation?
For many study types such as whole genome sequencing, whole exome sequencing and bisulphite sequencing no influence is expected. 

This includes re-sequencing projects aiming at detecting minor allele frequencies down to 1%, where usually a sequencing depth of 300x average coverage is recommended. This means that at least 3 unique reads with a specific mutation are needed in order to call a mutation. At 300x average coverage and an “index hopping” rate of 0.08%, there is <30% chance that a single mis-assigned read with the mutation may be detected, which is well below the threshold of 3 mutated reads. Moreover, this will only be the case if the mutation was present  at 100% in the “contaminating library”. If the mutation frequency is lower, the likelihood for carry-over is even further reduced. Therefore, rare mutation detection studies are very unlikely to be affected at GATC Biotech. If the “contaminating library” belongs to a different organism most of the “index hopping” reads will not map, leaving the experimental data unaffected. 

Another concern is RNA-Seq on HiSeq4000, where gene expression levels can vary substantially between sample types or treatments. The impact on an experiment, however, is in most cases very low. For example, if a cell line would upregulate a certain transcript upon treatment with a compound by the factor of 100, e.g. from 10 FPKM to 1,000 FPKM, the “index hopping” could increase the FPKM of the untreated control from 10 to 11 FPKM (~0.1% of 1000 FPKM). In conclusion, the fold change will not be substantially different.

Nevertheless, for single cell RNA-Seq where commonly up to 384 libraries are pooled on a single lane, it is recommended to use uniquely indexed libraries if very different cell types are analysed. 

Overall, similar to the reports from Sinha et al and Illumina, we observed “index hopping” on HiSeq 4000 but at significantly lower levels. GATC’s proprietary library preparation protocols and high degree of automation show that this effect can be reduced by preparation of high quality libraries and rigid purification / size selection steps. 

In any case, we will continue to monitor “index hopping” on a regular basis to ensure only the highest quality standards are achieved for our customers. 


1. Sinha R et al. (2017). Index Switching Causes “Spreading-Of-Signal” Among Multiplexed Samples In Illumina HiSeq 4000 DNA Sequencing. BioRxiv: doi:

2. Illumina (2017). Effects of Index Misassignment on Multiplexing and Downstream Analysis [white paper].

3. (2017). Update on @illmina index-swapping [Blog post].

4. Froenicke L. (2017). Update on Barcode Mis-Assignment Issue [Blog post]. 

Happy DNA Day!

24.04.2017 | Detlef Janssen

Infographic: DNA fun facts

Today is none other than DNA Day! The special day is celebrated every year on April 25 to commemorate the first publication of DNA structure in 1953, as well as the completion of the Human Genome Project in 2003.

DNA Day was first marked on April 25, 2003 in the United States. Annual DNA Day celebrations have since been organised by the National Human Genome Research Institute. The purpose of the event is to offer students, teachers and the public an opportunity to learn about the latest advances in genomic research.

GATC Biotech is proud to offer expertise in the DNA sequencing field, ranging from Sanger sequencing to whole genome sequencing to targeted sequencing. But besides technical knowledge, we’ve also found out a thing or two that can get anyone excited about DNA. Read some DNA fun facts below:

1. Half-man, half-microorganism
Not quite, but humans harbour as many as 145 genes that have jumped from bacteria, viruses or other single-celled organisms through the process of horizontal gene transfer. Most of these genes play established roles in metabolism, immune responses and other biochemical processes.

2. No T-Rex resurrection

Scientists believe that DNA has a half-life of 521 years. This means that at a temperature of -5°C, every bond would be destroyed after a maximum of 6.8 million years. DNA would cease to be readable much earlier, roughly at 1.5 million years. Bad luck for T-rex, as dinosaurs are believed to have lived 65 million years ago.

3. Are you a pumpkin head?
Humans and pumpkins share about 75% common DNA. About 98% of our genetic make-up is identical to chimpanzees and human-to human genetic variation is only 0.5% to 1%.

4. Get out of jail free card
DNA-based evidence has exonerated more than 300 wrongly convicted prisoners in the U.S. since 1989. Twenty of these prisoners have been on death row.

5. DNA goes sugar-free
Xeno nucleic acid (XNA) is a synthetic alternative to DNA. XNA is created by exchanging DNA’s sugar group for any number of artificially produced molecules. Six of these XNAs already exist, such as glycol nucleic acid (GNA), threose nucleic acid (TNA) and peptide nucleic acid (TNA)

6. To Pluto and back? You’ve got it in you! 
If the DNA in all cells of the human body was uncoiled, it would stretch 16 billion kilometers. Depending on the location in their orbits, the distance from Earth to Pluto varies between 4 and 7.5 billion kilometres.

7. DNA in the cell’s power generator
Human mitochondrial DNA (mtDNA) encodes for only 37 genes. Of these genes, 13 code for proteins of the electron transport chain and the rest code for transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs). In mammalians, mtDNA is usually inherited from the mother, as mitochondria in mammalian sperm are usually lost or destroyed in the process of fertilization.

8. DNA and CSI
In forensics, DNA profiling is based on polymerase chain reaction (PCR) and uses short tandem repeats (STR) that are highly variable. DNA analysts in North America look at 13 specific DNA loci, whereas those based in the UK have a 17 loci system. The odds that two individuals have the same thirteen-loci DNA profile is about one in a billion.

9. An octoploid coffee bean
Humans are diploid organisms with two pairs of 23 chromosomes or 46 in total. Some C. arabica coffee species are octoploids with eight sets of 11 chromosomes or 88 in total.

10. All in a day’s work
It takes about 8 hours for a mammalian cell to completely copy its DNA. Human DNA replicates at a rate of 50 nucleotides per second at 20 to 80 origins of replication. In contrast, E. coli DNA replicates at a rate of 1,000 nucleotides per second at one single origin of replication. The process takes about 40 minutes. 

11. Birds of a feather flock together
A controversial 2014 study of 2,000 Americans found that people tend to befriend those with similar DNA to their own. The authors analysed 500,000 markers from across the genome to conclude that friends share about 0.1% more DNA than strangers. This level of similarity is expected for fourth cousins.

12. Should Anne of Green Gables join X-Men?
Red hair, freckles and blue eyes are genetically considered mutations. Red hair appears in people with a recessive allele on chromosome 16 which produces an altered version of the MC1R protein. The MC1R gene is also often implicated in the presence of freckles. A specific mutation in the HERC2 gene, which affects the function of OCA2 is strongly linked to the appearance of blue eyes. 

Image of FFPE sample in comparison to blood sample

Analysis of single nucleotide polymorphisms (SNPs) and insertions and deletions (InDels) in the human exome is one of the most popular applications of next-generation sequencing (NGS). More and more clinical researchers are turning to exome sequencing to help in the diagnosis, prognosis, treatment and prevention of disorders caused by genomic abnormalities.  

Clinical samples can be quite challenging to sequence. This is especially true for starting material commonly used for tumour mutation profiling, such as formalin-fixed, paraffin-embedded (FFPE) and blood samples, from which cell-free DNA (cfDNA) is extracted. Below we offer insights into why FFPE-extracted DNA and cfDNA are difficult to sequence and how optimisation of certain steps can help perform efficient exome sequencing of cfDNA and FFPE samples.

A blood sample of a cancer patient contains not only circulating tumour DNA (ctDNA) but also high levels of cfDNA from non-cancerous cells. Moreover, ctDNA levels tend to vary significantly between patients, cancer type and the health status of the patient. The short ctDNA length of only about 160 bp complicates the analysis even further. Often, DNA isolation from plasma results in DNA concentrations ranging from as little as 1 ng to 10 ng/ml of plasma. With variant allele percentage as low as 1% in early stages of the disease, highly sensitive methods are required to achieve accurate variant calling.  

Besides proper plasma preparation, the library preparation step is crucial for successful ctDNA exome sequencing. A library that works with both low-input DNA and with a broad range of DNA input is needed.  The library preparation steps must be performed with extreme caution in order to maximise yield and quality of the genomic material. Only high fidelity enzymes should be used during the procedure. 

The process of fixation and the storage conditions for FFPE samples can cause substantial DNA damage. Genomic DNA derived from FFPE tissue is often partially degraded or in very limited quantity. Damaged DNA is prone to promote jumping between templates during PCR and inducing DNA polymerase errors during any PCR steps. There is often extensive variability in the amount of damage and types of damage in DNA extracted from FFPE material. Errors, such as inter-and intra-strand crosslinks, as well as accumulation of strand breaks are common damage events seen in FFPE-derived DNA. FFPE material also has higher rates of C>T deamination artefacts, as well as high levels of other base substitutions.  

If you want to perform exome sequencing of DNA extracted from FFPE samples, make sure you measure the final DNA concentrations on the Nanodrop and Qubit spectrophotometers. Establish a quality threshold and do not continue the experiment if the DNA quality falls below these levels. Perform exome sequencing with a high sequencing depth in order to achieve accurate variant calling. Ideally, samples should be run in duplicates and a minimum coverage sample cut-off should be established prior to downstream data analysis. 

Infographic: A brief history of DNA sequencing

It would be really unfortunate to exist for nearly 4 billion years without anybody noticing your presence. But that’s exactly what happened to one lonely molecule called deoxyribonucleic acid (DNA). The molecule duplicated, mutated and evolved without anyone giving it any thought. Even complex multicellular Homo sapiens carried on for thousands of years completely ignorant of their DNA, although each of their billions of cells carries two meters of the genetic material. 

DNA had to wait until 1869 for its first physical encounter with a human. The lucky man was Friedrich Mietscher, an Austrian physician who successfully isolated DNA in the form of chromatin from pus-soaked hospital bandages. 

Scientists at that time were not fully convinced DNA was worth getting excited about. Most still believed proteins were the carriers of genetic information. The false notion began to change in the 1940s and in 1952 the matter was finally laid to rest with an elegantly simple experiment from Alfred Hershey and Martha Chase, which demonstrated once and for all that DNA is genetic material. 

Just one year later, Francis Crick and James Watson with a crystallography hint from Rosalind Franklin, introduced the world to the double helical structure of DNA in 1953. With the structure now in the bag, scientists began their search for DNA function. The answer came from Marshal Nirenberg, who in 1961, showed that different combinations of DNA bases code for specific amino acids, the building blocks of proteins. 

With the realisation that DNA was the blueprint for life, came the curiosity to “read” the plan which DNA held within.  A great molecule to start sequencing with was RNA, as these types of nucleic acids are single-stranded and often considerably shorter than DNA. Indeed, in 1965, Robert Holley and his co-workers became the first people to read the bases of a nucleic acid when they sequenced yeast transfer RNA (tRNA) using RNAses with base specificity.  In 1970, Ray Wu was first to decipher a short sequence of DNA by using a technique called primer extension. Two years later, Walter Fiers read the first ever DNA sequence of a whole gene, the one encoding a MS2 virus coating protein. One year later, Walter Gilbert and Allan Maxam developed a way of sequencing DNA which used chemicals to cut DNA at certain bases. In 1975, Frederick Sanger introduced his first alternative method to DNA sequencing, called the “plus and minus” technique. The approach used polyacrylamide gels to separate products of primed synthesis in order of increasing chain length. In 1977, Sanger modified Ray Wu’s primer extension technique to develop the chain-terminator method or dideoxy sequencing or simply Sanger sequencing as we know it. The technique went on to dominate the sequencing world for the next 30 years.

Sanger used his newly developed method to sequence the first ever genome in 1977. The genome of bacteriophage virus øx174 became the most popular DNA positive control in labs around the world. A few years later, in 1982, researchers discovered DNA mutations. The first documented case was of a single DNA base change in the HRAS gene that could affect the onset of bladder cancer by altering the structure of its protein product.

Meanwhile, improvements to the Sanger sequencing method were constantly made. In 1984, Fritz Pohl developed the first non-radioactive sequencing technology platform, GATC1500. In 1986, Leroy Hood in collaboration with Applied Biosystems developed the first semi-automated DNA sequencing machine where sequencing data could be directly collected by a computer. The following year, Applied Biosystems launched the first automated DNA sequencing machine, selling at $300,000 apiece. Nearly 10 years later, ABI would become the first commercial provider to use capillary electrophoresis rather than a slab gel, establishing truly automated DNA sequencing.

Meanwhile, in 1990, the ambitious Human Genome Project began with the astronomical costs of $75 per DNA base. That same year, Haemophilus influenza became the first bacterium to have its genome sequenced using the “shotgun” sequencing technique. The slightly longer and more complex yeast genome of the Saccharomyces cerevisiae species followed in 1996.

1996 was not just the year of the yeast, it was also the year where next-generation sequencing (NGS) first came to be. It was during this year that Mostafa Ronaghi introduced a new DNA reading technique called pyrosequencing, based on a sequencing-by-synthesis method.  Two years later, Shankar Balasubramanian and David Klenerman founded Solexa, the precursor to Illumina. The two men combined efforts to develop a new sequencing-by-synthesis technique based on fluorescent dyes. 1998 was also the year that first animal genome was successfully sequenced, that of the microscopic worm, Caenorhabditis elegans. One year later, an international collaboration managed to publish the first human chromosome sequence, introducing the scientific community to chromosome 22. 

The beginning of the 21st century was certainly an exciting time for DNA. Genomics success stories were pouring in from every corner of the world. In 2000, Arabidposis thaliana became the first plant and Drosophila melanogaster the first insect to have their respective genomes sequenced. The first year of the new millennium also saw the much awaited first draft version of the human genome sequence, a combined effort attributed to project leaders Francis Collins from the U.S. National Institute of Health and Craig Venter, founder of Celera. In 2001, the draft human genome sequence, based on samples from 12 anonymous volunteers, was officially published. In 2002, the complete genome sequence of the mouse followed, showing 90% identities to that of humans.  In 2003, the human genome sequence of around 3 billion pairs was finalised, although a few gaps still exist to this day. 

The next-generation sequencers were not sitting idly during the human genome sequencing craze. In 2005, Jonathan Rothberg and colleagues used pyrosequencing to develop the 454 system, the first next-generation sequencing platform to come on the market. Meanwhile, Solexa researchers used their own sequencing-by-synthesis technique to read the whole genome of a virus called phiX-174. In 2007, Illumina took over Solexa in a $600 million buy-out, going on to provide the most widely used next-generation sequencing technology in the world. In 2007, a new competitor to 454 and Illumina was released in the form of the SOLiD system, which was based on sequencing by ligation.  A few years later, in 2011, Life Technologies released another competing sequencer, the Ion Torrent, which used a form of sequencing-by-synthesis based on detection of hydrogen ions whenever new DNA is made.  

Next-generation sequencing was becoming more and more accepted in the scientific community. In 2008, the International Cancer Genome Consortium was launched with the goal of using NGS to analyse thousands of tumour samples and profile cancer-related mutations. This was a tremendous year for cancer research, as scientists also managed to decode the whole DNA sequence of a cancer for the first time. To achieve this, they used NGS to read the genetic code from leukaemia cells isolated from a 50 year old patient. Also in 2008, James Watson became the first person to have his whole genome read using NGS. 

The year of 2009 was the first time third-generation sequencing technology came into the spotlight with the release of the Helicos sequencer. This technique made use of single molecule fluorescent sequencing to read DNA sequences, but the technology quickly fell out of favour due to high error rates. The technology was more successful in the hands of Pacific Biosciences, who launched their first single molecule real time sequencing platform in 2011. 

The latest sequencing technology to hit the mainstream was nanopore sequencing, where DNA is passed through a tiny nanopore in a membrane. The order of bases is then determined based on changes in the electrical current across the pore. Oxford Nanopore Technologies became the first company to commercialise this new form of sequencing in 2012.

DNA sequencing is now as popular as ever with scientists reading its composition and using the information for a countless number of applications. Now that good quality sequencing data is becoming cheaper and easier to generate, it is highly probable that the need for bioinformatics analysis will grow. A truly multidisciplinary approach will be needed in order to interpret and make use of the vast amount of generated data. Nowhere is this truer than in the field of personalised medicine. For newly emerging applications like liquid biopsies for non-invasive cancer detection, DNA analysis holds the great promise of personalised, effective and painless care. 

Odds are it won’t take another four billion years to get there.

Successful female GATC colleagues

There was no shortage of enthusiasm from the three bosses interviewed for this blog post. The ladies readily put their busy schedules aside to make time for a lively conversation in honour of International Women’s Day. If after the first sentence you were surprised to find out the three bosses were female, then you, like most of us, need to read on about what it takes for a woman to climb up the biotech ladder. 

Julia Bottlang – The Lab Powerhouse on beating the odds 

Mrs. Bottlang found her groove during the set-up of a Sanger sequencing laboratory in London. The lab technician found herself willing to make decisions rather than constantly waiting for directions. This initiative did not go unnoticed. When Mrs. Bottlang applied for the position of Head of Pre-Sequencing NextGen Sequencing Lab, she beat out an all-male applicant pool to land the job. To get to where she is, she believes young girls need to be encouraged to pursue their interests fully so that they can grow into confident women who are smart enough to recognise an opportunity and self-confident enough to take it.  

Silvana Mamone – The Data Queen on the importance of the next generation 

Mrs. Mamone is always eager to learn something new. She feels her enthusiasm for new knowledge is what led to a string of promotions and ultimately her current post as Head of the Data Analysis and Processing Team at GATC Biotech. Her newest position was probably the most challenging for her, not in the least because she had just become a mother. Neither she nor her partner considered becoming stay-at-home parents. So she sharpened her negotiating skills and landed flexible working hours, the opportunity to work from home when her child is sick and enough free days to cover all kindergarten holidays. She acknowledges she has reached a great work-life balance in part thanks to her employer’s full support of employees with families. Whenever Mrs. Mamone is off from analysing sequencing data, she is busy instilling self-confidence in her daughter as a next generation of powerful women.  

Kerstin Stangier – The Sequencing Mastermind on bringing a female perspective to a male’s world 

It might be difficult for a daughter to follow in the footsteps of an 80’s feminist with a job, a family and a political agenda. But becoming one of two female directors in a blooming biotech company is a pretty good start. Dr. Stangier proudly credits her mother for giving her the strength to go on and dominate male-dominated fields. From earning a graduate degree in chemistry to becoming a Director Production at GATC Biotech, Dr. Stangier is not only used to voicing her opinion in front of male crowds, but to actually having it heard. She believes that being a woman is advantageous in business and scientific discussions, as she can often offer a different perspective and suggest more creative solutions than her male counterparts. Although she is happy with the improvements she sees in the workplace for women, she acknowledges that top jobs in companies are nearly always exclusively filled by men. Her advice for women and even men after these top jobs is to have full self-confidence in their education, experience, abilities and judgement. In her opinion, winning is in the self-esteem rather than the gender card and with enough practice, any employee can deal this winning card to themselves. 

Lab impression: illumina sequencers

There are several important factors that you need to consider before doing next-generation sequencing (NGS) on an Illumina platform. A well-planned experiment could easily maximise the success of the final outcome regardless of whether you perform the sequencing yourself or outsource to a service provider. 

  • Starting material –The starting point of your project is the choice of a proper DNA / RNA extraction method for your organism of interest. The lysis and homogenisation steps of the protocol should be tailored to the specific material in order to maximise yield and quality. The protocol should be performed by experienced users to avoid degradation of the nucleic acid due to missteps or delays. For some materials, such as plants, the removal of inhibitors is of essence.  In all cases, the quality of your DNA/RNA should be assessed using capillary gel electrophoresis or similar methods.
  • Number of samples – The number of samples very much depends on the aim of your sequencing experiment. If you want to check the presence of a gene/SNP of a bacterial strain, one sample might be enough to answer this question. However, if you want to analyse effects on groups of samples, replicates should be performed if possible for proper statistical analysis. In general biological replicates are preferred over technical ones. Also consider including controls to validate your results. 
  • Number of reads – Your project will require a minimum number of sequencing reads in order to generate reliable data. If you are sequencing amplicons, small RNAs or re-sequencing small genomes, as little as 5 million reads can be sufficient. For resequencing projects the genome size needs to be considered and it is directly connected to the desired sequencing depth/coverage. For larger eukaryotic genomes more sequencing reads are needed than for small prokaryotic genomes. Over-sequencing costs not only more money and time, it also complicates downstream data analysis. 
  • Sequencing depth - The sequencing coverage, or the average number of times a single base is read during a run, is also of importance. The more frequent the base is sequenced, the more reliable the base call is likely to be. This parameter is also highly dependent on the application. For example, to reliably identify germline mutations 30x coverage is usually sufficient. However, 100x and more should be sequenced to detect somatic mutations of tumour samples.
  • Sequencing mode – Illumina offers two distinct types of sequencing - the single-read and the paired-end mode. Single-read runs sequence DNA fragments from one end to the other end depending on the fragment length and the sequencing length. The single-read mode is fast, cheap and could be beneficial for some RNA-seq and ChIP-Seq experiments. In paired-end mode the fragment is read first from one end and in a second read the same fragment is read from the opposite end. Thereby, for each fragment two paired reads are generated. Although this sequencing mode is more expensive, it generates more data and it makes the mapping to the genomic reference more reliable. Therefore, it is the preferred choice for applications like SNP analysis and genome assembly
  • Multiplexing – During the library preparation each sample is labelled with a sample-specific molecular tag, called barcode.  This process allows multiple samples to be processed in the same sequencing reaction and then separated during BioIT analysis. Besides lowering costs, multiplexing allows for randomisation and can help minimise sequencing bias. In a perfect world, the experimental design would involve pooling all controls and experimental samples together and sequencing these on the same lane. If this cannot be achieved samples should be randomised so that in each batch of samples both cases and controls are processed. For low complexity samples, such as amplicons and bisulfite-treated DNA, pooling samples together with high complexity samples can increase the sequencing quantity and quality.  
  • BioIT analysis –As sequencing costs are lower than ever before, the current bottleneck of NGS tends to be the bioinformatics analysis. Thinking about how you will analyse your data in advance can help ensure you have included all necessary controls. You can go ahead yourself with the data analysis using free or commercially available software, but learning how to use these programs often involves a steep learning curve. Alternatively, the option exists to outsource the BioIT analysis to an experienced provider.