Jane Rogers

Jane Rogers

Jane Rogers has been part of the IWGSC since its beginning. She was instrumental in helping the consortium develop its strategic roadmaps during the first 5 years and joined the IWGSC Coordinating Committee in 2008, representing The Genome Analysis Centre (TGAC, now Earlham Institut), to work on the development of physical maps for several chromosomes. In 2010, she started leading the IWGSC chromosome survey sequencing initiative which resulted in the publication of the chromosome-based draft sequence in Science in July 2014.

Jane joined the staff of the IWGSC in 2013 to lead the efforts to complete the production of the physical maps and the development of the strategy for the reference sequence. She led the IWGSC-Bayer CropScience projects that developed Whole Genome Profiling (WGPTM) based physical maps for three chromosomes and two chromosome arms and, subsequently, the genome-wide tag sequences through the IWGSC-Bayer CropScience WGPTM tag sequencing project. Over the last year, she was a co-leader of the IWGSC RefSeq project.

Even though Jane officially retired from the IWGSC at the end of December 2017, she was still very much involved in the writing of the manuscript about the analysis of the reference wheat genome sequence. Now that this phase is almost finished, Jane took the time to answer our questions and share her views about her experience with the IWGSC and the wheat sequencing project.

You have been involved with the IWGSC since almost its beginning, how would you describe this experience?

I have been involved directly with the IWGSC since 2008 when I became Director of The Genome Analysis Centre in Norwich (now Earlham Institute) and we began to look at how the new institute could contribute to producing a reference sequence for wheat to support long-standing research programmes in the UK. As the genome is large and complex, it made sense for it to be tackled through a consortium effort and I was aware from earlier discussions about potential sequencing strategies for wheat with Kellye Eversole (in 2005) that the IWGSC had a programme that it might be possible to join. The key thing about the consortium for me was that they were sticking to the goal to produce an annotated genome sequence of the highest quality possible that would have immediate practical application for wheat researchers and breeders. The strategy adopted by the consortium was viewed by some to be old fashioned by advocating BAC-based sequencing from chromosome-specific physical maps, but until short read sequencing and assembly technologies improved to become applicable to a large hexaploid genome, the sequential approach would generate useful resources for the community. It also lent itself to sharing the cost and workload amongst groups across the world. Working with a group of people committed to achieving the same goal, who were flexible to incorporating new approaches as they emerged (e.g. chromosome arm and later whole genome shotgun sequencing, and whole genome profiling) and with whom we could work through the sometimes conflicting challenges of contributing to the project and achieving individually recognizable milestones was a real pleasure.

During your career, you have worked on many sequencing projects, what did you find most challenging about developing a reference sequence of the wheat genome?

The greatest challenge in the wheat genome project was maintaining the momentum for achieving a high quality sequence amongst the many participating groups, in the first instance to produce physical maps for all chromosomes and, subsequently to deliver the sequence. One challenge was that once the chromosome-based strategies had been developed and published for the first chromosomes, it became increasingly difficult for novel publications, required by funders and for career development by the people undertaking the work, to be generated from the production of maps and sequences for the remainder. An additional challenge came from widely publicized claims that draft genome sequences generated from whole genome shotgun sequencing using ‘next generation’ sequencing technologies had delivered a wheat genome sequence for a lower cost and ahead of the schedule set out by the IWGSC. The amplification of such claims in the media made it increasingly difficult for IWGSC-affiliated groups to make the case for a high quality product based on a tried and tested approach to which they would make a partial contribution. As it turned out, the physical maps and associated chromosomal sequences for each of the chromosomes that were generated by the IWGSC have proved invaluable for the assignment and organization of the sequences assembled from whole genome data. By integrating data from the two approaches we have been able to produce a sequence with high coverage and long-range organization that is proving really useful to the wheat community. However, if you asked me how I would go about sequencing the genome today, it would not be the approach that I would advocate. Probably a combination of whole genome short read and long read sequencing combined with some means of assigning contigs / scaffolds to the individual chromosomes / genomes. The chromosome survey sequence data proved particularly useful in this regard for the RefSeq and I was pleased that at TGAC I was in a position to push that project through. It seems to me that it’s not yet clear whether chromosome-specific data will need to be generated for assignment of the contigs/scaffolds from other wheat cultivars for comparative genome analyses.

Over the last year, you co-led the IWGSC RefSeq project, from coordinating the teams analyzing the data to the writing of the manuscript, can you describe how the work was organized and how so many people worked together so effectively? Any challenges?

Jane Rogers

The workflows for preparing genomic data and analyses for publications describing genome assemblies have become pretty well established. There is a series of steps that flows from building an assembly, QC, annotation of elements such as repeats and genes, annotation QC, then analyses based on the assembly + annotation that relate to how the genome functions and finally, the description of the work for the manuscript. Once the framework for the work had been set out in face to face meetings with groups who wanted to contribute, it was fairly straightforward to put together working groups for each of the activities and to organize regular conference calls to review progress. Predicting the time needed for each stage is always challenging, since many analyses are dependent on the preceding stages and I think everyone tends to underestimate how much time the work needs. It is also difficult to predict what additional analyses may be prompted by initial results and where to draw the line for an initial publication. Nils, as overall leader of the project, insisted on setting realistic timelines to keep the project on track and his discipline was certainly very useful in keeping the momentum moving forward.

In the IWGSC RefSeq project we were keen to draw on the breadth of expertise amongst the consortium members, many of whom had been involved in annotation and analyses of other genomes or wheat data sets. One opportunity we were able to take was to involve several groups in the gene annotation process who had each previously developed automated pipelines and gene annotations used by the community. The results generated by two pipelines from INRA and the Helmholtz Institute were integrated and compared by the Earlham Institute to select the gene models best supported by the data available from transcriptome and comparative studies. All three sets of gene models are available and can be compared with models generated previously for whole genome analyses or for individual genes in genome browsers. Whilst we have already seen through analyses of specific gene families that manual curation will improve on the automated gene annotation, the integrated set of models provides an excellent starting point for gene-based studies.

Were there surprises or unexpected results in the analysis of the reference sequence?

Not really a surprise, but one of the most pleasing features to emerge from the genome analysis was that the IWGSC had achieved a genome assembly that was highly representative of the genome, with an estimated 94% coverage and providing an unprecedented level of long range sequence organization across the 21 chromosomes that comprise the A, B and D sub-genomes. It provides an excellent starting point for the functional analyses associating phenotype with genotype that will now follow. Many of the features described in the genome have been described previously for single chromosomes, e.g. chromosomal distribution of genes, repeats, recombination, but the whole genome analyses provided the first insights into the distribution of genomic features across the whole genome and the potential importance that this may have for gene / genome regulation and future genome manipulation. Understanding and learning how to manipulate gene regulation is going to be fascinating in a genome containing three genomes and a high proportion of duplicated genes whose patterns of expression have already hinted at the complexity of the interactions that determine phenotype and function.

Another aspect of the analysis that came to the fore was the potential for novel bioinformatics approaches to be used to complement molecular, cellular and whole plant studies. Phylogenomics analyses based on sets of more complete gene annotations than have hitherto been available contributed to the dissection of a variety of gene networks associated with specific traits. They also facilitated comparisons of gene family members, duplicated genes and pseudogenes distributed across the three genomes that may be differentially distributed and regulated. Understanding regulatory systems in wheat will be critical to future use of genome information in research and breeding programmes and the IWGSC RefSeq provides an excellent foundation for this work.

The RefSeq project has now almost reached its end, what would you say were the main challenges or roadblocks along the way?

The greatest challenge for the wheat genome project from its inception was to maintain the focus on producing a genome sequence that is an accurate representation of the bread wheat genome and not a product dictated by sequencing technology trends or focused only on gene sequences or other surrogate targets (e.g. sequencing diploid genomes). The sequential strategy adopted by the IWGSC based on physical maps for individual chromosomes flow sorted from hexaploid bread wheat was considered high risk, yet somewhat reactionary by funders because the whole genome sequence would require many groups to participate and deliver data to the same standards, it would require a long term effort beyond the normal lifetime of individual grants (typically 3-5 years) and, on the face of it, did not take advantage of the technological improvements in DNA sequencing and sequence assembly that were driving down the costs and timelines of sequencing other genomes. These perceptions meant that the IWGSC members faced considerable political pressure to demonstrate the efficacy of the strategy in order to secure funding to complete the project, adding to the pressures of overcoming the technical challenges of a project tackling a genome whose complexity, i.e being large, hexaploid and highly repetitive, posed, and still poses, a daunting proposition for any genome sequencing strategy that is based on a single approach.

Back in 2005 genomes such as those of human, mouse, Arabidopsis and rice had been delivered using cost and time consuming approaches following a sequencing strategy based on physical maps. These were successful, but cheaper and more rapid strategies, such as whole genome sequencing as advertised by Celera for the human genome, were needed for other genomes. Whole genome sequencing generally delivered sequences with variable coverage more rapidly and cheaply than map-based approaches and also offered exciting opportunities for algorithm improvement for bioinformaticians. The greatest disadvantage of whole genome sequencing was that the genome was delivered in a number of pieces (or contigs) and it was difficult to assess how well the genome was represented, especially if it was repetitive and / or complex. It was also difficult, subsequent to the whole genome being ‘sequenced’ to improve the product without, essentially, starting again. The challenges of using this approach for wheat were demonstrated in the Brenchley et al publication in 2012 of the whole genome shotgun sequencing of Chinese-Spring wheat using Roche-454 technology and the subsequent chromosome-based survey sequence of the IWGSC produced using Illumina short read technology (IWGSC, 2014). Both projects produced fragmented genic sequences that had some utility, but they fell a long way short of representing the full gene catalogue and lacked positional information and context, although the CSS did assign gene sequences to chromosomes.

You have often mentioned the importance of having a high quality reference sequence. Why is this so important? Whole genome sequencing has long been the easier route to sequencing complex plants yet the IWGSC developed BAC libraries, physical maps, and supported sequencing of BACs. Why are BAC libraries and maps important for future research?

In my opinion a reference sequence for an organism should be a reliable source of information about the genome composition, sequence ordering along chromosomes and annotated gene structures and provide a template for additional annotations, such as regulatory features, assignment of genetic markers etc. It should also be accurate enough to enable the design of targets for genotyping or gene modification. The more complete the sequence is, the more chance there is that the assembled genome will provide an accurate representation of an organism’s genome and thus improve its utility. Although it has become relatively quick and cheap to generate whole genome sequence data using short read sequencing technologies and assembly quality has improved with algorithm improvements, as gauged by contig and scaffold lengths, it has been more challenging to define the long range organization of sequenced fragments. For many genomes the additional data needed to do this, either in the form of orthogonal data ( e.g. genetic or physical maps), long read sequence data, or chromosomal profiling (e.g. using Hi-C) has not been generated or integrated and the “genome sequence” is left in a very fragmentary state that often needs additional work to use. In many cases that work is not captured and used to improve the genome sequence, leaving the sequence in an unsatisfactory state and subsequent users in the frustrating position of needing to repeat work before they can undertake their own projects.

The aim of the IWGSC has been to make publicly available the data sets and physical resources that have been used for the genome assembly. The chromosome-specific BAC libraries are available for experimental work that might include gene isolation, gene cloning and the development of probes to target specific features in different wheat genomes. Although the recent development of gene-editing technologies such as CRISPR will likely reduce the use that is made of the BACs I think they will still be a valuable resource for manipulating specific regions of DNA avoiding the need to manage the whole genome.

According to you, what other resources are still needed to meet the needs of wheat scientists and breeders?

Following the production of a reference sequence additional genomic resources that will benefit users are likely to be:

  1. additional annotation / curation of the genome sequence – including improvement of the sequence itself with gap filling or resolution of errors, improved curation of gene models, identification of genome features such as regulatory regions, e.g. transcription factor binding sites, histone modification sites, etc, and more detailed pseudogene annotation;
  2. browser updates to incorporate additional data and provide new tools (e.g. for data mining / visualization) and access to additional data as they become available;
  3. sequences of other wheat cultivars and tools to facilitate comparative analysis between cultivars and with the reference sequence;
  4. platforms to facilitate access, sharing and manipulation of large data sets related to functional characterization of genomic features and genotype-phenotype relationships.

What role could the IWGSC play in generating these resources?

IWGSC members will, of course, determine their priorities and the activities that would benefit from international coordination. Hopefully these will be integrated with the work of other groups, including the Wheat Initiative, CIMMYT etc. In my view, with its diverse membership the IWGSC could continue to play an important role in coordinating efforts to update the reference sequence and associated data sets and to promote sharing of new data and tools for data analysis. They could also continue to facilitate production of resources that would be widely useful in the community, similar to the way they worked with the SNP chips.

Based on your experiences in the human genome and other sequencing projects, how important is manual curation of the reference sequence?

Although gene prediction algorithms have improved significantly over the years producing an algorithm that is trained to identify all of the features of a gene using an automated process is still challenging. They are doing a much better job of identifying the most significant elements of gene models, but human intervention still enables different data to be reviewed and often picks up additional, or alternate, exons, splice sites, start / stop sites, etc., that can be missed. In some cases this can significantly alter the structures of genes and also alter their context, i.e. relationship to surrounding DNA that may be involved in regulation. It can also identify problems in the sequence itself, which may be corrected.

For the human genome, systematically improving the annotation through manual curation enabled the sequence to be used for downstream studies, e.g. studies of human variation and understanding the implications of variants. Notably, manual curation and analysis still plays an important role in the interpretation of individual genomic data, particularly in diagnostic circumstances. The value of manual curation in enhancing the utility of genome sequences has also been well illustrated for other species, e.g. Arabidopsis, rice, mouse, zebrafish and multiple bacterial and eukaryotic pathogen genomes. It has also been identified as important for researchers looking to apply genomics to agricultural species. So, in my view, manual curation is very important to enable the widest interpretation of genome sequence. The biggest issue is how to afford it. At a minimum, a system that collects the curation that is done by individual scientists when working on specific genes would enable information to be shared and prevent wasteful duplication. Having experts who can provide advice and training and also instigate common rules for curation is also helpful. One question for the wheat community to decide upon will be how useful is manual curation of the Chinese Spring genome for the interpretation of variation across other wheat genomes? For the human genome, manual curation of the reference has been critical to its application, although individual genomes are now needing further analysis. How useful would a first round of manual curation of the reference be for wheat?

About Jane 

Initially trained as a biochemist, Jane obtained her PhD in 1979 from Southampton University (UK) and subsequently held postdoctoral research positions in Utrecht, The Netherlands, and in the Departments of Biochemistry and Pharmacology at Cambridge University.

Jane led the establishment and operation of the high throughput sequencing facility at the Wellcome Trust Sanger Institute from its foundation in 1993 to 2007. In this role she managed the delivery of the UK’s contribution to the Human Genome Project and the production of reference genome sequences for other model organisms (e.g. mouse and zebrafish), pathogen and vertebrate genomes, for which the Institute became renowned. She was a member of the Board of Management of the Sanger Institute for 14 years and has served on international advisory groups for genomics strategy and policies.

From 2008 to 2013, she served as the first director of The Genome Analysis Centre (now Earlham Institute) in Norwich, laying the foundations for its development into a world-class centre for applied genomics and bioinformatics. Jane joined the staff of the IWGSC in 2013.

Since re-locating to a small village just outside Cambridge in 2013 with her husband, she has become interested in fruit growing, an activity with which the area has had a long association and used to supply plums and greengages to markets in Cambridge and London. They have restored an old orchard and are now learning about the different varieties of trees and how to manage the harvested fruit.

Publication date: 08/13/2018