[English]
BGI、東京大学 大学院農学生命科学研究科
アグリバイオインフォマティクス教育研究ユニット
新学術領域「複合適応形質進化の遺伝子基盤解明」共催
インフォマティクスオープンセミナー


"Challenge to de novo sequence of relatively large genomes with new sequence technologies"

オーガナイザー:
西山智明(金沢大学際実験センター)、門田幸二(東京大学 大学院農学生命科学研究科 アグリバイオインフォマティクス教育研究ユニット)、長谷部光泰(基礎生物学研究所)

新型シーケンサーの普及により、モデル生物はもとより、非モデル生物のゲノム解読も1研究室単位で出来る時代になりつつあります。しかし、植物や動物のようにギガベース単位のゲノム解読は、まだまだ容易ではありません。そこで、リピート配列の多い複雑なゲノムをアセンブルする方法、コンピューターフレームワークの改良、PacBio RSの配列を効率良く修正する方法について最新の研究成果を発表し議論するとともに、具体例として、HiSeq2000 によるカキとトノサマバッタゲノム解読、PacBio RSとHiSeq2000を用いた食虫植物フクロユキノシタゲノム解読について検討します。

2012年9月25日(火曜) 13:30〜18:00
東大農学部 2号館 化学第一講義室
※参加申込不要、当日直接会場にお越しください

13:30-14:15
Zhiyong HUANG (BGI)
“Assembling of Crassostrea gigas and Locusta migratoria reveals the new method in large and complicated genome assembly”

14:15-15:00
Xiaodong FANG (BGI)
“Flexible Computing Frameworks for Large Genome de novo Assembly”

Coffee Break (15:00-15:30)

15:30-16:15
Tomoaki Nishiyama (Advanced Science Research Center, Kanazawa University)
“Illumina and PacBio sequences for de novo assembly”

16:15-17:00
Likai MAO (BGI)
“Assembling the genome of Cephalotus follicularis

Coffee Break (17:00-17:15)

17:15-18:00
Masahiro Kasahara (Department of Computational Biology, Graduate School of Frontier Science, The University of Tokyo)
“Error correction algorithms for de novo genome assembly using PacBio reads”

==========
Zhiyong HUANG (BGI)
“Assembling of Crassostrea gigas and Locusta migratoria reveals the new method in large and complicated genome assembly”

The Pacific oyster (Crassostrea gigas) belongs to one of the most species-rich but genomically poorly explored phyla, the Mollusca. The oyster genome is highly polymorphic and rich in repetitive sequences with some still actively shaping the genome which makes it mighty difficult for assembly. Here we report the sequencing and assembly of the oyster genome using short-reads and a fosmid-pooling strategy. The final assembly comprised of 559 Mbp, with a length-weighted median (N50) contig size of 19.4 Kbp and a N50 scaffold size of 401 Kbp. The migratory locust (Locusta migratoria) is one of the world's most destructive insect pests, affecting the livelihoods of one in ten people on Earth. Locust biology has long been the object of intense scientific study, to try to find the best control strategy. Locusts are of considerable interest as a model for studying phenotypic plasticity and collective movement. However, the big genome size (approximately 7Gb) has slacked the genome analysis steps. To improve the assembly, especially the gene structure, RNA-seq information was used for scaffold construction and gene model correction. The RNA-seq data was generated from various samples, also include one normalized library. Finally, the N50 contig size was 9.293 kilobases (Kb), N50 scaffold size 320.3 Kb and a total gapped size 6.9 gigabases (Gb).


Xiaodong FANG (BGI)
“Flexible Computing Frameworks for Large Genome de novo Assembly”

The availability of reference genome sequence for a given species is of critical importance to better understand the species, it enable scientists to study the creature in genome-wide and systematically. De novo sequencing and assembly is the way to decode a species and achieve its reference sequence, but currently sequencing technology and assembling algorithm face challenges to deal with genome with high complexity. Genome sizes various sharply among organisms, ranged from several Kbp in virus to up to 670G in Amoeba. The largest genome published till now are mammalian which typically with a size of 3Gb, but there are also many other important species with a large genome size wanting to be decoded, such as wheat and sugarcane. The challenges for large genome assembly are due to sequencing technology, assembling algorithm and the computational complexity. The combination of Illumina sequencing technology and SOAPdenovo which designed for de novo genome assembly software have been proven to be successful in many genome project, but the challenge to assembly organisms with large genome size remains, one of the major problem is computational complexity. The larger of the genome, the more memory required, that means high performance computers are needed thus the cost increases. Here, we provide a flexible computing solution for large genome de novo assembly which does not require a super computer but a cluster.


Tomoaki Nishiyama (Advanced Science Research Center, Kanazawa University)
“Illumina and PacBio sequences for de novo assembly”

Next generation sequencers producing very cheap and accurate but short sequences greatly accelerated sequencing a number of organisms, including animals and plants. After the publication of the Giant Panda Genome with SOAPdenovo, a number of genome assemblers aiming at relatively large genomes have been published. Two major difficulties in the assembly of non-model organisms are presence of repetitive sequences and presence of heterozygosities. Mate-pair library with a distance larger than the repetitive sequence is essential for solving the problem. The third generation sequencer or the Single Molecule Real Time (SMRT) sequencing technology by Pacific Biosciences implemented in PacBio RS produces a read of several thousand nucleotides in length, albeit with a high error rate of 10 to 20%.
Since the introduction of PacBio RS as a shared facility of the Grant-in-Aid for Scientific Research on Innovative Areas "Genetic Bases for the Evolution of Complex Adaptive Traits", we sequenced a bacterium and several plants and animal sequences. We will present the performance, error rate and patterns as estimated with the reference, and experience in assembly with available programs. Because of the high error rate, assembly with PacBio data only is not likely feasible in near future, but a hybrid strategies using both Illumina and PacBio data are considered promising. For hybrid approaches there are two ways, one is error correction of PacBio data using Illumina sequences and the other is use of PacBio data for gap filling of Illumina based scaffolds. Performance of published programs is tested on the bacterial genome.


Likai MAO (BGI)
“Assembling the genome of Cephalotus follicularis

Cephalotus follicularis has a relatively large genome size (~2G based on our estimation). The probability of high proportion of repetitive sequences has been seen in its genome. This adds another layer of difficulty to the assembly of this genome. By 2nd-generation sequencing, we obtained more than 50X of clean data. However, these data failed to be assembled into a genome with high quality. Third-generation sequencing which produces long reads is providing big hope for such difficult tasks of assembling of large genomes. However, we still need to deal with the problem of high error rate. In this talk, we will present our work of assembling the genome of Cephalotus follicularis illustrating promising hybrid strategy of combining 2nd- and 3rd-generation sequencing data with error correction implemented.


Masahiro Kasahara (Department of Computational Biology, Graduate School of Frontier Science, The University of Tokyo)
“Error correction algorithms for de novo genome assembly using PacBio reads”

With the advent of Illumina HiSeq, de novo assembly of genomes using Illumina sequencers became common because it is very cost efficient. However, complex genomes with a large amount of repetitive elements are still hard to assemble using short reads produced by Illumina sequencers. To accurately assemble large and complex genomes, longer reads are demanded. A recently launched sequencer, PacBio RS yields
the longest reads among commercially available seqeuncers and application to de novo assembly is pursued by many groups in the world.
To put PacBio reads into a portfolio of sequencing reads, we must devise an assembly algorithm to accommodate a much higher sequencing error rate or otherwise correct sequencing errors in PacBio reads. The sequencing error rate is approximately 15% (although the figure is dependent on read filtering parameters) and the errors are dominated by insertions and deletions, which are harder to handle by existing alignment/assembly algorithms in a reasonable running time. For example, PacBioToCA is the most famous pipeline that corrects sequencing errors in long PacBio reads using other reads such as Illumina reads, but it runs too slow for larger genomes and was unable to produce results, at least at our site.
To this end, we developed a new error correction algorithm that is more efficient and scalable. We will introduce the performance for a pilot project and discuss a way to assemble large and complex genomes particularly of wild-type individuals.