Sequencing DNA has become one of the major tools in Biology. There are literally dozens or hundreds of applications, besides the classical genome assembly task. The algorithmic complexity and the demand on computing resources is overwhelming.
In this lecture, you get a deeper insight into the genome assembly algorithms and you will get familiar with preparing reads (from the sequencing machine) and using existing software to assemble genomes, as well as doing differential expression analysis by using sequencing (RNA-seq). This is achieved by a combination of theory input and "demos" by me and small projects for you as homework.
Part one - Workflows:
I will present “close-to-real” workflows for quality control (using two different software products), for genome assembly (de novo and by using a reference sequence), for SNP calling (matching reads against a refgen) and for differential expression analysis using RNA-seq. Having a good understanding of the principles of genome assembly (e.g. the corresponding chapter of GT) will help a lot.
Part two - Theory input:
I will present a few more details on the algorithmic background of assembly. For example, how to do error correction, or the specifics of single cell sequencing, or integration of reads in EULER-SR etc.
This means that the chapter on assembly algorithms in GT ("Genomics and Transcriptomics") is a prerequisite to GA.
Part three - Projects:
The projects will be done in the form of homework. I will be available at JKU during most of the dates to answer questions etc., plus of course anytime via email.
Projects should be done in groups of two. You will form the groups as you like. If you want to do it alone, contact me.
Currently, the projects are not yet fully defined. But you can assume that they will be around the following three themes:
Project A (Experiments with DBG)
You will write a few R pgms that crudely simulate sequencing (create reads from a given genome, including errors) and produce a DBG and an Overlap graph from those reads. Play around with parameters, visualize the DBG using Cytoscape.
Project B (Genome assembly)
You will be given a data set containing reads from a full genome sequencing project of a small organism.
Your task is to assemble the genome in a few variants and to compare the results. Assembly will be done up to the level of contigs only, there will be no mate pairs. Focus is on discussion of the results.
Project C (RNA-seq, differential expression)
You will be given two sets of data sets containing reads from RNA-seq from several samples.
Your task is to find out about all the genes that can be considered as differentially expressed between the two groups.
There will be an exam covering the lecture part.
Grading will be based on the sum of points from the exam and from the exercises. The weight for those two will be around 50:50.
You will need access to a Linux installation to do the projects. The genomes used will be small, so Linux on a notebook or PC will be enough. Upon request, you can also use the Linux machine of the BI Institute.
A knowledge of R is required. At least one of the projects will have to be done in R.
... during the first lecture.