MCDB231L RNAseq data analysis Assignment
- Subject Code :
MCDB231L
To integrate all the data analysis steps you have learned, your final assignment consists of the analysis of actual RNAseq data, all the way from the raw reads to the differential gene expression, and write a brief report on your findings. You should have most of the scripts from the assignments we did over the course of the semester. The final analysis and the report are individual work.
The data files you will get for this are:
control read files
experimental read files (tissue dissections)
some information on the intended dissection
a planarian transcriptome
a transcriptome annotation table
If you are taking the course for graduate credit please make sure that you have contacted me about the datasets you are planning to use instead. In this case you will also have to find your own transcriptome (or in a few cases genome) to map to.
As a reminder, the steps to take for the analysis are the following:
quality control to identify any potential issues (fastQC)
trimming and filtering to remove poor quality data (cutadapt)
generate a transcriptome index (bowtie2)
mapping the reads to the transcriptome (bowtie2)
summarize the mapping to obtain a counts table (samtools)
calling differential [removed]DESeq2)
merging in the annotations (R)
making some plots or other summaries of the data (R)
The most typical analysis will be the comparison between the gene expression in one experimental condition and the control, but if you wish you can include multiple experimental conditions, or include other aspects.
In the report, make sure to include the following sections:
Abstract
Brief summary of what you did and what you found. Typically no more than 300 words.
Introduction and Aim
Briefly describe the background to the project, what type of analysis you are planning to do, and what you hope to learn from this analysis; ~1.5 page.
Sample description
What samples you have used for your analysis (what organism, what treatments, what might you expect); ~0.5 page.
Data processing
What steps you have taken in the cleanup and processing (and include plots or statistics where informative); ~0.5 page.
Differential expression analysis
What comparison are you making for the differential expression? What stands out? Show informative plots or summaries of the resulting data
Typically ~4 pages.
Conclusions
Your interpretation of the outcome, ideally tying it back to the introduction and aim; ~1.5 page.
(Page indications are for single spaced 12pt font, 0.75 margins)
Please also submit your annotated R code for the analysis. This can include more graphing and analysis than what made it into your report.
The report and script are due at the end of finals week, on Dec 21 at 5pm, by email to josien.van.wolfswinkel@yale.edu and yan.cheng@yale.edu. Please include your name in the names of the documents. We will start evaluating the reports after the Xmas weekend, and only look at the last submitted version.
read files
The files provided are bulk RNAseq data from the planarian Schmidtea mediterranea
The sequencer output (fastq.gz) files are located in
/gpfs/ycga/project/mcdb231l/mcdb231l_jv434/dataFiles
There is a list of 25 files in this folder. control dataset:
These are 3 files that all start with ctl_ There are 3 replicates. All are just single end.
experimental datasets:
Naming for the experimental files is as follows:
sample1_rep1_1
sample1
_rep1
_1
.fastq.gz
sample name
replicate
read (of pair)
For each of the conditions there are replicates, but the number varies: most have 3 replicates, but samples 1 and 5 have only 2 replicates. Some are single end, some are paired end. You can choose whether to do a paired end or single end analysis.
Each of the conditions is a tissue isolation of unknown purity. Potential intended tissue isolations are intestine, epidermis, stem cells, pharynx, and brain. Stem cells have been isolated by Fluorescence Assisted Cell Sorting (FACS); the other tissues have been isolated by microsurgery. Once you have completed your analysis you can suggest what isolation you think the sample was, and comment on its predicted purity based on the RNAseq