$$ cd /home/santhilal/bin/ $$ wget http://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.5.zip $$ unzip fastqc_v0.11.5.zip
$$ cd /home/santhilal/bin/ $$ wget ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.5-Linux_x86_64.zip $$ unzip hisat2-2.0.5-Linux_x86_64.zip
$$ cd /home/santhilal/bin/ $$ wget https://superb-dca2.dl.sourceforge.net/project/samstat/samstat-1.5.1.tar.gz $$ tar -zxvf samstat-1.5.1.tar.gz
$$ cd /home/santhilal/bin/ $$ wget https://excellmedia.dl.sourceforge.net/project/subread/subread-1.5.1/subread-1.5.1-Linux-x86_64.tar.gz $$ tar -zxvf subread-1.5.1-Linux-x86_64.tar.gz
$$ cd /home/santhilal/bin/ $$ wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 $$ tar -xvjf samtools-1.9.tar.bz2 $$ cd samtools-1.9 $$ ./configure $$ make
>chr1 CTGATGGTATGTACCTTTAATATAATTTGGTGAGAATGGTACTTCACCTC TGTGGTCTTCCTCCCCAAAACCCATAACCCCAGTCAAACCCTAAGAAAAA CATCAGCCAAACTCAAATTTAGGGCATTCTGCAAAAATAACTGACTGACT AGTACTCCCTCAAACTGTCAGGGTCACCAAAAACAAGGAAAGTCTAAGAA ACTCACAGCCAAGAGGTACCTAAAATGACAAATAAATTCAATGTGGTGGC CCGGATGGGATCCTGGAACAAAGAAAAAGAATGTTAGCTAAAAACTAAGG AAATCTAAATAAATGTATAGGCTTTAGTTAATAATACTCTATTGAATATT CATTAGTTGTGACAAATATGAGAGGTTAGTAATAGGAAAAAAATGGGAAC TCTGTGTATCTTTGAAATTTTTCTGTATATCTGAACTATTCTAAAATTTT AGTACCACTAAAGGTTTTTATTACATAATAGTGTTTTAATTTAATCTCCA TGCTATTTAATGAAATTTACCAGTTGTATTTGTGTTTGATTAGACATTAT CTCAAGTTGATTGGAATTACTCATATTTCTTAATGCCATTGTTGCAGCTG TAATAAGGTTGACTTTTCTACAATAGACACAGAAATCATTACATCACCAT TTAATTTTTGAAAAATAATCCGATCCCCTGTGATTATAAACTTGAGATTC TTTAAGAAAGTAGAAAGGTTGTAAGATATTATGAGTCATAGGATATTTCC AGTTTTGAAGTAATAAGACAAAAATCATTAACTAAACAAATTGGCATTAA AAAAAAACAGTATTTTTGAGAGGACCAGTCCAGACAATGTGTCTGTAAGA TAGACTTTACAAACTTAGTTTGATTAATTTTTGTTTTTCCGAGGATATGA TTCTTTTCTATAATTACTGTTCTGTTTTTTTCTGACAGGCTATCAAATAT TGGCTGCCTTAAGCTCTCTGTGAACTGAGTTATTCTCAATAATACTCTGT1) Download the recent version of genome file for Homo sapiens from Ensembl database in directory 'genome/Chromosomes'. Navigate to the directory and then type this command.
$$ cd /data1/santhilal/rnaseq_analysis/genome/Chromosomes $$ wget ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gzNote: File downloaded/link obtained from source 'http://www.ensembl.org/info/data/ftp/index.html'
$$ gunzip -d Homo_sapiens.GRCh38.dna.primary_assembly.fa.gzTranscript annotation:
$$ wget ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gzNote: File downloaded/link obtained from source 'http://www.ensembl.org/info/data/ftp/index.html'
$$ gzip sample.fastqThe above command will generate a compressed file 'sample.fastq.gz'. In this tutorial I will provide you with already compressed file. To view fastq file you can use 'cat' and for compressed fastq use 'zcat' (remember to display only head).
@SRR1958169.1 HISEQ:269:HAMADADXX:1:1101:1168:1870 length=51 NGCGTGGAAGGGGTGGAGCCGCACCCGGATATGGAAGCCATCTTTGCCACA +SRR1958169.1 HISEQ:269:HAMADADXX:1:1101:1168:1870 length=51 #1=DDFFFHHHHHFHIIJJJJJJJJJJJJJJJJIJJJJJJHHHHHHFFFFF @SRR1958169.2 HISEQ:269:HAMADADXX:1:1101:1295:1887 length=51 NTAGGGTGAGCCCCACATCGGCCTGTGTATATCCCAGGGTGATCCTCTTCT +SRR1958169.2 HISEQ:269:HAMADADXX:1:1101:1295:1887 length=51 #1=DDDDDHHHHHIIIIIIDDGHGE9?:?DGDFHDD<BDH;FFHGGGIIIH
Dataset will be used from published study "The lncRNA DEANR1 Facilitates Human Endoderm Differentiation by Activating FOXA2 Expression". Steps to be followed to download the RAW sequencing file from the GEO repository.
1) Go to https://www.ncbi.nlm.nih.gov/geo/. In the search box paste the GSE accession id/number you found from the article (Tip: You can easily find GEO id in PDF of the article).
2) Note down and copy the PRJNAxxx (project id) and get information about the samples from http://www.ebi.ac.uk/ena
3) In the 'Select Columns' section select (check the box) 'Experiment title and Sample title'.
4) You will find 'WT', 'FOXA2KD' and 'DEANR1KD' samples (2 replicate each = 6 samples) along with other samples from the study.
5) Save the generated text file by clicking on the link 'Text'. Save it with name 'PRJNA192215.txt'.
6) You will find the links in 'fastq_ftp' column of the text file. Download the samples corresponding to 'WT', 'FOXA2KD' and 'DEANR1KD' knockdown samples (6 samples).
7) Rename the files into simple names 'WT_1.fastq.gz', 'WT_2.fastq.gz', 'DEANR1KD_1.fastq.gz', 'DEANR1KD_2.fastq.gz', 'FOXA2KD_1.fastq.gz' and 'FOXA2KD_1.fastq.gz'.
Downloading: Below mentioned command will download the files from the path mentioned in the downloaded information file 'PRJNA192215.txt'. Since downloading will take long time, you can submit this job in background. In the following command I am redirecting normal nohup.out and saving into 'download_nohup.out'. If you do not want to rename you just use nohup YOUR_COMPLETE_COMMAND &. This will save your progress in default 'nohup.out'.
Tip: (nohup/background processing)
$$ cd /data1/santhilal/rnaseq_analysis/fastq/ $$ nohup cat /data1/santhilal/rnaseq_analysis/fastq/PRJNA192215.txt | sed -n '3,8p' | cut -f2 | xargs wget > download_nohup.out 2>&1 & # Print only 3rd to 8th row from the file 'PRJNA192215.txt' and cut column 12 containing download linksKeep checking tail/last lines of 'download_nohup.out'. If it is still downloading you will see different line each time you 'tail' nohup.out. Orelse simple relay on 'top' command. Easy and reliable way to check is 'grep "saved" /data1/santhilal/rnaseq_analysis/fastq/download_nohup.out'. When the sample is downloaded and saved to your disk, wget assigns for each sample 'saved' status. Once you see 'saved' six times means all your six samples completed downloading and saved to your disk.
$$ nohup /home/santhilal/bin/FastQC/fastqc -o /data1/santhilal/rnaseq_analysis/QualityCheck/ -t 6 /data1/santhilal/rnaseq_analysis/fastq/*.fq.gz &Remember from the above command you are running all 6 jobs parellel. You can utilize 'top' command and 'nohup.out' file to check the status of submitted job (Refer lecture slides).
$$ cd /data1/santhilal/rnaseq_analysis/genome/HISAT2_index $$ nohup /home/santhilal/bin/hisat2-2.0.5/hisat2-build -p 8 /data1/santhilal/rnaseq_analysis/genome/Chromosomes/Homo_sapiens.GRCh38.dna.primary_assembly.fa /data1/santhilal/rnaseq_analysis/genome/HISAT2_index/hg38 &HISAT2 Alignment: Next step is alignment of downloaded fastq files to the indexed genome. Since we are going to submit more than one alignment jobs in the background. All the logs from the submitting jobs will be overwritten to the same file '/data1/santhilal/rnaseq_analysis/Alignment/nohup.out'. To avoid this, we are going to give unique 'nohup' name for each submission. For that we will be renaming the default 'nohup.out' to a name appropriate for the job. Every alignment program outputs alignment in SAM format.
$$ cd /data1/santhilal/rnaseq_analysis/Alignment $$ mkdir WT_1 WT_2 DEANR1KD_1 DEANR1KD_2 FOXA2KD_1 FOXA2KD_2 $$ nohup /home/santhilal/bin/hisat2-2.0.5/hisat2 --phred33 -p 8 --qc-filter -x /data1/santhilal/rnaseq_analysis/genome/HISAT2_index/hg38 -U /data1/santhilal/rnaseq_analysis/fastq/DEANR1KD_1.fastq.gz | /home/santhilal/bin/samtools-1.9/samtools view -@ 8 -bS - | /home/santhilal/bin/samtools-1.9/samtools sort -@ 8 -O BAM -o /data1/santhilal/rnaseq_analysis/Alignment/DEANR1KD_1/DEANR1KD_1_sorted.bam - &Similarly submit alignment jobs for other samples based on available CPUs/memory on the server.
SRR1958167.113755 256 10 27135 1 51M * 0 0 TCATCACCAAATAGTGCTACCTAAAGTAATATACAACCAATTTAATAAGTT CCCFFFFFHHHHHJHGIJJJJJJJJJHIJJJJJJJJJJJJJJJJJJJJJIJ NH:i:3 HI:i:2 AS:i:50 nM:i:0 SRR1958167.6172299 256 10 27763 1 51M * 0 0 AGCTCAGCCATGGCCTTGGGTCTGAAGCACAGAGCTGTGTGGAGTCTGGGC B@CFFFFFHHHHHJJJJJJJIJJJIJJJJJJHIIHIJJJIIIIJJJJJJJJ NH:i:3 HI:i:2 AS:i:50 nM:i:0 SRR1958167.16554448 256 10 28612 0 1S50M * 0 0 GGCAGGAGAGACTGAGGTGATGAGGGTCCAGAGTGGACCCCCAGAAAACCA CCCFFFFFHGHHHJIJJFHIIJJJIJIJJJJIJGIJJJIJJJIJIJIJJJJ NH:i:5 HI:i:5 AS:i:49 nM:i:0 SRR1958167.4929810 272 10 31436 0 51M * 0 0 GATCACCTGAGGTGAGGAGTTTGAGACCAGCCTGACCAACATGGAGAAACC HD999F;FDG?3IIGFHCIIIIIHHEHGHGECBEIIIIHGHHFEDDB=1@= NH:i:7 HI:i:4 AS:i:50 nM:i:0 SRR1958167.11547776 272 10 31436 0 51M * 0 0 GATCACCTGAGGTGAGGAGTTTGAGACCAGCCTGACCAACATGGAGAAACC JIIHHJJJJJJJJJIJJJJJJJJIIIJJJJIIJJIJJJHHHHGFFFFFBCC NH:i:7 HI:i:4 AS:i:50 nM:i:0 SRR1958167.23532591 272 10 31436 0 51M * 0 0 GATCACCTGAGGTGAGGAGTTTGAGACCAGCCTGACCAACATGGAGAAACC JJIIHJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJHHHHHFFFFDB@B NH:i:7 HI:i:4 AS:i:50 nM:i:0 SRR1958167.4156273 272 10 31546 0 51M * 0 0 AGCTACTTGGGAGGCTGAGGCAGGAGAGTCACTTGAACCCGGGAGGCAGAG IJGJJJJJJJIJJJJJJJJJJHJJIJJJJIIJJJJHGJHHHHHFFFFFCCC NH:i:10 HI:i:2 AS:i:50 nM:i:0 SRR1958167.19438259 16 10 31558 1 47M4S * 0 0 GGCTGAGGCAGGAGAGTCACTTGAACCCGGGAGGCAGAGGTTGCGGTGGGC HAEHGEGEGHGGCIIIEHFIJIIIJJIIIIJGGIHIJJHHHHHFFFDFCC@ NH:i:4 HI:i:3 AS:i:46 nM:i:0 SRR1958167.6675573 272 10 39012 0 51M * 0 0 AGACACAACCAAAAAAGAGAATTTTAGACCACTATCCTTGATGAACATTGG DBGHEIHFJJJJJJIHEJJHGGHHHHGHIIJJIJJJGIFHDHHFFDDB@@@ NH:i:10 HI:i:2 AS:i:48 nM:i:1 SRR1958167.6855956 0 10 46995 3 1S50M * 0 0 CGGCTGTCAGAACACAGTAAAGAATCCACACTGCTTCCCCCCTTTACCTAG CCCFFFFFHHHHHJIJJIJJJJIJJJJJJJJIJIJJJJJJJJHIJHJJIJG NH:i:2 HI:i:1 AS:i:49 nM:i:0Refer SAM format manual section 1.4 for complete description.
$$ export PATH=$PATH:/home/santhilal/bin/samtools/samtools $$ /home/santhilal/bin/samstat-1.5.1/src/samstat /data1/santhilal/rnaseq_analysis/Alignment/DEANR1KD_1/DEANR1KD_1.bamor you can run all the jobs in background.
$$ export PATH=$PATH:/home/santhilal/bin/samtools-1.9/samtools $$ cd /data1/santhilal/rnaseq_analysis/Alignment $$ nohup /home/santhilal/bin/samstat-1.5.1/src/samstat /data1/santhilal/rnaseq_analysis/Alignment/*/*.bam > nohupSAMstat.out 2>&1The above command will generate a HTML file '/data1/santhilal/rnaseq_analysis/Alignment/DEANR1KD_1/DEANR1KD_1.bam.html'. This file can be opened with any web browser. More information on SAMstat is here.
$$ nohup /home/santhilal/bin/subread-1.5.0-p1-source/bin/featureCounts -T 6 -t exon -g gene_id -Q 30 -F GTF -a /data1/santhilal/rnaseq_analysis/annotation/Homo_sapiens.GRCh38.87.gtf -o /data1/santhilal/rnaseq_analysis/quantification/hg38_ensembl_GRCh38.87_counts.txt /data1/santhilal/rnaseq_analysis/Alignment/WT_1/WT_1_sorted.bam /data1/santhilal/rnaseq_analysis/Alignment/WT_2/WT_2_sorted.bam /data1/santhilal/rnaseq_analysis/Alignment/DEANR1KD_1/DEANR1KD_1_sorted.bam /data1/santhilal/rnaseq_analysis/Alignment/DEANR1KD_2/DEANR1KD_2_sorted.bam /data1/santhilal/rnaseq_analysis/Alignment/FOXA2KD_1/FOXA2KD_1_sorted.bam /data1/santhilal/rnaseq_analysis/Alignment/FOXA2KD_2/FOXA2KD_2_sorted.bam > nohupFeatureCounts.out 2>&1
samples labels condition WT_1 control1 control WT_2 control2 control DEANR1KD_1 DEANR1KD1 DEANR1 DEANR1KD_2 DEANR1KD2 DEANR1 FOXA2KD_1 FOXA2KD1 FOXA2 FOXA2KD_2 FOXA2KD2 FOXA2Prepare quantification table: From the quantification table, we will have lot of information which will not be required in next steps. We have to remove those extra columns and just create a matrix with the required information.
$$ cat /data1/santhilal/rnaseq_analysis/quantification/hg38_ensembl_GRCh38.87_counts.txt | grep -v "^#" | cut -f1,7,8,9,10,11,12 | sed '1d' | sed '1i\Geneid\tWT1_1\tWT1_2\tDEANR1KD_1\tDEANR1KD_2\tFOXA2KD_1\tFOXA2KD_2' > /data1/santhilal/rnaseq_analysis/quantification/hg38_ensembl_GRCh38.87_counts.matrixPrepare annotation table: Make a gene-wise table from downloaded GTF ('/data1/santhilal/rnaseq_analysis/annotation/Homo_sapiens.GRCh38.87.gtf') as explained in the Main assignment 2 and save it to the directory '/data1/santhilal/rnaseq_analysis/annotation/' with name 'Homo_sapiens.GRCh38.87_gene_annotation.txt'.
$$ R ## try http:// if https:// URLs are not supported >> source("https://bioconductor.org/biocLite.R") # Connecting repository >> biocLite("edgeR") # Installing >> library("edgeR") # Loading installed edgeR librariesLoad your design matrix to a variable 'mydesign' and your quantification table in 'x' variable. Remember we prepared an annotation table using GTF, load that into 'dbs' variable. Final thing is specifying the path where your outputs from edgeR to be saved (in this case the 'outpath' will have path stored in the variable). In the following commands,
>> mydesign <- read.table("/data1/santhilal/rnaseq_analysis/DE/exp_design.txt", header=T, sep="\t") >> x <- read.table("./data1/santhilal/rnaseq_analysis/quantification/hg38_ensembl_GRCh38.87_counts.matrix", header=T, sep="\t", row.names=1) >> dbs <- read.table("/data1/santhilal/rnaseq_analysis/annotation/Homo_sapiens.GRCh38.87_gene_annotation.txt", header=T, sep="\t") >> outpath <- "/data1/santhilal/rnaseq_analysis/DE/"Check if the above variables are created properly by using 'ls()'. Next check the contents of the variables just by typing the variables names you created above (if the variable has huge data, do not print everything instead you can just print first ten entries using head).
>> ls() >> mydesign >> head(x) >> head(dbs) >> outpathGive information about the groups to edgeR along with loaded counts matrix.
>> group <- factor(mydesign$condition) >> d <- DGEList(counts=x, group=group)Normalizing reads to counts per million (CPM) and filtering the genes by keeping only the genes with CPM > 5 in atleast one sample (prominantly expressed in atleast one sample). We should be carefull in choosing this because if you are analyzing for non-coding RNAs, this value (cpm>5) is too high. Since our aim in this exercise is to get only top significantly expressed protein-coding genes, we are using more stringent criteria.
>> dim(d) >> keep <- rowSums(cpm(d)>5) >= 1 >> d <- d[keep,] >> dim(d) >> d$samples$lib.size <- colSums(d$counts)Dispersion estimates between samples (even within group) and between groups. This estimate is important for edgeR to calculate the differential expression because it gives information about how close or how variable the samples or groups. Read more about dispersion on page 44 of edgeR user guide.
>> d <- calcNormFactors(d) >> d1 <- estimateCommonDisp(d, verbose=T) >> d1 <- estimateTagwiseDisp(d1)## Plotting correlation of samples with normalized counts
>> png(paste0(outpath,"sample_correlation.png")) >> pheatmap(cor(log2(cpm(d)+1)), main="SampleCorrelation") >> dev.off()## Plotting library sizes of samples (Total number of sequenced reads assigned to all genes from the GTF file)
>> png(paste0(outpath,"sample_librarySizes.png")) >> barplot(d$samples$lib.size,las=2,names=colnames(d),cex.names=0.6,horiz=T,main="LibrarySizes") >> dev.off()## Differential expression between WT and DEANR1
>> et <- exactTest(d1, pair=c("control","DEANR1")) # Testing for differential expression betwee groups >> res <- topTags(et, n=length(et)) # Extracting differentially expressed genes (if you just specify, topTags(et) will give you first ten DEGs) >> res$table[,"Geneid"] <- rownames(res$table) # making rownames of the results to be first column of the results with column name 'Geneid' (easy for saving) >> colnames(res$table) # Checking the columns of results >> resF <- res$table[,c("Geneid","logFC","logCPM","PValue","FDR")] # Storing selective columns to a variable 'resF'
>> annotated <- merge( dbs, resF, by='Geneid')
>> write.table(annotated,paste0(outpath,"DEANR1/DEANR1_DEGs_annotated.txt"), sep="\t", quote=F,row.names=F)## Saving normalized read counts (counts per million) for all samples
>> nc <- cpm(d1$counts, normalized.lib.sizes=FALSE) # Normalization >> nc <- cbind(rownames(nc),nc) # converting rownames ('Geneid') into first column of 'nc', it will be useful for annotation. >> colnames(nc)[1] <- "Geneid" # Naming the first column as 'Geneid' >> nc_annotated <- merge( dbs, nc, by='Geneid') # Merging 'dbs' and 'nc' by matching column 'Geneid' >> write.table(nc_annotated,paste0(outpath,"normalized_cpm.txt"), sep="\t", quote=F,row.names=FALSE)
>> annotated_filt <- subset(annotated,abs(annotated$logFC)>=1 & annotated$FDR<=0.01) >> write.table(annotated_filt,paste0(outpath,"DEANR1/DEANR1_DEGs_annotated_filtered_all.txt"), sep="\t", quote=F,row.names=F)# Filtered with log-fold change >=1 amd FDR < 0.05 (only protein coding)
>> annotated_filt_pc <- subset(annotated,abs(annotated$logFC)>=1 & annotated$FDR<=0.01 & annotated$Class=="protein_coding") >> write.table(annotated_filt_pc,paste0(outpath,"DEANR1/DEANR1_DEGs_annotated_filtered_PC.txt"), sep="\t", quote=F,row.names=F)
>> summary(annotated_filt$Class)Question RS3: Find differentially expressed protein coding genes between WT and FOXA2. Submit the complete edgeR code for this analysis.