2
$\begingroup$

I am fairly new at RNA-sequencing analysis, but I have attempted to analyse the data I obtained from RNA sequencing on my own by following an online EdX class and by basing the majority of the analysis on the recommendations of Azenta in the following guide: A Quick Start Guide to RNA-Seq Data Analysis .

For context, I am looking at making two different comparisons each time between human A549 cells treated with control siRNA (non-targeted siRNA) and a targeted knockdown using siRNA. I have 3 sequenced replicates for each condition.

As mentioned I mainly followed the steps recommended by azenta:

  1. FASTQC to look at the quality of the samples (everything seems fine there)

  2. Trimming reads with trimmomatic (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:2:keepBothReads LEADING:3 TRAILING:3 MINLEN:36)

  3. Aligning the reads to the hg19 reference genome with STAR (the relevant files for the genome assembly, fasta and gtf, were retrieved from the gencode database)

  4. Calculate gene hit counts with featureCounts (in R; isGTFAnnotationFile=TRUE, countMultiMappingReads=FALSE, GTF.featureType="exon", largestOverlap=TRUE, GTF.attrType="gene_id")

  5. Compare hit counts between groups using DESeq2 with LFCshrink type Apeglem.

Without taking fold change into account, by only looking at the p-adjusted value < 0.05, we only identify 19 genes matching this parameter in one comparison and 2 in the other comparison. In each comparison, one of the genes that is identified as being differentially expressed is the one that was knocked down by our siRNA treatment.

We were expecting to identify much more differentially expressed genes, but since we do identify the knocked down gene, we are unsure whether something is wrong with our pipeline or with the samples? Or whether this could be interpreted as a "normal" amount of differentially expressed genes?

I tried different types of LFCshrink normalization which did not significantly alter the results. I also tried to run the easily accessible airway dataset through the DESeq2 pipeline, this dataset did seem to behave normally. I also tried a genome assembly based on files obtained from the UCSC database, which also did not alter the data.

Any guidance or advise would be appreciated to help figure out whether something is wrong with our analysis or not! Thank you for your help.

MA plot of comparison with 19 genes with padj < 0.05

PCA plot of comparison with 19 genes with padj < 0.05

$\endgroup$
4
  • 1
    $\begingroup$ Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. $\endgroup$ Commented May 31, 2024 at 21:07
  • $\begingroup$ What is the paired read length? Is it a stranded sequencing? Often there is no point of trimming RNA-Seq prior to mapping. STAR should be able to better figure out where to map a given pair of reads when you keep "a not so great quality but okeish sequence-wise" read ends. $\endgroup$ Commented Jun 5, 2024 at 9:19
  • $\begingroup$ While I doubt this will change the results drastically, I think it makes sense to use the best available version of the genome (T2T) and the best gene annotation. At least hg38/Gencode $\endgroup$ Commented Jun 5, 2024 at 10:01
  • $\begingroup$ Can you please change the Y range of that MA plot so that it includes all the data? There are currently what looks like hundreds of out-of-range genes. $\endgroup$ Commented Jun 6, 2024 at 19:42

2 Answers 2

2
$\begingroup$

Generally, the recommended way of doing such analysis is exploring data by a dimensionality reduction technique such as PCA, e.g. implemented in DESeq2 via plotPCA. I personally always also look at MA-plots (plotMA). This will inform how samples cluster, and if there is evidence that some unwanted technical variation is present, or that samples are simply not different, or that normalization might be an issue. Without these plots one really cannot help here.

It is of course possible that there are no DEGs. No difference is not necessarily a technical error.

$\endgroup$
4
  • 1
    $\begingroup$ thank you very much for your answer! I added the PCA and MA plots to the question. Based on your answer, I assume that since there is no clear clustering the case here is most likely that there is too much variation within the replicates and no issue with the analysis itself. $\endgroup$ Commented Jun 3, 2024 at 14:15
  • $\begingroup$ It might be helpful if you project other variables (other than treatment-control, for example batch, sex, ...) in your PCA plot, especially the replicates that you define above. For example I see treatment-control pairs in your PCA plots, if these pairs belong to individual replicates/batches, you can add that to your design model as a covariate and expect to find more DE genes. $\endgroup$ Commented Jun 4, 2024 at 6:59
  • $\begingroup$ I can only second what swbarnes2 is saying. The treatment does not seem to be doing much. Has the experiment been done in batches? $\endgroup$ Commented Jun 4, 2024 at 10:52
  • $\begingroup$ Yes the experiment has been done in batches, but the replicates treatment-control do not correspond to the seemingly nice pairing in the PCA plot. $\endgroup$ Commented Jun 4, 2024 at 13:34
2
$\begingroup$

The reason is that, according to your PCA, your treatment doesn't do very much. It's not what you want to hear, but that's probably the truth. You probably didn't do anything wrong, it's just what your experiment is.

I would not find it surprising that your knockdown affects only a handful of genes.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.