Run differential expression (DE) analysis for several comparisosns at a time. Internally it uses
DESeq2::DESeq()
to perform DE analysis.
Source: R/rnaseq_related.R
run_deseq_analysis.Rd
DESeq2 is a popular method to perform DE analysis for RNA-seq data. Pre and post DESeq run, however,
involves several data wrangling steps. For example, prior to run DESeq genes may need to be filtered out based on
number of reads mapped to genes across replicates. Similarly, post DESeq run user needs to set cutoffs (log2fc, pvalue and padj)
to define up
and down
regulated genes. For large RNA-seq experiments involving several DE comparison such as sub-setting
and different cutoffs creates messy and less readable code. This function helps to make things bit tidy and increase
the code readability. Besides that, output of this function can subsequently used for several other functions of this package.
Usage
run_deseq_analysis(
counts,
column_geneid,
sample_info,
group_numerator,
group_denominator,
delim = "\t",
comment_char = "#",
min_counts = 10,
min_replicates = 1,
cutoff_lfc = 1,
cutoff_pval = 0.05,
cutoff_padj = 0.01,
regul_based_upon = 1,
print_rows_all_zero = FALSE
)
Arguments
- counts
a character string of the path to a count file or an object of dataframe having raw counts for each gene. See details below to know more about format of the count file and count dataframe.
- column_geneid
a character string denoting a column of geneid in
counts
- sample_info
a character string denoting a name of sample information file or a dataframe. A file or a dataframe both must have at least two columns WITHOUT column names. First column denotes to samples names and second column denotes group name for each sample in first column. For e.g.
control_rep_1 Control control_rep_2 Control control_rep_3 Control treat_1_rep_1 Treatment_1 treat_1_rep_2 Treatment_1 treat_1_rep_3 Treatment_1 treat_2_rep_1 Treatment_2 treat_2_rep_2 Treatment_2 treat_2_rep_3 Treatment_2 - group_numerator
a character vector denoting sample groups to use in numerator to calculate fold change.
- group_denominator
a character vector denoting sample groups to use in denominator to calculate fold change.
- delim
a character denoting deliminator for
count
file. Only valid ifcount
is a file path.- comment_char
a character denoting comments line in count file. Only valid if
count
is a file path.- min_counts
a numeric value, default 10, denoting minimum counts for a gene to be used to consider a gene for differential expression analysis.
- min_replicates
a numeric value, default 1, denoting minimum samples within a sample group must have
minimum_counts
. Value provided must not be higher than number of samples in each group. For example for given valuesmin_replicates = 2
andminimum_counts = 10
the genes which have minimum counts 10 in at least 2 samples within a groups will be accounted for DEG. Rest will be filtered out prior to run DESeq.- cutoff_lfc
minimal threshold for log2fold change, default 1 (2 fold).
- cutoff_pval
minimal threshold for pvalue, default 0.05. P-value threshold will be applied only when
regul_based_upon
is either 1 or 3.- cutoff_padj
minimal threshold for Padj, default 0.01. Padj threshold will be applied only when
regul_based_upon
is either 2 or 3.- regul_based_upon
one of the numeric choices 1, 2, or 3.
if 1 ...
Up : log2fc >= cutoff_lfc & pvalue <= cutoff_pval
Down : log2fc <= (-1) * cutoff_lfc & pvalue <= cutoff_pval
Other : remaining genes
- print_rows_all_zero
logical, default FALSE, denoting whether to print genes with value 0 in all columns.
Value
Return object is a dataframe (tibble) having each row denoting a unique differential comparison. There are total 8 columns as explained below.
de_comparisons
: It stores the name of differential comparison for each row.numerator
: It stores name of samples which were used as numerator for the differential comparison in each row.denominator
: It stores name of samples which were used as denominator for the differential comparison in each row.norm_counts
: This is anamed-list
column. Each row in this column is a list of two containing normalised genes expression values in a dataframe for the samples - numerator and denominator. The first column of the dataframe isgene_id
and subsequent columns are gene expression values in replicates of corresponding samples. This normalised gene expression values are obtained usingcounts
slot of a DESeqDataSet object. e.g.:counts(dds,normalized=TRUE)
dsr
: This is anamed-list
column stores an object of classDESeq2::DESeqResults()
for the differential comparison in each row.dsr_tibble
: This is anamed-list
column stores and an output ofDESeq2::DESeqResults()
in the dataframe format for the differential comparison in each row.dsr_tibble_deg
: The data in this column is same as in the columndsr_tibble
except it contains two extra columnssignif
andregul
. Values in thesignif
specifies statistical and fold change significance of the gene while values in theregul
denotes whether the gene isup
ordown
regulated.deg_summmary
: This is anamed-list
column. Each element of the list is a dataframe summarizing number of differential expressed gene for the differential comparison for each row.
Details
: For the argument count
user can either provide a character string denoting a file or an object of dataframe.
In each case required format is explained below.
Count file
: Count file is a table of row read counts usually derived from .bam file for different genomic features (e.g. genes, transcripts etc.). Data in the count file must be in a tabular format with a valid column deliminator (e.g., tab, comma etc.). First row and first column will be considered as column names and row names respectively. Values for column names and row names are usually character string or combination of character, numbers and special characters such as_
, or.
. Both row names and column names must have unique values.Count dataframe
: Count data in a dataframe format having same requirement of row names and column names explained forcount file
.
Examples
count_file <- system.file("extdata","toy_counts.txt" , package = "parcutils")
count_data <- readr::read_delim(count_file, delim = "\t",show_col_types = FALSE)
sample_info <- count_data %>% colnames() %>% .[-1] %>%
tibble::tibble(samples = . , groups = rep(c("control" ,"treatment1" , "treatment2"), each = 3) )
res <- run_deseq_analysis(counts = count_data ,
sample_info = sample_info,
column_geneid = "gene_id" ,
cutoff_lfc = 1 ,
cutoff_pval = 0.05,
group_numerator = c("treatment1", "treatment2") ,
group_denominator = c("control"))
#> ℹ Running DESeq2 ...
#> converting counts to integer mode
#> Warning: some variables in design formula are characters, converting to factors
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> ✔ Done.
#> ℹ Summarizing DEG ...
#> ✔ Done.
res
#> ┌────────────────────────────┐
#> │ │
#> │ Summary of DE analysis │
#> │ │
#> └────────────────────────────┘
#>
#> Total number of comparisons: 2
#> Total number of genes used for DE analysis: 4034
#>
#> DE gene counts by comparison...
#>
#> treatment1_VS_control
#> • number of up genes : 13.
#> • number of down genes : 28.
#> ──────────────────────────────
#>
#> treatment2_VS_control
#> • number of up genes : 333.
#> • number of down genes : 502.
#> ──────────────────────────────
#>
## all comparisons
print(res$de_comparisons)
#> [1] "treatment1_VS_control" "treatment2_VS_control"
## DESEq result object(s)
print(res$dsr)
#> $treatment1_VS_control
#> log2 fold change (MLE): sample_groups treatment1 vs control
#> Wald test p-value: sample groups treatment1 vs control
#> DataFrame with 4034 rows and 6 columns
#> baseMean log2FoldChange lfcSE stat
#> <numeric> <numeric> <numeric> <numeric>
#> ENSG00000154342:WNT3A 715.7962 -0.6014175 0.0931795 -6.454399
#> ENSG00000196415:PRTN3 4242.8862 -0.0713742 0.0748438 -0.953643
#> ENSG00000173020:GRK2 671.9682 -0.2560099 0.1139957 -2.245787
#> ENSG00000140598:EFL1 80.7054 -0.1764337 0.2040219 -0.864778
#> ENSG00000139263:LRIG3 1279.0478 -0.1378660 0.1788328 -0.770921
#> ... ... ... ... ...
#> ENSG00000198046:ZNF667 99.3996 -0.3200579 0.1852360 -1.727839
#> ENSG00000231051:USP17L28 78.8482 -0.0354036 0.2034285 -0.174035
#> ENSG00000188783:PRELP 1106.4538 0.1615409 0.0828499 1.949802
#> ENSG00000213822:CEACAM18 834.6362 -0.2330325 0.0727700 -3.202317
#> ENSG00000167588:GPD1 2225.9670 0.3957288 0.0767754 5.154371
#> pvalue padj
#> <numeric> <numeric>
#> ENSG00000154342:WNT3A 1.08649e-10 6.58175e-09
#> ENSG00000196415:PRTN3 3.40264e-01 5.52032e-01
#> ENSG00000173020:GRK2 2.47177e-02 9.02358e-02
#> ENSG00000140598:EFL1 3.87161e-01 5.92821e-01
#> ENSG00000139263:LRIG3 4.40754e-01 6.38483e-01
#> ... ... ...
#> ENSG00000198046:ZNF667 8.40172e-02 2.18175e-01
#> ENSG00000231051:USP17L28 8.61838e-01 9.35689e-01
#> ENSG00000188783:PRELP 5.11997e-02 1.53283e-01
#> ENSG00000213822:CEACAM18 1.36327e-03 9.68022e-03
#> ENSG00000167588:GPD1 2.54484e-07 7.41830e-06
#>
#> $treatment2_VS_control
#> log2 fold change (MLE): sample_groups treatment2 vs control
#> Wald test p-value: sample groups treatment2 vs control
#> DataFrame with 4034 rows and 6 columns
#> baseMean log2FoldChange lfcSE stat
#> <numeric> <numeric> <numeric> <numeric>
#> ENSG00000154342:WNT3A 715.7962 0.5571318 0.0895747 6.219743
#> ENSG00000196415:PRTN3 4242.8862 0.7288109 0.0744285 9.792097
#> ENSG00000173020:GRK2 671.9682 -0.5424934 0.1157404 -4.687156
#> ENSG00000140598:EFL1 80.7054 0.0145278 0.2046425 0.070991
#> ENSG00000139263:LRIG3 1279.0478 -1.1836335 0.1804990 -6.557562
#> ... ... ... ... ...
#> ENSG00000198046:ZNF667 99.3996 -0.02731340 0.1844465 -0.1480831
#> ENSG00000231051:USP17L28 78.8482 -0.58915168 0.2158321 -2.7296765
#> ENSG00000188783:PRELP 1106.4538 0.27690150 0.0831212 3.3312999
#> ENSG00000213822:CEACAM18 834.6362 -0.54878553 0.0750902 -7.3083473
#> ENSG00000167588:GPD1 2225.9670 -0.00404444 0.0776078 -0.0521138
#> pvalue padj
#> <numeric> <numeric>
#> ENSG00000154342:WNT3A 4.97971e-10 1.55722e-09
#> ENSG00000196415:PRTN3 1.21745e-22 8.13110e-22
#> ENSG00000173020:GRK2 2.77028e-06 6.29949e-06
#> ENSG00000140598:EFL1 9.43405e-01 9.51662e-01
#> ENSG00000139263:LRIG3 5.46947e-11 1.84019e-10
#> ... ... ...
#> ENSG00000198046:ZNF667 8.82277e-01 8.98311e-01
#> ENSG00000231051:USP17L28 6.33965e-03 9.94329e-03
#> ENSG00000188783:PRELP 8.64414e-04 1.51348e-03
#> ENSG00000213822:CEACAM18 2.70448e-13 1.06334e-12
#> ENSG00000167588:GPD1 9.58438e-01 9.64655e-01
#>
## DESEq result data frame
print(res$dsr_tibble)
#> $treatment1_VS_control
#> # A tibble: 4,034 × 7
#> gene_id baseMean log2FoldCh…¹ lfcSE stat pvalue padj
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ENSG00000154342:WNT3A 716. -0.601 0.0932 -6.45 1.09e-10 6.58e-9
#> 2 ENSG00000196415:PRTN3 4243. -0.0714 0.0748 -0.954 3.40e- 1 5.52e-1
#> 3 ENSG00000173020:GRK2 672. -0.256 0.114 -2.25 2.47e- 2 9.02e-2
#> 4 ENSG00000140598:EFL1 80.7 -0.176 0.204 -0.865 3.87e- 1 5.93e-1
#> 5 ENSG00000139263:LRIG3 1279. -0.138 0.179 -0.771 4.41e- 1 6.38e-1
#> 6 ENSG00000165898:ISCA2 4072. -0.359 0.181 -1.99 4.69e- 2 1.44e-1
#> 7 ENSG00000179454:KLHL28 13435. -0.0627 0.0989 -0.634 5.26e- 1 7.11e-1
#> 8 ENSG00000188868:ZNF563 529. -0.326 0.111 -2.93 3.36e- 3 2.01e-2
#> 9 ENSG00000249158:PCDHA11 2.38 -2.21 1.40 -1.58 1.14e- 1 NA
#> 10 ENSG00000110852:CLEC2B 7.51 -0.698 0.672 -1.04 2.99e- 1 NA
#> # … with 4,024 more rows, and abbreviated variable name ¹log2FoldChange
#>
#> $treatment2_VS_control
#> # A tibble: 4,034 × 7
#> gene_id baseMean log2FoldC…¹ lfcSE stat pvalue padj
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ENSG00000154342:WNT3A 716. 0.557 0.0896 6.22 4.98e-10 1.56e- 9
#> 2 ENSG00000196415:PRTN3 4243. 0.729 0.0744 9.79 1.22e-22 8.13e-22
#> 3 ENSG00000173020:GRK2 672. -0.542 0.116 -4.69 2.77e- 6 6.30e- 6
#> 4 ENSG00000140598:EFL1 80.7 0.0145 0.205 0.0710 9.43e- 1 9.52e- 1
#> 5 ENSG00000139263:LRIG3 1279. -1.18 0.180 -6.56 5.47e-11 1.84e-10
#> 6 ENSG00000165898:ISCA2 4072. 0.193 0.181 1.07 2.86e- 1 3.32e- 1
#> 7 ENSG00000179454:KLHL28 13435. 0.0257 0.0990 0.260 7.95e- 1 8.21e- 1
#> 8 ENSG00000188868:ZNF563 529. 0.00983 0.111 0.0890 9.29e- 1 9.40e- 1
#> 9 ENSG00000249158:PCDHA11 2.38 -1.31 1.31 -1.00 3.17e- 1 3.65e- 1
#> 10 ENSG00000110852:CLEC2B 7.51 -1.39 0.731 -1.90 5.69e- 2 7.65e- 2
#> # … with 4,024 more rows, and abbreviated variable name ¹log2FoldChange
#>
## DESEq result data frame DEG assigned, look at the columns 'signif' and 'regul'
print(res$dsr_tibble_deg)
#> $treatment1_VS_control
#> # A tibble: 4,034 × 9
#> gene_id baseM…¹ log2F…² lfcSE stat pvalue padj signif regul
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 ENSG00000154342… 7.16e2 -0.601 0.0932 -6.45 1.09e-10 6.58e-9 p-val… other
#> 2 ENSG00000196415… 4.24e3 -0.0714 0.0748 -0.954 3.40e- 1 5.52e-1 NS other
#> 3 ENSG00000173020… 6.72e2 -0.256 0.114 -2.25 2.47e- 2 9.02e-2 p-val… other
#> 4 ENSG00000140598… 8.07e1 -0.176 0.204 -0.865 3.87e- 1 5.93e-1 NS other
#> 5 ENSG00000139263… 1.28e3 -0.138 0.179 -0.771 4.41e- 1 6.38e-1 NS other
#> 6 ENSG00000165898… 4.07e3 -0.359 0.181 -1.99 4.69e- 2 1.44e-1 p-val… other
#> 7 ENSG00000179454… 1.34e4 -0.0627 0.0989 -0.634 5.26e- 1 7.11e-1 NS other
#> 8 ENSG00000188868… 5.29e2 -0.326 0.111 -2.93 3.36e- 3 2.01e-2 p-val… other
#> 9 ENSG00000249158… 2.38e0 -2.21 1.40 -1.58 1.14e- 1 NA log2FC other
#> 10 ENSG00000110852… 7.51e0 -0.698 0.672 -1.04 2.99e- 1 NA NS other
#> # … with 4,024 more rows, and abbreviated variable names ¹baseMean,
#> # ²log2FoldChange
#>
#> $treatment2_VS_control
#> # A tibble: 4,034 × 9
#> gene_id baseM…¹ log2Fo…² lfcSE stat pvalue padj signif regul
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 ENSG000001543… 7.16e2 0.557 0.0896 6.22 4.98e-10 1.56e- 9 p-val… other
#> 2 ENSG000001964… 4.24e3 0.729 0.0744 9.79 1.22e-22 8.13e-22 p-val… other
#> 3 ENSG000001730… 6.72e2 -0.542 0.116 -4.69 2.77e- 6 6.30e- 6 p-val… other
#> 4 ENSG000001405… 8.07e1 0.0145 0.205 0.0710 9.43e- 1 9.52e- 1 NS other
#> 5 ENSG000001392… 1.28e3 -1.18 0.180 -6.56 5.47e-11 1.84e-10 p-val… Down
#> 6 ENSG000001658… 4.07e3 0.193 0.181 1.07 2.86e- 1 3.32e- 1 NS other
#> 7 ENSG000001794… 1.34e4 0.0257 0.0990 0.260 7.95e- 1 8.21e- 1 NS other
#> 8 ENSG000001888… 5.29e2 0.00983 0.111 0.0890 9.29e- 1 9.40e- 1 NS other
#> 9 ENSG000002491… 2.38e0 -1.31 1.31 -1.00 3.17e- 1 3.65e- 1 log2FC other
#> 10 ENSG000001108… 7.51e0 -1.39 0.731 -1.90 5.69e- 2 7.65e- 2 log2FC other
#> # … with 4,024 more rows, and abbreviated variable names ¹baseMean,
#> # ²log2FoldChange
#>
## DEG summary
print(res$deg_summmary)
#> $treatment1_VS_control
#> # A tibble: 3 × 2
#> regul n
#> <chr> <int>
#> 1 Down 28
#> 2 Up 13
#> 3 other 3993
#>
#> $treatment2_VS_control
#> # A tibble: 3 × 2
#> regul n
#> <chr> <int>
#> 1 Down 502
#> 2 Up 333
#> 3 other 3199
#>