Run differential expression (DE) analysis for several comparisosns at a time. Internally it uses DESeq2::DESeq() to perform DE analysis.

DESeq2 is a popular method to perform DE analysis for RNA-seq data. Pre and post DESeq run, however, involves several data wrangling steps. For example, prior to run DESeq genes may need to be filtered out based on number of reads mapped to genes across replicates. Similarly, post DESeq run user needs to set cutoffs (log2fc, pvalue and padj) to define up and down regulated genes. For large RNA-seq experiments involving several DE comparison such as sub-setting and different cutoffs creates messy and less readable code. This function helps to make things bit tidy and increase the code readability. Besides that, output of this function can subsequently used for several other functions of this package.

Usage

run_deseq_analysis(
  counts,
  column_geneid,
  sample_info,
  group_numerator,
  group_denominator,
  delim = "\t",
  comment_char = "#",
  min_counts = 10,
  min_replicates = 1,
  cutoff_lfc = 1,
  cutoff_pval = 0.05,
  cutoff_padj = 0.01,
  regul_based_upon = 1,
  print_rows_all_zero = FALSE
)

Arguments

counts

a character string of the path to a count file or an object of dataframe having raw counts for each gene. See details below to know more about format of the count file and count dataframe.

column_geneid

a character string denoting a column of geneid in counts

sample_info

a character string denoting a name of sample information file or a dataframe. A file or a dataframe both must have at least two columns WITHOUT column names. First column denotes to samples names and second column denotes group name for each sample in first column. For e.g.


control_rep_1	Control
control_rep_2	Control
control_rep_3	Control
treat_1_rep_1	Treatment_1
treat_1_rep_2	Treatment_1
treat_1_rep_3	Treatment_1
treat_2_rep_1	Treatment_2
treat_2_rep_2	Treatment_2
treat_2_rep_3	Treatment_2

group_numerator

a character vector denoting sample groups to use in numerator to calculate fold change.

group_denominator

a character vector denoting sample groups to use in denominator to calculate fold change.

delim

a character denoting deliminator for count file. Only valid if count is a file path.

comment_char

a character denoting comments line in count file. Only valid if count is a file path.

min_counts

a numeric value, default 10, denoting minimum counts for a gene to be used to consider a gene for differential expression analysis.

min_replicates

a numeric value, default 1, denoting minimum samples within a sample group must have minimum_counts. Value provided must not be higher than number of samples in each group. For example for given values min_replicates = 2 and minimum_counts = 10 the genes which have minimum counts 10 in at least 2 samples within a groups will be accounted for DEG. Rest will be filtered out prior to run DESeq.

cutoff_lfc

minimal threshold for log2fold change, default 1 (2 fold).

cutoff_pval

minimal threshold for pvalue, default 0.05. P-value threshold will be applied only when regul_based_upon is either 1 or 3.

cutoff_padj

minimal threshold for Padj, default 0.01. Padj threshold will be applied only when regul_based_upon is either 2 or 3.

regul_based_upon

one of the numeric choices 1, 2, or 3.

if 1 ...

Up : log2fc >= cutoff_lfc & pvalue <= cutoff_pval
Down : log2fc <= (-1) * cutoff_lfc & pvalue <= cutoff_pval
Other : remaining genes

if 2 ...

Up : log2fc >= cutoff_lfc & padj <= cutoff_padj
Down : log2fc <= (-1) * cutoff_lfc & padj <= cutoff_padj
Other : remaining genes

if 3 ...

Up : log2fc >= cutoff_lfc & pvalue <= cutoff_pval & padj <= cutoff_padj
Down : log2fc <= (-1) * cutoff_lfc pvalue <= cutoff_pval & padj <= cutoff_padj
Other : remaining genes

print_rows_all_zero

logical, default FALSE, denoting whether to print genes with value 0 in all columns.

Value

Return object is a dataframe (tibble) having each row denoting a unique differential comparison. There are total 8 columns as explained below.

de_comparisons : It stores the name of differential comparison for each row.
numerator : It stores name of samples which were used as numerator for the differential comparison in each row.
denominator : It stores name of samples which were used as denominator for the differential comparison in each row.
norm_counts : This is a named-list column. Each row in this column is a list of two containing normalised genes expression values in a dataframe for the samples - numerator and denominator. The first column of the dataframe is gene_id and subsequent columns are gene expression values in replicates of corresponding samples. This normalised gene expression values are obtained using counts slot of a DESeqDataSet object. e.g.: counts(dds,normalized=TRUE)
dsr : This is a named-list column stores an object of class DESeq2::DESeqResults() for the differential comparison in each row.
dsr_tibble: This is a named-listcolumn stores and an output of DESeq2::DESeqResults() in the dataframe format for the differential comparison in each row.
dsr_tibble_deg: The data in this column is same as in the column dsr_tibble except it contains two extra columns signif and regul. Values in the signif specifies statistical and fold change significance of the gene while values in the regul denotes whether the gene is up or down regulated.
deg_summmary : This is a named-list column. Each element of the list is a dataframe summarizing number of differential expressed gene for the differential comparison for each row.

Details

: For the argument count user can either provide a character string denoting a file or an object of dataframe. In each case required format is explained below.

Count file: Count file is a table of row read counts usually derived from .bam file for different genomic features (e.g. genes, transcripts etc.). Data in the count file must be in a tabular format with a valid column deliminator (e.g., tab, comma etc.). First row and first column will be considered as column names and row names respectively. Values for column names and row names are usually character string or combination of character, numbers and special characters such as _, or .. Both row names and column names must have unique values.
Count dataframe: Count data in a dataframe format having same requirement of row names and column names explained for count file.

Examples


count_file <- system.file("extdata","toy_counts.txt" , package = "parcutils")
count_data <- readr::read_delim(count_file, delim = "\t",show_col_types = FALSE)

sample_info <- count_data %>% colnames() %>% .[-1]  %>%
 tibble::tibble(samples = . , groups = rep(c("control" ,"treatment1" , "treatment2"), each = 3) )


res <- run_deseq_analysis(counts = count_data ,
                         sample_info = sample_info,
                         column_geneid = "gene_id" ,
                         cutoff_lfc = 1 ,
                         cutoff_pval = 0.05,
                         group_numerator = c("treatment1", "treatment2") ,
                         group_denominator = c("control"))
#> ℹ Running DESeq2 ...
#> converting counts to integer mode
#> Warning: some variables in design formula are characters, converting to factors
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
#> ✔ Done.
#> ℹ Summarizing DEG ...
#> ✔ Done.

res
#> ┌────────────────────────────┐
#> │                            │
#> │   Summary of DE analysis   │
#> │                            │
#> └────────────────────────────┘
#> 
#> Total number of comparisons: 2
#> Total number of genes used for DE analysis: 4034
#> 
#> DE gene counts by comparison...
#> 
#> treatment1_VS_control 
#> • number of up genes   : 13.
#> • number of down genes : 28.
#> ──────────────────────────────
#> 
#> treatment2_VS_control 
#> • number of up genes   : 333.
#> • number of down genes : 502.
#> ──────────────────────────────
#> 

## all comparisons

print(res$de_comparisons)
#> [1] "treatment1_VS_control" "treatment2_VS_control"

## DESEq result object(s)

print(res$dsr)
#> $treatment1_VS_control
#> log2 fold change (MLE): sample_groups treatment1 vs control 
#> Wald test p-value: sample groups treatment1 vs control 
#> DataFrame with 4034 rows and 6 columns
#>                           baseMean log2FoldChange     lfcSE      stat
#>                          <numeric>      <numeric> <numeric> <numeric>
#> ENSG00000154342:WNT3A     715.7962     -0.6014175 0.0931795 -6.454399
#> ENSG00000196415:PRTN3    4242.8862     -0.0713742 0.0748438 -0.953643
#> ENSG00000173020:GRK2      671.9682     -0.2560099 0.1139957 -2.245787
#> ENSG00000140598:EFL1       80.7054     -0.1764337 0.2040219 -0.864778
#> ENSG00000139263:LRIG3    1279.0478     -0.1378660 0.1788328 -0.770921
#> ...                            ...            ...       ...       ...
#> ENSG00000198046:ZNF667     99.3996     -0.3200579 0.1852360 -1.727839
#> ENSG00000231051:USP17L28   78.8482     -0.0354036 0.2034285 -0.174035
#> ENSG00000188783:PRELP    1106.4538      0.1615409 0.0828499  1.949802
#> ENSG00000213822:CEACAM18  834.6362     -0.2330325 0.0727700 -3.202317
#> ENSG00000167588:GPD1     2225.9670      0.3957288 0.0767754  5.154371
#>                               pvalue        padj
#>                            <numeric>   <numeric>
#> ENSG00000154342:WNT3A    1.08649e-10 6.58175e-09
#> ENSG00000196415:PRTN3    3.40264e-01 5.52032e-01
#> ENSG00000173020:GRK2     2.47177e-02 9.02358e-02
#> ENSG00000140598:EFL1     3.87161e-01 5.92821e-01
#> ENSG00000139263:LRIG3    4.40754e-01 6.38483e-01
#> ...                              ...         ...
#> ENSG00000198046:ZNF667   8.40172e-02 2.18175e-01
#> ENSG00000231051:USP17L28 8.61838e-01 9.35689e-01
#> ENSG00000188783:PRELP    5.11997e-02 1.53283e-01
#> ENSG00000213822:CEACAM18 1.36327e-03 9.68022e-03
#> ENSG00000167588:GPD1     2.54484e-07 7.41830e-06
#> 
#> $treatment2_VS_control
#> log2 fold change (MLE): sample_groups treatment2 vs control 
#> Wald test p-value: sample groups treatment2 vs control 
#> DataFrame with 4034 rows and 6 columns
#>                           baseMean log2FoldChange     lfcSE       stat
#>                          <numeric>      <numeric> <numeric>  <numeric>
#> ENSG00000154342:WNT3A     715.7962      0.5571318 0.0895747   6.219743
#> ENSG00000196415:PRTN3    4242.8862      0.7288109 0.0744285   9.792097
#> ENSG00000173020:GRK2      671.9682     -0.5424934 0.1157404  -4.687156
#> ENSG00000140598:EFL1       80.7054      0.0145278 0.2046425   0.070991
#> ENSG00000139263:LRIG3    1279.0478     -1.1836335 0.1804990  -6.557562
#> ...                            ...            ...       ...        ...
#> ENSG00000198046:ZNF667     99.3996    -0.02731340 0.1844465 -0.1480831
#> ENSG00000231051:USP17L28   78.8482    -0.58915168 0.2158321 -2.7296765
#> ENSG00000188783:PRELP    1106.4538     0.27690150 0.0831212  3.3312999
#> ENSG00000213822:CEACAM18  834.6362    -0.54878553 0.0750902 -7.3083473
#> ENSG00000167588:GPD1     2225.9670    -0.00404444 0.0776078 -0.0521138
#>                               pvalue        padj
#>                            <numeric>   <numeric>
#> ENSG00000154342:WNT3A    4.97971e-10 1.55722e-09
#> ENSG00000196415:PRTN3    1.21745e-22 8.13110e-22
#> ENSG00000173020:GRK2     2.77028e-06 6.29949e-06
#> ENSG00000140598:EFL1     9.43405e-01 9.51662e-01
#> ENSG00000139263:LRIG3    5.46947e-11 1.84019e-10
#> ...                              ...         ...
#> ENSG00000198046:ZNF667   8.82277e-01 8.98311e-01
#> ENSG00000231051:USP17L28 6.33965e-03 9.94329e-03
#> ENSG00000188783:PRELP    8.64414e-04 1.51348e-03
#> ENSG00000213822:CEACAM18 2.70448e-13 1.06334e-12
#> ENSG00000167588:GPD1     9.58438e-01 9.64655e-01
#> 

## DESEq result data frame

print(res$dsr_tibble)
#> $treatment1_VS_control
#> # A tibble: 4,034 × 7
#>    gene_id                 baseMean log2FoldCh…¹  lfcSE   stat   pvalue     padj
#>    <chr>                      <dbl>        <dbl>  <dbl>  <dbl>    <dbl>    <dbl>
#>  1 ENSG00000154342:WNT3A     716.        -0.601  0.0932 -6.45  1.09e-10  6.58e-9
#>  2 ENSG00000196415:PRTN3    4243.        -0.0714 0.0748 -0.954 3.40e- 1  5.52e-1
#>  3 ENSG00000173020:GRK2      672.        -0.256  0.114  -2.25  2.47e- 2  9.02e-2
#>  4 ENSG00000140598:EFL1       80.7       -0.176  0.204  -0.865 3.87e- 1  5.93e-1
#>  5 ENSG00000139263:LRIG3    1279.        -0.138  0.179  -0.771 4.41e- 1  6.38e-1
#>  6 ENSG00000165898:ISCA2    4072.        -0.359  0.181  -1.99  4.69e- 2  1.44e-1
#>  7 ENSG00000179454:KLHL28  13435.        -0.0627 0.0989 -0.634 5.26e- 1  7.11e-1
#>  8 ENSG00000188868:ZNF563    529.        -0.326  0.111  -2.93  3.36e- 3  2.01e-2
#>  9 ENSG00000249158:PCDHA11     2.38      -2.21   1.40   -1.58  1.14e- 1 NA      
#> 10 ENSG00000110852:CLEC2B      7.51      -0.698  0.672  -1.04  2.99e- 1 NA      
#> # … with 4,024 more rows, and abbreviated variable name ¹log2FoldChange
#> 
#> $treatment2_VS_control
#> # A tibble: 4,034 × 7
#>    gene_id                 baseMean log2FoldC…¹  lfcSE    stat   pvalue     padj
#>    <chr>                      <dbl>       <dbl>  <dbl>   <dbl>    <dbl>    <dbl>
#>  1 ENSG00000154342:WNT3A     716.       0.557   0.0896  6.22   4.98e-10 1.56e- 9
#>  2 ENSG00000196415:PRTN3    4243.       0.729   0.0744  9.79   1.22e-22 8.13e-22
#>  3 ENSG00000173020:GRK2      672.      -0.542   0.116  -4.69   2.77e- 6 6.30e- 6
#>  4 ENSG00000140598:EFL1       80.7      0.0145  0.205   0.0710 9.43e- 1 9.52e- 1
#>  5 ENSG00000139263:LRIG3    1279.      -1.18    0.180  -6.56   5.47e-11 1.84e-10
#>  6 ENSG00000165898:ISCA2    4072.       0.193   0.181   1.07   2.86e- 1 3.32e- 1
#>  7 ENSG00000179454:KLHL28  13435.       0.0257  0.0990  0.260  7.95e- 1 8.21e- 1
#>  8 ENSG00000188868:ZNF563    529.       0.00983 0.111   0.0890 9.29e- 1 9.40e- 1
#>  9 ENSG00000249158:PCDHA11     2.38    -1.31    1.31   -1.00   3.17e- 1 3.65e- 1
#> 10 ENSG00000110852:CLEC2B      7.51    -1.39    0.731  -1.90   5.69e- 2 7.65e- 2
#> # … with 4,024 more rows, and abbreviated variable name ¹log2FoldChange
#> 

## DESEq result data frame  DEG assigned, look at the columns 'signif' and 'regul'

print(res$dsr_tibble_deg)
#> $treatment1_VS_control
#> # A tibble: 4,034 × 9
#>    gene_id          baseM…¹ log2F…²  lfcSE   stat   pvalue     padj signif regul
#>    <chr>              <dbl>   <dbl>  <dbl>  <dbl>    <dbl>    <dbl> <chr>  <chr>
#>  1 ENSG00000154342…  7.16e2 -0.601  0.0932 -6.45  1.09e-10  6.58e-9 p-val… other
#>  2 ENSG00000196415…  4.24e3 -0.0714 0.0748 -0.954 3.40e- 1  5.52e-1 NS     other
#>  3 ENSG00000173020…  6.72e2 -0.256  0.114  -2.25  2.47e- 2  9.02e-2 p-val… other
#>  4 ENSG00000140598…  8.07e1 -0.176  0.204  -0.865 3.87e- 1  5.93e-1 NS     other
#>  5 ENSG00000139263…  1.28e3 -0.138  0.179  -0.771 4.41e- 1  6.38e-1 NS     other
#>  6 ENSG00000165898…  4.07e3 -0.359  0.181  -1.99  4.69e- 2  1.44e-1 p-val… other
#>  7 ENSG00000179454…  1.34e4 -0.0627 0.0989 -0.634 5.26e- 1  7.11e-1 NS     other
#>  8 ENSG00000188868…  5.29e2 -0.326  0.111  -2.93  3.36e- 3  2.01e-2 p-val… other
#>  9 ENSG00000249158…  2.38e0 -2.21   1.40   -1.58  1.14e- 1 NA       log2FC other
#> 10 ENSG00000110852…  7.51e0 -0.698  0.672  -1.04  2.99e- 1 NA       NS     other
#> # … with 4,024 more rows, and abbreviated variable names ¹baseMean,
#> #   ²log2FoldChange
#> 
#> $treatment2_VS_control
#> # A tibble: 4,034 × 9
#>    gene_id        baseM…¹ log2Fo…²  lfcSE    stat   pvalue     padj signif regul
#>    <chr>            <dbl>    <dbl>  <dbl>   <dbl>    <dbl>    <dbl> <chr>  <chr>
#>  1 ENSG000001543…  7.16e2  0.557   0.0896  6.22   4.98e-10 1.56e- 9 p-val… other
#>  2 ENSG000001964…  4.24e3  0.729   0.0744  9.79   1.22e-22 8.13e-22 p-val… other
#>  3 ENSG000001730…  6.72e2 -0.542   0.116  -4.69   2.77e- 6 6.30e- 6 p-val… other
#>  4 ENSG000001405…  8.07e1  0.0145  0.205   0.0710 9.43e- 1 9.52e- 1 NS     other
#>  5 ENSG000001392…  1.28e3 -1.18    0.180  -6.56   5.47e-11 1.84e-10 p-val… Down 
#>  6 ENSG000001658…  4.07e3  0.193   0.181   1.07   2.86e- 1 3.32e- 1 NS     other
#>  7 ENSG000001794…  1.34e4  0.0257  0.0990  0.260  7.95e- 1 8.21e- 1 NS     other
#>  8 ENSG000001888…  5.29e2  0.00983 0.111   0.0890 9.29e- 1 9.40e- 1 NS     other
#>  9 ENSG000002491…  2.38e0 -1.31    1.31   -1.00   3.17e- 1 3.65e- 1 log2FC other
#> 10 ENSG000001108…  7.51e0 -1.39    0.731  -1.90   5.69e- 2 7.65e- 2 log2FC other
#> # … with 4,024 more rows, and abbreviated variable names ¹baseMean,
#> #   ²log2FoldChange
#> 

## DEG summary

print(res$deg_summmary)
#> $treatment1_VS_control
#> # A tibble: 3 × 2
#>   regul     n
#>   <chr> <int>
#> 1 Down     28
#> 2 Up       13
#> 3 other  3993
#> 
#> $treatment2_VS_control
#> # A tibble: 3 × 2
#>   regul     n
#>   <chr> <int>
#> 1 Down    502
#> 2 Up      333
#> 3 other  3199
#>