Add ncbi taxonomy levels to ncbi protein accession — add_taxonomy

Given a tbl with a column of valid ncbi protein accession, the function assigns the ncbi taxonomy levels to each ncbi protein accession.

add_taxonomy_columns(
  tbl,
  ncbi_accession_colname = "ncbi_accession",
  ncbi_acc_key = NULL,
  taxonomy_level = "kingdom",
  map_superkindom = FALSE,
  batch_size = 20
)

Arguments

tbl	an object of class tbl
ncbi_accession_colname	a string (default : "ncbi_accession") denoting column name of ncbi accession.
ncbi_acc_key	user specific ENTREZ api key. Get one via `taxize::use_entrez()`
taxonomy_level	a string indicating level of ncbi taxonomy to be assigned to each ncbi protein accession. An input can be one of the followings superkingdom kingdom phylum subphylum class subclass infraclass cohort order suborder infraorder superfamily family subfamily genus species tribe no rank
map_superkindom	logical (default FALSE). Assign superkingdom if kingdom is not found. Valid only when taxonomy_level == "kingdom".
batch_size	The number of queries to submit at a time.

Value

a tbl.

Details

The aim of this function is to assign the specific level of ncbi taxonomy to the ncbi accession (protein). To do so, it requires a tibble with at least one column of ncbi (protein) accession. Returned taxonomy columns will be added on input tibble object keeping original columns as they were. Internally, first, it finds the ncbi taxonomy id for each ncbi accession and then it maps required taxonomy level. Assigning taxonomy id to each ncbi accession may take time depending upon number of input ncbi accessions. On subsequent runs or in a first run you may supply taxonomy column ('taxid') in input tibble, which will reduce the time to find taxonomy ids and directly assign the taxonomy level to given taxonomy id. To map taxonomy levels for large number of ncbi accession one may choose parallel processing approach as shown in the example.

Examples

if (FALSE) {
f <- system.file("extdata","blast_output_01.txt" ,package = "phyloR")
d <- readr::read_delim(f, delim ="\t" , col_names = F , comment = "#")
colnames(d) <- phyloR::get_blast_outformat_7_colnames()

## add kingdom
with_kingdom <- d %>%
        dplyr::slice(1:50) %>%
        add_taxonomy_columns(ncbi_accession_colname ="subject_acc_ver" )

## add species
with_kingdom_and_species <- with_kingdom %>%
        add_taxonomy_columns(ncbi_accession_colname ="subject_acc_ver",taxonomy_level = "species")
dplyr::glimpse(with_kingdom_and_species)

#------------------------------------
## using parallel processing approach

library(furrr)
num_of_splits <- 10
d <- d %>% dplyr::slice(1:100)
split_vec <- rep(1:num_of_splits , length.out = nrow(d))
qq_split <- d %>% dplyr::mutate(split_vec = split_vec)  %>%
dplyr::group_by(split_vec) %>% dplyr::group_split()
future::plan("multiprocess")
out <- qq_split[1:num_of_splits] %>%
        future_map( ~ phyloR::add_taxonomy_columns(tbl = ..1 ,
 taxonomy_level = "species" ,map_superkindom = F,
 ncbi_accession_colname = "subject_acc_ver" , batch_size = 20,
 ncbi_acc_key = "64c65ab9c52e0312bbcf4c32d3056cbcaa09"),
                   .progress = TRUE) %>%
        dplyr::bind_rows()
}