R/blast_result_parser.R
add_taxonomy_columns.Rd
Given a tbl with a column of valid ncbi protein accession, the function assigns the ncbi taxonomy levels to each ncbi protein accession.
add_taxonomy_columns( tbl, ncbi_accession_colname = "ncbi_accession", ncbi_acc_key = NULL, taxonomy_level = "kingdom", map_superkindom = FALSE, batch_size = 20 )
tbl | an object of class tbl |
---|---|
ncbi_accession_colname | a string (default : "ncbi_accession") denoting column name of ncbi accession. |
ncbi_acc_key | user specific ENTREZ api key. Get one via |
taxonomy_level | a string indicating level of ncbi taxonomy to be assigned to each ncbi protein accession. An input can be one of the followings
|
map_superkindom | logical (default FALSE). Assign superkingdom if kingdom is not found. Valid only when taxonomy_level == "kingdom". |
batch_size | The number of queries to submit at a time. |
a tbl.
The aim of this function is to assign the specific level of ncbi taxonomy to the ncbi accession (protein). To do so, it requires a tibble with at least one column of ncbi (protein) accession. Returned taxonomy columns will be added on input tibble object keeping original columns as they were. Internally, first, it finds the ncbi taxonomy id for each ncbi accession and then it maps required taxonomy level. Assigning taxonomy id to each ncbi accession may take time depending upon number of input ncbi accessions. On subsequent runs or in a first run you may supply taxonomy column ('taxid') in input tibble, which will reduce the time to find taxonomy ids and directly assign the taxonomy level to given taxonomy id. To map taxonomy levels for large number of ncbi accession one may choose parallel processing approach as shown in the example.
if (FALSE) { f <- system.file("extdata","blast_output_01.txt" ,package = "phyloR") d <- readr::read_delim(f, delim ="\t" , col_names = F , comment = "#") colnames(d) <- phyloR::get_blast_outformat_7_colnames() ## add kingdom with_kingdom <- d %>% dplyr::slice(1:50) %>% add_taxonomy_columns(ncbi_accession_colname ="subject_acc_ver" ) ## add species with_kingdom_and_species <- with_kingdom %>% add_taxonomy_columns(ncbi_accession_colname ="subject_acc_ver",taxonomy_level = "species") dplyr::glimpse(with_kingdom_and_species) #------------------------------------ ## using parallel processing approach library(furrr) num_of_splits <- 10 d <- d %>% dplyr::slice(1:100) split_vec <- rep(1:num_of_splits , length.out = nrow(d)) qq_split <- d %>% dplyr::mutate(split_vec = split_vec) %>% dplyr::group_by(split_vec) %>% dplyr::group_split() future::plan("multiprocess") out <- qq_split[1:num_of_splits] %>% future_map( ~ phyloR::add_taxonomy_columns(tbl = ..1 , taxonomy_level = "species" ,map_superkindom = F, ncbi_accession_colname = "subject_acc_ver" , batch_size = 20, ncbi_acc_key = "64c65ab9c52e0312bbcf4c32d3056cbcaa09"), .progress = TRUE) %>% dplyr::bind_rows() }