refseqR
workshop

Jose V. Die

2024-11-27

About me

  • 2009 PhD in Plant Genetics (UCO)


  • 2012 - 2017 US Department of Agriculture
  • 2012 Dept. Genetics (UCO)
  • 2018 Visiting Bioinformaticians Program (NIH, NCBI)


  • Broad interests : intersection of Genomics & Data Science, molecular breeding, R.

The lecture that changed Biology

“On protein synthesis” - at University College London for the Society for Experimental Biology (1957).

The Central Dogma of Molecular Biology

Our implementation: refseqR

  • Builds upon rentrez

  • Imports IRanges, Biostrings

  • input can be any character string object (1st arg)

  • provides output for interoperability and integration with Bioconductor

The refseqR packg.

Dependencies

library(refseqR)

GeneID accessions

GeneID accessions

refseq_fromGene(GeneID = "353", sequence = "transcript")
[1] "NM_001030018" "NM_000485"   


refseq_fromGene(GeneID = "353", sequence = "protein")
[1] "NP_001025189" "NP_000476"   


refseq_description(id = "353")
[1] "adenine phosphoribosyltransferase"

mRNA accessions

mRNA accessions

refseq_GeneID(accession = "NM_001030018", db = "nuccore")
[1] "353"


refseq_description(id = "NM_001030018")
[1] "adenine phosphoribosyltransferase"


refseq_CDSseq(transcript = "NM_001030018")
DNAStringSet object of length 1:
    width seq                                               names               
[1]   405 ATGGCCGACTCCGAGCTGCAGCT...ATCTGCTGGCCACTGGTGTATGA NM_001030018.2

mRNA accessions

refseq_CDScoords(transcript = "NM_001030018")
IRanges object with 1 range and 0 metadata columns:
                     start       end     width
                 <integer> <integer> <integer>
  NM_001030018.2        30       434       405


refseq_mRNAfeat(transcript = "NM_001030018", feat = c("caption", "moltype", "sourcedb", "slen"))
# A tibble: 1 × 4
  caption      moltype sourcedb slen 
  <chr>        <chr>   <chr>    <chr>
1 NM_001030018 rna     refseq   667  


refseq_RNA2protein(transcript = "NM_001030018")
[1] "NP_001025189"

Protein accessions

Protein accessions

refseq_description(id = "NP_001025189")
[1] "adenine phosphoribosyltransferase"


refseq_AAlen(protein = "NP_001025189")
[1] 134


refseq_AAmol_wt(protein ="NP_001025189")
[1] 14426

Protein accessions

refseq_AAseq(accession = "NP_001025189")
AAStringSet object of length 1:
    width seq                                               names               
[1]   134 MADSELQLVEQRIRSFPDFPTPG...IQKDALEPGQRVVVVDDLLATGV NP_001025189


refseq_protein2RNA(protein = "NP_001025189")
[1] "NM_001030018"


refseq_GeneID(accession = "NP_001025189", db = "protein")
[1] "353"

Vectorization

transcript = c("NM_001030018", "NM_001388492", "NM_000492")
feat = c("caption", "moltype", "slen", "title")
refseq_mRNAfeat(transcript, feat)
# A tibble: 3 × 4
  caption      moltype slen  title                                              
  <chr>        <chr>   <chr> <chr>                                              
1 NM_001030018 rna     667   Homo sapiens adenine phosphoribosyltransferase (AP…
2 NM_001388492 rna     13472 Homo sapiens huntingtin (HTT), transcript variant …
3 NM_000492    rna     6070  Homo sapiens CF transmembrane conductance regulato…

Vectorization

Get the protein ids. from a set of transcript accessions :

transcript = c("NM_001030018", "NM_001388492", "NM_000492")
sapply(transcript, function(x) refseq_RNA2protein(x), USE.NAMES = FALSE)
[1] "NP_001025189" "NP_001375421" "NP_000483"   

Monte Carlo simulations

Fetch the amino acid (AA) sequences for a set of 100 protein ids:

refseq_AAseq()


mean time (5,000 MC) = 67.40673 sec

Functions refseqR

  • to apply on GeneID accessions:

    refseq_GeneID()
    refseq_description()

  • to apply on transcript accessions:

    refseq_GeneID()
    refseq_description()
    refseq_CDScoords()
    refseq_CDSseq()
    refseq_RNA2protein()

Functions refseqR

  • to apply on protein accessions:

    refseq_GeneID()
    refseq_description()
    refseq_AAmol_wt()
    refseq_AAlen()
    refseq_AAseq()
    refseq_protein2RNA()

Availability

Availability

Acknowledgements

AGR-114 Mejora Genética Vegetal