Genetic Structure as Generative Poetic Form

Part of the long poetic sequence I’m working on is a poem called ‘I’ (or maybe ‘I-I’), whose sections are each devoted to a theme related to a given gene, and structured based on the nucleotide sequence from that gene. In order to create these poetic patterns, we need actual genetic sequence data to which patterning algorithms can be applied.

Here, I’ll lay out the steps used to gather and create these patterns:

library(tidyverse);library(DT); library(tidytext)
library(GO.db);library(hgu133plus2probe);
library(hgu133plus2.db)
genome <- hgu133plus2probe

Step 1: Find Genes that Perform Functions

The genome is vast, and starting at random at a given nucleotide seems unnecessarily naive, especially with the incredible amount of work that has been done in annotating genes based on their functions. This allows one to find specific sequences of nulceotides that are associated with a given biological function (visual perception, for instance) and to use those for relevant portions of the poem.

To query to Gene Ontology (GO) database, I use the GO.db package¹, which queries the AmiGo 2 database. The code below defines a convenience function which takes a given search term string and returns a tibble with the related GOID and GOTERM fields, which can then be manually reviewed for interest. Once one or more terms are identified, these can be copied and manually added to the list in the next step which queries.

go_tbl <- toTable(GOTERM)[-c(1)]

search_go <- function(search_string){
  output <-
    go_tbl %>%
    dplyr::filter(str_detect(Term,search_string)) %>%
    dplyr::select(go_id,Term,Definition,Ontology) %>%
    dplyr::distinct()
    
  return(output)
}

Let’s define search terms for various topics of interest as regular expressions, which we can then pass to the database, and save these in a list called searches. Below are a handful of examples:

searches <- list()
searches$vision_search <- search_go(" vision|^vision|sight| visual|^visual|ocular")
searches$hearing_search <- search_go("hearing| ear |^ear | auditory|^auditory")
searches$skin_search <- search_go(" skin |^skin| melanoma |^touch | touch ")
searches$language_search <- search_go(" language |^langauge| speech |^speech")
searches$spine_search <- search_go(" spine |^spine")

We can then review the resulting tables to find the genes which are of greatest interest for a given section of the poem, like so:

searches$vision_search %>%
  datatable(
    rownames = F, filter = "top",
    caption = 'Summary of Gene Ontology Terms for Search',
    colnames = c('GO ID','GO Term','Definition','BP'),
    extensions = c('Responsive','Buttons'),
    options = list(
      pageLength = 5, lengthMenu = c(5, 15),
      dom = 'C<"clear">lfrtip', buttons = c('colvis')
    )
  )

Show entries

Search:

Summary of Gene Ontology Terms for Search
GO ID	GO Term	Definition	BP

GO:0001744	insect visual primordium formation	Establishment of the optic lobe placode. In Drosophila, for example, the placode appears in the dorsolateral region of the head in late stage 11 embryos and is the precursor to the larval visual system.	BP
GO:0001748	insect visual primordium development	The process whose specific outcome is the progression of the optic placode over time, from its formation to the mature structure. During embryonic stage 12 the placode starts to invaginate, forming a pouch. Cells that will form Bolwig's organ segregate from the ventral lip of this pouch, remaining in the head epidermis. The remainder of the invagination loses contact with the outer surface and becomes the optic lobe. An example of this process is found in Drosophila melanogaster.	BP
GO:0002074	extraocular skeletal muscle development	The process whose specific outcome is the progression of the extraocular skeletal muscle over time, from its formation to the mature structure. The extraocular muscle is derived from cranial mesoderm and controls eye movements. The muscle begins its development with the differentiation of the muscle cells and ends with the mature muscle. An example of this process is found in Mus musculus.	BP
GO:0007601	visual perception	The series of events required for an organism to receive a visual stimulus, convert it to a molecular signal, and recognize and characterize the signal. Visual stimuli are detected in the form of photons and are processed to form an image.	BP
GO:0007632	visual behavior	The behavior of an organism in response to a visual stimulus.	BP

Showing 1 to 5 of 15 entries

Previous1 2 3Next

Step 2: Subset Genome based on Functions of Interest

After reading through and doing some research on the gene ontology terms above, I select a specific term/function and then proceed to pull:

Relevant genes and annotation data from the human genome.²
Probe data which includes the actual nucleotide sequences for each gene.³

Below is a convenience function which takes a list of GOID strings as an argument and retrieves the related genes and functional annotation:

get_genes <- function(go_ids){
  
  output <-
    hgu133plus2.db %>%
    select(
      keytype = "GO", 
      columns = c(
        "GO","ONTOLOGY","EVIDENCE","SYMBOL","GENENAME","MAP","PATH","PROBEID"
      ), 
      keys = go_ids
    ) %>%
    mutate(
      EVIDENCE_TYPE = recode(
        EVIDENCE,
        `IMP` = "inferred from mutant phenotype",
        `IGI` = "inferred from genetic interaction",
        `IPI` = "inferred from physical interaction",
        `ISS` = "inferred from sequence similarity",
        `IDA` = "inferred from direct assay",
        `IEP` = "inferred from expression pattern",
        `IEA` = "inferred from electronic annotation",
        `TAS` = "traceable author statement",
        `NAS` = "non-traceable author statement",
        `ND` = "no biological data available",
        `IC` = "inferred by curator",
        `IBA` = "inferred from biological aspect of ancestor"
      )
    ) %>%
    mutate_if(
      .predicate = is.character,
      .funs = funs(as_factor)
    ) %>%
    inner_join(genome, by = c("PROBEID" = "Probe.Set.Name")) %>%
    inner_join(go_tbl[c(1:2)], by = c("GO" = "go_id")) %>%
    dplyr::select(GENENAME,SYMBOL,sequence,MAP,PATH,PROBEID,x,y,GO,GOTERM = Term,ONTOLOGY,EVIDENCE,EVIDENCE_TYPE) %>%
    distinct() 
  
  return(output)
  
}

We can then filter the output by the strength of evidence or any other variable to select the specific gene and position that we want to use.⁴

# Use named lists for the purpose of documentation
vision_genes <- 
  get_genes(
    go_ids = c(
      `visual perception` = "GO:0007601", 
      `detection of light` = "GO:0050908",
      `visual equilibrioception` = "GO:0051356"
    )
  )

hearing_genes <-
  get_genes(
    go_ids = c(
      `response to auditory stimulus` = "GO:0010996",
      `auditory behavior` = "GO:0031223",
      `ear morphogenesis` = "GO:0042471",
      `ear development` = "GO:0043583",
      `inner ear cell fate commitment` = "GO:0060120"
    )
  )

skin_genes <- 
  get_genes(
    go_ids = c(
      `establishment of skin barrier` = "GO:0061436", 
      `skin development` = "GO:0043588",
      `skin morphogenesis` = "GO:0043589"
    )
  )

Step 3: Select a Specific Gene

Out of all of the genes identified, we need to continue inward toward the needle in the haystack. This may include:

Filtering by evidence type
Determining which precise gene function is the best fit for the poem (e.g. for the vision_genes, these include visual perception, detection of light stimulus involved in visual perception…)

Based on the final selection of gene, we pass the name of the gene from the SYMBOL field (e.g. "RAX") to a convenience function which will return the poetic patterns available based on the gene’s nucleotide sequence and accompanying base pairs.

get_patterns <- function(gene_id){
  
  # Filter gene data based on selection
  gene_df <-
    hgu133plus2.db %>%
    select(
      keytype = "SYMBOL", 
      columns = c(
        "GO","ONTOLOGY","EVIDENCE","SYMBOL","GENENAME","MAP","PATH","PROBEID"
      ), 
      keys = gene_id
    ) %>%
    mutate(
      EVIDENCE_TYPE = recode(
        EVIDENCE,
        `IMP` = "inferred from mutant phenotype",
        `IGI` = "inferred from genetic interaction",
        `IPI` = "inferred from physical interaction",
        `ISS` = "inferred from sequence similarity",
        `IDA` = "inferred from direct assay",
        `IEP` = "inferred from expression pattern",
        `IEA` = "inferred from electronic annotation",
        `TAS` = "traceable author statement",
        `NAS` = "non-traceable author statement",
        `ND` = "no biological data available",
        `IC` = "inferred by curator",
        `IBA` = "inferred from biological aspect of ancestor"
      )
    ) %>%
    mutate_if(
      .predicate = is.character,
      .funs = list(~as_factor(.))
    ) %>%
    inner_join(genome, by = c("PROBEID" = "Probe.Set.Name")) %>%
    inner_join(go_tbl[c(1:2)], by = c("GO" = "go_id")) %>%
    dplyr::select(GENENAME,SYMBOL,sequence,MAP,PATH,PROBEID,x,y,GO,GOTERM = Term,ONTOLOGY,EVIDENCE,EVIDENCE_TYPE) %>%
    distinct() %>%
    # Arrange by position
    arrange(x) %>%
    dplyr::select(sequence) %>%
    unnest_tokens(
      nucleotide,
      sequence,
      token =  "character_shingles",
      n = 1
    ) %>%
    # Define the other half of the base pair
    mutate(
      pair = recode(
        nucleotide,
        `a` = "t",
        `t` = "a",
        `c` = "g",
        `g` = "c"
      )
    )
  
  # Define patterns
  tercet_purines <-
    gene_df %>%
    mutate(
      # Duplicate if purines
      nucleotide = recode(nucleotide, `a` = "a|a",`g` = "g|g"),
      pair = recode(pair, `a` = "a|a",`g` = "g|g"),
      base_pair = paste0(nucleotide,"|",pair)
    ) %>%
    summarize(pattern = base::paste(base_pair,collapse = "||")) %>%
    c()

  tercet_amino <- 
    gene_df %>%
    mutate(
      base_pair = paste0(nucleotide,pair)
    ) %>%
    summarize(pattern = base::paste(base_pair,collapse = "|")) %>%
    # Add new | for every tercet break
    mutate(pattern = gsub('(.{9})', '\\1|', pattern)) %>%
    c()
  
  couplet_amino <-
    gene_df %>%
    mutate(
      row = row_number(),
      codon = cut(row, nrow(.) %/% 3, labels=FALSE)
    ) %>%
    group_by(codon) %>%
    summarize(
      nucleotide = base::paste(nucleotide,collapse = ""),
      pair = base::paste(pair,collapse = "")
    ) %>%
    mutate(
      pattern = base::paste0(nucleotide,"|",pair)
    ) %>%
    summarize(pattern = base::paste(pattern,collapse = "||")) %>%
    c()
  
  patterns <- 
    list(
      `gene_df` = gene_df,
      `tercet_purines` = tercet_purines,
      `tercet_amino` = tercet_amino,
      `couplet_amino` = couplet_amino
    )
  
  return(patterns)
  
}

Step 4: Convert Gene Sequence to Poetic Pattern

Once we figure out which gene we want to use, we can use its gene_id to convert it into patterns using the get_patterns() function defined above:

vision_patterns <- get_patterns(gene_id = "RAX")

There are a few options to start with (and these will likely grow) for defining patterns for prosody. In each of these, a | denotes a line end and a || denotes a stanza end:

tercet_purines: a pattern in which each base pair is made into a tercet, with purines (adenine a and guanine g) getting 2 lines because they have 2 rings, and pyramidines (thymine t and cytosine c) getting one line because they have 1 ring. Example: The base pair at would correspond to the tercet pattern aat

str(vision_patterns$tercet_purines$pattern)

##  chr "c|g|g||a|a|t||c|g|g||t|a|a||g|g|c||g|g|c||g|g|c||g|g|c||a|a|t||a|a|t||c|g|g||g|g|c||t|a|a||c|g|g||t|a|a||t|a|a|"| __truncated__

tercet_amino: a pattern in which every codon (i.e. 3 base pairs) corresponding to an amino acid gets its own tercet, where the initial nucleotide from the base pair corresponds to the start word of a line, and the other nucleotide to the end word of the line.

str(vision_patterns$tercet_amino$pattern)

##  chr "cg|at|cg||ta|gc|gc||gc|gc|at||at|cg|gc||ta|cg|ta||ta|gc|gc||gc|at|ta||cg|cg|gc||at|cg|at||cg|ta|gc||gc|gc|gc||a"| __truncated__

couplet_amino: a pattern in which every codon (i.e. 3 base pairs) corresponding to an amino acid gets its own couplet, where the initial nucleotide from one side of the base pair gets the first line of the couplet, and the corresponding nucleotides of the base pairs get the second line of the couplet. This creates a pattern where there are three patterned words in each line. Similar to tercet_amino, but turned sideways.

str(vision_patterns$couplet_amino$pattern)

##  chr "cac|gtg||tgg|acc||gga|cct||acg|tgc||tct|aga||tgg|acc||gat|cta||ccg|ggc||aca|tgt||ctg|gac||ggg|ccc||aac|ttg||gtc"| __truncated__

Step 5: Fill the Pattern with Words

Finally, we need to replace the nucleotides (A, T, C, G) with some verbal/written patterns to use in the poem. So, we define patterns of words to stand in for each nucleotide. Some options for the types of word patterns we might use, based on their shared characteristics, include:

letter_proportion: words which have a certain proportion of their characters sharing the same letter
vocal_proportion: a concentration of similar vocal pattern (such as fricatives)
cluster: clustered together by similarity across multiple metrics, using kmeans
rhyme: rhyme based on Soundex or other algorithms
category: words from a given category (e.g. using the Regressive Imagery Dictionary classifications)
regex: matches a regular expression (e.g. any word beginning with w and containing an o or u in the middle before ending with the letter d).
Various other potential methods.

Appendix: Other options for Packages

If you’re a fan of this kind of thing, there are various other options to read in the human genome for use in constructing patterns, each with their own pros and cons. Here are a few:

The org.Hs.eg.db package Genome wide annotation for Human, primarily based on mapping using Entrez Gene identifiers. This requires working with the data as a biostrings object, but does include association data which will be helpful in locating the nucleotide sequences for use in specific portions of the poem.
The BSgenome.Hsapiens.UCSC.hg19 Full genome sequences for Homo sapiens (UCSC version hg19)
The biomartr package⁵, which allows for retrieval of all available kingdoms stored in the NCBI RefSeq, NCBI Genbank, ENSEMBL, and ENSEMBLGENOMES sources. It also allows for download as either biostrings or data.table object.
The rentrez package, which allows querying the NCBI Entrez Eutils API to search, download data from, and interact with databases.

Carlson M (2018). GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 3.6.0. DOI: 10.18129/B9.bioc.GO.db ↩︎
For this, we use the hgu133plus2.db package, which allows us to query by any of the following fields: ACCNUM, ALIAS, ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, ENZYME, EVIDENCE, EVIDENCEALL, GENENAME, GENETYPE, GO, GOALL, IPI, MAP, OMIM, ONTOLOGY, ONTOLOGYALL, PATH, PFAM, PMID, PROBEID, PROSITE, REFSEQ, SYMBOL, UCSCKG, UNIPROT. We’ll be using the GO field to match our GOIDs to additional information. Notes from this site on using GO a.k.a. gene ontology for identifying functional areas is helpful, as is this vignette on the Bioconductor site.↩︎
In the current code, I’m pulling the Affymetrix Human Genome U133 Plus 2.0 Array annotation data using the hgu133plus2.db package.↩︎
For instance, we may want to filter on the strength of evidence. Based on the GO documentation “IEA is a weak association and is based on electronic information, no human curator has examined or confirmed this association… IEA is also the most common evidence code.” Which speaks to the tip-of-the-iceberg nature or our current knowledge.↩︎
Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821 ↩︎