Genetic Structure as Generative Poetic Form
Part of the long poetic sequence I’m working on is a poem called ‘I’ (or maybe ‘I-I’), whose sections are each devoted to a theme related to a given gene, and structured based on the nucleotide sequence from that gene. In order to create these poetic patterns, we need actual genetic sequence data to which patterning algorithms can be applied.
Here, I’ll lay out the steps used to gather and create these patterns:
library(tidyverse);library(DT); library(tidytext)
library(GO.db);library(hgu133plus2probe);
library(hgu133plus2.db)
genome <- hgu133plus2probe
Step 1: Find Genes that Perform Functions
The genome is vast, and starting at random at a given nucleotide seems unnecessarily naive, especially with the incredible amount of work that has been done in annotating genes based on their functions. This allows one to find specific sequences of nulceotides that are associated with a given biological function (visual perception, for instance) and to use those for relevant portions of the poem.
To query to Gene Ontology (GO) database, I use the GO.db
package1, which queries the AmiGo 2 database. The code below defines a convenience function which takes a given search term string and returns a tibble
with the related GOID
and GOTERM
fields, which can then be manually reviewed for interest. Once one or more terms are identified, these can be copied and manually added to the list in the next step which queries.
go_tbl <- toTable(GOTERM)[-c(1)]
search_go <- function(search_string){
output <-
go_tbl %>%
dplyr::filter(str_detect(Term,search_string)) %>%
dplyr::select(go_id,Term,Definition,Ontology) %>%
dplyr::distinct()
return(output)
}
Let’s define search terms for various topics of interest as regular expressions, which we can then pass to the database, and save these in a list called searches
. Below are a handful of examples:
searches <- list()
searches$vision_search <- search_go(" vision|^vision|sight| visual|^visual|ocular")
searches$hearing_search <- search_go("hearing| ear |^ear | auditory|^auditory")
searches$skin_search <- search_go(" skin |^skin| melanoma |^touch | touch ")
searches$language_search <- search_go(" language |^langauge| speech |^speech")
searches$spine_search <- search_go(" spine |^spine")
We can then review the resulting tables to find the genes which are of greatest interest for a given section of the poem, like so:
searches$vision_search %>%
datatable(
rownames = F, filter = "top",
caption = 'Summary of Gene Ontology Terms for Search',
colnames = c('GO ID','GO Term','Definition','BP'),
extensions = c('Responsive','Buttons'),
options = list(
pageLength = 5, lengthMenu = c(5, 15),
dom = 'C<"clear">lfrtip', buttons = c('colvis')
)
)
Step 2: Subset Genome based on Functions of Interest
After reading through and doing some research on the gene ontology terms above, I select a specific term/function and then proceed to pull:
Relevant genes and annotation data from the human genome.2
Probe data which includes the actual nucleotide sequences for each gene.3
Below is a convenience function which takes a list of GOID
strings as an argument and retrieves the related genes and functional annotation:
get_genes <- function(go_ids){
output <-
hgu133plus2.db %>%
select(
keytype = "GO",
columns = c(
"GO","ONTOLOGY","EVIDENCE","SYMBOL","GENENAME","MAP","PATH","PROBEID"
),
keys = go_ids
) %>%
mutate(
EVIDENCE_TYPE = recode(
EVIDENCE,
`IMP` = "inferred from mutant phenotype",
`IGI` = "inferred from genetic interaction",
`IPI` = "inferred from physical interaction",
`ISS` = "inferred from sequence similarity",
`IDA` = "inferred from direct assay",
`IEP` = "inferred from expression pattern",
`IEA` = "inferred from electronic annotation",
`TAS` = "traceable author statement",
`NAS` = "non-traceable author statement",
`ND` = "no biological data available",
`IC` = "inferred by curator",
`IBA` = "inferred from biological aspect of ancestor"
)
) %>%
mutate_if(
.predicate = is.character,
.funs = funs(as_factor)
) %>%
inner_join(genome, by = c("PROBEID" = "Probe.Set.Name")) %>%
inner_join(go_tbl[c(1:2)], by = c("GO" = "go_id")) %>%
dplyr::select(GENENAME,SYMBOL,sequence,MAP,PATH,PROBEID,x,y,GO,GOTERM = Term,ONTOLOGY,EVIDENCE,EVIDENCE_TYPE) %>%
distinct()
return(output)
}
We can then filter the output by the strength of evidence or any other variable to select the specific gene and position that we want to use.4
# Use named lists for the purpose of documentation
vision_genes <-
get_genes(
go_ids = c(
`visual perception` = "GO:0007601",
`detection of light` = "GO:0050908",
`visual equilibrioception` = "GO:0051356"
)
)
hearing_genes <-
get_genes(
go_ids = c(
`response to auditory stimulus` = "GO:0010996",
`auditory behavior` = "GO:0031223",
`ear morphogenesis` = "GO:0042471",
`ear development` = "GO:0043583",
`inner ear cell fate commitment` = "GO:0060120"
)
)
skin_genes <-
get_genes(
go_ids = c(
`establishment of skin barrier` = "GO:0061436",
`skin development` = "GO:0043588",
`skin morphogenesis` = "GO:0043589"
)
)
Step 3: Select a Specific Gene
Out of all of the genes identified, we need to continue inward toward the needle in the haystack. This may include:
- Filtering by evidence type
- Determining which precise gene function is the best fit for the poem (e.g. for the
vision_genes
, these include visual perception, detection of light stimulus involved in visual perception…)
Based on the final selection of gene, we pass the name of the gene from the SYMBOL
field (e.g. "RAX"
) to a convenience function which will return the poetic patterns available based on the gene’s nucleotide sequence and accompanying base pairs.
get_patterns <- function(gene_id){
# Filter gene data based on selection
gene_df <-
hgu133plus2.db %>%
select(
keytype = "SYMBOL",
columns = c(
"GO","ONTOLOGY","EVIDENCE","SYMBOL","GENENAME","MAP","PATH","PROBEID"
),
keys = gene_id
) %>%
mutate(
EVIDENCE_TYPE = recode(
EVIDENCE,
`IMP` = "inferred from mutant phenotype",
`IGI` = "inferred from genetic interaction",
`IPI` = "inferred from physical interaction",
`ISS` = "inferred from sequence similarity",
`IDA` = "inferred from direct assay",
`IEP` = "inferred from expression pattern",
`IEA` = "inferred from electronic annotation",
`TAS` = "traceable author statement",
`NAS` = "non-traceable author statement",
`ND` = "no biological data available",
`IC` = "inferred by curator",
`IBA` = "inferred from biological aspect of ancestor"
)
) %>%
mutate_if(
.predicate = is.character,
.funs = list(~as_factor(.))
) %>%
inner_join(genome, by = c("PROBEID" = "Probe.Set.Name")) %>%
inner_join(go_tbl[c(1:2)], by = c("GO" = "go_id")) %>%
dplyr::select(GENENAME,SYMBOL,sequence,MAP,PATH,PROBEID,x,y,GO,GOTERM = Term,ONTOLOGY,EVIDENCE,EVIDENCE_TYPE) %>%
distinct() %>%
# Arrange by position
arrange(x) %>%
dplyr::select(sequence) %>%
unnest_tokens(
nucleotide,
sequence,
token = "character_shingles",
n = 1
) %>%
# Define the other half of the base pair
mutate(
pair = recode(
nucleotide,
`a` = "t",
`t` = "a",
`c` = "g",
`g` = "c"
)
)
# Define patterns
tercet_purines <-
gene_df %>%
mutate(
# Duplicate if purines
nucleotide = recode(nucleotide, `a` = "a|a",`g` = "g|g"),
pair = recode(pair, `a` = "a|a",`g` = "g|g"),
base_pair = paste0(nucleotide,"|",pair)
) %>%
summarize(pattern = base::paste(base_pair,collapse = "||")) %>%
c()
tercet_amino <-
gene_df %>%
mutate(
base_pair = paste0(nucleotide,pair)
) %>%
summarize(pattern = base::paste(base_pair,collapse = "|")) %>%
# Add new | for every tercet break
mutate(pattern = gsub('(.{9})', '\\1|', pattern)) %>%
c()
couplet_amino <-
gene_df %>%
mutate(
row = row_number(),
codon = cut(row, nrow(.) %/% 3, labels=FALSE)
) %>%
group_by(codon) %>%
summarize(
nucleotide = base::paste(nucleotide,collapse = ""),
pair = base::paste(pair,collapse = "")
) %>%
mutate(
pattern = base::paste0(nucleotide,"|",pair)
) %>%
summarize(pattern = base::paste(pattern,collapse = "||")) %>%
c()
patterns <-
list(
`gene_df` = gene_df,
`tercet_purines` = tercet_purines,
`tercet_amino` = tercet_amino,
`couplet_amino` = couplet_amino
)
return(patterns)
}
Step 4: Convert Gene Sequence to Poetic Pattern
Once we figure out which gene we want to use, we can use its gene_id
to convert it into patterns using the get_patterns()
function defined above:
vision_patterns <- get_patterns(gene_id = "RAX")
There are a few options to start with (and these will likely grow) for defining patterns for prosody. In each of these, a |
denotes a line end and a ||
denotes a stanza end:
tercet_purines
: a pattern in which each base pair is made into a tercet, with purines (adeninea
and guanineg
) getting 2 lines because they have 2 rings, and pyramidines (thyminet
and cytosinec
) getting one line because they have 1 ring. Example: The base pairat
would correspond to the tercet patternaat
str(vision_patterns$tercet_purines$pattern)
## chr "c|g|g||a|a|t||c|g|g||t|a|a||g|g|c||g|g|c||g|g|c||g|g|c||a|a|t||a|a|t||c|g|g||g|g|c||t|a|a||c|g|g||t|a|a||t|a|a|"| __truncated__
tercet_amino
: a pattern in which every codon (i.e. 3 base pairs) corresponding to an amino acid gets its own tercet, where the initial nucleotide from the base pair corresponds to the start word of a line, and the other nucleotide to the end word of the line.
str(vision_patterns$tercet_amino$pattern)
## chr "cg|at|cg||ta|gc|gc||gc|gc|at||at|cg|gc||ta|cg|ta||ta|gc|gc||gc|at|ta||cg|cg|gc||at|cg|at||cg|ta|gc||gc|gc|gc||a"| __truncated__
couplet_amino
: a pattern in which every codon (i.e. 3 base pairs) corresponding to an amino acid gets its own couplet, where the initial nucleotide from one side of the base pair gets the first line of the couplet, and the corresponding nucleotides of the base pairs get the second line of the couplet. This creates a pattern where there are three patterned words in each line. Similar totercet_amino
, but turned sideways.
str(vision_patterns$couplet_amino$pattern)
## chr "cac|gtg||tgg|acc||gga|cct||acg|tgc||tct|aga||tgg|acc||gat|cta||ccg|ggc||aca|tgt||ctg|gac||ggg|ccc||aac|ttg||gtc"| __truncated__
Step 5: Fill the Pattern with Words
Finally, we need to replace the nucleotides (A, T, C, G) with some verbal/written patterns to use in the poem. So, we define patterns of words to stand in for each nucleotide. Some options for the types of word patterns we might use, based on their shared characteristics, include:
letter_proportion
: words which have a certain proportion of their characters sharing the same lettervocal_proportion
: a concentration of similar vocal pattern (such as fricatives)cluster
: clustered together by similarity across multiple metrics, using kmeansrhyme
: rhyme based on Soundex or other algorithmscategory
: words from a given category (e.g. using the Regressive Imagery Dictionary classifications)regex
: matches a regular expression (e.g. any word beginning withw
and containing ano
oru
in the middle before ending with the letterd
).- Various other potential methods.
Appendix: Other options for Packages
If you’re a fan of this kind of thing, there are various other options to read in the human genome for use in constructing patterns, each with their own pros and cons. Here are a few:
- The org.Hs.eg.db package Genome wide annotation for Human, primarily based on mapping using Entrez Gene identifiers. This requires working with the data as a
biostrings
object, but does include association data which will be helpful in locating the nucleotide sequences for use in specific portions of the poem. - The BSgenome.Hsapiens.UCSC.hg19 Full genome sequences for Homo sapiens (UCSC version hg19)
- The
biomartr
package5, which allows for retrieval of all available kingdoms stored in theNCBI RefSeq
,NCBI Genbank
,ENSEMBL
, andENSEMBLGENOMES
sources. It also allows for download as eitherbiostrings
ordata.table
object. - The
rentrez
package, which allows querying the NCBI Entrez Eutils API to search, download data from, and interact with databases.
Carlson M (2018). GO.db: A set of annotation maps describing the entire Gene Ontology. R package version 3.6.0. DOI: 10.18129/B9.bioc.GO.db↩︎
For this, we use the
hgu133plus2.db
package, which allows us to query by any of the following fields: ACCNUM, ALIAS, ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, ENZYME, EVIDENCE, EVIDENCEALL, GENENAME, GENETYPE, GO, GOALL, IPI, MAP, OMIM, ONTOLOGY, ONTOLOGYALL, PATH, PFAM, PMID, PROBEID, PROSITE, REFSEQ, SYMBOL, UCSCKG, UNIPROT. We’ll be using theGO
field to match ourGOID
s to additional information. Notes from this site on using GO a.k.a. gene ontology for identifying functional areas is helpful, as is this vignette on the Bioconductor site.↩︎In the current code, I’m pulling the Affymetrix Human Genome U133 Plus 2.0 Array annotation data using the
hgu133plus2.db
package.↩︎For instance, we may want to filter on the strength of evidence. Based on the GO documentation “IEA is a weak association and is based on electronic information, no human curator has examined or confirmed this association… IEA is also the most common evidence code.” Which speaks to the tip-of-the-iceberg nature or our current knowledge.↩︎
Drost HG, Paszkowski J. Biomartr: genomic data retrieval with R. Bioinformatics (2017) 33(8): 1216-1217. doi:10.1093/bioinformatics/btw821↩︎