This post covers work related to a recent GEM maize competition where users were provided SNP data, weather data, and additional data to predict the yields in an unseen year.

Transformers in Genomic Prediction: Advancing Plant Breeding Through Deep Learning

The Evolution of Genomic Prediction in Plant Breeding

Plant breeding has entered a new era where computational prediction of breeding values has become crucial for scaling selection processes from thousands to hundreds of thousands of individuals. Traditionally, breeders have focused on additive genetic variation - the linear, independently inherited components that reliably pass from parents to offspring. While non-additive effects like epistasis (gene-gene interactions) and genotype-environment interactions can enhance prediction accuracy, they’ve been historically challenging to model effectively.

For years, genomic Best Linear Unbiased Prediction (gBLUP) has been the gold standard for predicting breeding values, excelling at capturing linear genetic effects. However, the landscape is rapidly evolving with the introduction of transformer architectures into plant breeding applications.

Understanding Transformer Architecture in Genomic Contexts

Since their introduction in the seminal “Attention Is All You Need” paper (Vaswani et al., 2017), transformers have revolutionized machine learning across domains. Their success in natural language processing has catalyzed adaptations for diverse data types, including biological sequences. The key innovation lies in the attention mechanism, which enables the model to discover and weigh interactions between any elements in a sequence, regardless of their distance.

In genomic applications, this means a transformer can identify relationships between distant genetic markers that might influence traits - a capability particularly relevant for capturing epistatic effects that traditional linear models might miss. The transformer processes genetic sequences by converting them into rich, context-aware representations that capture both local and global patterns in the genome.

Engineering Considerations for Genomic Transformers

Implementing transformers for genomic prediction presents several unique challenges:

  1. Sequence Length Management: Modern breeding programs generate dense genetic markers, often ranging from 500 to 500,000 SNPs per individual. This far exceeds the typical transformer sequence length limits (usually around 1024 tokens). Solutions include:
    • Hierarchical compression using convolutional layers, similar to the Enformer architecture
    • Attention-based pooling strategies to maintain information while reducing dimensionality
    • Linear attention variants that scale better with sequence length
  2. Data Efficiency: While transformers are powerful, they typically require larger datasets compared to CNNs. In breeding contexts, this necessitates:
    • Careful architecture design to prevent overfitting
    • Implementation of regularization strategies specific to genomic data
    • Potential use of pre-training on larger genomic datasets before fine-tuning on specific breeding populations
  3. Architecture Design: Cross-attention mechanisms offer powerful ways to integrate multiple data streams (genetic, environmental, phenotypic), but require careful consideration:
    • Balance between model expressiveness and computational efficiency
    • Strategic placement of cross-attention layers to maximize information flow
    • Careful ablation studies to validate the utility of each architectural component

This approach to genomic prediction represents a significant advance in our ability to model complex genetic architectures and could substantially impact breeding program efficiency. The GEM maize competition provides a practical testbed for these methods, challenging researchers to predict yields using SNP, weather, and auxiliary data in novel environments.

II.Example code

The Machine Learning Model

Discussion

This project was developed with pytorch and nbdev. See the documentation here: https://cjgo.github.io/hybridpredictmaize22/

I successfully replicated the performance of the model a recent paper by replicating their architecture. This included transformer layers and cross attention connections between the weather and genotype inputs. Overall it was a good lesson in writing pytorch networks and data pre-processing pipelines from scratch.

Thanks to http://www.genomes2fields.org/