A protein biodiversity viewer with Deep Learning embeddings

This application will build upon bespokin.com, however, instead of using AI generated images, we will use real-world proteins. Given the sheer volume of published proteins, we will focus our app on a specific pathway with known proteins, the nicotine biosynthesis pathway.

Genetic engineering in plants is very complicated. As an example, nicotine is a very valuable chemical, to both humans and plants. Nicotine is produced in root tissue and transported to the rest of the green tissues above-ground. Why wouldn’t the plant simply synthesize nicotine where it would be transported? There are a couple good answers: The raw materials or precursors to make nicotine are found in greater abundance in the cytosol and vacuoles of root cells which are designed as storage units for the plants resources. The proteins involved in biosynthesis of nicotine have evolved to function in root cells which have different ionomic conditions, aka the proteins wouldn’t be the correct shape in the leaves for example and therefore would not work nearly as efficiently.

Therefore, if a genetic engineer wanted to produce nicotine in other cells they would have to increase the presence of the sugars and amino acids which act as building blocks for proteins, and redesign the structures of the proteins to behave in their new locations.

This app will show the space around these root-cell nicotine-enzymes of known proteins across the tree of life. We will use (p)BLAST to find similar proteins according to a multiple sequence alignment. This is a traditional algorithm. So it will not explictly find proteins which have the same functions as the proteins which catalyze reactions to form nicotine.

Alternatively, this pipeline could be modified to only BLAST the functional domains (the parts of the protein which actually interact with chemicals to modify them). Doing so would change the nature of the app, it would allow us to create a tobacco plant which produces something other than nicotine. Our current approach will enable engineers to control the ability to make these chemical factories functional to other parts of the plant. Additionally, we could ignore the proteins and focus on the regulatory genomics of these proteins and examine similar promoters/enhancers/silencers which surround the physical gene; this would allow us to move the factories (rather than make those factories functional which is what we are interested in).


We will use a similar set of tools as bespokin.com: the primary tools will be Svelte for making the frontend, D3 for the interactive visualizations, and PostgreSQL to store our BLAST results.

Meta’s ESM model or similar, will provide embeddings for each protein which will be used in the D3 visualization.

Step 1 : Locate the enzymes/proteins involved in nicotine biosynthesis

Step 2 : BLAST the proteins found in step 1

Step 3 : Parse the Uniprot JSON files corresponding to each protein and store them in a database