Domaines
Condensed matter
Statistical physics
Biophysics
Physics of living systems
Type of internship
Théorique, numérique Description
A biological sequence (DNA, RNA, protein) is a string of contiguous covalently attached amino acids or nucleotides. A central paradigm of biology is that the sequence determines the function of the molecule in the organism. However, this mapping is complex and context dependent. Generative models trained on sequence data can be used to sample novel functional sequences. But generated sequences merely reproduce statistics of the training data, combining various features found in the natural sequences in an uncontrolled manner.
In this internship we will explore how generative models (such as RBM, VAE, Diffusion, …) can be modified to extract disentangled representations of biological sequences, where interesting properties are mapped to independent latent coordinates. Such latent variables can then be modified during sampling to control properties of designed sequences.
Contact
Jorge FERNANDEZ DE COSSIO DIAZ