Transitioning to Bio ML:
My Experience Learning and Modifying FoldFlow-2

Table of contents

Introduction: Personal Motivation & the Road Ahead

Transition phases, especially those that shape the direction of our lives, are often really tough! At least in my case, they also tend to be some of the most rewarding episodes I’ve gone through. Currently being in such a phase again and having reached the first milestone, I’m opening my blog and documenting my experience in this first post.

About a year ago, I decided to shift my career path towards Bio ML and the fascinating field of protein drug discovery. Why proteins? — you might ask. With a background in particle physics, a thirst for solving complex puzzles, and expertise in various machine learning areas, I’ve always been driven by the desire to engage in work that is meaningful, intellectually challenging, and beneficial to humanity. Protein design offered precisely that combination. Setting my goal to enter the field, I embarked on a learning journey that turned out to be the most demanding self-initiated research project I’ve undertaken so far.

This post is basically me putting together all the pieces I’ve learned on the way that play an important role in the latest generative techniques for proteins. Here, I’ll present a deep dive, which closely follows my learning steps, into FoldFlow-2, a recent state-of-the-art model for protein structure generation. I’ll walk you through the advanced ML concepts underlying the model, document my process of understanding and modifying its architecture, and reflect on the insights and skills gained along the way.

If you’ve ever wondered how modern ML can be used for creation of new proteins, or if you’re considering redirecting your career similarly to mine, I hope this article offers useful perspectives and practical resources. I’d be happy if it serves as a springboard for a smoother start to your own journey into the world of Bio ML. For those who want to dig deeper, I’ll provide links to upcoming focused posts where I break down each core technique powering my augmented version of FoldFlow-2. So, stay tuned, the links will be added gradually as I continue writing.

The Promise and Challenge of Protein Drug Discovery

Proteins are essential biomolecules responsible for nearly every crucial process in living organisms. These macromolecules fold into complicated three-dimensional structures, determining how they interact with other molecules and shaping their biological roles. For instance, hemoglobin, a protein present in red blood cells, binds and transports oxygen from the lungs to tissues, which is a critical step in respiration 1 .

Hemoglobin is composed of four subunits, each containing a heme group 2 that binds oxygen. This binding is a gradual and cooperative process: once the first heme binds oxygen, the structure of the whole protein is slightly changed, making it easier for the other heme groups to bind. Therefore, while binding the first molecule is relatively difficult, adding three consequent ones is progressively easier. There’s plenty of oxygen in the lungs and it’s easier for hemoglobin to bind it and quickly fill up the remaining subunits. However, as blood circulates through the rest of the body, where the carbon dioxide level is increased, a release of one of the bound oxygen molecules induces a change in the shape of the protein. This prompts the remaining three oxygens to be quickly released, as well. This cooperative mechanism allows hemoglobin to pick up large loads of oxygen from our lungs and deliver them where needed.

Hemoglobin binding oxygen — A change in shape occurs at a subunit of hemoglobin when it binds an oxygen molecule in turquoise, influencing the other subunits and increasing its binding affinity.

Hemoglobin’s example demonstrates the sophisticated nature of protein structure and function relationships, refined over millions of years of evolution. Unfortunately, the task of intentional de novo protein design that aims to create entirely new proteins with desirable functions from scratch is a complex and challenging one. Proteins consist of sequences of 20 standard amino acid residues. Therefore, if we limit the sequence length to, say, 50 residues — even though one can find way longer sequences in nature — the size of the possible design space is 20⁵⁰. While traditional physics-based computational approaches, e.g. molecular dynamics (MD) simulations, could yield potentially promising results, high computational costs and slow speed significantly limit areas of their application. The design space is simply too vast for them to accomplish the task.

The rise of Bio ML and the phenomenal success of AlphaFold-2 marked a transformative moment in protein science. AlphaFold-2 3 showed that deep learning could predict protein structures from amino acid sequences with high accuracy, outperforming all previous computational methods. This accelerated the adoption of machine learning techniques to the protein discovery problem. One of the latest outstanding results is a model 4 that designed a protein binder for receptors of COVID-19.

Then it got even more interesting. Generative models emerged as a powerful tool in protein design, capable of generating entirely new realistic protein sequences. Among these novel approaches, FoldFlow-2 5 caught my attention for a number of reasons. First of all, it leverages several cutting-edge methods that I wanted to learn because some are already used in many other protein discovery models and others offer significant improvements over current baselines. Flow Matching on the $\text{SE}(3)$ group manifold, Optimal Transport theory, protein LLMs, and some AlphaFold-2 innovations like Invariant Point Attention define the FoldFlow-2 architecture. I was genuinely eager to dive into that knowledge. Secondly, despite the model being rather complex, it doesn’t have an extremely large codebase or require weeks of GPU runtime to train. Considering all this, I chose FoldFlow-2, since I was also interested in modifying its architecture and experimenting with some $\text{SE}(3)$-equivariant tensor field Graph Neural Networks (GNNs), which were on my study list too.

In the next chapters, I’ll unpack the essential machine learning innovations that underpin FoldFlow-2, share my experience of dissecting and familiarizing myself with its architecture, and detail my attempts at improving upon its already impressive capabilities.

How FoldFlow-2 Fits Into Generative Protein Modelling

Although most people in the ML community are now familiar with AlphaFold models and their revolutionary success in structure prediction, there’s a new wave of research that focuses on generative models that can design entirely new protein structures. Following that wave, Dreamfold 6 team has developed FoldFlow-2. It’s a recent state-of-the-art $\text{SE}(3)^N$-invariant generative model for protein backbone generation that is additionally conditioned on the sequences of amino acids. As the name suggests, this architecture builds on top of FoldFlow (7 ) and implements a novel mechanism of handling multi-modal data, resulting in a substantial performance gain over the original version.

Several successful attempts to create generative models (RDM 8 , RFDiffusion 9 , FrameDiff 10 ), operating on Riemannian manifolds, had been published before FoldFlow was released in 2024. Some required pretraining on prediction of protein structure (RFDiffusion), others used approximations to compute Riemannian divergence in the objective (RDM) and all of them relied on the Stochastic Differential Equations as the theoretical base for modeling the diffusion process on the manifold, which assumes a non-informative source (prior) distribution that one uses for training. FoldFlow was one of the first models that introduced $\text{SE}(3)$ Conditional Flow Matching 11 for generation of a protein backbone with a possibility to use an informative prior distribution, and it utilized minibatch Optimal Transport 12 to speed up training.

Generation of proteins from scratch is a much harder problem than predicting its 3D structure. A model should create proteins that are designable, different to the ones found in the training set and diverse. It’s not only difficult to build such models, but it’s also not easy to adequately assess their performance (more on this in the following sections). A multi-modal architecture of FoldFlow-2 is definitely a step forward that offers improvements across all three metrics that researchers use for evaluation. To fully grasp FoldFlow-2’s approach, let’s first cover some theoretical preliminaries and talk about the core ML techniques proposed by the authors of the paper.

Overview of Core ML Techniques in FoldFlow-2

The model shares and extends some of the theoretical foundations laid out in the AlphaFold-2 and FrameDiff papers. Each of its techniques is a topic in itself and requires more detailed explanations than I can give here without making this post excessively long. Instead, as I already mentioned in the beginning, I’ll dive deeper into each technique in separate focused posts, offering a shorter description here. Don’t worry if you don’t understand something immediately after reading it. I won’t lie, the math behind the model is surely not easy here and it took me a few months of reading paper after paper to start orientating in it. Let’s kick off with an important concept of protein backbone and its parametrization.

Representations of a Protein Backbone

Glycine and alanine amino acids — Two amino acids are linked together. GLY - glycine, ALA - alanine. The vectors formed by two atomic bonds (shown with arrows) are used in the Gram-Schmidt algorithm to construct the initial frames for each residue. A torsion angle $\psi$ is required for correct oxygen placement.

A backbone consists of repeated N—C$_{\alpha}$—C—O four heavy atoms linked together in a chain, with each set corresponding to one amino acid (residue). C$_{\alpha}$ atom of each residue, except for glycine (GLY), is attached to a side chain that varies for each amino acid and determines its distinct chemical properties. The geometry of the backbone is determined by mapping a set of idealized local coordinates, [N$^{\star}$, C$^{\star}_{\alpha}$, C$^{\star}$, O$^{\star}$] $\in \mathbb{R}^3$ centered at
C$_{\alpha}^{\star}$=(0, 0, 0), to the actual position of each residue. This mapping is performed using a rigid transformation given by an action $x$ of the special Euclidean group $\text{SE}(3)$ defined by 3D rotations $R$ and translations $S$. In other words, an action $x^i$ generates backbone coordinates for a residue $i$:

$$[N, C_{\alpha}, C, O]^i = x^{i} \cdot [N^{\star}, C^{\star}_{\alpha}, C^{\star}, O^{\star}] \tag{1}$$

As shown in Eq. 1, each residue transformation can be decomposed into two components $x^i=(r^i, s^i)$ where $r^i \in \text{SO}(3)$ is a $3 \times 3$ rotation matrix and $s^i \in \mathbb{R}^3$ is a three-dimensional translation vector. Thus, following AlphaFold-2’s approach, the entire structure of a protein with N residues is parameterized by a sequence of N such transformations described by the product group $\text{SE}(3)^N$. This results in a representation of all heavy atoms of the protein given by a tensor $X \in \mathbb{R}^{N \times 4 \times 3}$. Additionally, in order to compute the coordinates of the backbone oxygen in frame $i$, one needs to apply a rotation around C$_{\alpha}$—C bond by a torsion angle $\psi^i$.

The final rotation matrix $r^i$ for each residue is obtained via the Gram-Schmidt algorithm3 . This procedure operates on two vectors built from backbone atom coordinates, enforcing orthonormality to output a valid rotation matrix centered on the C$_{\alpha}$ atom. Further details of this parametrization are well documented in the appendix of the FrameDiff10 paper.

So, one way to model a protein is to associate an element of $\text{SE}(3)$, called a “rigid” for simplicity, to each residue in the chain. This representation is used as the “structure” modality of the model.

The second modality represents a protein as a sequence of 20 possible one-hot encoded amino acids. This is a usual way to tokenize data in protein language models. Then, the whole protein sequence is written as a tensor $A \in \mathbb{R}^{N \times 20}$.

Before moving on, let me briefly write what exactly the $\text{SE}(3)$ group is and why it’s the natural choice for describing protein structures.

$\text{SE}(3)$ Group: Tool for Backbone Structure Parametrization

Each rigid transformation $x^i$ corresponding to residue $i$ in a protein backbone is mathematically described by the $\text{SE}(3)$ group 13 . Simply put, $\text{SE}(3)$ represents all possible rotations and translations in three-dimensional space. Since each residue’s coordinates can be obtained according to Eq. 1, $\text{SE}(3)$ provides an ideal mathematical tool for modeling the spatial positions of amino acids. Essentially, our task turns out to be a prediction of rotations and translations of the idealized coordinates for each residue, which produces the three-dimensional structure of the protein.

A powerful property of $\text{SE}(3)$ is that it forms a Lie group, which is also a differentiable manifold with smooth (differentiable) group operations. Informally, a manifold is a topological space that locally resembles Euclidean space. Since the manifold is differentiable in our case, we can smoothly interpolate between different points (representing different protein structures), which is crucial for generative modeling. Each point on this manifold has an associated tangent space 14 , allowing us to define smooth transitions, or flows, between protein structures. The tangent space at the identity element of $\text{SE}(3)$ is called its Lie algebra $\mathfrak{se}(3)$, which contains skew-symmetric 15 rotation matrices and translation vectors. $\text{SE}(3)$ is a matrix Lie group, i.e. its elements are represented with matrices. Additionally, $\text{SE}(3)$ can be decomposed into two simpler groups: the rotation group $\text{SO}(3)$ and the translation group $\mathbb{R}^3$.

Next, I’ll discuss how this group formalism enables the creation of flows on the manifold, which form the core of the protein generative process in FoldFlow-2.

Conditional Flow Matching on the $\text{SE}(3)$ Manifold

In the previous subsection, I mentioned that Lie groups are smooth manifolds, a property that allows them to be equipped with a Riemannian metric 16 , which makes it possible to define distances, angles and geodesics 17 on the manifold. For $\text{SE}(3)$, the metric decomposes into separate metrics for its constituent subgroups: $\text{SO}(3)$ and $\mathbb{R}^3$ 7 . The decomposition of the $\text{SE}(3)$ group into its subgroups allows to construct independent flows for rotations and translations, which can then be combined to create a unified flow on $\text{SE}(3)$. Flow Matching11 techniques for Euclidean spaces like $\mathbb{R}^3$ are well-studied and you can find an excellent introduction in this post 18 . So, I’ll focus on the key aspects of Flow Matching specifically for the rotation group $\text{SO}(3)$.

Metric and Distance on $\text{SE}(3)$

Before diving into the Flow Matching framework, let me establish the notion of a metric and show the distance for the $\text{SE}(3)$ group, which we can conveniently split into two components, $\text{SO}(3)$ and $\mathbb{R}^3$. A usual choice for the metric on $\text{SO}(3)$ is:

$$\langle \mathfrak{r_1}, \mathfrak{r_2} \rangle_\text{SO}(3) = \frac{1}{2} \text{tr}(\mathfrak{r_1}^T \mathfrak{r_2}) \tag{2},$$

where $\mathfrak{r_1}$ and $\mathfrak{r_2}$ are the elements of the Lie algebra $\mathfrak{so}(3)$.

Using eq. 2, the distance on $\text{SE}(3)$ can be defined as follows:

$$d_{\text{SE}(3)}(x_1, x_2) = \sqrt{d_{\text{SO}(3)}(r_1, r_2)^2 + d_{\mathbb{R}^3}(s_1, s_2)^2} = \sqrt{\left\| \log(r_1^T r_2) \right\|_F^2 + d_{\mathbb{R}^3}(s_1, s_2)^2} \tag{3}$$

Hence, the distance on $\text{SO}(3)$ is calculated as the Frobenius matrix norm of the logarithmic map (read section 4.3.3) of the relative rotation between $r_1$ and $r_2$ and $d_{\mathbb{R}^3}$ is the usual Euclidean distance. Although the formula for the $\text{SO}(3)$ distance looks complex, in practice and in the code it’s been calculated in a much simpler way by finding a rotation angle needed to get from the first rotation matrix, $r_1$, to the second, $r_2$.

This distance formulation will be crucial when I discuss the optimization objective and Optimal Transport in the next sections.

Probability Path on $\text{SO}(3)$

Another concept I need to introduce before revealing the optimization objective of the model is probability paths on $\text{SO}(3)$. Imagine that we have two probability densities $\rho_0$, $\rho_1 \in \text{SO}(3)$ where $\rho_0$ corresponds to our target data distribution and $\rho_1$ is an easy-to-sample source distribution. We can smoothly interpolate between these two densities in the probability space by following a probability path denoted $\rho_t: [0, 1] \to \mathbb{P}(\text{SO}(3))$ that depends on one parameter $t$ that we can think of as time. This transition is generated by a flow, a mapping $\psi_t$ that takes every starting point $r$ in $\rho_0$, given by a rotation matrix on $\text{SO}(3)$, and moves it to a new location on the manifold, $r_t = \psi_t(r)$, at time $t$. Thus, the entire distribution $\rho_t$ is formed by applying this map to the initial distribution $\rho_0$.

The map $\psi_t$ is the solution to the ordinary differential equation (ODE), eq. 4, with the initial condition $\psi_0(r) = r$.

$$\frac{d \psi_t}{dt} = u_t(\psi_t(r)) \tag{4}$$ The dynamics of this flow are governed by a velocity field, $u_t: [0,1] \times \text{SO}(3) \to T_{r_t}\text{SO}(3)$, that lies in the tangent space $T_{r_t}\text{SO}(3)$ at point $r_t = \psi_t(r)$. This means the velocity field $u_t$ assigns a tangent vector to each point on the manifold. Therefore, for any rotation $r$, the velocity $u_t(r_t) \in T_{r_t}$ is a vector in the tangent space at that point, describing the instantaneous direction and magnitude of the flow. In simpler terms, you can view this vector field as the guide that provides precise instructions for every single point at every single moment in time of how to move in order to morph the whole initial probability density $\rho_0$ into the target one, $\rho_1$.

Now that I explained how probability paths and flows work on the $\text{SO}(3)$ manifold, let’s see how FoldFlow-2 leverages these concepts to formulate its training objective.

From Conditional Flow Matching to the Optimization Objective

The main task of the model is to generate realistic and novel proteins, which are parametrized by the product group $\text{SE}(3)^N$. One way to train such a model is by using the Conditional Flow Matching11 technique. Focusing on the rotation ($\text{SO}(3)$) component of the objective, let me shed some light upon the main idea of this approach.

The idea of Conditional Flow Matching is to fit a conditional velocity field $u_t(r_t| r_0, r_1)$ in the tangent space $T_{r_t}\text{SO}(3)$ associated with the flow $\psi_t$ that smoothly transports the data distribution $r_0 \sim \rho_0$ to the source distribution $r_1 \sim \rho_1$. The unconditional vector field, the marginal velocity field over all possible endpoint pairs, is intractable to compute directly. Therefore, the model learns a conditional vector field $u_t(r_t| r_0, r_1)$, which is conditioned on the specific start ($r_0$) and end ($r_1$) points of the trajectory. Once we have access to this vector field, we can sample from $\rho_1$ and use a simple ODE solver to run the reverse process, generating a protein that resembles those found in the data distribution $\rho_0$. This generative technique can trace its roots back to the influential paper on Neural ODEs 19 where the authors laid the groundwork for continuous normalizing flows. I recommend reading it to those unfamiliar with the topic, since it provides foundational concepts that simplify the understanding of Flow Matching.

The authors of FoldFlow-2 follow the strategy developed in the previous version of the model7 that constructs a flow $\psi_t$, connecting $r_0$ and $r_1$, by utilizing the geodesic between these points. A geodesic that connects two points is the shortest path between them on a manifold. For a general manifold, including $\text{SO}(3)$, the geodesic interpolant between $r_0$ and $r_1$, indexed by $t$, is given by the following equation:

$$\psi_t = r_t = \exp_{r_0} (t \, \log_{r_0}(r_1)) \tag{5}$$

A geodesic interpolant, eq. 5 between two points $r_0$ and $r_1$ on a manifold is the generalization of linear interpolation to curved spaces. In Euclidean space, interpolating between two points is simply a straight line that can be written as:

$$\psi_t = x_t = (1 - t)x_0 + tx_1 \tag{6}$$

However, on manifolds, straight lines are generalized by geodesics, which are curves that locally minimize distance and have zero acceleration. This interpolant includes two concepts important for manifold operations: exponential and logarithmic maps.

The exponential map (fig. 3a) $\exp_{r_0}: T_{r_0}\text{SO}(3) \to \text{SO}(3)$ takes a tangent vector $v \in T_{r_0}\text{SO}(3)$ at point $r_0$ and maps it to the point reached by following the unique geodesic $\gamma(t)$, which satisfies $\gamma(0) = r_0$ and $\dot{\gamma}(0) = v$, in the direction specified by that vector, producing a new point on the manifold, corresponding to a rotation matrix $r_1$:

$$r_1 = \exp_{r_0}(v) = \gamma(1) \in \text{SO}(3) \tag{7}$$

Effectively, one travels a unit of time along the geodesic $\gamma(t)$ for a distance equal to $\left| \left| v \right| \right| = \sqrt{g_{\text{SO}(3)}(v, v)}$ and ends up at a new point on the manifold. The distance is computed according to a chosen metric $g_{\text{SO}(3)}$ on that manifold ($\text{SO}(3)$ in our case). It can be viewed as an analogue to addition in Euclidean space: for a point $p$ and a tangent vector $v$ the exponential map is simply $\exp_p(v) = p + v$. Basically, knowing $v$ and the starting point $p$, I can answer the question where I would be in a unit of time if I go there with constant velocity $v$ from point $p$.

Conversely, the logarithmic map (fig. 3b) is the local inverse of the exponential map, $\log_{r_0}: \text{SO}(3) \to T_{r_0}\text{SO}(3)$. It provides the tangent vector at $r_0$ pointing in the direction of $r_1$, eq. 8. Here, it’s the opposite: if I know my current location $r_1$ and the starting location $r_0$, the log map tells me in what direction and how “fast” I should travel to reach my $r_1$, starting at $r_0$ in a unit of time.

$$v = \log_{r_0}(r_1) \in T_{r_0}\text{SO}(3) \tag{8} $$

In Euclidean space, it is analogous to vector subtraction between two points: for points $p$ and $q$, the log map $\log_p(q)$ returns the vector $q-p$.

(a) Exponential map between $r_0$ and $r_1$. (b) Logarithmic map between $r_0$ and $r_1$.

One of the key innovations of the paper is an efficient computation of the logarithmic and exponential maps required for the geodesic interpolant eq. 5, avoiding their standard definitions as infinite matrix series. This method leverages the Lie group structure of $\text{SO}(3)$. To compute $\log_{r_0}(r_1)$, a relative rotation $r_{rel} = r_1^T r_0$ is first calculated. Then $r_{rel}$ is converted to its axis-angle representation 20 and the hat operator is applied to it. The hat operator maps a three-dimensional vector to a skew-symmetric matrix. Since the output of the previous step is a skew-symmetric matrix, this whole procedure yields $\mathfrak{r_1} \in \mathfrak{so}(3)$ that belongs to the Lie algebra of $\text{SO}(3)$ and, by definition, lives at the tangent space at the identity element of the group. It’s possible to apply left translation to $\mathfrak{r_1}$ to move it to the tangent space of $r_0$. This is achieved, using left matrix multiplication by $r_0$, which produces the desired logarithmic map $\log_{r_0}(r_1)$. Similarly, the exponential map can be computed in closed form for skew-symmetric matrices, the elements of $\mathfrak{so}(3)$, using Rodrigues’ formula. You can see the visualization of the steps involved in the log map computation in fig. 4.

$Visualization of the computation of the logarithmic map $\log_{r_0}(r_1)$$

(a) Two rotations on $\text{SO}(3)$ group: $r_0$ and $r_1$. (b) Relative rotation $r_{rel}$ that, by definition, has its starting point at the identity element $Id$ located at the center of the sphere. (c) The log map is efficiently computed in the axis-angle representation at the identity element, yielding a tangent vector of the Lie algebra, $\log_{Id}(r_{rel}) \in \mathfrak{so}(3)$. (d) The tangent vector is left-translated to $r_0$, producing $\log_{r_0}(r_1)$.

Having established feasible ways to work with the geodesic interpolant $r_t$, the authors describe how to get the conditional velocity field $u_t(r_t| r_0, r_1)$, which is given by the ODE associated with the conditional flow and is the time derivative of the geodesic interpolant in eq. 5:

$$\frac{d\psi_t(r|r_0, r_1)}{dt} = \dot{r_t} = u_t(r_t| r_0, r_1) \tag{9}$$

The computation of the conditional vector field leverages the group structure rather than directly taking the derivative of the interpolant. From eq. 9, we see that computing the vector field $u_t \in T_{r_t}\text{SO}(3)$ requires taking the time derivative of $r_t$ at time $t$ along the geodesic. However, the vector field has a simple closed-form expression written in the paper as $u_t = \log_{r_t}(r_0) / t$, where the logarithmic map is computed using the efficient procedure described above. In my opinion, there’s a minus sign missing in that formula, but it doesn’t really matter for training because our neural network learns to “undo” that mistake adjusting its weights accordingly.

Finally, I’m ready to write down the full loss of FoldFlow-2, where the translational part is derived similarly to the rotational component. You can check the suggested post 18 for further details.

$$ \mathcal{L} = \mathcal{L}_{\text{SO}(3)} + \mathcal{L}_{\mathbb{R}^3} = \mathbb{E}_{t \sim \mathcal{U}[0,1], q(x_0, x_1), \rho_t(x_t|x_0, x_1, \bar{a})} \left\| v_\theta(t, r_t, \bar{a}) - \log_{r_t}(r_0) / t \right\|^2_{\text{SO}(3)} + \left\| v_\theta(t, s_t, \bar{a}) - \frac{s_t - s_0}{t} \right\| ^2_2 \tag{10}$$

where $t$ is sampled uniformly from $\mathcal{U}[0,1]$, the neural network’s prediction $v_{\theta} \in T_{r_t}\text{SO}(3)$ is in the tangent space at $r_t$, the norm is induced by the metric on $\text{SO}(3)$ and $q(r_0, r_1)$ is any coupling between samples from the data and source distributions. On top of the rotations $r_t$ and translations $s_t$ the model’s input includes the sequence of amino acids $\bar{a}$, which is masked 50% of the time during training. This masking allows for unconditional generation of proteins when the sequence is not known or assumed.

As we see from eq. 10, the neural network directly predicts a vector field, which we then regress on the target vector field corresponding to the geodesic (eq. 5) and Euclidean (eq. 6) interpolants.

Full Training Objective

While the Conditional Flow Matching objective is central to the training, the complete loss function incorporates two additional auxiliary terms to improve predictions. These losses, inspired by the FrameDiff10 model, operate directly on the predicted 3D atomic coordinates.

One of the auxiliary losses is the backbone atom loss $\mathcal{L}_{bb}$ that penalizes the squared difference between the predicted backbone atom coordinates and the ground truth positions. The second one is the pairwise distance loss, $\mathcal{L}_{2D}$, between four heavy atoms of the residues within a local neighborhood (distance $< 6 \mathring{A}$ ). The losses are computed as follows:

$$\mathcal{L}_{bb} = \frac{1}{4N} \sum \left|\left| A - \hat{A} \right|\right|^2, \quad \mathcal{L}_{2D} = \frac{\left|\left| \mathbb{1} \{ D < 6 \mathring{A} \} (D - \hat{D}) \right|\right|^2}{\sum \mathbb{1}_{D < 6 \mathring{A}} - N} \tag{11},$$

where $\mathbb{1}$ is the indicator function, $A^{N \times 4 \times 3}$ is the tensor of predicted backbone atom positions, $\hat{A}$ is the ground truth, and $D^{N \times N \times 4 \times 4}$ is the tensor containing all pairwise distances between four heavy atoms $a$, $b$ of the residues $i$, $j$, i.e. $D_{ijab} = \left|\left|A_{ia} - A_{jb} \right|\right|$.

The intuition behind inclusion of these auxiliary losses is that it helps the model to produce more physically plausible proteins and decrease the number of chain breaks, as well as the amount of steric clashes. Essentially, this mechanism improves the fine-grained characteristics of protein geometry. These auxiliary losses, weighted by $\lambda_{aux} = 0.25$, are only active during the later phase of the flow, specifically when the time variable $t < 0.25$ and the fine-grained characteristics emerge (our data distribution has $t=0$).

The complete loss function is therefore a weighted sum of the primary flow-matching objective and these conditional auxiliary losses:

$$\mathcal{L} = \mathcal{L}_{\text{SO}(3)} + \mathcal{L}_{\mathbb{R}^3} + \mathbb{1} \{ t < 0.25 \} \lambda_{aux} ( \mathcal{L}_{bb} + \mathcal{L}_{2D}) \tag{12}$$

One important remark must be stated here. FoldFlow-2 introduces Reinforced FineTuning method that modifies the final loss, however, I omit it here. The reason for that is simple: the actual training code of FoldFlow-2 is still not published fully and the reinforcement finetuning part is not available. Therefore, I trained the model, using the just the loss from eq. 12 and skipping the RL part altogether.

There is another vital detail hiding in the CFM loss formula (eq. 10). It’s the way the samples $r_0$ and $r_1$ are coupled. An optimal choice, developed in the paper, is to set $q(r_0, r_1) = \pi(r_0, r_1)$, which is a solution of the Riemannian Optimal Transport7 problem. Let me say a couple of words about this crucial aspect.

Optimal Transport on $\text{SE}(3)$

While it’s possible to use any coupling between the data, $\rho_0$, and the source, $\rho_1$, distributions, e.g. $q(\rho_0, \rho_1) = \rho_0 \rho_1$, it’s not guaranteed that the probability path, generated by the conditional vector field $u_t(x_t|x_0, x_1)$, would be the shortest when measured under an appropriate metric on $\text{SE}(3)$. Shorter, more optimal paths are desirable, as they lead to faster, more stable training and decreasing variance in the training objective 12 . To achieve this, FoldFlow-2 uses a mathematical approach called Optimal Transport (OT). There’s a great introductory lecture on Optimal Transport 21 recorded two years ago, which would be very beneficial for deeper understanding of the subject. A python package POT 22 provides an excellent starting point into OT, as well.

The goal of OT, according to transportation theory 23 , is to find the most efficient way to “transport” one distribution to another, minimizing the total effort. For example, to learn how much iron ore to ship from each mine to each factory so that every factory gets exactly what it needs, every mine sends out all its iron ore, and the total shipping cost is as small as possible. That was an important problem during the World War II.

Formally, Optimal Transport finds the best “transport map” $\Psi$ that minimizes the overall cost of moving all the points. This is captured by the following formula:

$$\text{OT}(\rho_0, \rho_1) = \underset{\Psi: \Psi_{\text{#} \rho_0=\rho_1} }{\text{inf}} \int_{\text{SE}(3)^N_0} \frac{1}{2} c(x, \Psi(x))^2 d\rho_0(x) \tag{13}$$

In other words, eq. 13 defines the minimum possible cost, denoted $\text{OT}(\rho_0, \rho_1)$, to transform an entire data distribution of frames, corresponding to the protein residues, $\rho_0$, into the distribution $\rho_1$, following the best possible transport map, $\Psi$. It’s a pushforward map. The notation $\Psi_{\text{#}}\rho_0$ represents the new distribution formed by applying the map $\Psi$ to all points in the original distribution $\rho_0$. The constraint, $\Psi_{\text{#}}\rho_0 = \rho_1$, dictates that this new, transformed distribution must be identical to the target distribution $\rho_1$. It guarantees that we reshape the entire “cloud” of starting points into the target cloud, rather than just moving points arbitrarily (fig. 5). The cost for moving a single point $x$ to its destination $\Psi(x)$ is determined by the cost function $c(x, \Psi(x))$, which is computed according to the geodesic distance (eq. 3) induced by the metric on $\text{SE}(3)$. The cost formulation heavily penalizes long, inefficient paths and strongly encourages the model to find the most direct transformation route on the manifold.

Optimal Transport map for two continuous distributions $\rho_0$ and $\rho_1$.

Searching for the perfect transport map $\Psi$ for thousands of points is computationally very difficult. The paper employs a well-known shortcut. Instead of solving for $\Psi$ directly, it solves a related, more manageable problem (the Kantorovich formulation of OT) to find an optimal transport plan $\pi$, which is a joint probability distribution minimizing the cost of transporting $\rho_0$ to $\rho_1$. This plan does not define the full map but provides an efficient way to sample corresponding pairs of points $x_0$ and $x_1$. By training on these matched pairs, $(x_0, x_1)$ sampled from $\pi$, FoldFlow-2 ensures a much more efficient and stable learning process. The visual intuition of this pair sampling is presented in Fig. 6. There are several ways how to compute the OT plan $\pi$, but that’s beyond the scope of this post. I can just say that the code uses the POT22 library to achieve this.

Sampling the points from $\rho_0$ and $\rho_1$, using the Optimal Transport plan that minimizes the transportation cost induced by the metric on $\text{SE}(3)$. (a) Sampling without OT. (b) Sampling according to the Optimal Transport plan.

Summary of the Core ML Techniques of FoldFlow-2

In this chapter, we explored the sophisticated machine learning framework that powers FoldFlow-2’s ability to generate protein structures. The process begins by representing protein backbones as a sequence of rigid transformations on the special Euclidean group, $\text{SE}(3)$. This geometric foundation is crucial, as it allows the model to operate on a smooth manifold where distances and paths are well-defined. Alongside that geometric parametrization, FoldFlow-2 uses sequences of amino acids to represent proteins.

The core generative mechanism is Conditional Flow Matching11 , a technique where the model learns a time-dependent velocity field. This field defines a flow that smoothly transports points from the complex data distribution of valid protein structures to the easy-to-sample source distribution.

To further enhance training stability and speed, FoldFlow-2 incorporates Riemannian Optimal Transport7 (OT). Instead of arbitrarily pairing initial and target structures, OT is used to find the most efficient coupling, or “transport plan,” between the two distributions. By training on these optimally matched pairs, the model learns straighter, more direct transformation paths, which reduces the variance of the training objective and leads to a more robust learning process.

With this theoretical foundation in place, I will now examine how these concepts are implemented in FoldFlow-2’s actual architecture.

Model Architecture

The main innovation of FoldFlow-2 in comparison to the original version is the addition of a powerful sequence encoder. At a high level, FoldFlow-2 consists of three main stages that follow a typical Encoder-Processor-Decoder (fig. 7) deep learning paradigm:

Input structure and sequence are passed to the encoder.
Encoded representations are combined and processed in a multi-modal trunk.
Processed representations are sent to the geometric decoder, which outputs a vector field that lies in the tangent space of the $\text{SE}(3)$ group.

FoldFlow-2 architecture. Sequence and structure are processed separately, then combined in the fusion trunk and, whose output is decoded in the geometric decoder module, producing the final vector field in the tangent space of $\text{SE}(3)$.

Structure & Sequence Encoder

Structure encoding is performed with a module based on the Invariant Point Attention (IPA) and the protein backbone update algorithms that have been designed for AlphaFold-2. IPA modifies the standard attention mechanism 24 by making the attention weights depend on distances between key and query points that are two sets of N three-dimensional points where N is a hyperparameter. These points are obtained through a linear projection layer applied to residue features, similar to how standard keys and queries are produced. In order to compute a meaningful distance between them, rigid transformations $x_i$ are applied to the points. The baked-in invariance of IPA and the way the backbone is updated make the module $\text{SE}(3)$-equivariant. You can find more details about the algorithms in the supplementary material 25 of AlphaFold-2. The block’s output is divided into three types of representations that follow the naming convention of AlphaFold-2: single, pair and rigid. Without going too deep into what those embeddings are, I’d like to point out that single representations are essentially transformed residue features, pair representations are computed for each pair of residues, using their features and relative distances, and rigids are elements of $\text{SE}(3)$ group I introduced above that describe each residue in terms of rotations and translations.

The core component of the sequence encoder is a pre-trained frozen ESM-2 26 model with 650M parameters. This protein language model was trained on masked sequences of amino acids and creates high-quality features with strong generalization properties, making them well-suited for downstream tasks. The model extracts embeddings from each transformer layer and the final prediction head, yielding 34 total 1280-dimensional feature vectors per residue as single representations. Additionally, attention weights between all pairs of residues from each layer are stacked together to form the final pair representations. The sequence-to-trunk module then constructs a learned linear combination of those 34 embeddings (called “Learnable Pooling” in the figure) before transforming the result with an MLP. Meanwhile, the pair representations are projected into a lower-dimensional space via an MLP and combined with embedded pairwise distances (fig. 8).

Both modalities are mixed and processed in the multi-modal fusion trunk that consists of two main parts: the combiner module and the trunk blocks.

The combiner (fig. 9) uses dedicated MLPs to project each type of single and pair embedding into a shared latent space with half the original dimensionality. The resulting feature vectors from sequence and structure encodings are then concatenated to create unified single and pair joint representations. These are fed further to the trunk module that is made up of two Triangular Self Attention blocks, which are used as the core units of the Evoformer block in AlphaFold-2. Therefore, the whole component is a compact version of the Evoformer architecture with additional shallow MLP mixing of the input embeddings of two different modalities.

Geometric Decoder

Finally, the structure decoder leverages the IPA transformer once more and decodes its input into $\text{SE}(3)_0^N$ vector fields. $\text{SE}(3)_0^N$ is a translation-invariant version of $\text{SE}(3)^N$ that is constructed by switching to a reference frame centered at the center of mass of all C_α backbone atoms. This module takes as input the single and pair embeddings from the trunk, along with the rigids from the structure encoder. The authors found that adding a skip-connection between the decoder and encoder was crucial for model performance, since it preserved temporal information, which would otherwise be lost within the Evoformer block.

Model Summary

To wrap up this chapter, let me summarize the key aspects of FoldFlow-2’s architecture that I’ve covered. The model follows the standard Encoder-Processor-Decoder approach, with multi-modality supported via fusing sequence and structure representations. Many of its components are inspired by the original AlphaFold-2 algorithms, including IPA, Evoformer, and Backbone Update modules.

My Modification: Rationale, Approach & Implementation

Now, after I’ve talked about the theory and the architecture of FoldFlow-2, I’m ready to walk you through my hands-on experience of modifying and extending the model to deepen my understanding of it even further. I always preferred learning by doing!

So, rather than only studying the model inside out, I wanted to modify its components with two key objectives. First, to potentially improve performance through architectural changes, and second, to explore in more detail equivariant Graph Neural Networks (GNNs) that are ubiquitous in the Bio ML field. I worked with GNNs in the past and I even teach a class on them. However, I didn’t have a lot of practical experience with the geometric equivariant GNN architectures, which are usually implemented with the e3nn 27 python library. Therefore, I was particularly interested in going deeper into current geometric SOTA models. So I thought of an $\text{SE}(3)$-equivariant GNN addition that could be integrated as a modular component, allowing me to easily toggle it on and off, introducing minimal disruption to the base architecture.

I discussed in the previous chapter that FoldFlow-2 uses two encoders, the structure and the sequence one. The structure encoder has the IPA algorithm at its core, which modifies the standard attention by adding the dependence on distances between residues. Even though the whole backbone update rule, as well as the full encoder block, is $\text{SE}(3)$-equivariant, the distance-based method of influencing attention weights is $\text{SE}(3)$-invariant, since distance is just a scalar that doesn’t change under group actions. Essentially, attention weights between residues $i$ and $j$ are larger if the residues get closer to each other. Unfortunately, this technique offers limited expressivity when compared to other architectures, which leverage 3D-positional information in a more “flexible” way, by constructing features that transform under actions of $\text{SE}(3)$ equivariantly.

My approach was to integrate a sophisticated $\text{SE}(3)$-equivariant GNN encoder that operates directly on atomic coordinates, working alongside the existing structure and sequence encoders. I borrowed an idea from the self-conditioning technique 28 that has shown good results in generative modeling. I sought to enhance FoldFlow-2’s existing self-conditioning mechanism. While the original implementation simply added a distogram of binned predicted relative positions to the pair representations 50% of the time, I wanted to try out a more advanced option, hoping to get better expressivity and overall results.

Since the model’s predicted rigids (elements of $\text{SE}(3)$) contain backbone atom coordinates as their translation components, I designed a system where the model’s own structural predictions from the previous step are fed to the separate GNN encoder during half of the training iterations (Fig. 10). To maintain architectural simplicity, the GNN encoder outputs only single representations that are processed through the combiner module together with the embeddings from the other two encoders. I found this approach the least invasive, though reasonable, way to incorporate an additional encoder while preserving the core architecture. By conditioning on its own predictions, the model can iteratively refine its structural outputs. This strategy has been shown to significantly improve the quality of generated samples28 .

Augmented architecture of FoldFlow-2 with an additional MACE-based structure encoder conditioned on the model's own predictions 50% of time.

With the overall strategy defined, the next critical choice was the specific architecture for this new $\text{SE}(3)$-equivariant encoder. For this role, I needed a model that could capture complex, interatomic interactions more expressively than the simple distance-based mechanisms found in the original structure encoder. Next, let me introduce the selected model, the Multi-Atomic Cluster Expansion (MACE) network, and the architectural details that make it particularly well-suited for this task.

$\text{SE}(3)$-equivariant MACE Encoder

The MACE 29 architecture is a great example of a geometric equivariant tensor field network. It operates with internal features that are not just scalars but objects that transform equivariantly under group actions of $\text{SE}(3)$. Strictly speaking, MACE was built for actions of the $\text{O}(3)$ group, but since we’re working with relative positions, which guarantees translational invariance, the whole model is not only $\text{SE}(3)$-equivariant, but is equivariant to 3D reflections, as well. This property is achieved by representing features according to the irreducible representations of the group $\text{O}(3)$, ensuring equivariance and aiming at more accurate modeling of physical properties of atomic environments. The foundation of this type of architecture was first developed in the seminal paper on Tensor Field Networks 30 , which I encourage you to read if you’ve never encountered this concept before.

The main ingredient to construct MACE features is to make use of spherical harmonics 31 , $Y_m^l: \mathbb{S}^2 \to \mathbb{R}$. These are functions defined on a sphere $\mathbb{S}^2$ which form an orthonormal basis, making it possible to decompose any function on a sphere into a linear combination of spherical harmonics. The second important quality of spherical harmonics is that they transform predictably (equivariantly) under rotations or, more formally, according to an action of irreducible representations of the $\text{SO}(3)$ group called Wigner D-matrices 32 (eq. 14).

$$ (\hat{\mathcal{R}}Y_m^l)(\mathbf{x}) = \sum_{m'=-l}^{l} Y_{m'}^l(\mathbf{x}) D^l_{m'm}(R), \tag{14}$$

where $\hat{\mathcal{R}}$ is the operator that acts on functions when the coordinate system is rotated by R and $D^l_{m’, m}$ are the elements of the Wigner D-matrix, the $(2l+1) \times (2l+1)$ irreducible matrix representation of order $l$ of the rotation $R$. Therefore, building features, using spherical harmonics, would facilitate desired equivariance. Unfortunately, I can’t go into more detail of this fascinating and complex topic here without deviating too much from our original focus, so I’ll leave it for the future in-depth post.

MACE’s primary innovation, however, lies in its efficient construction of higher body order messages. Unlike traditional Message Passing Neural Networks (MPNNs), which are limited to pairwise (two-body) interactions at each layer, MACE creates messages that explicitly incorporate correlations between multiple nodes simultaneously, e.g. three- and four-body interactions. This is accomplished through a clever use of tensor products, which bypasses the typical exponential computational cost associated with higher-order terms. This ability to model interactions between several neighboring atoms simultaneously has demonstrated significant gains in sample efficiency and accuracy on a number of atomic benchmark datasets29 . The MACE variant of many-body message passing is shown below:

$$m_i = \sum_{j} \mathbf{u_1}(x_i; x_j) + \sum_{j_1, j_2} \mathbf{u_2}(x_i; x_{j_1}, x_{j_2}) + ... + \sum_{j_1, ..., j_{\nu}} \mathbf{u_{\nu}}(x_i; x_{j_1}, ..., x_{j_{\nu}}), \tag{15}$$

where $m_i$ is the message of node $i$, $x$ are the features of the nodes, $\mathbf{u}$ are learnable functions, the summation happens over the neighbors of node $i$, and $\nu$ is a hyper-parameter corresponding to the maximum body order minus one, i.e. the maximum number of neighbors used for construction of the message for node $i$29 .

After this brief introduction to the theoretical foundations of MACE, I will cover the next crucial encoder implementation step. I needed to decide how to construct an appropriate graph representation of protein structures.

Graph Construction

For the MACE encoder to process protein structures effectively, I needed to transform the sequential backbone representation into an appropriate graph format. The graph construction strategy involved several design decisions that balanced computational efficiency with chemical realism.

I chose to represent each protein residue as a single node positioned at its C$_{\alpha}$ atom. This choice is well-motivated from a structural biology perspective and is a common option in protein modeling. Being the central atom of each amino acid (fig. 2), they form the backbone of the protein and capture the overall fold geometry. Also, they serve as origins of the frames of the backbone parametrization used in FoldFlow-2. This abstraction reduced computational complexity, by decreasing the number of nodes substantially, while preserving the essential geometric information.

To capture both local and non-local interactions crucial for protein folding, I used two complementary methods to connect the C$_{\alpha}$ atoms. First, I connected all nodes within a radius of 5 $\mathring{A}$. This distance threshold captures direct physical interactions and is often used as it covers typical ranges for hydrogen bonds and van der Waals contacts. To ensure no nodes remained disconnected, which could happen for residues in extended or disordered regions, I connected the remaining nodes, following the standard kNN approach. This guaranteed full graph connectivity and added potentially helpful long-distance edges, offering the way to model non-local interactions.

To prevent computational explosion for large proteins, I capped the maximum number of edges using a constant $E_{max}$ calculated as the fraction of the number of edges of a complete graph (equal to $N^2$ for $N$ residues).

Summary

Overall, integrating the MACE-based encoder proved to be a valuable experiment. First, it provided a practical experience in working with a state-of-the-art equivariant GNN and, second, I was able to get theoretical knowledge of math foundations of similar models. In the next chapter, I will detail the training process and demonstrate a comprehensive evaluation of the final models, bringing this post closer to its conclusion.

Code Availability

The complete implementation of FoldFlow-MACE, including all modifications and training scripts, is available on GitHub: https://github.com/stanislav-chekmenev/foldflow-mace.

Results and Insights

After months of going over and over the papers that cover the theoretical part (I really wanted to understand everything), I was ready to implement my architectural modifications. It took me a solid 1-1.5 months to study and fully grasp the codebase. It was especially tedious to find my way around the initial data preparation part, since it has many innovations introduced in AlphaFold-2, such as the backbone parametrization, and, to be honest, it’s not an easy code to read. For the most part, the code is well-structured, pretty clean, and has helpful comments in most of the places where a comment should be.

Initially, I just ran FoldFlow-1 on one protein structure and went through it step by step in the debugger to see what was going on. Then I switched to FoldFlow-2 and soon enough was able to start modifying its architecture. When I finished implementing my additional MACE-based encoder, I ran a few preliminary debugging runs on a tiny batch of two short proteins, consisting of 60 amino acids each. The runs indicated good training convergence for both models, the base FoldFlow-2 and my augmented version, which I’ll call FoldFlow-MACE for simplicity. However, they couldn’t demonstrate any difference in the results, so I was eager to start a full-scale training. Let me write some information about the setup, which will be important when I come to the interpretation of the results.

Training Details

I was lucky to get access to a small cluster that had 8 H100 80Gb GPUs, but only for a limited period of time. Therefore, due to these constraints I had to find a feasible compromise between the settings that would result in the best model’s performance and the ones that would fit training and experimentation into my available computational budget.

Simplified Setup

Since I wanted to test my hypothesis that the MACE-based encoder would improve the overall quality of the generated structures compared to the benchmark FoldFlow-2 model, I simplified the setup by removing some components and running training with and without my additional encoder. This approach would still produce a valid comparison between the two methods, though it would likely fail to reproduce the original paper’s results.

Coupling

With this goal in mind, I turned off the Optimal Transport coupling and used random pairing between samples from the data distribution $\rho_0$ and the uniform source distribution $\rho_1$.

Trunk Module

The base FoldFlow-2 model had the Evoformer, composed of two transformer blocks, in place of its trunk module. I used an identity transformation instead, essentially eliminating the main trunk. While being fully aware that this would lead to a drop in performance, I took this step to significantly decrease the training time. Moreover, the first version of FoldFlow didn’t have this component either, and yet was comparable to the state-of-the-art architectures.

Data

Both models were trained exclusively on a subset of PDB structures ranging from 200 to 300 amino acid residues in length. I adopted the batching strategy from the original implementation, where each batch contained data points sampled uniformly over clusters of proteins with 30% similarity. Consequently, each batch contained different time steps (different points along the geodesic between $\rho_0$ and $\rho_1$) from one protein from a cluster. For details, see the original FoldFlow-1 paper7 . Since the GPUs I had access to possessed twice as much memory as those used by the FoldFlow-2 authors, I doubled the effective batch size, which was determined by the square of the maximum number of protein residues. This setup could run smoothly on two H100 80Gb GPU cards with around 80-90% of utilization.

All other settings followed the default ones provided by the original implementation. Let me briefly describe the configurations I used for the MACE encoder.

MACE Configuration

First and foremost, I was utterly surprised by how much GPU memory the model required. The bottleneck was in the computation of the equivariant tensor product. This process increased the memory consumption by a factor of three, so I was forced to run the training on 6 H100 GPUs.

Model

Despite the fact that I was using 6 powerful GPUs, the memory constraints still prevented me from using a large number of wide hidden layers. I ended up with two 64-dimensional hidden layers that operated with equivariant features of degree two, in other words, features that transform as degree-two spherical harmonics. The correlation order, responsible for the number of neighbors used for message construction, was set to three. A final 128-dimensional equivariant linear projection layer served as the model’s head, converting hidden equivariant features into invariant scalars (features of degree zero). This step followed a standard geometric deep learning blueprint 33 , where equivariance is maintained between the layers of a model, but the output is invariant. The architecture of my encoder is presented in fig. 11.

MACE encoder architecture. Each atom is initially embedded in $\mathcal{R}^{64}$. Equivariant edge features are constructed via spherical harmonics computed for each pair of connected nodes, using their relative displacements. The features are processed via the higher order equivariant message passing. Final node embeddings are 128-dimensional vectors of scalars. Note, that radial edge embeddings are omitted here for simplicity.

Graph density

Another crucial aspect was the maximum number of edges in the graph. As I mentioned earlier, the graph was constructed dynamically, and I had to choose an appropriate upper bound on the edge count, since going for a complete graph wasn’t an option here. A complete graph wouldn’t fit into memory during the tensor product operation, and the amount of long-range interactions would be too large, as well. Thus, I capped the total number of edges to be around 15% of the total number of nodes squared. This limit was applied only to the radius-based graph construction step and was ignored by the kNN algorithm that ensured each node had at least one connected neighbor.

With both the baseline FoldFlow-2 model and my augmented variant configured and ready, I proceeded to train both architectures. Finally, I can compare the models and draw my conclusions.

Results

Both models were trained for 200K steps, and once I observed the evaluation metrics reaching a plateau, the training was terminated. To evaluate model performance, I tracked several metrics during training, running evaluation every 5K steps.

Training Loss Analysis

FoldFlow-MACE took approximately twice as long to train. Unfortunately, I couldn’t see any significant difference in the training loss plotted in fig. 12. The objective in eq. 12 is very noisy, making it difficult to interpret raw values. For clarity, I present the exponential moving average of the full loss instead.

The exponential moving average (EMA) of the sum of rotation and translation losses on the train set for FoldFlow-2 and FoldFlow-MACE.

Evaluation Metrics

The measure of model quality can be seen through the structural evaluation metrics. My primary criterion was the TM-score that measures similarity between two protein backbones $x_0$, $x_1 \in \text{SE}(3)^N$ according to eq. 16:

$$\text{TM-score}(x_0, x_1) = \text{max} \left[ \frac{1}{N_{\text{target}}} \sum_i^{N_{\text{common}}} \frac{1}{1 + \left( \frac{d_i}{d_0 N_{\text{target}}} \right)^2} \right], \tag{16}$$

where $N_{\text{target}}$ is the length of the target protein, $N_{\text{common}}$ is the length of the common sequence after 3D structural alignment, $d_i$ is the distance between the C$_\alpha$ atoms of the aligned structures, and $d_0 = 1.24 (N -15)^{1/3} - 1.8$ is a scaling constant that normalizes across protein lengths. The maximum is taken over all possible structural alignments. The TM-score ranges from 0 to 1, where values close to 1 indicate perfectly aligned backbones. Scores above 0.5 suggest structural similarity, while scores below 0.2 indicate unrelated proteins.

Although the full loss seemed to stop improving after 100K steps, the TM-score continued to grow and stabilized around 200K steps. This behavior underlines the importance of validation metrics. Below, I present four plots (fig. 13) showing a small subset of evaluation metrics, including the TM-score alongside other important structural characteristics.

Additionally, I paid a lot of attention to two other quantities that characterize protein quality: the average proportion of $\alpha$-helices and $\beta$-strands. Both secondary structures are observed in natural proteins, and this diversity should also be reflected in newly designed structures. Another important metric I tracked was the average ratio of valid distances between consecutive C$_\alpha$ atoms in the backbone. This distance is approximately constant in natural proteins at 3.8 $\mathring{A}$. Therefore, generated structures should also satisfy this geometric constraint.

Evaluation metrics for generated protein structures. a) TM-score between generated and ground truth backbones. b) Average proportion of $\alpha$-helices. c) Average proportion of $\beta$-strands. d) Average fraction of valid distances between consecutive C$_\alpha$ atoms in the backbone ($\approx$ 3.8 $\mathring{A}$).

Final Evaluation via Refolding Procedure

Looking at the plots shown in fig. 13, I wasn’t able to conclude which model was better, but it was already clear to me that I couldn’t count on a big improvement. Lastly, I ran the final evaluation, following the standard way to assess the quality of designed proteins, which is widely accepted in the literature despite its shortcomings. Let me briefly introduce you to it. It is based on the refolding procedure pictured in fig. 14:

First, the model generates protein structures across various lengths. I generated 50 proteins for each target length: 200, 225, 250, 275, and 300 amino acids.
Secondly, the ProteinMPNN model is used for inverse folding. It takes each generated structure and turns it into 8 possible amino acid sequences.
Thirdly, ESMFold refolds each of these sequences back into a 3D structure.
Finally, the root mean squared deviation is computed between the originally generated backbone and its refolded counterparts. This metric is called self-consistency RMSD and is given in eq. 17.

Self-consistency RMSD — Computation of self-consistency RMSD.

$$\text{RMSD}(x_0, x_1) = \sqrt{\sum_{i=1}^L \frac{d_i^2}{L}}, \tag{17}$$

where $d_i$, exactly as in the TM-score formula (eq. 16), is the distance between four heavy atoms of the $i^{th}$ residues.

The self-consistency RMSD metric forms the basis for computing three performance metrics: designability, novelty, and diversity.

Designability is computed as the fraction of generated proteins with scRMSD $< 2.0 \mathring{A}$. Diversity represents the structural variation, calculated as the average pairwise TM-score of all designable structures (lower is better). Novelty measures how different the generated proteins are from known structures, defined as the average fraction of designable proteins with TM-score $< 0.5$.

For both models, I present the average scRMSD alongside these three derived metrics in table 1. All averages are computed across the five protein lengths mentioned above (200, 225, 250, 275, and 300 amino acids). Since diversity is caclulated as the average pair-wise distance of designable proteins, each estimate of the mean is correlated and produces a wrong estimate of the standard error. This is the reason it’s not shown in the table.

	FoldFlow-2	FoldFlow-MACE
Designability ($\uparrow$)	0.240 ± 0.093	0.104 ± 0.026
Diversity ($\downarrow$)	0.541	0.492
Novelty ($\uparrow$)	0.0	0.0
scRMSD ($\downarrow, \mathring{A}$)	5.114 ± 0.653	7.685 ± 0.779

Comparison of protein design evaluation metrics between FoldFlow-2 and FoldFlow-MACE.

Interpretations & Conclusions

As anticipated, the FoldFlow‑2 model’s performance did not match the original paper’s results for several reasons. My training set consisted only of 3689 data points, around 6 times smaller than the size of the dataset used in the FoldFlow‑1 paper7 . In addition, FoldFlow‑2 included filtered AlphaFold‑2 structures from the SwissProt dataset, bringing the total to approximately 160K data points5 . Given the absence of the Evoformer, Optimal Transport, and RL finetuning in my setup, I was satisfied with the baseline metrics. I also attribute the zero novelty score to these same factors.

Although FoldFlow-MACE’s performance on the validation metrics shown in fig. 13 was on par with the baseline, the final results presented in table 1 tell a different story. The explanation for this discrepancy likely arises from an important architectural detail interacting with the nature of the generative process and the different sensitivities of the evaluation metrics.

My primary hypothesis is that the FoldFlow-MACE model underperformed due to the lack of a skip connection between the MACE encoder and the decoder module. The authors of the original paper highlighted the importance of this connection for the IPA-based encoder, and its absence for the MACE branch probably plays a big role.

This architectural flaw was exacerbated by two key conditions. First, the model’s C$_\alpha$ coordinate predictions that were fed into MACE were very noisy during the early stages of generation when time $t$ is close to one. Second, the MACE-based encoder was further hindered by operating on a highly sparse graph, using only 15% of the potential edges. An encoder learning complex local physics from such an incomplete view of the atomic neighborhood is bound to produce a less reliable signal. Consequently, when MACE processed this noisy input, its feature embeddings likely contributed more “structured noise” than helpful refinement. Without a direct skip connection, this noisy signal was ineffectively integrated, while the decoder remained primarily guided by the globally-aware IPA embeddings arriving via its clean path.

This explains the divergent results. The TM-score, a more forgiving metric of global topology and can be robust to this subtle noise. As long as the IPA branch ensures the overall fold is correct, the TM-score remains high. In contrast, the scRMSD pipeline acts as a sensitive biophysical test. It failed because the structured noise brought by the MACE encoder created just enough geometric and physical implausibility to render the final structure undesignable. It’s also worth noting that the scRMSD metric itself is imperfect, as it relies on the predictions of two other models, ProteinMPNN and ESMFold, which have their own inherent limitations.

In conclusion, a promising strategy for improving FoldFlow-MACE would be to add the missing skip connection between the encoder and decoder and restrict MACE to the later stages of generation, similar to how auxiliary losses are applied ($t < 0.25$). However, I decided against pursuing these improvements due to the considerable GPU memory overhead that the MACE addition introduced.

Final Remarks

Thank you for reading through this deep dive into protein design with $\text{SE}(3)$ Flow Matching. I hope this exploration of FoldFlow-2 and my experiments with a MACE-based encoder were helpful for you and, perhaps, could be a good place to start your own personal journey into this fascinating field.

Acknowledgements

I’d like to say a big thank you to the authors7 of the FoldFlow model family for open-sourcing their code and providing helpful notebooks with toy examples of Flow Matching on the $\text{SO}(3)$ manifold. These resources were a great starting point for diving into their codebase.

I’m also grateful to the authors10 of FrameDiff, whose code became the foundation for FoldFlow and whose data preprocessing pipeline made it possible for me to get the training data I needed.

Special thanks to Antonio Rueda-Toicen, Mario Tormo and the whole KI-Servicezentrum team at Hasso Plattner Institut for helping me to get access to the computational resources that made this research possible.

Finally, I should mention that this article was written with help from generative AI tools like GPT-4.1, Claude Sonnet 4, and Gemini 2.5 Pro, which helped me structure and polish the text. But all the ideas and responsibility for the content are mine.

References & Useful Links

In the order of appearance in the text:

Goodsell, Dutta, Molecule of the month, 2003. ↩
Heme group, Wikipedia. ↩
Jumper et al., Highly accurate protein structure prediction with AlphaFold. Nature, 2021 ↩
Gainza et. al., De novo design of protein interactions with learned surface fingerprints. Nature, 2023 ↩
Huguet et. al., Sequence-augmented $\text{SE}(3)$-flow matching for conditional protein backbone generation. NeurIPS, 2024 ↩
Dreamfold. ↩
Bose et. al., $\text{SE}(3)$-Stochastic flow matching for protein backbone generation. ICLR, 2024 ↩
Huang et. al., Riemannian diffusion models. NIPS, 2022 ↩
Watson et. al., De novo design of protein structure and function with RFdiffusion. Nature, 2023 ↩
Yim et. al., $\text{SE}(3)$ diffusion model with application to protein backbone generation. PMLR, 2023 ↩
Lipman et. al., Flow matching for generative modeling. ICLR, 2023 ↩
Tong et. al., Improving and generalizing flow-based generative models with minibatch optimal transport. TMLR, 2024 ↩
Hall, Lie groups, Lie algebras, and representations. Springfield, 2013 ↩
Tangent space, Wikipedia. ↩
Skew-symmetric matrix, Wikipedia. ↩
Riemannian manifold, Wikipedia. ↩
Geodesic, Wikipedia. ↩
Fjelde, Mathieu, Dutordoir, An introduction to flow matching. 2024 ↩
Chen et. al., Neural Ordinary Differential Equations. NIPS, 2018 ↩
Axis-angle representation, Wikipedia. ↩
A primer on optimal transport theory and algorithms, Youtube. ↩
POT package. ↩
Transportation theory, Wikipedia. ↩
Vaswani et. al., Attention is all you need.. NIPS, 2017 ↩
Jumper et. al., Supplementary information for AlphaFold-2. Nature, 2021 ↩
Lin et. al., Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 2023↩
e3nn package. ↩
Chen et. al., Analog bits: generating discrete data using Diffusion models with Self-Conditioning. ICLR, 2023 ↩
Batatia et. al., MACE: higher order equivariant message passing neural networks for fast and accurate force fields.
NIPS, 2022 ↩
Thomas et. al., Tensor field networks: rotation- and translation-equivariant neural networks for 3D point clouds. NIPS, 2018 ↩
Spherical harmonics, Wikipedia. ↩
Wigner D-matrix, Wikipedia. ↩
M. Bronstein et. al.,Geometric deep learning: grids, groups, graphs, geodesics and gauges.2021↩

Transitioning to Bio ML: My Experience Learning and Modifying FoldFlow-2