This is a blog of an ongoing research that formulates deep learning as fluid dynamics problem.
In this exploratory note, the link between deep learning and fluid dynamics is explored. This enables us to find the limit of deep learning system when both the number of neurons and the number of layers approaches infinity. Recently, there has been an interest in formulating deep learning problems in terms of dynamical systems (E, 2017). In fact, dynamical systems were used in many neural network architectures, such as recurrent neural network, echo state machines, and spiking neurons to name a few. For the case of spiking neurons, the continuum limit yields a hyperbolic partial differential equation and the field of study is known as neural field theory. In the mentioned architectures, dynamical systems are used to compute the neural network’s activation, while the recent focus is on using dynamical systems as a substitute for the whole architecture (E, Han, & Li, 2019). Furthermore, the connection between training neural networks and optimal control has been an ongoing research.It is natural to think of optimal control as a method for training neural networks. Both problems aim to find parameters that would make a system give certain output when presented with specific inputs. There has been an interest in using a more general optimal control, mainly stochastic optimal control via mean field games. What is very interesting about this approach is it provides a pathway in formulating the stochastic control problems in fluid dynamics. The case for deterministic optimal control was shown that the control vector field does, indeed, satisfy the incompressible Euler fluid equation (Bloch, Holm, Crouch, & Marsden, 2000).
Here, training deep learning architectures is presented as an incompressible Euler fluid equation, albeit with an extra potential term. This achieved by first translating the deep learning networks to dynamical systems, then training is cast as a mean field optimal control problem. In addition to that, weights of a radial basis neural networks, constructed in an elaborate way, are shown to satisfy an integrable system of equations known as the Camassa-Holm equation (Camassa & Holm, 1993).
The key to casting deep learning problems as optimal control problems is considering the residual neural networks as Euler discretisation of a dynamical system (E, 2017). A network with $N$ layers is then written as
where $t = 0, \dots, N-1$. The initial condition $x(0) = x_0 \in \mathbb{R}^d$ is in neural networks language the input layer and $x(N) = x_T \in \mathbb{R}^d$ is the output layer. The variable $\theta_t$ is the trainable parameter and $f$ is the activation function. Taking the limit as $N \to \infty$ one would recover the continuous-time dynamical system for ResNet
The solution of the discrete-time system $x_{t+1}$ at time $t = T$, or the output layer. For outputs $x_{i,T}$, the goal is to be as close as possible to the target $y_i$ with respect to a specified metric. To control how the vector field $f$, would achieve such a desideratum, the parameter $\theta_t$ is tuned so that the said metric be at a minimum. This in essence an optimal control problem. First, let us clarify what is meant by optimal control, however, the reader is encouraged to check the following resources for a detailed exposition on the subject (Bullo & Lewis, 2004), (Agrachev & Sachkov, 2013), and (Jurdjevic, 1996).
The continuous system associated to $\Sigma = (\mathcal{X}, f, \mathcal{U})$ ResNet neural network is
This formulation is still general, but here are some of the important specific cases
Another way of looking at continuous deep neural networks, which we will further study later, is when the whole vector field $f$ is considered the control parameter.
Before delving into the rest of the article, few notation are introduced.
We define what was meant by metric earlier which is the called the objective function in control theory language.
The controlled trajectory solves the differential equation that describes the neural network. The aim is to minimise the objective function while steering the control trajectory from subset $x_0 \in \mathcal{X}$ to $x_T \in \mathcal{X}$. We then define neural network training problem as follows
One method that generalises the calculus of variations is the Pontryagin’s Maximum Principle, which provides the necessary condition of optimality. In general, there are two routes that one can take to solve the optimal control problem: by using Pontryagin’s Maximum Principle, or dynamic programming principle. Each approach has its own pros and cons, however for stochastic optimal control dynamic programming is the de facto method and that has to do with the fact that it is a simple extension of the deterministic case. When applying dynamic programming principle, the continuous limit of the result is the celebrated stochastic Hamilton-Jacobi-Bellman equation. In the deterministic case, the Hamilton-Jacobi-Bellman equation is considered a generalisation of the Hamilton-Jacobi equation, and for the stochastic case is basically the deformation of Hamilton-Jacobi equation. The being said, the Pontryagin Maximum Principle approach is very difficult to apply to stochastic systems. Here, we still consider it for stochastic optimal control problems, but instead of the stochastic system as the constraint, the probability density is substituted for the system.
Due to the fact that it is indispensable, we only recite Pontryagin’s Maximum Principle. The proof of Pontryagin’s Maximum Principle statement in geometric setting can be found in (Agrachev & Sachkov, 2013), (Sussmann, n.d.) and the supplement chapter of (Bullo & Lewis, 2004). One of the nucleus of Pontryagin’s Maximum Principle is it constructs a Hamiltonian system to solve the optimal control problem. To state Pontryagin’s Maximum Principle we first need to define the Hamiltonian and its family.
In other words, $\lambda$ is an integral curve of Hamiltonian vector field associated with $H_{\Sigma}^{max}$. Along with that, it also satisfies the following transversality conditions: $\lambda(0) \in ann(T_{q(t_0)}q_0)$ and $\lambda(1) \in ann(T_{q(t_1)}q_T)$ and we either have $\lambda_0 = 1$ or $\lambda(0) \neq 0$ and finally there exists a positive constant $C \in \mathbb{R}$ such that the maximum Hamiltonian is constant, i.e. $H_{\Sigma}^{max}(\lambda(t)) = C$.
The proof of this theorem is very technical and it requires time, so it will be left out of this course, however, the reader might wish to check (Agrachev & Sachkov, 2013) for detailed exposition on Pontryagin’s Maximum Principle. A special attention is needed for the last statement, $h(\lambda)$, which implies that the integral curve of the covector $\lambda^{\ast}$ is a constant of motion for the dynamics governed by the Hamiltonian vector field associated to $H_{\Sigma}^{max}$. Basically, the solution of the Hamiltonian system
Consider the stochastic control system given by
and where $u: \mathcal{X} \to \mathbb{R}^N$, and $\sigma_k : \mathcal{X} \to \mathbb{R}^N$ are diffusion vectors and both $u$ and $\sigma_k$ are assumed to satisfy the regularity and linear growth conditions that guarantee the existence and uniqueness of the solution of $X(t)$. $W_t$ are independent and identically distributed Brownian motions that are adapted to a filtration $\mathcal{F_{t, t\geq 0}}$. The stochastic process $X(t)$ is a Markov process and its transition probability $P(X(t_0), t_0| X(t_1), t_1)$, i.e the probability of transitioning from the state $X(t_0)$ at time $t_0$ to the state $X(t_1)$ at time $t_1$ and it satisfies:
Consider the function $f \in C^2(\mathbb{R}^N)$, then its rate of change along a stochastic path is obtained using Ito’s lemma, and taking the expected value with $P(X(t_0), t_0| X(t_1), t_1)$ as its probability measure. Integration by parts then yields the Fokker-Planck equation, or what is also known as Kolmogorov forward equation given by
where $\rho$ is the probability density function. Going back to the optimal control problem, instead of using the Pontryagin’s Maximum Principle with the differential equation for $X(t)$, the Fokker-Planck equation is used instead. The cost function here is chosen to be the expected value of the cost Lagrangian.
A similar construction, albeit for deterministic systems, was introduced by Benamou and Brenier in (Benamou & Brenier, 2000), where they proved that the Monge-Kantorovich problem of finding a map $f$ that would transport the density $\rho_0$ to $\rho_T$ while in the same time minimising the 2-Wasserstein metric
both densities satisfy $\int \rho_0(x) dx = 1$ and $\int \rho_T(x) dx = 1$. The solution $f(x)$, which preserves the boundedness of the densities is a pullback of the density, i.e. $f^{\ast}( \rho_T(x)) = \rho_0(x)$. Furthermore, the map $f$ can be written as a gradient of a convex function $\Psi$ and that makes the pullback relation the following
which is the Monge-Ampère equation. The said equation is very difficult to solve even numerically and for this reason (Benamou & Brenier, 2000) formulated as a fluid mechanics problem. The Monge-Kantorovich problem is equivalent to minimising the following functional
which is constrained by the transport equation
The adjoint equation, that describes the evolution of the Lagrange multiplier is the Hamilton-Jacobi-Bellman equation
where the optimality condition is $ v(t,x) = \nabla \lambda$. Taking the gradient of the Hamilton-Jacobi-Bellman equation, we obtain the pressure-less Euler equation
In (E, Han, & Li, 2019), the residual neural network, ResNet, is considered as part of the control tuple $\Sigma = (\mathcal{X}, f, \mathcal{U})$ and this motivated out investigation. Instead of ResNet, we consider a different type of neural networks known as radial basis function network, RBFN for short (Moody & Darken, 1989). Radial basis function network is a network with forward architecture with only three layers; input, hidden and output. For the activation function, as its name suggests, it uses radial basis function. That being said, the research topic of deep radial basis function network has not been very active, and here care is taken in this subject. As with the family of radial basis function, the distance between two points the function’s argument, and here the distance is between the input, $x_i$, and the centre of the neurons, $c_i$. The smaller the distance between the input and the centre at the larger the output of the function and vice-versa. The activation functions of the hidden neurons are then combined together to yield the final output
where $\omega_i$ are weights of the hidden neurons and they determine the influence of the a neuron on the output.
Training of such type of neural network is done in two stages, the first uses a clustering algorithm, such as $k$-mean clustering. The first step is needed to determine the centres $c_i$. The second step determines the weights $\omega_i$.
Here we use the radial basis function network in a slightly different way. The first dissimilarity is that the number of outputs is the same of the inputs. The second dissimilarity is that not implemented in a pure forward architecture, but with the outputs are fed back as inputs. In mathematical terms that is
and the the continuous-time dynamical system is
Here we formulate the training of radial basis function network as an optimal control problem. Consider the training data $(\mathbf{x_{j}}(0), \mathbf{y_j})$, where $\mathbf{x_{j}}(0) \in \mathbb{R}^d$ is the $i^{th}$ entry of the training data set. Consider the control system $\Sigma = ( \mathcal{X}, f, \mathcal{U})$ with $f(x, t) = \sum_i \omega_i(t)e^{- | x(t) - x_j(t) |^2}$. The goal is to find $\omega_i$ that would steer the control system $\Sigma$ from $\mathbf{x_{j}}(0)$ at time $t = 0$, to the state $\mathbf{x_{j}}(T)$ at time $t = T$ that is close to $\mathbf{y_j}$. The objective function is defined by
where $\Phi$ is the terminal condition. The cost Lagrangian here is not chosen as the $L^2$ norm, but instead a norm with the metric chosen as $g_{ij} = e^{-| x_i - x_j |^2}$. In explicit terms that is
The Pontryagin’s Hamiltonian is then
and the Hamilton’s equations are then
This system is interesting in itself, when replacing the radial basis function with $e^{- \lvert x_i(t) - x_j(t)\rvert}$, which is the Green’s function for the differential operator $D = \mathrm{Id} - \partial_{xx}$ and it is the weak solution of the dispersionless Camassa-Holm equation
which is in turn it is the Euler-Poincaré on the group of diffeomorphism with Sobolov norm $H^1$ (Camassa & Holm, 1993), and (Holm, Schmah, Stoica, & Ellis, 2009). The Camassa-Holm equation is an infinite dimensional system and it models waves in shallow water, and it is an integrable system. Furthermore, the Hamilton’s equation for $(x,p)$ form an integrable system. This type of systems, the ones whose Pontryagin’s Hamiltonian yields an integrable system is studied in details in (Jurdjevic, 2016).
![]() |
---|
The solution of the Camassa-Holm equation with the initial condition chosen to be a Gaussian function and periodic boundary conditions. One of the features of this equation is its wave-breaking mechanism and as a result peaked solitary waves, known as peakons, emerge. |
In recent years there has been an interest in mean field games theory view of optimal control and that is due to the fact that it uses deterministic quantities to derive optimal control laws for stochastic systems. One of the first attempts at this type of control was in (Brockett & Khaneja, 2000) where a control for an ensemble of stochastic system was sought. That is not the only reason; the problem is very rich mathematically, and it provides many interesting connections with other established fields. Why there’s interest in such a framework, the primary reason is applying Pontryagin’s Maximum Principle to stochastic optimal control problem. That way allows for to use of differential geometry in dealing with nonlinear stochastic control systems. Although dynamic programming approach has been successful for the stochastic problem, except it presents us with few technical difficulties as the only way to obtain the optimal control law is by solely solving the stochastic Hamilton-Jacobi-Bellman equation, which is a nonlinear backward partial differential equation, Solving the Hamilton-Jacobi-Bellman equation is challenging in itself. Furthermore, the Pontryagin’s Maximum Principle is tempting as it allows us for a direct way in attempting to formulate the problem in the language of multisymplectic geometry. Pontryagin’s Maximum Principle for deterministic finite-dimensional systems is equivalent to solving a Hamiltonian system, thus it allows to use the tools of symplectic geometry to solve the system in question and even construct robust numerical algorithms using symplectic integrators. Furthermore, for some cases optimal control problems are related to the fluid mechanics problems and here we study the dynamical systems that arise from such problems (Bloch, Holm, Crouch, & Marsden, 2000), and (Gay-Balmaz, Holm, & Ratiu, 2009).
Mean field games theory was developed by Lasry and Lions (Lasry & Lions, 2006), (Lasry & Lions, 2006) and Caines et al (Huang, Malhamé, & Caines, 2006) to model games with an infinite number of players or agents. Each player wants to maximise/minimise a certain quantity with respect to a certain criterion, for example, spend the least amount of money. The behaviour of the player takes into consideration the behaviour of other players. Examples of such “situations” are abundant in nature and society. The decision taken by each agent does not affect the overall behaviour or the state of the population, the decision making for all members of the population is done in a decentralised manner. One main assumption of the mean field games is that all agents are indistinguishable, and they are modelled by a state $X^i_t$, where $1 \leq i \leq N$. A summary of the mathematics behind mean field games theory is given here, however, the keen reader is advised to (Huang, Malhamé, & Caines, 2006), (Lasry & Lions, 2007), and (Bensoussan, Frehse, & Yam, 2013). Francophone reader might consider also reading the first papers on this topic (Lasry & Lions, 2006), (Lasry & Lions, 2006). Consider a population of $N$ agents where each agent has control over the stochastic differential equation
complying with the notation used in these notes, the drift is denoted by $u$ is the control, or strategy, and the variance is denoted by $\sigma$ and $dW$ are independent Brownian motion adapted to a filtration $\mathcal{F_{t, t\geq 0}}$. As mentioned in the background section, such process is a Markov process and its probability density function satisfies the Fokker-Planck equation. Mean field games system is then described as
where $H$ is a Hamiltonian function that is convex. As a member tries to take the best decision for them, the decision making can be viewed from a game theory problem. However, game theory becomes very difficult to solve as the number of players exceeds three. However, as the size of the population goes to infinity, the solution consists of points of Nash equilibrium.
Such equilibrium can be obtained by differential games method and game theory, but its difficult to arrive at such equilibrium. However, the assumptions made in mean field game theory allows one to obtain Nash equilibrium and that is by considering an infinite number of players/agents.
Now, a mean field games theory aficionado would probably ask: what can stochastic geometric control theory introduce to mean field games? Two things in general: first, the tools of differential geometry can be applied to mean field games theory. So now the considered agents can be modelled by nonlinear stochastic differential equations with symmetry and how this symmetry can be used to reduce the degrees of freedom. The second is to cast the system of HBJ-FP equations in a geometric structure and use this structure to help in inferring properties and even solve the optimal strategy.
The approach considered here to mean field optimal control relies on the Fokker-Planck equation that describes the evolution of the probability density function of the Markov processes $X_i$ the describe the control stochastic system in question. This can be viewed as a generalisation of (Benamou & Brenier, 2000).
In this section the mean field optimal control using Pontryagin’s Maximum Principle is derived. The control dynamics are
with
where $X_i \in \mathbb{R}^d$, $u \in \mathcal{U}$ and $W_k$ are independent and identically distributed Brownian motions that are adapted to filteration $\mathcal{F_{t, t \geq 0}}$. The stochastic integral is taken in the Stratonovich sense. The Fokker-Planck equation is for this control system is
The symbol $\mathcal{L}$ denotes a Lie derivative, and $\mathcal{L_{u}} \rho$ is the Lie derivative of a density along the vector field $u$ and it is given explicitly by
The control system is now a tuple $\Sigma = (\mathcal{X}, u, \mathcal{U})$ and the objective function is given by
and the Pontryagin’s Hamiltonian is given by
where $\lambda$ acts as the Lagrange multiplier. Hamilton’s equations, or to be more precise De Donder-Weyl equations, associated with Pontryagin’s Maximum Principle are given by
At this point it is obvious that when $\sigma_k \to 0$, the fluid mechanics formulation of Monge-Kantorovich problem mentioned earlier is recovered. However, the double Lie derivatives do not translate to double Lie derivative for the Euler fluid equation. This is not the end of the road here, the idea is to derive the De Donder-Weyl equations associated with Pontryagin’s Maximum Principle using a transformed Fokker-Planck equation. Consider the case when the dynamics for the control problem are simplified to
where $\sigma \in \mathbb{R}$ is a constant. The Fokker-Planck equation then simplifies to
Let $u(x,t) = v(x,t) + \sigma^2\nabla \mathrm{log} \rho$, which can be viewed as akin to Helmholtz decomposition, Fokker-Planck equation becomes
and the Pontryagin’s Hamiltonian is
The term $\int \rho |\nabla \mathrm{log} \rho |^2 \, dx$ is Fisher information metric and the extra constraint $\left\langle p, 1 - \int \rho \, dx \right\rangle $ is to ensure that the vector field $v$ remains incompressible. With this, the De Donder-Weyl equations are
Taking the gradient of the second equation, which is the Hamilton-Jacobi equation, we obtain the Madelung equation, also known as quantum Euler equation