Vertical mixing parameterizations in ocean models are formulated on the basis of the physical principles that govern turbulent mixing. However, many parameterizations include ad hoc components that are not well constrained by theory or data. One such component is the eddy diffusivity model, where vertical turbulent fluxes of a quantity are parameterized from a variable eddy diffusion coefficient and the mean vertical gradient of the quantity. In this work, we improve a parameterization of vertical mixing in the ocean surface boundary layer by enhancing its eddy diffusivity model using data‐driven methods, specifically neural networks. The neural networks are designed to take extrinsic and intrinsic forcing parameters as input to predict the eddy diffusivity profile and are trained using output data from a second moment closure turbulent mixing scheme. The modified vertical mixing scheme predicts the eddy diffusivity profile through online inference of neural networks and maintains the conservation principles of the standard ocean model equations, which is particularly important for its targeted use in climate simulations. We describe the development and stable implementation of neural networks in an ocean general circulation model and demonstrate that the enhanced scheme outperforms its predecessor by reducing biases in the mixed‐layer depth and upper ocean stratification. Our results demonstrate the potential for data‐driven physics‐aware parameterizations to improve global climate models.

The upper region of the ocean is highly energetic and is responsible for transferring mass, energy and biogeochemical tracers between the atmosphere and the deeper regions of the ocean. This transport takes place because of turbulent swirling motions, which are found to be of varying sizes. Climate models cannot represent all of these motions because smaller‐scale swirls are complex and require additional computational resources. As we cannot neglect those small swirls, we try to approximate their effects on larger‐scale motions using mathematical models. These models have a few ad hoc or empirical assumptions that lead to uncertainty when these climate models are used to project the future climate. To reduce this uncertainty, we augment an existing model of turbulent swirling process with machine learning, which replaces some ad hoc approximations with data‐driven neural networks. Neural networks can learn those missing processes more accurately than a traditional physics‐based model. The neural networks are shown to improve physics in climate simulations. Although we only touch on one component in an ocean climate model, this approach can be replicated to improve any other component that was using ad hoc assumptions and replace them with data‐driven models using techniques from machine learning.

We improve a parameterization of vertical mixing in the ocean surface boundary layer using neural networks

Neural networks are trained to predict the diffusivity of second moment closure and maintain energetic constraints of the original parameterization

The improved scheme reduces biases of mixed layer depth and thermocline in an atmospherically forced ocean model

Vertical mixing parameterizations used in ocean general circulation models (OGCM) represent the effects of unresolved processes on the mean state. These parameterizations have theoretical deficiencies due to the lack of understanding of inadequately represented or missing processes. To overcome this deficiency, parameterizations often require ad hoc/empirical modifications either to approximate the missing processes or to fit data. Vertical mixing schemes can be constructed with various assumptions and different schemes are calibrated differently. These inconsistencies cause the schemes to disagree among themselves (Li et al., 2019) and are a major source of model uncertainty (Fox‐Kemper et al., 2019; Gutjahr et al., 2021; Hawkins & Sutton, 2009; Huber & Zanna, 2017; Todd et al., 2020). Poorly parameterized mixing can result in errors that accumulate over time, leading to biases in the OGCM.

New approaches are emerging to improve various parameterizations in ocean and atmosphere models using machine learning. We have applied neural networks, a type of machine learning, to improve a vertical mixing parameterization of the ocean surface boundary layer (OSBL). OSBL is a vital region of turbulence in the ocean. It acts as an interface between the atmosphere and the deeper ocean and it is important to accurately represent mixing in the OSBL. The atmosphere energizes the ocean through the OSBL. Mass, tracers, and momentum are transferred between the atmosphere and deep ocean via the OSBL, and inaccuracies in vertical mixing parameterizations can give rise to uncertain estimates of heat transport, sea level rise, ocean carbon uptake, etc. Including missing processes in upper ocean vertical mixing schemes impact large‐scale phenomena, for example, accounting for Langmuir turbulence and submesoscale effects in the OSBL improves simulations of the Indian monsoon (Orenstein et al., 2022).

We focus on the energetic Planetary Boundary Layer (ePBL) scheme, a first‐order OSBL turbulent mixing parameterization as described in Reichl and Hallberg (2018) (see Section 2). The variation of the vertical diffusivity profile *κ*_{ϕ} (of arbitrary scalar, *ϕ*) within the OSBL in ePBL and similar first order schemes can be expressed as a diffusivity scale *g*(*σ*) is a dimensionless shape‐function, and *σ* = *z*/*h* is a dimensionless vertical coordinate, where *z* is the vertical coordinate and *h* is the depth of the boundary layer. OSBL parameterizations that follow this approach traditionally assume that *g*(*σ*) is a universal function or has a fixed component such as a cubic polynomial that does not change (Large et al., 1994; O'Brien, 1970), and therefore is ad‐hoc. In the K‐profile‐parameterization (KPP) scheme of Large et al. (1994), there is a cubic polynomial which is multiplied by a vertically varying turbulent velocity that sets the structure of *κ*_{ϕ}. The cubic polynomial is universal, whereas turbulent velocity mostly affects the surface layer defined by the region 0 < *σ* < 0.1, making the cubic structure dominant below the surface layer. In ePBL scheme (Reichl & Hallberg, 2018), *κ*_{ϕ} follows similar design. However, there is no physics‐based justification for a universal or ad‐hoc profile to exist, and it is widely understood that characteristics of boundary layers can vary considerably with forcing conditions (Li et al., 2019). We hypothesize that capturing variations of the shape function that are not considered in first‐order OSBL schemes such as ePBL will improve the overall representation of vertical mixing in ocean models. In the subsequent text, our usage of the term “universal shape function” will include shape functions which involve some ad‐hoc components or approximations such as used in the ePBL scheme (see Section 2).

Second Moment Closure (SMC) is an alternative approach to predict vertical diffusivity profiles within the OSBL (Rodi, 1987; Umlauf & Burchard, 2005). SMC does not require a shape function because it instead predicts the diffusivity from the turbulent kinetic energy (*k*) and the turbulent length scale (*l*_{t}). Various SMC approaches exist to predict *k* and *l*_{t} and a general formulation to infer diffusivity is expressed as:*c*_{ϕ} represents the model stability functions (Umlauf & Burchard, 2005).

SMC predicts a profile of vertical diffusivity based on models of physical processes that drive turbulent fluxes within the OSBL. SMC does not prescribe a shape function a priori. However, since SMC directly evaluates a diffusivity profile, the implied shape function and diffusivity scale can be diagnosed from the output. The implied shape function differs significantly from a universal shape function, as seen in Figure 1. The diagnosed shape‐function and diffusivity scale from SMC can then be used to build a model for use in ePBL. We selected SMC over large eddy simulation as our “truth” because it is inexpensive compared to the latter, leading to effortless creation of training data set spanning a wide range of forcing regimes. This is required for machine learning applications as they are hungry for a large amount of data.

A natural question is why SMC is not directly used instead of ePBL in OGCM. It remains impractical to directly use SMC for vertical mixing in climate simulation due to the sensitivity of their predictions to long time steps and coarse vertical grids often used in climate models (see Reichl & Hallberg, 2018). However, using the framework described in this article, ePBL can yield a closer approximation to the vertical diffusivity from the SMC scheme without sensitivity to the model's vertical resolution and time step. Our neural network approach allows ePBL to consider the physics‐based variation in the shape function seen in SMC due to solving *k* and *l*_{t}. This variability in the shape function will lead to different profiles of vertical mixing within ePBL than using a prescribed universal profile.

Consider a physics‐based parameterization that gives an output Ψ as some functional relationship **x**:

Finding **x**, but the fit might not work for different regimes or might implicitly depend on higher‐order combinations of terms in **x** (nonlinearity) or some other neglected terms. **x** = (*x*_{1}, *x*_{2}, …) is the input vector and **w** are parameters (weights and biases). Machine learning involves determining (learning) the correct values of **w** by tuning the hyperparameters that give the optimal **x**, which has been a significant challenge in traditional physics‐based approaches.

Machine learning is favorable for the development and application of climate models due to the abundance of optimization algorithms and hardware (Balaji et al., 2022; Christensen & Zanna, 2022). Studies show that neural networks can be used in idealized model configurations, and recently, the use of machine learning has emerged in realistic GCMs. Artificial neural networks (ANNs) have been shown to improve sub‐grid momentum transport in atmospheric models (Yuval & O'Gorman, 2023), predict precipitation (Shamekh et al., 2023) and fluxes (Shamekh & Gentine, 2023), while in ocean models they have been used to improve the parameterization of free convection (Ramadhan et al., 2023). Liang et al. (2022) applied deep neural networks to predict temperature and salinity evolution in the OSBL at a weather station (Station Papa). Partee et al. (2022) trained a deep neural network to learn subgrid kinetic energy of oceanic mesoscale eddies from a high resolution OGCM to improve their representation in a lower resolution OGCM. Convolutional neural networks (CNNs) have been used to predict parameterizations of ocean momentum backscatter in a variety of models (Bolton & Zanna, 2019; Guillaumin & Zanna, 2021; Zanna & Bolton, 2020) and have been implemented in an ocean primitive equation model (Zhang et al., 2023). Gregory et al. (2023) recently employed CNNs to learn data assimilation increments for sea‐ice and showed that networks could be used to reduce biases in sea‐ice.

Apart from neural networks, techniques considered part of the machine learning toolbox show potential to improve GCMs. The random forest algorithm has been used to parameterize moist convection (O'Gorman & Dwyer, 2018) and to learn small‐scale processes from a high resolution atmospheric model (Yuval & O'Gorman, 2020). Mansfield and Sheshadri (2022) used Gaussian Process emulator to tune gravity wave parameterization in an intermediate complexity atmospheric GCM. Souza et al. (2020) use a Bayesian technique to fine‐tune the non‐local flux terms of the KPP parameterization of Large et al. (1994).

The aforementioned examples show the potential of enhancing conventional physics‐based schemes using machine learning techniques. This article draws inspiration from these demonstrations, recognizing the promise of machine learning in advancing ocean model parameterizations and prompting further investigation in this area.

ANNs are trained using output from SMC that directly predicts the profile of vertical diffusivity and do not rely on ad hoc shape functions. As neural networks are powerful approximators, they can model the variability in the vertical diffusivity profiles of the SMC, but we formulate the ANNs to fit within the simplified framework of the first‐order ePBL approach. Our procedure has the following advantages:

We use the neural networks to modify the vertical diffusion term within ePBL instead of directly predicting turbulent flux time tendencies (e.g., temperature and salinity), guaranteeing that the scheme conserves physical quantities.

The neural networks are introduced in a manner that does not interfere with the potential energy‐based mixing constraints of the original ePBL scheme, and therefore ePBL's robust numerical implementation is preserved.

The ANNs predict quantities used to compute the diffusivity: the non‐dimensional structure (shape function) and a turbulent velocity, which simplifies training, implementation, and interpretability versus directly predicting the diffusivities.

ANNs yield strictly positive values of the vertical diffusivity, an important consideration for numerical stability (see Section 3.4.2).

Our ANNs are as small as possible to balance accuracy and computational costs, as they will be used in climate timescale OGCM simulations.

We structure the article as follows. Section 2 describes the ePBL scheme and briefly addresses a calibration/tuning problem. Section 3 gives details of the network structure and describes the data used to train networks with estimates of uncertainty. Section 4.1 provides details on implementing the enhanced ePBL scheme, hereafter called ePBL_NN. The new improvements in ePBL_NN are demonstrated online using free‐running single‐column model experiments (Section 4.2), and their impact on biases in an existing ocean‐ice climate model is assessed in Section 4.3. We conclude with a summary and discussion of the broader implications of this work for applying machine learning to improve parameterizations in ocean climate models.

The ePBL framework, as described by Reichl and Hallberg (2018), is designed for climate applications of OGCMs and emphasizes robust solutions to changes in model time stepping and vertical resolution. The scheme is simple enough to implement efficiently within implicit diffusion solvers often used in OGCMs while maintaining important physical constraints on ocean mixing. The ePBL scheme performs with high skill in idealized models and OGCMs (Li et al., 2019; Reichl & Li, 2019), and has been implemented in NOAA—Geophysical Fluid Dynamics Laboratory (GFDL)'s MOM6‐based climate models: OM4, CM4, and ESM4 (Adcroft et al., 2019; Dunne et al., 2020; Held et al., 2019).

ePBL builds on the paradigm of bulk mixed layer models (Kraus & Turner, 1967; Niiler, 1977), which constrain the boundary layer depth (*h*) via energetic implications of vertical mixing. The scheme therefore constrains the mixing based on parameterizing the rate by which turbulent kinetic energy is converted to potential energy within the OSBL:

Here, *h* is the (positive) depth of the boundary layer as defined in Reichl and Li (2019), *f*, surface friction velocity *u*_{*}, surface buoyancy flux *h*, and integrated release of potential energy by convective buoyancy fluxes. Reichl and Hallberg (2018) find

ePBL extends the bulk mixed layer formulation to resolve vertical structure within the OSBL by applying a down‐gradient flux profile using the vertical diffusivity given by*κ*_{ϕ} is the variable diffusivity of a scalar *ϕ*. The diffusivity varies with depth and is given in the following form:*L* and *v*_{0} are length and velocity scales. In the present implementation of ePBL (Reichl & Hallberg, 2018), the turbulent Prandtl number is assumed to be one and hence the diffusivity and viscosity are identically modified. Both *L* and *v*_{0} are expressed as functions of position *σ* within the boundary layer. The length scale in Equation 7 is set as (following O'Brien, 1970; Large et al., 1994):

By assuming a fixed constant for *γ*, the expressions given by Equations 7 and 8 may be expressed in the same form as Equation 1, which reveals the role of the shape function as *g*(*σ*) = *σ*(1 − *σ*)^{γ}. *γ* should not be a fixed constant. Constructing *γ* as a data‐driven function is challenging and the form *σ*(1 − *σ*)^{γ} does not have a physical basis. The velocity scale *k* − *ϵ* (see Equations 43–45 in Reichl & Hallberg, 2018).

Although the integrated mixing in ePBL is constrained via the function *γ* and *v*_{0} that set the diffusivity profile within the boundary layer. Differences in the diffusivity profile mean that even when the energetic constraints are accurate, inconsistent OSBL evolution and stratification can emerge when comparing ePBL with SMC such as *k* − *ϵ* (see Figures 6 and 7). In this article, we enhance the physics‐based ePBL approach by improving these velocity‐ and length‐scale formulations with ANNs (see Section 3).

ANNs are one of the most widely used forms of machine learning models. ANNs are universal approximators and can find hidden nonlinear relations between quantities (Cybenko, 1989; Hornik, 1991; Hornik et al., 1989). In this section, we describe the fundamentals of ANNs and provide details describing the training procedures for the neural networks used to supplement ePBL's eddy diffusivity model.

ANNs consist of nodes arranged in layers. Nodes are elements of a vector **x** that constitute a layer. See Figure 2 for a schematic. Each vector is connected to its adjacent vector via a transformation that involves multiplying with coefficients, called weights **w**, and adding an offset, called biases **b**. After transforming the vector with weights and biases, a nonlinear operation yields the next vector (or layer). The nonlinear operator is an activation function **x**_{1}, one hidden layer **x**_{2} and an output layer **y**, ANN can be written as_{2} and an output layer **y**, ANN can be written as

Deeper networks are expanded versions of the above Equation set 9 and are obtained by adding additional layers. ANNs can capture nonlinear relationships within certain tolerances and can interpolate with high accuracy within the range of training data. We employ ANNs to learn the nonlinear relationship between chosen input parameters, described below, and the vertical diffusivity profile predicted using SMC.

To train the ANN model to predict the diffusivity profile, we use the sigma coordinate defined as *σ* = *z*/*h*. Therefore, at the surface *σ* = 0, and at *h*, *σ* = 1. We define the diffusivity in the sigma coordinate in terms of a velocity scale, *v*_{0}, the boundary layer depth, *h*, and the nondimensional shape function, *g*:*g*(*σ*) is defined to give values between [0,1]. We could have introduced vertical structure in few or all of the terms on the right hand side in Equation 10. Instead we use only *g*(*σ*) to provide vertical structure as we found out that it was convenient to train one profile than two or more. The benefit of adopting the sigma coordinate is in removing the dependence on the vertical coordinate (e.g., grid spacing in *z*) that varies in different ocean models. This allows us to train and infer (feed‐forward) without depending on the model's vertical grid, which makes it practical to implement in an ocean model with an adaptive vertical grid (e.g., Bleck, 2002).

The velocity scale *v*_{0} in Equation 10 does not vary with *σ*. The entire vertical structure of *κ*_{ϕ} is captured by *g*(*σ*) alone. This is in contrast to Equation 7 where both the length scale and the velocity scale vary in vertical direction and contribute to the vertical structure of *κ*_{ϕ}. We made this choice to simplify the approach so that only one neural network is needed to capture the vertical structure of *κ*_{ϕ}.

We choose to obtain the shape function and velocity scale using two separate neural networks:*f*, *B*_{0}, *u*_{*}, and *h*, while *f*, *B*_{0}, and *u*_{*}. We chose this strategy rather than combining the two outputs into one ANN for a couple of reasons. First, it is straightforward to cleanly diagnose *g*(*σ*) and *v*_{0} from the data, as will be explained in Sections 3.4 and 3.5. Second, we anticipate that having separate networks will make the individual networks easier to interpret, which allows us to better understand physical processes modeled by the network.

Both neural networks

The SMC data used to train the networks is generated using the single column model framework implemented in the General Ocean Turbulence Model (GOTM; Umlauf & Burchard, 2005; Umlauf et al., 2014). GOTM provides numerous SMC options to predict the fluxes of turbulence and the vertical diffusivity. We employ a two‐equation model: *k* − *ϵ*, with stability function closure following Schumann and Gerz (1995). The choice of this specific SMC parameterization is made to be consistent with Reichl and Hallberg (2018). Vertical mixing parameterizations remain an active research topic, and currently used schemes, including SMC, can exhibit biases in different forcing regimes and regions (Damerell et al., 2020; Li et al., 2019; Peters & Baumert, 2007; Sane et al., 2021; Tirodkar et al., 2022). Any biases in the training data are inherited by the neural networks. However, our neural networks can be trained using the output of different mixing schemes, including the improved schemes developed in future research.

The GOTM column model consists of a vertical grid with forcing applied at the surface grid point. It is applicable for flows with horizontal homogeneity, that is, horizontal fluxes are zero or constant. GOTM simulations are performed by changing the following parameters: latitude (Coriolis), surface wind stress (surface friction velocity), and surface heat flux (surface buoyancy flux). Salinity is kept constant and temperature is the only active tracer, though the results are general for any combination of buoyancy fields and forcing. Our initial analysis indicates that the diffusivity of *k* − *ϵ*, *κ*_{kϵ}, depends on the Coriolis parameter *f*, the surface buoyancy flux *B*_{0}, the surface friction velocity *u*_{*} and the depth of the boundary layer, *h*. We can only specify *f*, *B*_{0}, and *u*_{*} in single column simulations, and *h* is diagnosed from the time evolution simulated by GOTM.

Each GOTM case runs with a set of constant forcings. The time step is set at 60 s, and the vertical grid spacing is 1 m. The depth of the column is 800 m. The simulation results for the *k* − *ϵ* model are converged at this time step and resolution (see Figure 1 in Reichl & Hallberg, 2018). The initial conditions consist of zero horizontal velocity, the surface temperature is set at 20°C, and the initial temperature stratification is set at 0.005°C/m. Data was saved at hourly intervals. For every *f*, *B*_{0}, and *u*_{*}, we included one hundred instantaneous profiles of diffusivity at each hour from day 2 to day 6 in the training data set.

We found (*f*, *B*_{0}, *u*_{*}, *h*) to strongly affect diffusivity compared to the background stratification established by the initial conditions. Stratification acts as a barrier to the deepening of the mixed layer, and therefore it is challenging to obtain deeper layers with strong stratification at the bottom of the mixed layer, and this limits the generation of training data spanning a wide range of *h*. Therefore, we choose a weak initial stratification. The effects of stratification on diffusivity in *k* − *ϵ* most directly impact the rate of deepening of the boundary layer (which is already captured by the energetic constraint of the ePBL), compared to the shape function itself.

Table 1 shows the range of forcing parameters of the training data. Forcing range is different for ^{2}, capping prevents the wind stress from going beyond 1.2 N/m^{2} as the shape function does not vary significantly beyond 1.2 N/m^{2}. A similar argument can be made about the surface heat flux. The range selected to perform the sweep has been informed using the observed forcings in the JRA atmospheric reanalysis data set (Tsujino et al., 2018). The range shown in Table 1 covers most of the forcing space certainty as explained in Appendix B. For *h*, maximum variations for *g*(*σ*) were observed between 20 and 300 m and beyond 300 m g(*σ*) is found to vary marginally. Randomizing the training data and splitting it into two sets (train and validation) could result in very similar elements from similar experiments being present in both the train and the validation data. This is undesirable since a fully independent validation data set is required to monitor overfitting when training a neural network. To prevent this issue, a validation data set is independently generated having 10% the size of the training data using a fully independent set of forcing parameters. No single element between (*f*, *B*_{0}, *u*_{*}) is common between the training data set and the validation data set to strictly ensure the independence between the training and validation sets.

*Note*. We have added additional details about GOTM runs to Supporting Information S1.

The parameters *f*, *B*_{0}, *u*_{*}, and *h* are inputs to *g*(*σ*) at 16 evenly distributed nodes, as shown in Figure 2. For each set of forcing (i.e., latitude, heat flux, and surface stress), the GOTM output consists of the evolution of the initial conditions into a developed boundary layer. The boundary layer deepens and variations in *κ*_{kϵ} emerge. Ignoring the initial ≈2 days of data, *h* is diagnosed for each model output with a frequency of 60 min by analyzing the profile of the vertical buoyancy flux. Here, *h* is defined by the depth at which *κ*_{kϵ}(*σ*), is normalized by its maximum to find the shape function:

The neural network cannot learn a continuous profile in *σ*, but instead we train it to learn on a subsampled *g*(*σ*) grid that consists of 18 equally spaced *σ* points (0, 1/17, 2/17, …, 16/17, 1). *σ* at 0 and 1 is ignored in the training because *g*(*σ*) = 0 at the surface (*σ* = 0). At *σ* = 1, *g*(*σ*) → 0 and hence is assumed to be zero for training purposes. Therefore, the network predicts *g*(*σ*) at the 16 interior locations. Subsampling *g*(*σ*) to 18 evenly distributed points was found to be sufficient to capture essential features of *g*(*σ*) while maintaining a small enough network to later implement in an OGCM.

ANNs show high prediction skill when input is within the range of training data. GOTM experiments can cover a wide range of data points that span latitude, heat flux, and surface wind stress, such as those historically observed in the real ocean. However, this is not true for *h* as it evolves prognostically and we cannot set its range for each run. We have chosen *h* to vary from 20 to 300 m (see Table 1) for training purposes, but for some surface heating conditions the boundary layer depth will saturate toward the Monin‐Obukhov length *L*_{MO}, which might be less than 300 m. As *h* is an input to the network and if for a particular case *L*_{MO} < *h* < 300 m, then profiles will not exist and the network will have to predict outside the range of the training data set. The network might end up predicting spurious profiles.

We address this issue by supplementing the training with synthetic data. For a particular case, if *h* saturates to, for example, 200 m, then an additional 10 profiles are added to cover the missing range of 200–300 m in the training data. The shape function for these synthetic profiles is assumed to be the same as when *h* = 200 m, that is, *g*(*σ*) for *h* ∈ (200, 300) will have the same values as *g*(*σ*) for *h* = 200 m. This assumption is reasonable, since *g*(*σ*) was found to vary little for deeper boundary layers with surface heating.

Strong convection can cause a related issue due to quick deepening of the boundary layer within the spin‐up phase of the turbulent OSBL. This gap is filled in the same way as described for deep boundary layer gaps. If the lowest value of *h* is, for example, 100 m, then 10 profiles are added that cover 20–100 m. The shape function for these 10 profiles is assumed to be the same as that when *h* = 100 m. This fill‐up of gaps in *h* is necessary to stabilize ANNs trained with our existing data sets. Knowing the exact bounding box of the training data set is imperative for a successful and stable implementation in a GCM.

The network *g*(*σ*), is a vector of length 16. If *g*(*σ*) predicts a negative value of the shape function for any *σ*, it would lead to negative diffusivity values. We prevent this by training on the logarithm of *g*(*σ*). *g*(*σ*)) and, while inferring, the exponential function is used. This ensures that the shape function is strictly positive.

The four inputs to the network (*f*, *B*_{0}, *u*_{*}, *h*) are normalized by their respective mean and standard deviation of the training data. For the 16 output nodes, each output was normalized by its own mean and standard deviation. For output node *i*, log(*g*(*σ*_{i})) was transformed into

To train

To measure the network's performance, linear correlation coefficient between the validation data and its prediction was calculated (see Figure 3a). The linear correlations for the 16 nodes were weight averaged with the mean value of *g*(*σ*) at the corresponding node. The weight‐averaged correlation is a better estimate of the network's skill for the given set of hidden nodes, as it reduces the influence of noisy values at the bottom of the boundary layer. The noisy values might be due to interpolation of the shape function profiles from the GOTM data. Based on hyperparameter sweep, we chose 2 hidden layers with 32 hidden nodes for

Figure 3b shows the loss curves for training the network. Training loss (magenta) and validation loss (green) decrease with the training epoch. The validation loss is higher than the training loss, but both eventually plateau. The difference between validation and training loss, shown in blue, remains constant in later epochs, signifying when training should be stopped. The validation loss does not increase, ensuring that the network is not overfitting the training data. The performance of the network is further tested by comparing it with the validation data. Strong agreement with validation data can be seen through the average correlation scores in Figure 3a and the error statistics in column (d) of Figure 4.

Figure 4 displays the performance of *g*(*σ*). The medians are superposed over the mean *g*(*σ*) profile of the entire data set. This helps to visualize the skill of *σ* value. Nodes 11 and 12 have a high error variance compared to other nodes. The error variances in column (a) are different from those in column (b) because the data have different variances along the nodes. The last node 16 has a high variance in (a) but because the *g*(*σ*) values at that node are small, poor performance at that node does not penalize the overall performance of

For the neural network

*all* of the forcing regimes: surface heating, neutral, and convection. Perhaps, this adds a limitation to the network, which falls short of having very high skill for all the regimes. In our training experiments (not described in this article), training and predicting separately on the stable and unstable regimes gave higher skill than training on all regimes at once. Having two networks to predict *g*(*σ*) alone could lead to higher skill without increasing the number of hyperparameters. This might be a cost‐effective way to increase the overall accuracy of ePBL_NN without expanding the size of the network. Increasing the number of hidden nodes in the hidden layers increases the cost of forward computation, while switching between the networks based on the forcing regime has a similar cost to using a single network. For simplicity, in this work we prefer to train all data using a single neural network and have not pursued this any further.

We used the L1 loss function (mean absolute error) for training, as it gave better training performance than the L2 (root mean square error [RMSE]). We also increased the convergence of the network parameters (weight and biases) by tweaking the loss function. The loss values at nodes 8 to 13 were amplified by a factor of 100. This made the loss gradients steeper at the nodes that show the highest variance (seen in columns 1 and 2). This forces the network to put more weight on reducing errors on the nodes that are otherwise difficult to learn. The ADAM optimization algorithm (Kingma & Ba, 2014) has been used to train the weights and biases of the network with a learning rate of 10^{−3}.

The second neural network *v*_{0}. Velocity is diagnosed from the training data using the following *jugaad*:*v*_{0} for a set (*f*, *B*_{0}, *u*_{*}). The spread of max(*κ*_{kϵ}(*σ*))/*h* for a constant (*f*, *B*_{0}, *u*_{*}) is small, and averaging assists the neural network in training to predict the mean value (see Appendix A).

Similarly to *v*_{0} and exponential function is used while inferring to ensure that the predicted *v*_{0} is strictly positive. The data is divided into 80%–20% to train and test the performance of the network. As seen in Table 1, the training data cover a wide range compared to that of

Training, testing, and validation data provide one method for testing the network and its ability to reproduce training data. However, to fully test the network's potential for OGCM experiments the neural networks must also be implemented in free‐running, prognostic models. Our implementation does not cause simulation to fail due to any spurious effects or instabilities which is a known problem with implementing neural networks in a GCM (see Brenowitz et al., 2020 and references therein). Stability might result because we implement neural networks as a component within the existing ePBL framework. We demonstrate the success of our implementation using both free running column model experiments and forced ice‐ocean global OGCM climate model experiments.

We now describe the implementation of our networks in the MOM6 ocean model. The weights and biases of the network are generated offline and stored in NetCDF files. Feedforward (inference) calculation of the network involves matrix multiplications and activation functions. These have been coded as subroutines in MOM6's vertical mixing module (ePBL). A flag activates the neural networks to predict *g*(*σ*) and *v*_{0}. All inputs to the network are readily available within the ePBL module. The neural networks require the depth of the boundary layer *h*, which is provided by the ePBL scheme as described in Reichl and Hallberg (2018). The neural networks function alongside the algorithm by which ePBL derives *h* and therefore they do not interfere with any energy constraints set by the original scheme. Additionally, in MOM6, the diffusivity derived from ePBL and the neural network subroutines is passed to a main diabatic mixing module which combines diffusivities from various mixing parameterizations (such as Jackson et al., 2008) within MOM6. More details can be found in Reichl and Hallberg (2018). *g*(*σ*) is obtained at 16 points between the surface and *h*. At the surface, *g*(*σ* = 0) = 0 to satisfy zero diffusivity. At *h*, *g*(*σ* = 1) is set as a small number by assuming *g*(*σ* = 1) = *c* · *g*(*σ* = 16/17), where *c* is a small positive constant set as 0.1. GCM and single column runs were found to be insensitive to small and non‐zero values of *c*.

Shape function on *σ* is converted to the model's vertical grid by linear interpolation. The use of the sigma coordinate makes our scheme grid independent of the vertical coordinate. The shape function on the model grid is multiplied by *v*_{0} · *h* according to Equation 10 to recover the diffusivity profile of the *k* − *ϵ* model. The subroutines pass on the diffusivity profile to the ePBL module. In MOM6, there are other parameterizations active along with ePBL to incorporate strong shearing regions found at the equator and also that handle background diffusivity. Both networks

The inputs to the neural network are also capped inside the subroutine to ensure the networks do not make predictions outside their training range. For *f*, *B*_{0}, *u*_{*}, *h*) are outside the known range, then the subroutine limits the inputs and changes them to the nearest point in the four‐dimensional hypercube formed by the four inputs. Our training data covers a reasonable space of the forcing parameter regime as observed among realistic present conditions (as it will be applied in this study). Data points outside the range are less probable, allowing the network to perform effectively for nearly all of the tested forcing conditions (see Appendix B). Capping the inputs prevents the network's output from being unphysical. If the network is applied for simulations in substantially different climate regimes (e.g., paleoclimate or for other planetary bodies) the training data could be enhanced. If the network receives inputs outside the known range, the shape function can have spurious values with irregular vertical structure. Capping the inputs ensures that this spurious behavior is prevented. The training on logarithm and using exponential function while inferring described in the earlier sections prevents non‐positive behavior for both

We compare three schemes to examine the performance of the network in single column model: GOTM *k* − *ϵ*, ePBL, and ePBL_NN. MOM6 in single column configuration (as in Reichl & Hallberg, 2018) is used to run ePBL and ePBL_NN, while GOTM is used for the *k* − *ϵ* experiments. The column models are forced at the surface grid interface with constant buoyancy forcing (surface heating of 50 W/m^{2}) and constant wind surface stress (0.2 N/m^{2}). Stratification is constant throughout the column in the initial conditions. To have the same entrainment in all the three cases, the m* value is diagnosed from the *k* − *ϵ* output and imposed in MOM6. The quantity m* is the non‐dimensional integral of the entrainment flux and is given by *G* as in Equation 5. Instead of using the parameterized m* from Reichl and Hallberg (2018) and Reichl and Li (2019), we use a diagnosed and time varying m* from *k* − *ϵ* to perform a controlled comparison with identical forcing conditions. This prevents deficiencies in the parameterized m* from causing any disagreements between MOM6 and GOTM. By matching the surface forcing and integral of the entrainment flux, the differences between all the three cases can only be due to diffusivities.

Two latitudes are compared: Latitude 40° (Figure 6) and 1° (Figure 7). The figures show the time series of diffusivity and temperature stratification. For both latitudes, the diffusivity and stratification in ePBL_NN are in closer agreement with the *k* − *ϵ* model than the original ePBL model, showing the ability of the neural networks to match *k* − *ϵ*. ePBL_NN has a diffusivity profile closer to *k* − *ϵ* than ePBL throughout the OSBL. In *k* − *ϵ* (SG), the turbulent diffusivity is computed from the simulated turbulent kinetic energy and turbulent length scale, using stability functions that relate the Prandtl number to the Richardson number (Schumann & Gerz, 1995). The neural networks have “learned” those relationships (without direct knowledge of either parameter) that set the structure of diffusivity and hence show high skill in predicting the profile.

The upper ≈20% of the diffusivity profile is able to learn traditional constraints, such as the law of the wall scaling, since they are features of the training data. The bottom ≈40% of the OSBL shows more variability and is an important region for the entrainment process. In deepening of the boundary layer, the entrainment process mixes the higher density water masses (usually cold) from below the mixed layer with the lower density mixed layer above it (usually warmer). Outside of the polar regions, this process cools the mixed layer along with the sea surface temperature (and warms the interior) and has implications for ocean‐atmosphere energy exchange and feedbacks.

We next tested the ePBL_NN scheme using the GFDL's OM4.0 ocean/sea ice model. The model has a nominal 1/4° resolution and is forced using the JRA forcing product (Tsujino et al., 2018). JRA forced simulations constrain the atmospheric fields that force the ocean model with the observed/reanalysis atmospheric data. This is different from the atmosphere‐ocean coupled model as there is no feedback from the ocean response to the atmosphere. However, this approach is beneficial for testing parameterizations since two experiments can be more carefully compared without considering the complications of those feedbacks. Future work will examine the performance of these schemes in fully coupled climate models.

Two sets of OGCM experiments have been performed: one using the ePBL scheme as a control run (e.g., as described by Adcroft et al. (2019)) and the second with the neural networks active to replace the shape function and velocity scale in ePBL. The simulations were performed for a period of 1958–2017.

In this study, we compare the two runs with observations to analyze the impact on: (a) Ocean heat uptake, (b) Sea surface temperature, (c) Mixed layer depth, and (d) Upper ocean temperature stratification in the Tropical Pacific. For sea surface temperature, data from the World Ocean Atlas (Boyer et al., 2018; Locarnini et al., 2019) has been used to compare the two schemes. For the subsurface comparison: mixed layer depth and stratification, ARGO float measurements have been utilized (Argo, 2023).

Figure 8 shows the global ocean heat content for the three runs: one with ePBL_NN shown in red‐solid line, and the other two with ePBL by setting *γ* from Equation 8 as 1 and 3 shown as blue‐dashed line and green‐dotted line respectively. ePBL_NN shows more heat uptake than the original scheme, and rate of warming is between ePBL runs with *γ* set as 1 or 3. This highlights the sensitivity of the total ocean heat content to the shape function and to boundary layer mixing schemes.

Figure 9 shows the sea surface temperature (SST) bias averaged over the years 2003–2017 for each 1° grid point. SST biases are similar in the two runs with minimal differences, which is expected since the atmospheric fields are prescribed and not coupled. SST around the eastern Pacific and Atlantic equatorial regions shows a slightly warmer bias for the ePBL_NN run than for the ePBL. In the Indian Ocean, the bias is slightly colder. The SST bias in the Gulf Stream and Kuroshio current is slightly warmer in ePBL_NN by about 0.5°C. The response of the SST to ePBL_NN in the boundary current regions indicates that changes in the vertical viscosity or diffusivity also impacts the circulation in certain regions.

Changes in the patterns of SST can be due to changes in the mixed layer depth and the surface heat fluxes. The heat fluxes are computed as a function of SST, surface ocean velocity and ice cover as stated in Adcroft et al. (2019) and Griffies et al. (2016).

Summer and winter mixed layer depths (MLD) are compared, a metric usually used to indicate the depth at which atmospheric influences are directly felt in the ocean. Here, winter (summer) mixed layer depth is the maximum (minimum) of the monthly averaged MLDs for each grid point over the period 2003–2017. The MLD depends on the definition, and we evaluate it using two criteria: Reichl et al. (2022) and de Boyer Montégut et al. (2004). The criterion from de Boyer Montégut et al. (2004) uses a threshold potential density of 0.03 kg/m^{3} whereas Reichl et al. (2022) uses a threshold potential energy anamoly of 25 J/m^{2} to define the MLD. Figures 10 and 11 show the MLD using the potential energy anomaly criterion and the potential density respectively.

Figures 10 and 11 show summer time MLD. The summer time MLD bias has reduced significantly in ePBL_NN as compared to ePBL. The average bias reduced from 7.22 to 5.73 m as seen in Figure 11. Between −20° and 20° latitude, the average RMSE for MLD bias in ePBL was about 7.94 m. In ePBL_NN, the bias was reduced to 5.18 m. We have shown the latitude dependency of RMSE between model and observations in Supporting Information S1 (see Figures S1 and S2 in Supporting Information S1). The ePBL_NN scheme performs better under stable surface heating conditions than the ePBL scheme. The shallow MLD bias reduction has implications for equatorial oceanic regions and its effect on large‐scale ocean‐atmosphere feedbacks (Adcroft et al., 2019). Winter MLD biases (Figures 12 and 13) are very similar for both runs. The ePBL_NN predicts diffusivity close to a second moment scheme but does not significantly impact the winter time bias simulated by the model with the original ePBL scheme. This is likely because other model physics and factors can dominate in setting the deep convective mixed layers and water properties.

Although ePBL_NN has been trained on all the regimes including surface cooling, a different scheme or process might be compensating the effects of improved diffusivity. This could also be due to higher sensitivity of shallow mixed layers to changes in surface forcing than deep mixed layers. For shallow mixed layer depth, any perturbations in the atmospheric forcing will reach the base of the boundary layer quicker than it would reach in deeper layers. In Reichl and Hallberg (2018), the rate of conversion of turbulent kinetic energy to potential energy within the boundary layer (left hand side in Equation 7) uses a parameterization that depends on *h*. Changing the diffusivity can alter *h* which in turn modulates the rate of energy conversion. This can lead to changes in the MLD.

The winter time MLD biases are similar for both runs. The summer time MLD bias shows a further reduction when evaluated using de Boyer Montégut et al. (2004) than with Reichl et al. (2022). It is not unusual to get different values of MLD using different definitions. For both definitions, winter biases in ePBL_NN are not worsened. Qualitative agreement of the reduction in summer bias in ePBL_NN using two different criteria provides strong evidence of ePBL_NN performing better than ePBL in terms of MLD bias reduction under fixed atmospheric forcing conditions.

The final comparison we use to assess the impact of the neural network diffusivities on the model result is the upper ocean temperature stratification (*∂*Θ/*∂z*, where Θ is the potential temperature) in the Equatorial Pacific region. The thermocline in the Equatorial Pacific region plays an important role in El Niño Southern Oscillation dynamics with implications for the Earth's climate system (for example) (Jin & An, 1999). The temperature stratification is shown for a vertical cross section along the equator spanning −220° to −80°E. Figure 14c shows the observational data from ARGO floats (Roemmich & Gilson, 2009). Figure 14b is the *∂*Θ/*∂z* from the original ePBL and shows lower stratification as compared to ARGO observations. The *∂*Θ/*∂z* from ePBL_NN, as seen in Figure 14a, shows significant improvements in the stratification of the upper 50 m of the ocean. Stratification in ePBL_NN is closer to ARGO data in the equatorial region of the Pacific Ocean. The neural network predicted diffusivities help to increase the stratification and make it closer to observations than the simulation with the original ePBL with the ad hoc shape function for diffusivity.

Stratification acts as a barrier to mixing, this warrants further investigation into how ePBL_NN changes transport pathways of heat through the OSBL in different regions of the world's oceans. Overall, the MLD bias is reduced, and stratification has improved for the upper ≈50 m suggesting that ePBL_NN works to fix these two biases in conjunction.

In this study, we apply neural networks to improve the parameterization of the vertical diffusivity in the OSBL. The data used to train the neural networks is obtained using second‐moment closure simulations by running single‐column model under various forcing scenarios, spanning the possible range of present‐day and future conditions. The neural networks are implemented within the existing physics‐based parameterization from the ePBL framework of Reichl and Hallberg (2018). The neural networks augment the method to determine the vertical diffusivity in the ePBL scheme with data‐driven relations but maintain the physically motivated energetic constraints on mixing from the original scheme. A benefit of our approach is that it yields a stable implementation in the OGCM (MOM6). This enables us to perform decade‐scale simulations spanning 1958 to 2017.

Atmospherically forced Ice‐Ocean experiments using the GFDL's MOM6 1/4° model suggest an overall improved performance due to the enhancements in ePBL_NN relative to the original scheme. There is a reduction in biases of summer‐time MLD and no exacerbation of the winter‐time biases compared to ePBL. The stratification of the upper ocean in the tropical pacific shows improvements in the thermocline compared to the ARGO float observations. This analysis indicates that the resulting scheme is suitable for implementation in future OGCM configurations and experiments and is expected to reduce biases in climate simulations. Further analysis using a wider range of diagnostics in additional model configurations will be particularly beneficial.

The ePBL framework is already optimized for GCMs, providing larger time stepping capabilities (≈*O*(1) hr) and ePBL_NN leverages these advances with improved diffusivity profiles. It is computationally expensive to run a second‐moment scheme in a GCM due to time‐stepping restrictions (≈*O*(10 − 100) s), but ePBL_NN can yield eddy diffusivity profiles more consistent with a SMC within the original ePBL framework. This is significant for GCMs as we are achieving closer results to a model having a second‐order turbulence closure scheme, but able to maintain coarse resolutions and long time steps needed for climate scale simulations. We note that the longer implicit time step used in the numerics of ePBL (see Reichl & Hallberg, 2018) can lead to a smoothing effect which can complicate resolving small‐scale structure, but we observe that the large‐scale evolution is tracked accurately.

While the results of this work are promising, numerous aspects remain important for future work. For example:

ePBL, ePBL_NN, and the SMC considered here assume downgradient diffusion and hence have no nonlocal flux terms. The representation of nonlocal fluxes could improve the scheme further and potentially affect convective regions and Langmuir turbulence (e.g., Chor et al., 2021). In this application we do not explicitly consider the impact of Langmuir turbulence within ePBL (though it is part of setting the energy available for entrainment, see Reichl & Li, 2019).

The neural networks can be made larger to capture more complex relationships in the data by increasing the number of hyper‐parameters (hidden nodes). In this work we chose small networks for initial investigation. The successful use of small neural networks as efficient surrogate models of SMCs proves that we can replicate the behavior of complex models with high fidelity. Increasing the network size will be explored in the future and will likely require using GPUs for implementation in OGCM (Zhang et al., 2023).

The performance of the modified vertical mixing scheme in a coupled model (atmosphere‐ocean‐ice) may not show the same impact on model bias as observed in this forced ocean‐ice model. The atmosphere‐ocean feedbacks will require exploration in future work.

Improving the representation of the diffusivity profile has implications for many quantities that have gradients within the boundary layer. For example, changing the diffusivity of nutrients within the euphotic layer has implications for biogeochemical processes such as primary production. The implications of improved diffusivity for ecological modeling will be explored in future work.

Finally, we have trained on one SMC, the *k* − *ϵ* model with stability functions following Schumann and Gerz (1995). This parameterization was chosen for consistency with Reichl and Hallberg (2018), but alternative SMC models may yield different results. In future work this process will be repeated with different SMC schemes to understand the influence of SMC diffusivities on the performance of GCMs. One disadvantage is that SMCs have been assumed to be the “truth” but it might lack realism and hence future work will focus on including data from LES studies and observations.

One key achievement of this work is that it establishes a relationship between the shape function of upper ocean vertical mixing and the forcing parameters. Previous work in similar first‐order upper ocean mixing parameterizations assumes that the shape function is fixed, or was set by ad‐hoc approximations. This work further suggests that models that consider this variation in the shape function are more skillful at simulating upper ocean stratification and ocean mixed layers. The physics‐informed function (network) developed in this work for determining the shape function from the forcing parameters is applied here in ePBL as an example. However, the function is not specific to ePBL and can also be used within other first order OSBL parameterizations (such as KPP, Large et al., 1994; Van Roekel et al., 2018).

It is also important to consider that the neural network based model used in this work is not the only approach to find a relationship between the forcing terms and the vertical mixing profile. The neural network is able to establish the existence of a relationship between its input and outputs, which is learned during the training process. While the neural network can be applied in ocean models as‐is to improve simulations, we also desire an in‐depth understanding of the patterns in the inputs that the network used to make its skillful predictions. In future work, we seek to relate the network's findings to the processes that govern the OSBL's behavior (e.g., with equation discovery). This may ultimately lead to a simpler, interpretable and computationally low‐cost physics based model for the shape function that can be learned from the neural network and applied in ocean models.

A second implication of this work is demonstrating the potential for neural networks to improve parameterizations in ocean models. This implication is in agreement with several similar previous studies in earth system modeling (e.g., O'Gorman & Dwyer, 2018; Yuval & O'Gorman, 2020). As neural networks are not limited to individual processes, future avenues of research on ocean parameterizations will benefit from their usage. For example, neural networks can be applied to incorporate different mixed layer processes such as non‐local fluxes during convection, entrainment, Langmuir turbulence, symmetric instability, surface wave effects, etc. into a single neural network model. Further improvements can be made which incorporate time history to improve predictions under transient forcings. Many existing ocean/atmosphere parameterizations have a physics based parent scheme with a few ad‐hoc components or approximations. These components can be replaced or re‐tuned using our approach or other emerging approaches such as Ensemble Kalman methods, posteriori criteria matching, etc. (e.g., Frezat et al., 2022; Lopez‐Gomez et al., 2022, and references therein). Parameterizations in the form of weights and biases are advantageous because they can be re‐tuned and further optimized to train as additional data, observations, and processes are presented.

The successful application of neural networks in an OGCM simulation unlocks the potential to test the importance of improving a certain process/parameterization in the model. For example, consider a case where the process studies' data exist, but a physics‐based parameterization might be challenging to develop. Neural networks can parameterize that process and its impacts in an OGCM can be explored before going into a detailed parameterization development, which can be resource‐consuming.

One of the major sources of uncertainty in climate models arise from parameterizations due to their inadequate representation of sub‐grid physics. Perhaps, high resolution or shorter time‐steps can attenuate the effects of structural uncertainties in sub‐grid parameterizations. Computational limitations often impose constraints on factors such as resolution, ensemble size, and integration time scales within models. These limitations underscore the need for improving the current generation of climate models, while steering away from relying on higher resolution models or shorter time steps. Combining traditional process‐oriented studies with the emerging field of machine learning offers the potential for synergistic advancements, leading to the refinement of sub‐grid models. We have established a pipeline whereby an existing parameterization is augmented to harness the capabilities of neural networks. The successful integration of neural network within the ePBL, and its application to an ocean model, introduces opportunities for enhancing parameterizations that govern upper ocean mixing in climate models.

In general, turbulent velocity scales are related to turbulent kinetic energy and depend on boundary forcing, *u*_{*} and *B*_{0}. Here, in addition to *u*_{*} and *B*_{0}, we find a dependency of the bulk turbulent velocity scale *v*_{0} on *f*. The bulk velocity is diagnosed by using diffusivity and boundary layer depth from the training data as per Equation 14. To predict *v*_{0} using *f* has been used because we found the model improves in its ability to predict variations in *v*_{0} in the training data. This is evident from Figure A1.

Figure A1 shows the variation of *v*_{0} with respect to latitudes and *h* under surface heating and cooling conditions. This indicates that *f* is a useful input for accurately predicting *v*_{0}. Physically, the inclusion of *f* is related to the role of rotation in limiting the wind‐input of energy and the shear production of turbulence in the boundary layer through Ekman effects. The variation due to *h* is smaller than due to *f*, around 5% of the mean value of *v*_{0} for any particular set of forcing (*f*, *B*_{0}, *u*_{*}). Since the implementation and generalization is significantly easier if the network only depends on external forcing parameters, we choose to include *f* as an input to the network and neglect *h*.

Table 1 gives the range of the forcing parameters covered in the training data set. A natural question is how much of the variability observed in GCM simulations is covered in the training data. We can estimate this using Shannon entropy (Shannon, 1948) which measures the amount of uncertainty and variability in a variable (Carcassi et al., 2021; Sane et al., 2020, 2021).

Shannon entropy of an event *x*_{i} is given by *information* or surprise related to the event. We only use discrete probability distributions. Low probability events have high Shannon entropy because they cause more surprise compared to high probability events. It is a non‐parametric measure and does not make any assumption about the distribution. *u*_{*} and *B*_{0} are non‐Gaussian (Figure B1).

For *u*_{*}: H(*u*_{*} > 0.03 m/s) ≈ 95.5% and for *B*_{0}: H(|*B*_{0}| > 2.1 × 10^{−7} m^{2}/s^{3}) ≈ 86%. This can be interpreted as the values *u*_{*} > 0.03 have 95.5% uncertainty associated with them. So leaving out values of *u*_{*} for which *u*_{*} > 0.03 removes 95.5% uncertainty from the training data. This is a simplistic estimate and assumes *u*_{*} and *B*_{0} are independent. These estimates show that our training data covers 95.5% variability for *u*_{*} and 86% of *B*_{0} as observed under realistic conditions in a GCM.

The training data points are uniform and although they cover most of the range seen in realistic conditions, the training data does not follow the same marginal probability distribution of *u*_{*} and *B*_{0} as well as the joint probability distribution between *u*_{*}, *B*_{0}. For machine learning application of parameterization development the consequence of sampling from joint distribution of variables from realistic conditions versus having uniformly spaced forcing is unknown as of now and will be left for future study.

See Table C1.

AS, AA and LZ received M^{2}LInES research funding through the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program. AA was also supported by award NA18OAR4320123, from the National Oceanic and Atmospheric Administration (NOAA), U.S. Department of Commerce. The statements, findings, conclusions, and recommendations are those of the author(s) and do not necessarily reflect the views of the National Oceanic and Atmospheric Administration, or the U.S. Department of Commerce. We were intellectually supported by various other members of the M^{2}LInES project. We used the Stellar computational resources provided by Princeton University and the National Oceanic and Atmospheric Administration (NOAA) Geophysical Fluid Dynamics Laboratory (GFDL). We thank Dr. Enrico Zorzetto and Dr. Robert Hallberg for providing feedback for this article. We also thank Dr. Jun‐Hong Liang and two anonymous reviewers for reviewing and providing profound insights, which led us to improve this manuscript. The authors thank the international Argo project and the various associated national programs for collecting and freely distributing the data set. AS thanks his wife for showing remarkable grit in helping to improve the plain language summary.

The code and the data can be obtained from