This is an open access article under the terms of the Creative Commons Attribution‐NonCommercial‐NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non‐commercial and no modifications or adaptations are made.

This paper derives a criterion for deciding conditional independence that is consistent with small‐sample corrections of Akaike's information criterion but is easier to apply to such problems as selecting variables in canonical correlation analysis and selecting graphical models. The criterion reduces to mutual information when the assumed distribution equals the true distribution; hence, it is called mutual information criterion (MIC). Although small‐sample Kullback–Leibler criteria for these selection problems have been proposed previously, some of which are not widely known, MIC is strikingly more direct to derive and apply.

Conditional independence is fundamental to statistical inference (Dawid, 1979). Many tests for conditional independence have been proposed, including tests based on partial correlation, conditional characteristic functions (Su & White, 2007), Hellinger distance (Su & White, 2008), maximal nonlinear conditional correlation (Huang, 2010), and projection‐based distance covariance (Fan et al., 2020). These and other criteria are developed from a hypothesis test framework, which has well‐known limitations in multiple testing situations (Burnham & Anderson, 2002). An alternative approach to deciding conditional independence is based on Kullback–Leibler (KL) criteria, such as Akaike's information criterion (AIC). AIC is an attractive alternative because it can be applied to multiple testing problems, it does not require specifying an arbitrary significance level, it accounts for out‐of‐sample variability, and it is derived from a proper score for selecting probability density functions (PDFs) (Akaike, 1973). Nevertheless, applying AIC to decide conditional independence generally requires maximizing a likelihood function subject to the constraint indicated by conditional independence. Such constrained optimization problems can be difficult to solve, which has hindered the application of AIC to such problems.

A clue to a simpler approach comes from regression model selection. In regression model selection, AIC is relatively easy to apply because one simply includes the variables that appear in the regression model and excludes the others. In particular, the relevant AIC does not require solving a constrained optimization problem. For selection problems that cannot be reduced to regression model selection, the question arises as to whether there exists a criterion similar to AIC that does not require solving constrained maximum likelihood problems, and yet can be evaluated by excluding the variables that are conditionally independent of the retained variables. The purpose of this paper is to derive such a criterion.

We begin by seeking a criterion whose differences equal the differences in KL divergences in the case of selecting explanatory variables of a regression model. This ensures that the criterion recovers regression model selection. Then, we add one more condition, namely, that the criterion should be symmetric, in the sense that the criterion does not depend on which variables are labelled response and explanatory. Remarkably, only one quantity satisfies these conditions. This quantity reduces to mutual information when the model PDF equals the true PDF. Accordingly, we call this quantity mutual information criterion ($\ud544\mathbb{I}\u2102$). This paper demonstrates that $\ud544\mathbb{I}\u2102$ is our desired criterion.

Sample estimates of $\ud544\mathbb{I}\u2102$ can be derived based on AIC. Naturally, the resulting estimates share the same limitations as AIC. One well‐known limitation of AIC is that it tends to select overfitted models. A standard fix to this problem is to use a small‐sample corrected version called AICc (Hurvich & Tsai, 1989). Unfortunately, AICc implicitly assumes that the explanatory variables are the same between training and verification samples (DelSole & Tippett, 2021; Rosset & Tibshirani, 2020; Tian et al., 2020). We show that this assumption implies that AICc is not guaranteed to make consistent decisions about conditional independence. Therefore, AICc is not appropriate for estimating $\ud544\mathbb{I}\u2102$. The appropriate small‐sample correction to AIC that accounts for independent training and verification samples has been derived recently by DelSole and Tippett (2021) and Tian et al. (2020) (here called AICr). This criterion is used to derive an estimate of $\ud544\mathbb{I}\u2102$, called MIC. Because MIC is based on the newly derived AICr rather than AIC or AICc, it improves upon previous criteria even for the extensively studied case of regression model selection.

The problem of selecting both response and explanatory variables is more formidable. However, MIC provides a very reasonable small‐sample criterion for selecting explanatory and response variables and is well suited for selecting variables for canonical correlation analysis (CCA). Another application of MIC is to select graphical models. Graphical models provide a visual summary of various conditional independencies among variables. Conditional independence implies that an associated conditional mutual information vanishes. We derive an analogous criterion called conditional MIC that provides a small‐sample criterion for selecting graphical models.

In a series of papers, Yasunori Fujikoshi derived small‐sample criteria for many of the above selection problems by explicitly maximizing the likelihood function under the appropriate hypothesis of conditional independence (Fujikoshi, 1982, 1985; Fujikoshi et al., 2010). We show that differences in MIC are equivalent to each of these criteria derived by Fujikoshi (after accounting for slight differences in formulation). Despite these earlier derivations, the derivation presented here is of considerable value because of its greater simplicity compared to previous derivations. The basis of this simplification is that KL divergences satisfy certain identities called chain rules. These chain rules can be used to convert certain constrained maximum likelihood problems into unconstrained problems. As a result, small‐sample criteria for conditional independence can be derived from these chain rules, thereby avoiding direct maximization of the likelihood function, which often requires intricate matrix manipulation.

Let **x** and **y** be random vectors with a joint PDF *p*(**x**, **y**). In practice, the true PDF is unknown. Our goal is to identify an approximate PDF by deciding if the PDF has *structure* and then to estimate the PDF under this constraint. Let *q*(**x**, **y**) denote a candidate PDF without structure, and let *q*_{1}(**x**, **y**), *q*_{2}(**x**, **y**), … denote candidate PDFs with different structures. Our criterion for choosing a particular structure is that it minimizes the KL divergence or equivalently minimizes *p*(**x**, **y**). ${\mathbb{H}}_{i}(XY)$ is called the *cross entropy* between *p* and *q*_{i} (ignoring an irrelevant factor of 2). $\mathbb{H}(XY)$ (with no subscript) denotes the cross entropy between *p* and *q*. When $p=q$, cross entropy equals (twice) the entropy of *p*(**x**, **y**).

The criterion for selecting structure is well developed in the special case of selecting regression models. To be precise, the selection of regression models will be called *X*‐selection and defined as follows.

*X*‐selection

A regression model (also called a prediction model) is effectively a conditional PDF *q*_{i}(**y**|**x**), where the first and second variables are called response and explanatory, respectively. The prediction model is related to the joint PDF as

The *X*‐selection problem is to select one prediction model from a set of candidate models *q*_{1}(**y**|**x**_{1}), *q*_{2}(**y**|**x**_{2}), …. The candidate PDFs are restricted to ones in which the prediction models differ in their explanatory variables **x**_{1}, **x**_{2}, …, each of which is a subset of **x**, but have the same response variable **y**. It is assumed that each prediction model equals the unconstrained PDF conditioned on the appropriate subset of explanatory variables:

The first equality states that certain *X* variables may be omitted from *q*_{i}(**y**|**x**) without changing the prediction, and the second equality states that the resulting prediction model equals the unconstrained PDF *q*(**y**|**x**_{i}). Aside from this, no further structure is imposed. In particular, no structure is imposed on *q*_{i}(**x**):

It follows from (2)–(4) that

This identity shows that the joint PDF can be written as a product of PDFs where structure is imposed by omitting *X* variables in the conditional PDF. Note that the joint PDF *q*_{i}(**x**, **y**) depends on the full **x** even when the prediction model *q*(**y**|**x**_{i}) depends only on a proper subset of **x**. Variables that can be omitted from conditionals are said to be *redundant*.

It follows from (1) and (2) that cross entropy satisfies the chain rule *X*‐selection,

Lemma 1 implies that under *X*‐selection,

Because only differences in cross entropy affect selection, this identity shows that, under *X*‐selection, selecting prediction models based on $\mathbb{H}(Y|{X}_{i})$ is equivalent to selecting structured PDFs based on ${\mathbb{H}}_{i}(XY)$. Importantly, the left‐hand side of (8) involves structured PDFs while the right hand side involves only unstructured PDFs. This fact will become important later when we derive estimates of cross entropy—the left‐hand side will require solving constrained maximum likelihood problems, whereas the right‐hand side will not.

Not all selection problems can be reduced to *X*‐selection. For instance, in CCA, both *X* and *Y* variables are selected. We call this *simultaneous selection*. $\mathbb{H}(Y|X)$ is not a meaningful criterion for simultaneous selection because Y differs between models. For instance, $\mathbb{H}(Y|X)$ is a proxy for prediction error, and comparing prediction errors of different quantities with different units is not meaningful. In such cases, the natural approach is to define the structure in *q*_{i}(**x**, **y**) associated with the selection problem and then compute the corresponding cross entropy ${\mathbb{H}}_{i}(XY)$. However, this approach inevitably leads to solving a constrained maximum likelihood problem, which can be difficult. We seek an alternative approach that avoids solving a constrained maximum likelihood problem, similar to the way regression model selection avoids this problem. More precisely, we seek a criterion that can be computed by omitting redundant *X* and *Y* variables from the calculation, just as $\mathbb{H}(Y|X)$ can be computed by omitting redundant *X* variables from the prediction model. Let this new criterion be denoted $\ud544\mathbb{I}\u2102(X;Y)$, where explanatory and response variables are separated by a semicolon. The first natural requirement is that it should be consistent with cross entropy for *X*‐selection.

$\ud544\mathbb{I}\u2102(X;Y)$ is said to be consistent with cross entropy for *X*‐selection if for all *q*(**y**, **x**_{1}, **x**_{2}) and *p*(**y**, **x**_{1}, **x**_{2}),

A second requirement is that it should be suitable for simultaneous selection, particularly for selecting variables in CCA. Importantly, CCA does not distinguish response and explanatory variables. Therefore, we seek a criterion that satisfies the following property.

A selection criterion is said to be *symmetric* if it does not depend on which variables are labelled response and explanatory.

Clearly, $\mathbb{H}(Y|X)$ is not symmetric, since $\mathbb{H}(Y|X)=-2{\mathbb{E}}_{XY}[\mathrm{log}{q}_{Y|X}(\mathbf{y}|\mathbf{x})]\ne -2{\mathbb{E}}_{XY}[\mathrm{log}{q}_{X|Y}(\mathbf{x}|\mathbf{y})]=\mathbb{H}(X|Y)$. On the other hand, $\mathbb{H}(XY)$ is symmetric, but it is not consistent with cross entropy since $\mathbb{H}({X}_{1}Y)-\mathbb{H}({X}_{2}Y)=\mathbb{H}(Y|{X}_{1})-\mathbb{H}(Y|{X}_{2})+\mathbb{H}({X}_{1})-\mathbb{H}({X}_{2})$. In general, $\mathbb{H}({X}_{1})-\mathbb{H}({X}_{2})\ne 0$. The criterion that is both symmetric and consistent with cross entropy is given in the following proposition.

To within an additive constant, the only criterion that is both symmetric and consistent with cross entropy for *X*‐selection is

Let $\mathbb{H}(Y|{X}_{1}{X}_{2})$ and $\mathbb{H}(Y|{X}_{1})$ denote cross entropies for *q*(**y**|**x**_{1}, **x**_{2}) and *q*(**y**|**x**_{1}), respectively. By assumption, $\ud544\mathbb{I}\u2102(X;Y)$ is consistent with cross entropy for *X*‐selection; hence,

Rearranging this equation gives

The absence of *X*_{2} on the right implies that the right‐hand side is a functional of the distribution of **x**_{1} and **y** only. It follows that the left‐hand side has this same dependence. Repeating the above argument but with the roles of **x**_{1} and **x**_{2} swapped leads to the conclusion that the left‐hand side is a functional of the joint distribution only of **x**_{2} and **y**. These two properties hold for arbitrary *p*(**x**_{1}, **x**_{2}, **y**) only if the left‐hand side is a functional of the distribution of **y** only. That is,

where *f*(*Y*) is some functional of *q*(**y**) and *p*(**y**). Similar arguments, but swapping the roles of X and Y, give

where *g*(*X*) is some functional of *q*(*X*) and *p*(*X*). By assumption, $\ud544\mathbb{I}\u2102$ is symmetric; hence, $\ud544\mathbb{I}\u2102(X;Y)=\ud544\mathbb{I}\u2102(Y;X)$. Therefore, $\ud544\mathbb{I}\u2102$ may be eliminated from (15) and (16) to give

Substituting the chain rule (6) into (17) and rearranging terms gives

The left‐hand side does not depend on the distribution of *Y*, and the right‐hand side does not depend on the distribution of *X*. The only way that this identity can hold for arbitrary distributions is that the two sides must equal a constant. Therefore,

where *α* is a constant. Since only differences in $\ud544\mathbb{I}\u2102$ are important, the constant *α* may be set to zero without loss of generality. Solving for *f*(*Y*) and substituting into (15) determines MIC uniquely and yields (11). Equations (10) and (12) follow from (11) by the chain rule (6).

To our knowledge, $\ud544\mathbb{I}\u2102$ has not appeared in the literature. If $p(\mathbf{x},\mathbf{y})=q(\mathbf{x},\mathbf{y})$, then $\ud544\mathbb{I}\u2102(X;Y)=-2\ud544(X;Y)$, where *mutual information* between **x** and **y**. Just as $\mathbb{H}$ is cross entropy (times 2), $\ud544\mathbb{I}\u2102$ may be called *cross mutual information* (times −2). Anticipating its application to variable selection, we call $\ud544\mathbb{I}\u2102$ mutual information criterion. The explicit dependence of $\ud544\mathbb{I}\u2102(X;Y)$ on the model PDF is

Conditional $\ud544\mathbb{I}\u2102$ can be defined analogously to conditional mutual information:

$\ud544\mathbb{I}\u2102$ satisfies chain rules analogous to mutual information; for example,

Although MIC is consistent with cross entropy for *X*‐selection, this does not guarantee that it is a sensible criterion for simultaneous selection. To show the latter, we first clarify the structure associated with *X*‐selection.

*X* variables will be partitioned as $\mathbf{x}={({\mathbf{x}}_{K}^{T}\phantom{\rule{0.1em}{0ex}}{\mathbf{x}}_{R}^{T})}^{T}$, where **x**_{K} denotes the *M*_{K} variables to keep and **x**_{R} denotes *M*_{R} variables either to remove or retain. Similarly, *Y* variables will be partitioned as $\mathbf{y}={({\mathbf{y}}_{K}^{T},{\mathbf{y}}_{R}^{T})}^{T}$, where **y**_{K} and **y**_{R} have dimensions *P*_{K} and *P*_{R}.

Under *X*‐selection, the decision to remove **x**_{R} from the prediction model depends on the cross entropies of *p*(**y**|**x**_{K}, **x**_{R}) and *p*(**y**|**x**_{K}). The structure relevant to this problem is (3), which is expressed below in the notation of Definition 4: *ω* denotes the appropriate structural constraint on the PDF. Under (22), ${\mathbb{H}}_{\omega}(Y|{X}_{K}{X}_{R})=\mathbb{H}(Y|{X}_{K})$, and therefore, **x**_{R} is indistinguishable from deciding that the model PDF satisfies (22). The first equality in (22) asserts that, under the model PDF, **y** and **x**_{R} are *conditionally independent* given **x**_{K}. We denote this condition as

Conditional independence defines a particular structure on a PDF. Importantly, conditional independence can be expressed through *q*(·) in different ways. For instance, by repeated application of the probability law (2), the structure (22) can be expressed equivalently as

These expressions are equivalent statements that the model PDF satisfies *Y* ⊥ *X*_{R} | *X*_{K}. This equivalence allows us to prove the following.

Under conditional independence *ω* : *Y* ⊥ *X*_{R} | *X*_{K}, *q*_{ω}(**x**_{K}, **x**_{R}, **y**) defined in (27).

Computing the cross entropy of (27) yields ${\mathbb{H}}_{\omega}(Y{X}_{R}{X}_{K})=\mathbb{H}(Y{X}_{K})+\mathbb{H}({X}_{R}{X}_{K})-\mathbb{H}({X}_{K})$, and therefore,

Proposition 2 shows that $\ud544\mathbb{I}\u2102$ is consistent with cross entropy for deciding conditional independence (23). By analogy, we anticipate that simultaneous selection corresponds to selecting some form of conditional independence. To define this form, note that simultaneous selection asks whether (**x**_{R}; **y**_{R}) should be included with (**x**_{K}; **y**_{K}). By analogy with *X*‐selection, the criterion for simultaneous selection should be based on comparing MIC with and without the potentially redundant variables (**x**_{R}; **y**_{R}), that is, based on comparing $\ud544\mathbb{I}\u2102({X}_{K}{X}_{R};{Y}_{K}{Y}_{R})$ to $\ud544\mathbb{I}\u2102({X}_{K};{Y}_{K})$. The structure required for this difference in $\ud544\mathbb{I}\u2102$ to equal the difference in cross entropies between unstructured and structured PDFs is given next.

The criterion *ψ* is

For clarity, we note that (30) can be expressed in other equivalent forms using logical equivalences (an example is 73 below; see Dawid, 1979).

Expanding $\ud544\mathbb{I}\u2102$ using definition (10) and rearranging terms gives

Comparison with (29) implies

For the expectation to vanish for any true PDF *p*(**x**_{K}, **x**_{R}, **y**_{K}, **y**_{R}), the argument of the log must equal one. The first parenthesis is the only term that depends on **y**_{R}. By familiar arguments in separation of variables, this term must equal a constant and that constant must be one to ensure that the model PDFs integrate to one. Under this result, the second parenthesis is the only term that depends on **y**_{K}; hence, by similar arguments, it too must equal one. Given these two results, the last term in parenthesis must equal one, implying that *ψ* does not impose structure on *q*(**x**). It follows that

which are the constraints in (30). The corresponding constrained joint PDF is

This proves the “only if” part. To prove the “if” part, note that *ψ* in (30) implies (35), which implies (32), which if substituted in (29) yields (31).

To clarify the reasonableness of (23) and (30), the following proposition describes their consequences in terms of CCA.

**x**_{K} and **y**_{K} does not alter the canonical correlations.

Consider CCA of **x** and **y**, which yields a projection vector pair **u** and **v** such that the correlation between **u**^{T}**x** and **v**^{T}**y** equals the canonical correlation. Following Definition 4, partition $\mathbf{u}={({\mathbf{u}}_{K}^{T}{\mathbf{u}}_{R}^{T})}^{T}$ and $\mathbf{v}={({\mathbf{v}}_{K}^{T}{\mathbf{v}}_{R}^{T})}^{T}$. If (23) is true, then ${\mathbf{u}}_{R}=\mathbf{0}$ for all canonical correlations. If (30) is true, then ${\mathbf{u}}_{R}=\mathbf{0}$ and ${\mathbf{v}}_{R}=\mathbf{0}$ for all canonical correlations. In either case, the canonical correlations for (**x**_{K}; **y**_{K}) are identical to those of (**x**; **y**).

The constraint *ψ* in (30) can be written in terms of *q*_{ψ}(·) as in (33) and (34), which in turn can be written, respectively, as

Let ${\text{cov}}_{\psi}[{\mathbf{y}}_{R},\mathbf{x}|{\mathbf{y}}_{K}]$ denote the conditional covariance matrix between **y**_{R} and **x** given **y**_{K} under model PDF *q*_{ψ}(**x**_{K}, **x**_{R}, **y**_{K}, **y**_{R}). Then (36) implies

Under covariance constraints (37), Fujikoshi (1982) showed that ${\mathbf{u}}_{R}=\mathbf{0}$ and ${\mathbf{v}}_{R}=\mathbf{0}$. Under the second identity in (37), Fujikoshi et al. (2010) showed that ${\mathbf{u}}_{R}=\mathbf{0}$. In both cases, the canonical correlations for (**x**_{K}; **y**_{K}) are identical to those of (**x**; **y**). This completes the proof.

The above considerations have ignored the fact that model PDFs generally involve parameters that are unknown and must be estimated from finite samples. This estimation can lead to overfitting and must be taken into account. Let $q(\mathbf{Y}|\mathbf{X};{\mathit{\theta}}_{Y|X})$ denote the PDF model for predicting **Y** given **X** with parameters ${\mathit{\theta}}_{Y|X}$. We follow Akaike (1973) by using maximum likelihood estimates (MLEs) for the parameters. Accordingly, let ${\widehat{\mathit{\theta}}}_{Y|X}$ denote the MLE of ${\mathit{\theta}}_{Y|X}$ derived from the sample $(\widehat{\mathbf{X}},\widehat{\mathbf{Y}})$. A fundamental principle in model selection is to judge model performance based on how well the model predicts an *independent* sample (**X**_{0}, **Y**_{0}). Following Akaike (1973), we average the cross entropy for $q({\mathbf{Y}}_{0}|{\mathbf{X}}_{0};{\widehat{\mathit{\theta}}}_{Y|X})$ over $(\widehat{\mathbf{X}},\widehat{\mathbf{Y}})$ and (**X**_{0}, **Y**_{0}), which have identical distributions but are independent of each other. The result is $\mathbb{A}\mathbb{I}\u2102$:

Under normality, the PDF model satisfies $q(\mathbf{X},\mathbf{Y};{\widehat{\mathit{\theta}}}_{XY})=q(\mathbf{X};{\widehat{\mathit{\theta}}}_{X})q(\mathbf{Y}|\mathbf{X};{\widehat{\mathit{\theta}}}_{Y|X})$, where ${\widehat{\mathit{\theta}}}_{XY},{\widehat{\mathit{\theta}}}_{X},{\widehat{\mathit{\theta}}}_{Y|X}$ are MLEs of the parameters in the respective PDF models (this identity does not hold in general; Barndorff‐Nielsen, 1976). As a result of this identity, $\mathbb{A}\mathbb{I}\u2102$ satisfies the chain rule

The Akaike‐type extension of $\ud544\mathbb{I}\u2102$ is defined as

To within an additive constant, the only criterion that is both symmetric and whose differences equal the corresponding differences in $\mathbb{A}\mathbb{I}\u2102$ is $\ud544\mathbb{I}\u2102\mathrm{a}$.

In the proof for Proposition 1, replace $\mathbb{H}$ everywhere by $\mathbb{A}\mathbb{I}\u2102$. Then, the proof follows the same steps. In particular, the analogous expression for (14) has the right‐hand side $\ud544\mathbb{I}\u2102\mathrm{a}({X}_{K};Y)-\mathbb{A}\mathbb{I}\u2102(Y|{X}_{K})$, which still is a functional of the distribution of **x**_{K} and **y** only, because $q(\mathbf{y}|{\mathbf{x}}_{K};{\widehat{\mathit{\theta}}}_{Y|{X}_{K}})$ does not depend on **x**_{R}. Also, $\mathbb{A}\mathbb{I}\u2102$ satisfies the chain rule (39), so the step from (17) to (18) is essentially the same as for $\mathbb{H}$.

Proposition 5 implies that estimates of $\ud544\mathbb{I}\u2102\mathrm{a}$ follow from estimates of $\mathbb{A}\mathbb{I}\u2102$, and so we consider in some detail unbiased and consistent estimation of $\mathbb{A}\mathbb{I}\u2102$. For normal distributions, such estimates can be derived from the model, **Y** and **X** are identified as response and explanatory variables, respectively, **B** and ${\mathit{\mu}}_{Y}$ contain regression coefficients, **j** is a vector of ones to account for the intercept, and **E**_{Y} is a random matrix. Each row of **E**_{Y} is independently distributed as a multivariate normal with zero mean and covariance matrix ${\mathit{\Sigma}}_{Y|X}$. The dimensions are

Let ${\widehat{\mathit{\Sigma}}}_{YY},{\widehat{\mathit{\Sigma}}}_{XX},{\widehat{\mathit{\Sigma}}}_{(XY)}$ be the MLEs of the covariance matrices of **E**_{Y}, **E**_{X}, **E**_{XY}, respectively. These matrices are related through standard identities

Evaluating AICc for each model in (43), noting that each model has only $M=1$ explanatory variable (i.e., the intercept), and using the identity

$NP+N(2MP+P(P+1))/(N-M-P-1)=PN(N+M)/(N-M-P-1)$, we obtain the criteria

Conditional independence can be expressed in many different ways. A criterion for conditional independence should make consistent decisions for equivalent formulations. While such consistency is guaranteed for population quantities like cross entropy, it is not guaranteed for sample criteria. The following proposition gives the necessary condition for a sample criterion to give consistent decisions about conditional independence.

Let $\mathcal{A}\mathcal{I}\mathcal{C}(Y|X)=N\mathrm{log}|{\widehat{\mathit{\Sigma}}}_{Y|X}|+\mathcal{P}$ be a sample criterion (note that AICc is of this form). Define the associated chain rule to be $\mathcal{A}\mathcal{I}\mathcal{C}(XY)=\mathcal{A}\mathcal{I}\mathcal{C}(X)+\mathcal{A}\mathcal{I}\mathcal{C}(Y|X)$. If $\mathcal{A}\mathcal{I}\mathcal{C}(Y|X)$ satisfies the chain rule, then it makes consistent decisions about *Y* ⊥ *X*_{R} | *X*_{K}. If it violates the chain rule, then there exists a sample for which it makes contradictory decisions about *Y* ⊥ *X*_{R} | *X*_{K}.

Let *ω* denote the constraint *Y* ⊥ *X*_{R} | *X*_{K}. Therefore, the associated candidate PDF *q*_{ω}(·) satisfies (24)–(27). Based on these identities, the positivity of the following quantities are equally valid criteria for deciding *ω*:

If $\mathcal{A}\mathcal{I}\mathcal{C}$ satisfies the chain rule, then a little algebra shows ${\widehat{\delta}}_{1}={\widehat{\delta}}_{2}={\widehat{\delta}}_{3}={\widehat{\delta}}_{4}$; hence, $\mathcal{A}\mathcal{I}\mathcal{C}$ gives consistent decisions about *Y* ⊥ *X*_{R} | *X*_{K}. Note that ${\widehat{\delta}}_{i}$ is of the form

where $\delta {\mathcal{P}}_{i}$ is a positive, deterministic term that depends only on (*N*, *M*_{K}, *M*_{R}, *P*) and Λ_{K} is independent of *i* because by (45)

In fact, Λ_{K} is a likelihood ratio because ${\widehat{\delta}}_{1},{\widehat{\delta}}_{2},{\widehat{\delta}}_{3},{\widehat{\delta}}_{4}$ are nested comparisons. Therefore, Λ_{K} is a random variable on (0, 1]. Suppose $\mathcal{A}\mathcal{I}\mathcal{C}$ violates the chain rule; hence, for some sample, ${\widehat{\delta}}_{i}\ne {\widehat{\delta}}_{j}$. Then because Λ_{K} does not depend on *i*, $\delta {\mathcal{P}}_{i}\ne \delta {\mathcal{P}}_{j}$ for the parameters (*N*, *M*_{K}, *M*_{R}, *P*) of that sample. Because $-\mathrm{log}{\mathrm{\Lambda}}_{K}$ is a continuous random variable with positive support on [0, *∞*), there is nonzero probability that it lies between $\delta {\mathcal{P}}_{i}$ and $\delta {\mathcal{P}}_{j}$. When this occurs, ${\widehat{\delta}}_{i}$ and ${\widehat{\delta}}_{j}$ have opposite signs and therefore $\mathcal{A}\mathcal{I}\mathcal{C}$ gives contradictory decisions about *Y* ⊥ *X*_{R} | *X*_{K}.

Unfortunately, AICc does *not* satisfy the chain rule; that is, $\text{AICc}(XY)\ne \text{AICc}(X)+\text{AICc}(Y|X)$. The reason AICc violates the chain rule is because its derivation implicitly assumes ${\mathbf{X}}_{0}=\widehat{\mathbf{X}}$ (DelSole & Tippett, 2021; Tian et al., 2020), which contradicts the assumption in (38) that $(\widehat{\mathbf{X}},\widehat{\mathbf{Y}})$ and (**X**_{0}, **Y**_{0}) are independent. Following Rosset and Tibshirani (2020), we define the following.

**X**_{0} and $\widehat{\mathbf{X}}$ are said to be Same‐X if ${\mathbf{X}}_{0}=\widehat{\mathbf{X}}$.

**X**_{0} and $\widehat{\mathbf{X}}$ are said to be Random‐X if the rows of **X**_{0} and $\widehat{\mathbf{X}}$ are independently and identically distributed as a joint normal distribution.

AICc is an unbiased estimate of $\mathbb{A}\mathbb{I}\u2102$ for Same‐X. An important special case of Same‐X is the intercept‐only models (43). In this case, $\text{AICc}(Y),\text{AICc}(X),\text{AICc}(XY)$ still are the correct unbiased estimates of $\mathbb{A}\mathbb{I}\u2102(Y),\mathbb{A}\mathbb{I}\u2102(X),\mathbb{A}\mathbb{I}\u2102(XY)$, because the only explanatory variable in each model is the intercept, which is Same‐X, and therefore consistent with the derivation of Hurvich and Tsai (1989). However, under Random‐X, $\text{AICc}(Y|X)$ violates the chain rule, and therefore, by Proposition 6, AICc can make contradictory decisions about *Y* ⊥ *X*_{R} | *X*_{K}. For these reasons, AICc is unsuitable for selecting models under Random‐X. The appropriate sample criterion for Random‐X is given in the next proposition.

Assuming the candidate model (41) includes the true model, an unbiased estimate of $\mathbb{A}\mathbb{I}\u2102(Y|X)$ under Random‐X is

Under Random‐X, $\mathbb{A}\mathbb{I}\u2102$ satisfies the chain rule (39); therefore, $\mathbb{A}\mathbb{I}\u2102(Y|X)$ can be estimated as $\mathbb{A}\mathbb{I}\u2102(XY)-\mathbb{A}\mathbb{I}\u2102(X)$. Unbiased estimates of the latter two matrices are (48) and (47), respectively. Taking the difference $\text{AICc}(XY)-\text{AICc}(X)$ yields (54). Alternatively, $\text{AICr}(Y|X)$ can be derived by exact integration, as shown in DelSole and Tippett (2021) (see also Fujikoshi, 1985; Tian et al., 2020). AICr is written in the form (54), rather than in other forms in DelSole and Tippett (2021), to facilitate comparisons discussed below.

AICr satisfies the chain rule $\text{AICr}(XY)=\text{AICr}(X)+\text{AICr}(Y|X)$, and hence by Proposition 6, it gives consistent decisions for equivalent selection problems. Since AICr also is an unbiased estimate of $\mathbb{A}\mathbb{I}\u2102$ for Random‐X, it is the natural basis for estimating Akaike's extension of $\ud544\mathbb{I}\u2102$.

Assuming the candidate PDF (41) includes the true PDF, an unbiased estimate of $\ud544\mathbb{I}\u2102\mathrm{a}(X;Y)$ under Random‐X is

In terms of the regression model (41),

Equation (55) follows from Proposition 5 after replacing $\mathbb{A}\mathbb{I}\u2102$ with the estimate AICr. Equations (56) and (57) follow from (55) because AICr satisfies the chain rule. Equation (58) follows from (55) and (54).

*X*‐selection

Under *ω* : *Y* ⊥ *X*_{R} | *X*_{K}, (27) implies that *ω* is Δ_{X} < 0, where

Under $\psi :{Y}_{K}\perp {X}_{R}|{X}_{K}\text{and}\phantom{\rule{0.1em}{0ex}}{Y}_{R}\perp {X}_{K}{X}_{R}|{Y}_{K}$, (35) implies that *ψ* is Δ_{XY} < 0, where

Partition the matrices in (41) as $\mathbf{X}=[{\mathbf{X}}_{K}\phantom{\rule{0.1em}{0ex}}{\mathbf{X}}_{R}]$ and $\mathbf{Y}=[{\mathbf{Y}}_{K}\phantom{\rule{0.1em}{0ex}}{\mathbf{Y}}_{R}]$, where **X**_{K}, **X**_{R}, **Y**_{K}, **Y**_{R} are each full column rank matrices of rank *M*_{K}, *M*_{R}, *P*_{K}, *P*_{R}, respectively, with $M={M}_{K}+{M}_{R}$ and $P={P}_{K}+{P}_{M}$. Then ${\mathrm{\Delta}}_{X}=N\mathrm{log}{\mathrm{\Lambda}}_{K}+\mathcal{P}(N,{M}_{K}+{M}_{R},P)-\mathcal{P}(N,{M}_{K},P)$, or

Similarly, ${\mathrm{\Delta}}_{XY}=N\mathrm{log}|{\widehat{\mathit{\Sigma}}}_{({Y}_{R}{X}_{R})|({Y}_{K}{X}_{K})}|-N\mathrm{log}|{\widehat{\mathit{\Sigma}}}_{{X}_{R}|{X}_{K}}|-N\mathrm{log}|{\widehat{\mathit{\Sigma}}}_{{Y}_{R}|{Y}_{K}}|+\mathcal{P}(N,{M}_{K}+{M}_{R},{P}_{K}+{P}_{R})-\mathcal{P}(N,{M}_{K},{P}_{K})$, or equivalently

Many standard texts recommend using AICc for *X*‐selection (e.g., Burnham & Anderson, 2002). We argue that AICc is not suitable for deciding conditional independence because it gives inconsistent decisions for equivalent formulations of conditional independence. Another issue can be seen by comparing Δ_{X} to ${\mathrm{\Delta}}_{X}^{\prime}=\text{AICc}(Y|{X}_{K}{X}_{R})-\text{AICc}(Y|{X}_{K})$. The latter criterion imposes less penalty per each extra predictor than does (61). The reason for this is that AICc assumes Same‐X while AICr assumes Random‐X (as discussed earlier in this section). As a result, AICc neglects a source of uncertainty and therefore underestimates the cross entropy.

Under normality, deciding *Y* ⊥ *X*_{R} | *X*_{K} is equivalent to deciding

The likelihood ratio test (LRT Johnson & Wichern, 2002) for this hypothesis is to decide ${\mathbf{B}}_{R}=\mathbf{0}$ when ${\mathrm{\Delta}}_{\text{LRT}}=\mathrm{log}{\mathrm{\Lambda}}_{K}-\mathrm{log}{\mathrm{\Lambda}}_{C}>0$, where Λ_{K} is defined in (53), and Λ_{C} is the critical value from Wilks' lambda distribution with parameters (*P*, *M*_{R}, *N* − *M*). Both ${\mathrm{\Delta}}_{\text{LRT}}$ to Δ_{X} depend on sample values only through the likelihood ratio and therefore differ only by the critical value. However, the LRT is limited to nested models.

Conditional independence *ω* : *Y* ⊥ *X*_{R} | *X*_{K} also can be expressed as (26), which under normal distributions is equivalent to *ω* indicates that a separate derivation is unnecessary.

Conditional independence *ω* : *Y* ⊥ *X*_{R} | *X*_{K} also can be expressed as (25), which under normal distributions is equivalent to the hypothesis ${\mathbf{B}}_{Y}=\mathbf{0}$ in the model *ω*, the criterion also is the same, as also can be seen from the following identity:

Turning to an apparently different selection problem, Fujikoshi (1989) proposed a criterion for selecting *Y* variables on the basis that **y**_{R}, after removing the effects of **y**_{K}, does not depend on **x**. This criterion can be framed as the hypothesis

Under normality, the selection problem (66) is equivalent to deciding

which is merely (23), except with *X* and *Y* labels switched. We call this *Y‐selection*. Thus, all of the above results for *X*‐selection can be applied immediately to *Y* ‐selection, after swapping variable labels. In particular, the criterion for *Y* ‐selection is

This small‐sample criterion is asymptotically equivalent to the criterion derived by Fujikoshi (1989). Because MIC is symmetric, the criterion is identical to regression model selection but with the usual roles of *X* and *Y* swapped; namely, *X* is response and *Y* is explanatory. In this sense, selecting response variables is fundamentally equivalent to selecting explanatory variables—once a criterion for *X*‐selection exists, one can swap *X* and *Y* labels and apply it to select response variables. In this sense, a separate derivation of a criterion for *Y* ‐selection is unnecessary.

It should be recognized that the criteria stated in Propositions 9 and 10 were obtained merely by evaluating MIC. In particular, no constrained maximum likelihood problem needed to be solved. Nevertheless, (61) and (63) assert that the criteria are equivalent to the AICr of the joint PDFs constrained by the relevant form of conditional independence. These assertions can be verified because the associated constrained optimization problems have in fact been solved in the literature, though this fact seems not to be widely recognized. First, Fujikoshi (1985) derived the corrected AIC criterion for *X*‐selection under Random‐X. The result is his eq. 5.17, which is identical to our −Δ_{X}. This serves as a check on our derivation of (61). Also, this equivalence implies that Fujikoshi (1985) derived the small‐sample correction to AIC under Random‐X nearly 40 years ago!

In regards to Proposition 10, the verification is somewhat more complicated because the small‐sample corrected AIC for simultaneous selection does not appear in the literature. However, Fujikoshi et al. (2010) derived a criterion based on Distance Information Criterion, which is closely related to AIC (see sec. 10.6.1 of Fujikoshi et al., 2010). The small‐sample corrected version of this criterion is called CDIC and appears in sec. 11.5.2 of Fujikoshi et al. (2010). To remove the slight inconsistency with AIC, we adjust CDIC as follows: replace the overall factor of “*n*” by “*N*,” and replace “*n*” in the numerator of each penalty term by “*N* + 1,” which yields the following modified criterion CDIC*:

Comparison with (64) shows that ${\text{CDIC}}^{*}\phantom{\rule{.5em}{0ex}}\text{and}\phantom{\rule{.5em}{0ex}}-{\mathrm{\Delta}}_{XY}$ agree, except for additive terms that depend only on *N* and *P* + *M*. It is not clear why there exist differing terms, but Fujikoshi et al. (2010) applied their criterion to situations in which *N* and *P* + *M* were constant; hence, these terms do not affect model selection. We interpret this agreement as confirming that both Δ_{X} and Δ_{XY} are the correct small‐sample criteria for conditional independence.

Importantly, Fujikoshi (1985) and Fujikoshi et al. (2010) derived the above criteria by explicitly maximizing the likelihood function subject to a constraint associated with conditional independence. The solution to such constrained optimization problems requires intricate matrix manipulations. In contrast, the criteria in Proposition 11 were obtained simply by taking differences in MIC. The simplicity in the latter approach derives from the fact that certain forms of conditional independence allow structured PDFs to be expressed in terms of criteria for unstructured PDFs. Specifically, the left‐hand sides of (60) and (62) require solving a constrained ML problem whereas the right‐hand side requires solving unconstrained ML problems. A remarkable fact is that MIC gives this decomposition directly simply by computing differences in MIC of appropriate variable subsets.

MIC is a natural criterion for CCA because, in addition to the above reasons, it depends on sample values *only* through the canonical correlations.

Let the canonical correlations between **X** and **Y** in (41) be ${\widehat{\rho}}_{1},{\widehat{\rho}}_{2},\dots \phantom{\rule{0.1em}{0ex}}$. Then

Recall that canonical correlations are derived from the eigenvalues of

For normal distributions, a sample estimate of mutual information is (Soofi et al., 2010),

Thus, minimizing MIC strikes a balance between maximizing mutual information while minimizing the number of parameters being estimated.

To illustrate the application of MIC for selecting variables in CCA, consider data generated by the model *X* and *Y* variables are included, population CCA yields one nonzero canonical correlation, namely, ${\rho}_{1}=0.7$. However, only the first four *X* and first four *Y* variables are relevant; additional variables beyond this add no information about the *X*–*Y* relation and are therefore redundant. We consider a selection problem in which the candidate variables are included in a sequentially nested fashion; that is, the candidate model with *M*_{K} *X* variables consists of ${X}_{1},{X}_{2},\dots ,{X}_{{M}_{K}}$, and the candidate model with *M*_{Y} *Y* variables consists of ${Y}_{1},{Y}_{2},\dots ,{Y}_{{M}_{Y}}$.

Figure 1 shows MIC for a particular realization of samples for $N=50$. The minimum MIC occurs when three *X* and three *Y* variables are used. Repeating this procedure 100 times and counting the number of times a particular model is selected leads to the top left panel of Figure 2. For reference, the population mutual information is indicated by the shading. The most common selection is for three *X* and three *Y* variables. For comparison, we define an “uncorrected MIC” using (58) but with the uncorrected penalty ${\mathrm{lim}}_{N\to \infty}\mathcal{P}(N,{M}_{K},P)=2{M}_{K}P$. Selections based on uncorrected MIC, shown in the bottom left panel, show much larger tendency to overfit, which illustrates the importance of using the corrected criterion. For a larger sample size, $N=200$ (right column), MIC overwhelmingly selects four *X* and four *Y* variables, the correct choice for large *N*. The uncorrected MIC still shows a larger tendency to overfit. Even for $N=$ 20,000 (not shown), MIC overwhelming selects four *X* and four *Y* variables and shows little tendency to overfit.

We now consider using MIC to select graphical models. Graphical models express conditional dependencies by a graph comprising nodes and edges, where the absence of an edge between two nodes indicates that those two variables are conditionally independent given all other variables. More precisely, the two nodes *Z*_{1} and *Z*_{2} have no edge if *Z*_{/12} means “all Z‐variables except *Z*_{1} and *Z*_{2}.” Graphical models corresponding to *X*‐selection, *Y*‐selection, and simultaneous selection are illustrated in Figure 3. The graph for simultaneous selection follows from the fact that

(which follows from the converse of Lemma 4.3 in Dawid, 1979). The associated structures have a simple expression in terms of the precision matrix (i.e., the inverse of the covariance matrix). Specifically, (72) implies that the (*Z*_{1}, *Z*_{2}) element of the precision matrix vanishes. Accordingly, the precision matrices corresponding to *X*‐selection, *Y*‐selection, and simultaneous selection have, respectively, the following forms *ω*_{12} is true, then conditional mutual information vanishes; that is, $\ud544({Z}_{1};{Z}_{2}|{Z}_{/12})=0$. As remarked in (20), a conditional $\ud544\mathbb{I}\u2102$ may be defined that behaves analogously to conditional mutual information, except it varies in the opposite way (i.e., large $\ud544\mathbb{I}\u2102$ corresponds to weak conditional independence). By suitable redefinition of variable labels in previous sections, conditional $\ud544\mathbb{I}\u2102$ is

The Akaike‐based sample estimate of conditional MIC is *ω*_{12} if $\text{MIC}({Z}_{1};{Z}_{2}|{Z}_{/12})>0$. This criterion can be evaluated for any *Z*_{1} and *Z*_{2}, even if the graph is nondecomposable. For completeness, we note that the analogous criterion for deciding *Z*_{1} ⊥ *Z*_{2} is $\ud544\mathbb{I}\u2102({Z}_{1};{Z}_{2})=\mathbb{H}({Z}_{1}|{Z}_{2})-\mathbb{H}({Z}_{1})>0$.

For scalar *Z*_{1} and *Z*_{2}, conditional MIC is *D* is the total number of Z‐variables and ${\widehat{\rho}}_{12|{Z}_{/12}}$ is the partial correlation between *Z*_{1} and *Z*_{2} after regressing out *Z*_{/12}.

One of the most popular algorithms for identifying graphical models is the PC Algorithm (Spirtes et al., 2001). This algorithm requires a criterion for deciding conditional independence. A standard criterion is based on statistical significance of the partial correlation. However, significance depends on the arbitrary significance level *α* and is not guaranteed to be a proper score. In contrast, the criterion $\text{MIC}({Z}_{1};{Z}_{2}|{Z}_{/12})>0$ does not depend on the arbitrary *α* and is a proper score. To illustrate its application, we consider a simple four‐variable model governed by *N* samples from (76), where *E*_{1}, *E*_{2}, *E*_{3}, *E*_{4} are independently drawn from $\mathcal{N}(\mathrm{0,1})$. Then, this whole procedure (including resampling A, B, C) was repeated 1000 times, and the number of times the PC algorithm identified the correct graph was recorded. We have performed two different experiments: one that decides conditional independence based on significance of the partial correlation using $\alpha =5\%$, and one based on MIC in (75). The results are shown in Figure 4. The figure shows that for this choice of *α* and for a small sample size (N < 100), the PC algorithm selects the correct graph more frequently using MIC than using the significance of the partial correlation. This result should not be interpreted as general, since it depends on the choice of *α*, which is a tuning parameter in the PC algorithm. In contrast, the criterion $\text{MIC}({Z}_{1};{Z}_{2}|{Z}_{/12})>0$ does not involve tunable parameters. The parameter *α* could be tuned to produce better results, but this tuning is not generally possible when the true graph is unknown. We emphasize that the criterion for conditional independence (74) is not restricted for univariate *Z*_{1} and *Z*_{2}; hence, this criterion may open new approaches to graphical model selection.

This research was supported primarily by the National Science Foundation (AGS‐1822221). Additional support was provided from National Science Foundation (AGS‐1338427), National Aeronautics and Space Administration (NNX14AM19G), and the National Oceanic and Atmospheric Administration (NA14OAR4310160). The views expressed herein do not necessarily reflect the views of these agencies.

No original data were generated through this work.