canonicalCorrelation¶

Given an input array of deviate values, generates a canonical correlation array.

Synopsis¶

canonicalCorrelation (devt)

Required Arguments¶

float devt[[]] (Input): An array of length nseq × nvar of deviate values containing nseq row elements for each of nvar variables (columns).

Return Value¶

An array of length nvar × nvar containing the canonical correlation array.

Description¶

Function canonicalCorrelation generates a canonical correlation matrix from an arbitrarily distributed multivariate deviate sequence devt with nvar deviate variables, nseq elements in each deviate sequence, and a Gaussian Copula dependence structure.

Function canonicalCorrelation first maps each of the j = 0, …, nvar‑1 input deviate sequences devt[k = 0, …, nseq‑1][ j] into a corresponding sequence of variates, say $V_{kj}$ (where variates are values of the empirical cumulative probability function, $CDF(x)$ , defined as the probability that random deviate variable $X\leq x$ ). The variate matrix element $V_{kj}$ is then mapped into standard normal N(0,1) distributed deviates $z_{kj}$ using the inverse standard normal CDF normalInverseCdf( $V_{kj}$ ) and then the standard covariance estimator

$C_{ij} = \tfrac{1}{m} \sum_{k=1}^{m} z_{ki} z_{kj}$

(where m = nseq and i and j have values between 1 and nvar) is used to calculate the canonical correlation matrix corr, where $C_{i j}$ = corr[i-1][j-1] = the return value canonical correlation array.

If a multivariate distribution has Gaussian marginal distributions, then the standard “empirical” correlation matrix given above is “unbiased”, i.e. an accurate measure of dependence among the variables. But when the marginal distributions depart significantly from Gaussian, i.e. are skewed or flattened, then the empirical correlation may become biased. One way to remove such bias from dependence measures is to map the non-Gaussian-distributed marginal deviates to N(0,1) deviates (by mapping the non-Gaussian marginal deviates to empirically derived marginal CDF variate values, then inverting the variates to N(0,1) deviates as described above), and calculating the standard empirical correlation matrix from these N(0,1) deviates as in the equation above. The resulting “canonical correlation” matrix thereby avoids the bias that would occur if the empirical correlation matrix were extracted from the non-Gaussian marginal distributions directly.

The canonical correlation matrix may be of value in such applications as Markowitz portfolio optimization, where an unbiased measure of dependence is required to evaluate portfolio risk, defined in terms of the portfolio variance which is in turn defined in terms of the correlation among the component portfolio instruments.

The utility of the canonical correlation derives from the observation that a “copula” multivariate distribution with uniformly-distributed deviates (corresponding to the CDF probabilities associated with the marginal deviates) may be mapped to arbitrarily distributed marginals, so that an unbiased dependence estimator derived from one set of marginals N(0,1) (distributed marginals) can be used to represent the dependence associated with arbitrarily-distributed marginals. The “Gaussian Copula” (whose variate arguments are derived from N(0,1) marginal deviates) is a particularly useful structure for representing multivariate dependence.

Example: Using Gaussian Copulas to Imprint and Extract Correlation Information ——————————————————————————

This example uses function randomMvarGaussianCopula to generate a multivariate sequence gcdevt whose marginal distributions are user-defined and imprinted with a user-specified input correlation matrix corrin and then uses function canonicalCorrelation to extract an output canonical correlation matrix corrout from this multivariate random sequence.

This example illustrates two useful copula related procedures. The first procedure generates a random multivariate sequence with arbitrary user-defined marginal deviates whose dependence is specified by a user-defined correlation matrix. The second procedure is the inverse of the first: an arbitrary multivariate deviate input sequence is first mapped to a corresponding sequence of empirically derived variates, i.e. cumulative distribution function values representing the probability that each random variable has a value less than or equal to the input deviate. The variates are then inverted, using the inverse standard normal CDF function, to N(0,1) deviates; and finally, a canonical covariance matrix is extracted from the multivariate N(0,1) sequence using the standard sum of products.

This example demonstrates that function randomMvarGaussianCopula correctly embeds the user-defined correlation information into an arbitrary marginal distribution sequence by extracting the canonical correlation from these sequences and showing that they differ from the original correlation matrix by a small relative error, which generally decreases as the number of multivariate sequence vectors increases.

from __future__ import print_function
from numpy import *
from pyimsl.math.linSolPosdef import linSolPosdef
from pyimsl.stat.randomOption import randomOption
from pyimsl.stat.randomSeedSet import randomSeedSet
from pyimsl.stat.randomMvarTCopula import randomMvarTCopula
from pyimsl.stat.chiSquaredInverseCdf import chiSquaredInverseCdf
from pyimsl.stat.fInverseCdf import fInverseCdf
from pyimsl.stat.normalInverseCdf import normalInverseCdf
from pyimsl.stat.canonicalCorrelation import canonicalCorrelation

nvar = 3
lmax = 15000
df = 5.0
arg1 = 10.0
arg2 = 15.0
corrin = [[1.0, -0.9486832, 0.8164965],
          [-0.9486832, 1.0, -0.6454972],
          [0.8164965, -0.6454972, 1.0]]

print("Off-diagonal elements of Input Correlation Matrix:\n")
for i in range(nvar):
    for j in range(i):
        print(" CorrIn(%d,%d) = %10.6f" % (i, j, corrin[i][j]))
print("\n Degrees of freedom df = %6.2f" % df)
print("\n Imprinted random sequences distributions:")
print("\n 1: Chi, 2: F, 3: Normal;")
print("\nOff-diagonal elements of Output Correlation Matrices")
print("calculated from Student's t Copula imprinted")
print("multivariate sequence:")

#
# Compute the Cholesky factorization of corrin
#
# Use IMSL function linSolPosdef to generate
# the nvar by nvar upper triangular matrix chol from
# the Cholesky decomposition R*RT of input correlation
# matrix corrin:
#
chol = []
linSolPosdef(corrin, None, factor=chol, factorOnly=True)

kmax = lmax / 100
for kk in range(1, 4):
    tcdevt = zeros((int(kmax), nvar), dtype=double)
    print("\n# of vectors in multivariate sequence: %7d\n\n" % kmax)
    # use Congruential RN generator, with multiplier 16807
    randomOption(1)
    # set RN generator seed to be 123457
    randomSeedSet(123457)

    for k in range(int(kmax)):
        #
        # generate a NVAR-length random Student's t Copula
        # variate output vector tcvart which is uniformly
        # distributed on the interval [0,1] and imprinted
        # with correlation information from input Cholesky
        # matrix chol:
        tcvart = randomMvarTCopula(df, chol)
        for j in range(3):
            #
            # invert Student's t Copula probabilities to
            # deviates using variable-specific
            # inversions: j = 0: Chi Square; j = 1: F;
            # j = 2: Normal(0,1); will end up with deviate
            # sequences ready for mapping to canonical
            # correlation matrix:
            #
            if (j == 0):
                # convert probs into ChiSquare(df=10) deviates
                tcdevt[k, j] = chiSquaredInverseCdf(tcvart[j], arg1)
            elif (j == 1):
                # convert probs into F(dfn=15,dfd=10) deviates
                tcdevt[k, j] = fInverseCdf(tcvart[j], arg2, arg1)
            else:
                # convert probs into Normal(mean=0,variance=1) deviates:
                tcdevt[k, j] = normalInverseCdf(tcvart[j])
    #
    # extract Canonical Correlation matrix from arbitrarily
    # distributed deviate sequences tcdevt (k=1..kmax, j=1..NVAR)
    # which have been imprinted with corrin (i=1..NVAR, j=1..NVAR)
    # above:
    corrout = canonicalCorrelation(tcdevt)
    for i in range(nvar):
        for j in range(i):
            rs00 = corrin[i][j]
            rs = corrout[i][j]
            relerr = abs((rs - rs00) / rs00)
            print(" CorrOut(%d,%d) = %10.6f; relerr = %10.6f" %
                  (i, j, corrout[i][j], relerr))
    kmax *= 10

Output¶

Off-diagonal elements of Input Correlation Matrix:

 CorrIn(1,0) =  -0.948683
 CorrIn(2,0) =   0.816496
 CorrIn(2,1) =  -0.645497

 Degrees of freedom df =   5.00

 Imprinted random sequences distributions:

 1: Chi, 2: F, 3: Normal;

Off-diagonal elements of Output Correlation Matrices
calculated from Student's t Copula imprinted
multivariate sequence:

# of vectors in multivariate sequence:     150


 CorrOut(1,0) =  -0.953573; relerr =   0.005154
 CorrOut(2,0) =   0.774720; relerr =   0.051166
 CorrOut(2,1) =  -0.621419; relerr =   0.037302

# of vectors in multivariate sequence:    1500


 CorrOut(1,0) =  -0.944316; relerr =   0.004603
 CorrOut(2,0) =   0.810163; relerr =   0.007757
 CorrOut(2,1) =  -0.636348; relerr =   0.014174

# of vectors in multivariate sequence:   15000


 CorrOut(1,0) =  -0.946770; relerr =   0.002017
 CorrOut(2,0) =   0.808562; relerr =   0.009718
 CorrOut(2,1) =  -0.636322; relerr =   0.014215