Condividi tramite


Estimating Correlation and Variance/Covariance Matrices

Important

This content is being retired and may not be updated in the future. The support for Machine Learning Server will end on July 1, 2022. For more information, see What's happening to Machine Learning Server?

The rxCovCor function in RevoScaleR calculates the covariance, correlation, or sum of squares/cross-product matrix for a set of variables in a .xdf file or data frame. The size of these matrices is determined by the number of variables rather than the number of observations, so typically the results can easily fit into memory in R. A broad category of analyses can be computed from some form of a cross-product matrix, for example, factor analysis and principal components.

A cross-product matrix is a matrix of the form X'X, where X represents an arbitrary set of raw or standardized variables. More generally, this matrix is of the form X'WX, where W is a diagonal weighting matrix.

Computing Cross-Product Matrices

While rxCovCor is the primary tool for computing covariance, correlation, and other cross-product matrices, you will seldom call it directly. Instead, it is generally simpler to use one of the following convenience functions:

  • rxCov: Use rxCov to return the covariance matrix
  • rxCor: Use rxCor to return the correlation matrix
  • rxSSCP: Use rxSSCP to return the augmented cross-product matrix, that is, we first add a column of 1’s (if no weights are specified) or a column equaling the square root of the weights to the data matrix, and then compute the cross-product.

Computing a Correlation Matrix for Use in Factor Analysis

The 5% sample of the U.S. 2000 census has over 14 million observations. In this example, we compute the correlation matrix for 16 variables derived from variables in the data set for individuals over the age of 20. This correlation matrix is then used as input into the standard R factor analysis function, factanal.

First, we specify the name and location of the data set:

#  Computing a Correlation Matrix for Use in Factor Analysis

bigDataDir <- "C:/MRS/Data"
bigCensusData <- file.path(bigDataDir, "Census5PCT2000.xdf"

Next, we can take a quick look at some basic demographic and socio-economic variables by calling rxSummary. (For more information on variables in the census data, see http://usa.ipums.org/usa-action/variables/group.) Throughout the analysis we use the probability weighting variable, perwt, and restrict the analysis to people over the age of 20.

The blocksPerRead argument is ignored if run locally using R Client. Learn more...

rxSummary(~phone + speakeng + wkswork1 + incwelfr + incss + educrec + metro +
	ownershd + marst + lingisol + nfams + yrsusa1 + movedin + racwht + age,
	data = bigCensusData, blocksPerRead = 5, pweights = "perwt", 
	rowSelection = age > 20)

This call provides summary information about the variables in this weighted subsample, including cell counts for factor variables:

Call:
rxSummary(formula = ~phone + speakeng + wkswork1 + incwelfr + 
	incss + educrec + metro + ownershd + marst + lingisol + nfams + 
	yrsusa1 + movedin + racwht + age, data = censusData, pweights = "perwt", 
	rowSelection = age > 20, blocksPerRead = 5)

Summary Statistics Results for: ~phone + speakeng + wkswork1 + incwelfr
	+ incss + educrec + metro + ownershd + marst + lingisol + nfams +
	yrsusa1 + movedin + racwht + age
File name: C:\MRS\Data\Census5PCT2000.xdf
Probability weights: perwt
Number of valid observations: 9822124 
	
	Name     Mean        StdDev       Min Max   SumOfWeights MissingWeights
	wkswork1   32.068473   23.2438663  0     52 196971131    0             
	incwelfr   61.155293  711.0955602  0  25500 196971131    0             
	incss    1614.604835 3915.7717233  0  26800 196971131    0             
	nfams       1.163434    0.5375238  1     48 196971131    0             
	yrsusa1     2.868573    9.0098343  0     90 196971131    0             
	age        46.813005   17.1797905 21     93 196971131    0             

Category Counts for phone
Number of categories: 3

	phone                  Counts   
	N/A                      5611380
	No, no phone available   3957030
	Yes, phone available   187402721

Category Counts for speakeng
Number of categories: 10

	speakeng                 Counts   
	N/A (Blank)                      0
	Does not speak English     2956934
	Yes, speaks English...           0
	Yes, speaks only English 162425091
	Yes, speaks very well     17664738
	Yes, speaks well           7713303
	Yes, but not well          6211065
	Unknown                          0
	Illegible                        0
	Blank                            0

Category Counts for educrec
Number of categories: 10

	educrec                 Counts  
	N/A  or No schooling           0
	None or preschool        2757363
	Grade 1, 2, 3, or 4      1439820
	Grade 5, 6, 7, or 8     10180870
	Grade 9                  4862980
	Grade 10                 5957922
	Grade 11                 5763456
	Grade 12                63691961
	1 to 3 years of college 55834997
	4+ years of college     46481762

Category Counts for metro
Number of categories: 5

	metro                                Counts  
	Not applicable                       13829398
	Not in metro area                    32533836
	In metro area, central city          32080416
	In metro, area, outside central city 60836302
	Central city status unknown          57691179

Category Counts for ownershd
Number of categories: 8

	ownershd                    Counts  
	N/A                          5611380
	Owned or being bought              0
	Check mark (owns?)                 0
	Owned free and clear        40546259
	Owned with mortgage or loan 94626060
	Rents                              0
	No cash rent                 3169987
	With cash rent              53017445

Category Counts for marst
Number of categories: 6

	marst                      Counts   
	Married, spouse present    112784037
	Married, spouse absent       5896245
	Separated                    4686951
	Divorced                    21474299
	Widowed                     14605829
	Never married/single (N/A)  37523770

Category Counts for lingisol
Number of categories: 3

	lingisol                    Counts   
	N/A (group quarters/vacant)   5611380
	Not linguistically isolated 182633786
	Linguistically isolated       8725965

Category Counts for movedin
Number of categories: 7

	movedin                Counts  
	NA                     92708540
	This year or last year 20107246
	2-5 years ago          30328210
	6-10 years ago         16959897
	11-20 years ago        16406155
	21-30 years ago        10339278
	31+ years ago          10121805

Category Counts for racwht
Number of categories: 2

	racwht Counts   
	No      40684944
	Yes    156286187

Next we call the rxCor function, a convenience function for rxCovCor that returns just the Pearson’s correlation matrix for the variables specified. We make heavy use of the transforms argument to create a series of logical (or dummy) variables from factor variables to be used in the creation of the correlation matrix.

censusCor <- rxCor(formula=~poverty + noPhone + noEnglish  + onSocialSecurity + 
	onWelfare + working + incearn + noHighSchool + inCity + renter + 
	noSpouse + langIsolated + multFamilies + newArrival + recentMove + 
	white + sei + older, 
	data = bigCensusData, pweightsb= "perwt", blocksPerRead = 5, 
	rowSelection = age > 20,
	transforms= list(
		noPhone = phone == "No, no phone available",
		noEnglish = speakeng == "Does not speak English",
		working = wkswork1 > 20,
		onWelfare = incwelfr > 0,
		onSocialSecurity = incss > 0,
		noHighSchool = 
			!(educrec %in% 
			c("Grade 12", "1 to 3 years of college", "4+ years of college")),
		inCity = metro == "In metro area, central city",
		renter = ownershd %in% c("No cash rent", "With cash rent"),
		noSpouse = marst != "Married, spouse present",
		langIsolated = lingisol == "Linguistically isolated",
		multFamilies = nfams > 2,
		newArrival = yrsusa2 == "0-5 years",
		recentMove = movedin == "This year or last year",
		white = racwht == "Yes",
		older = age > 64	
		))

The resulting correlation matrix is used as input into the factor analysis function provided by the stats package in R. In interpreting the results, the variable poverty represents family income as a percentage of a poverty threshold, so increases as family income increases. First, specify two factors:

censusFa <- factanal(covmat = censusCor, factors=2)
print(censusFa, digits=2, cutoff = .2, sort= TRUE)

Results in:

Call:
factanal(factors = 2, covmat = censusCor)

Uniquenesses:
			poverty          noPhone        noEnglish onSocialSecurity 
			0.53             0.96             0.93             0.23 
		onWelfare          working          incearn     noHighSchool 
			0.96             0.51             0.68             0.82 
			inCity           renter         noSpouse     langIsolated 
			0.97             0.79             0.90             0.90 
	multFamilies       newArrival       recentMove            white 
			0.96             0.93             0.97             0.88 
				sei            older 
			0.57             0.24 

Loadings:
					Factor1 Factor2
onSocialSecurity  0.83   -0.29  
working          -0.68          
sei              -0.59   -0.29  
older             0.82   -0.29  
poverty          -0.28   -0.63  
noPhone                         
noEnglish                 0.26  
onWelfare                       
incearn          -0.46   -0.34  
noHighSchool      0.31    0.29  
inCity                          
renter                    0.45  
noSpouse                  0.28  
langIsolated              0.31  
multFamilies                    
newArrival                0.26  
recentMove                      
white                    -0.34  

				Factor1 Factor2
SS loadings       2.60    1.67
Proportion Var    0.14    0.09
Cumulative Var    0.14    0.24

The degrees of freedom for the model is 118 and the fit was 0.6019  

We can use the same correlation matrix to estimate three factors:

censusFa <- factanal(covmat = censusCor, factors=3)
print(censusFa, digits=2, cutoff = .2, sort= TRUE)

	Call:
	factanal(factors = 3, covmat = censusCor)
	
	Uniquenesses:
			poverty          noPhone        noEnglish onSocialSecurity 
				0.49             0.96             0.72             0.24 
			onWelfare          working          incearn     noHighSchool 
				0.96             0.50             0.62             0.80 
			inCity           renter         noSpouse     langIsolated 
				0.96             0.80             0.90             0.59 
		multFamilies       newArrival       recentMove            white 
				0.96             0.79             0.97             0.88 
				sei            older 
				0.56             0.22 
	
	Loadings:
					Factor1 Factor2 Factor3
	onSocialSecurity  0.87                  
	working          -0.62    0.34          
	sei              -0.51    0.42          
	older             0.88                  
	poverty                   0.68          
	incearn          -0.36    0.51          
	noEnglish                         0.53  
	langIsolated                      0.64  
	noPhone                                 
	onWelfare                -0.21          
	noHighSchool      0.25   -0.26    0.26  
	inCity                                  
	renter                   -0.36    0.24  
	noSpouse                 -0.30          
	multFamilies                            
	newArrival                        0.46  
	recentMove                              
	white                     0.24   -0.24  
	
					Factor1 Factor2 Factor3
	SS loadings       2.41    1.50    1.18
	Proportion Var    0.13    0.08    0.07
	Cumulative Var    0.13    0.22    0.28

The degrees of freedom for the model is 102 and the fit was 0.343

Computing A Covariance Matrix for Principal Components Analysis

Principal components analysis, or PCA, is a technique closely related to factor analysis. PCA seeks to find a set of orthogonal axes such that the first axis, or first principal component, accounts for as much variability as possible, and subsequent axes or components are chosen to maximize variance while maintaining orthogonality with previous axes. Principal components are typically computed either by a singular value decomposition of the data matrix or an eigenvalue decomposition of a covariance or correlation matrix; the latter permits us to use rxCovCor and its relatives with the standard R function princomp.

As an example, we use the rxCov function to calculate a covariance matrix for the log of the classic iris data, and pass the matrix to the princomp function (reproduced from Modern Applied Statistics with S):

#  Computing A Covariance Matrix for Principal Components Analysis
	
irisLog <- as.data.frame(lapply(iris[,1:4], log))
irisSpecies <- iris[,5]
irisCov <- rxCov(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width,
	data=irisLog)
irisPca <- princomp(covmat=irisCov, cor=TRUE)
summary(irisPca)

Yields the following:

Importance of components:
							Comp.1    Comp.2     Comp.3    Comp.4
Standard deviation     1.7124583 0.9523797 0.36470294 0.1656840
Proportion of Variance 0.7331284 0.2267568 0.03325206 0.0068628
Cumulative Proportion  0.7331284 0.9598851 0.99313720 1.0000000

The default plot method for objects of class princomp is a scree plot, which is a barplot of the variances of the principal components. We can obtain the plot as usual by calling plot with our principal components object:

plot(irisPca)

Yields the following plot:

irisPca plot

Another useful bit of output is given by the loadings function, which returns a set of columns showing the linear combinations for each principal component:

loadings(irisPca)

	Loadings:
				Comp.1 Comp.2 Comp.3 Comp.4
	Sepal.Length  0.504 -0.455  0.709  0.191
	Sepal.Width  -0.302 -0.889 -0.331       
	Petal.Length  0.577        -0.219 -0.786
	Petal.Width   0.567        -0.583  0.580
	
					Comp.1 Comp.2 Comp.3 Comp.4
	SS loadings      1.00   1.00   1.00   1.00
	Proportion Var   0.25   0.25   0.25   0.25
	Cumulative Var   0.25   0.50   0.75   1.00

You may have noticed that we supplied the flag cor=TRUE in the call to princomp; this flag tells princomp to use the correlation matrix rather than the covariance matrix to compute the principal components. We can obtain the same results by omitting the flag but submitting the correlation matrix as returned by rxCor instead:

irisCor <- rxCor(~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
	data=irisLog)
irisPca2 <- princomp(covmat=irisCor)
summary(irisPca2)
loadings(irisPca2)
plot(irisPca2)

A Large Data Principal Components Analysis

Stock market data for open, high, low, close, and adjusted close from 1962 to 2010 is available at https://github.com/thebigjc/HackReduce/blob/master/datasets/nyse/daily_prices/NYSE_daily_prices_subset.csv. The full data set includes 9.2 million observations of daily open-high-low-close data for some 2800 stocks. As you might expect, these data are highly correlated, and principal components analysis can be used for data reduction. We read the original data into a .xdf file, NYSE_daily_prices.xdf, using the same process we used in the Tutorial: Analyzing loan data with RevoScaleR to read our mortgage data (set revoDataDir to the full path to the NYSE directory containing the .csv files when you unpack the download):

#  A Large Data Principal Components Analysis

if (bHasNYSE){  

bigDataDir <- "C:/MRS/Data"
nyseCsvFiles <- file.path(bigDataDir, "NYSE_daily_prices","NYSE",
	"NYSE_daily_prices_")

nyseXdf <- "NYSE_daily_prices.xdf"
append <- "none"
for (i in LETTERS)
{
	importFile <- paste(nyseCsvFiles, i, ".csv", sep="")
	rxImport(importFile, nyseXdf, append=append)
	append <- "rows"
}

Once we have our .xdf file, we proceed as before:

stockCor <- rxCor(~ stock_price_open + stock_price_high + 
	stock_price_low + stock_price_close + 
	stock_price_adj_close, data="NYSE_daily_prices.xdf")
stockPca <- princomp(covmat=stockCor)
summary(stockPca)
loadings(stockPca)
plot(stockPca)

Yields the following output:

> summary(stockPca)
Importance of components:
							Comp.1    Comp.2      Comp.3       Comp.4
Standard deviation     2.0756631 0.8063270 0.197632281 0.0454173922
Proportion of Variance 0.8616755 0.1300327 0.007811704 0.0004125479
Cumulative Proportion  0.8616755 0.9917081 0.999519853 0.9999324005
								Comp.5
Standard deviation     1.838470e-02
Proportion of Variance 6.759946e-05
Cumulative Proportion  1.000000e+00
> loadings(stockPca)

Loadings:
						Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
stock_price_open      -0.470 -0.166  0.867              
stock_price_high      -0.477 -0.151 -0.276  0.410 -0.711
stock_price_low       -0.477 -0.153 -0.282  0.417  0.704
stock_price_close     -0.477 -0.149 -0.305 -0.811       
stock_price_adj_close -0.309  0.951                     

				Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
SS loadings       1.0    1.0    1.0    1.0    1.0
Proportion Var    0.2    0.2    0.2    0.2    0.2
Cumulative Var    0.2    0.4    0.6    0.8    1.0

The scree plot is shown as follows:

stockPca plot

Between them, the first two principal components explain 99% of the variance; we can therefore replace the five original variables by these two principal components with no appreciable loss of information.

Ridge Regression

Another application of correlation matrices is to calculate ridge regression, a type of regression that can help deal with multicollinearity and is part of a broader class of models called Penalized Regression Models.

Where the ordinary least squares regression minimizes the sum of squared residuals

math 1

ridge regression minimizes the slightly modified sum

math 2

The solution to the ridge regression is

math 3

where X is the model matrix. This matrix is similar to the ordinary least squares regression solution with a “ridge” added along the diagonal.

Since the model matrix is embedded in the correlation matrix, the following function allows us to compute the ridge regression solution:

#  Ridge regression
rxRidgeReg <- function(formula, data, lambda, ...) {
  myTerms <- all.vars(formula)
  newForm <- as.formula(paste("~", paste(myTerms, collapse = "+")))
  myCor <- rxCovCor(newForm, data = data, type = "Cor", ...)
  n <- myCor$valid.obs
  k <- nrow(myCor$CovCor) - 1
  bridgeprime <- do.call(rbind, lapply(lambda, 
        function(l) qr.solve(myCor$CovCor[-1,-1] + l*diag(k), 
                             myCor$CovCor[-1,1])))
  bridge <-  myCor$StdDevs[1] * sweep(bridgeprime, 2, 
        myCor$StdDevs[-1], "/")
  bridge <- cbind(t(myCor$Means[1] - 
        tcrossprod(myCor$Means[-1], bridge)), bridge)
  rownames(bridge) <- format(lambda)
  return(bridge)
}

The following example shows how ridge regression dramatically improves the solution in a regression involving heavy multicollinearity:

set.seed(14)
x <- rnorm(100)
y <- rnorm(100, mean=x, sd=.01)
z <- rnorm(100, mean=2 + x +y)
data <- data.frame(x=x, y=y, z=z)
lm(z ~ x + y)

 Call:
 lm(formula = z ~ x + y)

 Coefficients:
 (Intercept)            x            y  
	  1.827        4.359       -2.584  
rxRidgeReg(z ~ x + y, data=data, lambda=0.02)
					  x         y
 0.02 1.827674 0.8917924 0.8723334

You can supply a vector of lambdas as a quick way to compare various choices:

rxRidgeReg(z ~ x + y, data=data, lambda=c(0, 0.02, 0.2, 2, 20))
                        x           y
  0.00 1.827112 4.35917238 -2.58387778
  0.02 1.827674 0.89179239  0.87233344
  0.20 1.833899 0.81020130  0.80959911
  2.00 1.865339 0.44512940  0.44575060
 20.00 1.896779 0.08092296  0.08105322

For λ = 0, the ridge regression is identical to ordinary least squares, while as λ → ∞, the coefficients of x and y approach 0.