different components of the overall variance | My Assignment Tutor


Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,
07/2004.
Page 1 of 7
Tutorial Solutions – Week 5 (PCA)
Question 1:
a) Can the second PC explain more variation than the first PC when performing PCA on the
correlation matrix?
Solution:
No, PCA fits the first PCA to explain the maximum variance and then each subsequent PC is
fitted to remaining unexplained variance. Each PC explains different components of the
overall variance so that variance is summative.
b) When could you use the covariance matrix rather than the correlation matrix?
Solution:
Use the covariance matrix if the original variables are measured on similar scales with
similar units. When variables are measured on different scales the variance of one variable
can overwhelm the analysis. Standardising by the variable standard deviation (using the
correlation matrix) corrects for this.
c) The sum of the eigenvalues should equal what?
Solution:
Eigenvalues should equal the sum of the variances in the covariance matrix (diagonal
elements) or the sum of the diagonal on the correlation matrix which is equal to p number
of variables. The total variance of all PCs (sum of eigenvalues) should equal the sum of
variance for all original variables (sum of covariance diagonal elements).
d) If PCA results based on the correlation matrix of 9 variables find that first 3 PCs explain
82%, 7% and 2.5% of the variance respectively; only the first PC has an eigenvalue>1;
and the scree plot shows a distinct elbow at PC2 and another smaller elbow at PC3, how
many components would you choose?
Solution:
No absolute correct answer – judgement call.
The first 3 PCs explain 91.5% of the total variation leaving 8.5% explained by the
remaining 6 PCs or on average 1.4% per remaining PC.
Although the second PC does not have an eigenvalues>1, by contribution 7.5% to
cumulative variance explained it is quite a bit larger than the 3rd PC of 2.5% and the
remaining PCs. This would be the smaller elbow at PC3 on the scree plot.
I would use only two PCs. Although the 3rd gets total variance over 90%, its contribution of
only 2.5% would make any interpretation a bit vague. Using only 2 PCs makes overall
interpretation much easier (especially graphically), while losing very little explanatory
power.
My decision might change if the variable loadings showed that one variable was very
strongly loaded on PC3 that was not well represented on PC1 or PC2.
Question 2:
Using the dataset ‘europeemploy.txt’ (from Manly Table 1.5) perform PCA on the correlation
matrix using the prcomp function. This process should include:
a) Check the original correlation matrix to get an understanding of the data and the linear
relationships between variables.
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,
07/2004.
Page 2 of 7
Solution:
> (corAGR MIN MAN PS CON SER FIN SPS TC
AGR 1.000 0.316 -0.254 -0.382 -0.349 -0.605 -0.176 -0.811 -0.487
MIN 0.316 1.000 -0.672 -0.387 -0.129 -0.407 -0.248 -0.316 0.045
MAN -0.254 -0.672 1.000 0.388 -0.034 -0.033 -0.274 0.050 0.243
PS -0.382 -0.387 0.388 1.000 0.165 0.155 0.094 0.238 0.105
CON -0.349 -0.129 -0.034 0.165 1.000 0.473 -0.018 0.072 -0.055
SER -0.605 -0.407 -0.033 0.155 0.473 1.000 0.379 0.388 -0.085
FIN -0.176 -0.248 -0.274 0.094 -0.018 0.379 1.000 0.166 -0.391
SPS -0.811 -0.316 0.050 0.238 0.072 0.388 0.166 1.000 0.475
TC -0.487 0.045 0.243 0.105 -0.055 -0.085 -0.391 0.475 1.000
The variables represent different employment industries with original data showing
percentage of workforce employed by industry for 30 countries (see Manly Table 1.5).
There are mostly very low correlations between industries, which isn’t too surprising.
Should we expect the percentage of people employed in agriculture to be strongly linearly
related to the percentage employed in manufacturing? Let’s continue anyway and see if the
PCA helps us understand the data.
b) Produce the PCA output, calculate eigenvalues and % variance explained by each PC (all
calculations in R). Interpret.
Solution:
> (ee.prcomp Standard deviations:
[1] 1.764159276 1.345078898 1.223200801 1.031234134 0.842767778 0.557977406
[7] 0.541683406 0.451460350 0.002664149
Rotation:
PC1 PC2 PC3 PC4 PC5 PC6
AGR 0.5114918 0.023474999 0.27859140 -0.01649218 0.02403794 -0.04239691
MIN 0.3749833 -0.000490734 -0.51505210 -0.11360623 -0.34631272 0.19857439
MAN -0.2461613 -0.431752051 0.50205622 -0.05827010 0.23362179 -0.03091715
PS -0.3161203 -0.109144430 0.29369499 -0.02324549 -0.85444839 0.20647051
CON -0.2215986 0.242470912 -0.07153072 -0.78266601 -0.06215096 -0.50263565
SER -0.3815359 0.408255893 -0.06514938 -0.16903778 0.26667324 0.67269361
FIN -0.1310884 0.552938958 0.09565440 0.48921763 -0.13128795 -0.40593492
SPS -0.4281618 -0.054705874 -0.36015928 0.31724250 0.04571821 -0.15845276
TC -0.2050706 -0.516649883 -0.41299565 0.04206329 0.02290077 -0.14189804
PC7 PC8 PC9
AGR -0.16357428 -0.54040909 0.58203611
MIN 0.21259036 0.44859201 0.41881803
MAN 0.23601541 0.43175735 0.44708636
PS -0.06056504 -0.15512240 0.03025124
CON -0.02028469 -0.03082345 0.12865575
SER 0.17483893 -0.20175280 0.24502068
FIN 0.45764510 0.02726352 0.19075812
SPS -0.62133030 0.04147562 0.41031481
TC 0.49214521 -0.50212355 0.06074315
> #eigen values
> ee.prcomp$sdev^2
[1] 3.112258e+00 1.809237e+00 1.496220e+00 1.063444e+00 7.102575e-01
[6] 3.113388e-01 2.934209e-01 2.038164e-01 7.097692e-06
> ## %variance

>
(pervar

[1] 3.458064e+01 2.010264e+01 1.662467e+01 1.181604e+01 7.891750e+00
[6] 3.459320e+00 3.260232e+00 2.264627e+00 7.886324e-05

>
(pervar

[1] 34.6 20.1 16.6 11.8 7.9 3.5 3.3 2.3 0.0
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,
07/2004.
Page 3 of 7
Nine original variables so nine potential PCs. The first 4PCs have eigenvalues>1 and they
cumulatively explain 83.1% of the variation in the original 9 variables. The first PC only
explains 34.6%. This means that after fitting the first linear combination of variables (PC1)
there is still a lot of variation (65.4%) unexplained. This suggests that in 9 dimensional
space the 30 countries are fairly scattered (also reflected in low correlations).
c) Construct a scree plot. Explain your choice of the number of relevant PCs.
Solution:
> screeplot(ee.prcomp, type=”lines”)
The elbow at PC2 would suggest only using PC1 which does not explain enough overall
variance to be useful. Another elbow at PC6 would suggest using the first 5 PCs, but only
the first 4 had eigenvalues greater than 1. Adding the 5th PC would improve variance from
83.1% on 4 PCs to 91% which could be a useful improvement although more PCs are much
harder to interpret. I will go with 4 PCs.
d) Construct the Z equation for PC3. Interpret.
Solution:
3 0.28( ) 0.52( ) 0.50( ) 0.29( ) 0.07( )
0.07( ) 0.10( ) 0.36( ) 0.41( )
AGR MIN MAN PS CON
SER FIN SP T
Z
S C
– + + –
– + – –
=
On component 3 MIN and MAN are the most strongly correlated although only moderately,
and in opposite directions. This component most strongly reflects the contrast between
manufacturing and mining industries. AGR, MAN, PS and FIN are all positively correlated to
different degrees while all other variables are negatively correlated (to different degrees).
e) Produce a biplot of the first two PCs. Interpret. Explain any differences between your
ordination plot and Manly Figure 6.2.
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,
07/2004.
Page 4 of 7
Solution:
> biplot(ee.prcomp,cex=c(0.7,0.7))
On PC1 countries with high employment in Agriculture (AGR) and Mining (MIN) such as
Albany, Turkey and Czech are in contrast to those higher in Social and Personal Services
(SPS) and Power and water supplies (PS). This PC accounts for 34.6% of the overall
variance in the data.
On PC2 countries with high employment in services (SER) and finance (FIN) such as
Gibraltar contrast those with high employment in manufacturing (MAN) and transport and
communication (TC) such as Yugoslavia (former) and Malta. This PC accounts for 20.1% of
the overall variance in the data.
The sign of country scores on PC2 is opposite to those in Manly Figure 6.2. This does not
matter as their relative positions are maintained (Albany and Romania at opposite extremes
of PC2).
f) Produce a biplot of the first and third PC. How does your R code need to change to
display non-consecutive PCs?
Solution:
> biplot(sp.prcomp,choices=c(1,3), cex=c(0.7,0.7))
When PCs are not consecutive the code for choices must change:
choices=c(1,3) rather than choices=3:4
Question 3:
Complete Exercise 2 at the end of Chapter 6 of Manly using the data file ‘protein.txt’. The
data is the protein consumption (grams/person/day) from 9 different sources for 25
European countries.
a) Base your PCA on the correlation analysis using the prcomp function. You will first need
to isolate the variables to be included in PCA.
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,
07/2004.
Page 5 of 7
Solution:
> pro > str(pro)
‘data.frame’: 25 obs. of 11 variables:
$ Country: Factor w/ 25 levels “Albania”,”Austria”,..: 1 2 3 4 5 6 7 8 9 10

$ Rmeat : int 10 9 14 8 10 11 8 10 18 10 …
$ Wmeat : int 1 14 9 6 11 11 12 5 10 3 …
$ Eggs : int 1 4 4 2 3 4 4 3 3 3 …
$ Milk : int 9 20 18 8 13 25 11 34 20 18 …
$ Fish : int 0 2 5 1 2 10 5 6 6 6 …
$ Cereals: int 42 28 27 57 34 22 25 26 28 42 …
$ starch : int 1 4 6 1 5 5 7 5 5 2 …
$ nuts : int 6 1 2 4 1 1 1 1 2 8 …
$ FV : int 2 4 4 4 4 2 4 1 7 7 …
$ Total : int 72 86 89 91 83 91 77 91 99 99 …
> pro1> cor(pro1)
Rmeat Wmeat Eggs Milk Fish
Rmeat 1.00000000 0.18850977 0.57532001 0.5440251 0.06491072
Wmeat 0.18850977 1.00000000 0.60095535 0.2974816 -0.19719960
Eggs 0.57532001 0.60095535 1.00000000 0.6130310 0.04780844
Milk 0.54402512 0.29748163 0.61303102 1.0000000 0.16246239
Fish 0.06491072 -0.19719960 0.04780844 0.1624624 1.00000000
Cereals -0.50970337 -0.43941908 -0.70131040 -0.5924925 -0.51714759
starch 0.15383673 0.33456770 0.41266333 0.2144917 0.43868411

nuts
FV
-0.40988882 -0.67214885 -0.59519381 -0.6238357 -0.12226043
-0.06393465 -0.07329308 -0.16392249 -0.3997753 0.22948842
Cereals starch nuts FV
-0.50970337 0.1538367 -0.4098888 -0.06393465

Rmeat

Wmeat
-0.43941908 0.3345677 -0.6721488 -0.07329308

Eggs
-0.70131040 0.4126633 -0.5951938 -0.16392249

Milk
-0.59249246 0.2144917 -0.6238357 -0.39977527

Fish
-0.51714759 0.4386841 -0.1222604 0.22948842

Cereals 1.00000000 -0.5781345 0.6360595 0.04229293

starch
-0.57813449 1.0000000 -0.4951880 0.06835670

nuts 0.63605948 -0.4951880 1.0000000 0.35133227
FV 0.04229293 0.0683567 0.3513323 1.00000000
There is a large range of initial correlations, none very high but a few above 0.6 so the
analysis may be worthwhile.
> pro.prcomp > loadings

>
(loadings

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
-0.31 -0.07 -0.36 -0.60 0.40 -0.38 0.23 -0.05 -0.25
-0.32 -0.21 0.63 -0.04 -0.31 -0.08 0.15 -0.03 -0.58

Rmeat
Wmeat
Eggs
Milk
Fish

-0.42 -0.10 0.08 -0.26
0.07 0.66 0.04 -0.47 0.28

-0.38 -0.17 -0.40 0.03 -0.32 0.02 -0.72 0.10 -0.19
-0.13 0.65 -0.30 0.23 -0.30 -0.04 0.24 -0.44 -0.26

Cereals 0.43 -0.25 0.07 0.02 0.19 -0.19 -0.34 -0.72 -0.19

starch
-0.30 0.39 0.28 0.31 0.67 0.02 -0.33 0.08 -0.15

nuts 0.42 0.13 -0.14 -0.25 0.09 0.59 -0.03 0.22 -0.57
FV 0.12 0.50 0.34 -0.60 -0.23 -0.16 -0.36 0.01 0.21
> pro.prcomp$sdev^2
[1] 4.0955365 1.6249031 1.0853237 0.9050170 0.4267377 0.3469402
[7] 0.2695240 0.1345226 0.1114953
> (pervar[1] 45.505961 18.054478 12.059152 10.055744 4.741530 3.854891
[7] 2.994711 1.494695 1.238837
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,
07/2004.
Page 6 of 7
> (pervar[1] 45.5 18.1 12.1 10.1 4.7 3.9 3.0 1.5 1.2
> screeplot(pro.prcomp, type=”lines”)
b) How many PCs should be considered based on the scree plot, eigenvalues and total
variance methods?
Solution:
The first 3 PCs have eigenvalues >1 and together explain 75.7% of the variance. Four PCs
would explain 85.8% and 5 PCs would be needed to explain 90.5%. Five PCs out of 9 isn’t a
bad reduction in dimensionality but still difficult to interpret. The scree plot shows elbows at
2, 3 and 5 which suggest using 1, 2 or 4 PCs respectively. The first PC alone does not
explain enough variance (45.5%).
I would choose 4 PCs.
c) Explain the relationships between Albania and Ireland and between Portugal and
Bulgaria from the first 2 PCs. Try using biplots. What is a limitation inherent in your
interpretation?
Solution:
> biplot(pro.prcomp,cex=c(0.7,0.7)) #ordination plot labels =row numbers
> biplot(pro.prcomp,cex=c(0.7,0.7),xlabs=pro$Country) #ordination plot labels
=country names
Source: Manly, Bryan F.J. Multivariate Statistical Methods: A Primer, Third Edition, CRC Press,
07/2004.
Page 7 of 7
Look back at original data to see that Albania=1, Ireland=12, Portugal=17 and Bulgaria =4.
Also look at values for each original variable.

>
(pro2Country Rmeat Wmeat Eggs Milk Fish Cereals starch nuts FV

1 Albania 10 1 1 9 0 42 1 6 2
12 Ireland 14 10 5 26 2 24 6 2 3
17 Portugal 6 4 1 5 14 27 6 5 8
4 Bulgaria 8 6 2 8 1 57 1 4 4
Albania v Ireland = Albania high nuts and cereals and low milk and eggs and Ireland the
opposite on PC1. Both similarly low in fish on PC2
Portugal v Bulgaria: Portugal very high in fish and FV while Bulgaria low in both on PC2. On
PC1 both high in nuts and cereals and low in milk, eggs etc.
Albanians tend to receive most of their protein from nuts and cereals while the Irish get
most of their protein from milk and eggs. Albanians and Irish populations have similar low
consumption of fish.
Portuguese people get most of their protein from fish while Bulgarians get most of their
protein from cereals.
The interpretation above is limited by the fact that the first 2 PCs only explain 63.9% of
variation in the data.

Private and Confidential

Yours all information is private and confidential; it is not shared with any other party. So, no one will know that you have taken help for your Academic paper from us.



This essay is written by:

Prof. Amanda Verified writer

Finished papers: 435

Proficient in:

English, History, Business and Entrepreneurship, Nursing, Psychology, Management

You can get writing help to write an essay on these topics
100% plagiarism-free

Hire This Writer
© 2017 theacademicessays. All Rights Reserved. Design & Developed by theacademicessays.

Ask Your Homework Today!

We have over 1000 academic writers ready and waiting to help you achieve academic success

CLICK HERE TO GET ORIGINAL ANSWERS FROM WRITERS

WhatsApp
Hello! Need help with your assignments?
Loading...