Page 1 of 6
Tutorial Solutions – Week 2 (Plot)
The data file ‘usair.dat’  gives a number of measurements of air pollution from 41 cities
in the USA. The data consists of seven variables, with values representing the means from
1969 to 1971 for each city:
• SO2: SO2 content in micrograms per cubic metre;
• temp: average annual temperature in degrees F;
• manuf: number of manufacturing enterprises employing =>20 workers;
• pop: the population in thousands, in the 1970 census;
• wind.speed: the average annual wind speed in miles per hours;
• annual.precip: the average annual precipitation in inches;
• days.precip: the average number of days with precipitation each year.
a) Produce and interpret a draftsman display of the first 4 variables. What are the y-axis
and x-axis of the plot on the bottom left (4th row, 1st column) and the plot on the 3rd row
and 2nd column? Do any variable comparisons suggest a strong linear relationship?
> us > # Draftsman’s display of the first four variables only
> plot( us[, 1:4] )
Plot(4,1): x-axis=SO2 and y-axis=pop
Plot(3,2) : x-axis=temp and y-axis=manuf
Looks like the strongest linear relationship (positive) is between pop and manuf, which
Page 2 of 6
b) Create a 3D scatterplot between temp, pop and manuf. Discuss any interesting features.
> scatterplot3d(temp,manuf,pop, main=”3D Scatterplot”)
Looks like there are just a few very large population sizes that may influence analysis.
c) Create 5 plots showing the relationship between temp and each of the other variables.
Display these plots in the same window in 2 rows and 3 columns.
> par(mfrow = c(2,3))
> plot(temp~annual.precip )
> par(mfrow = c(1,1))
Page 3 of 6
d) Create a scatterplot matrix for the variables: temp, manuf, pop, wind.speed and
Both pop and manuf are skewed right (a few large outliers). The relationships between the
environmental variables and each of the other variables show a lot of variation and
although a linear relationship can be fitted, it is not obvious. The linear relationship between
pop and manuf still seems quite strong.
e) Identify the cities (rows) of data with the highest populations causing the skew in that
variable. Use syntax to remove these observations from the plotting. Interpret any
changes to the plot.
SO2 temp manuf pop wind.speed annual.precip days.precip
1 10 70.3 213 582 6.0 7.05 36
2 13 61.0 91 132 8.2 48.52 100
3 12 56.7 453 716 8.7 20.66 67
4 17 51.9 454 515 9.0 12.95 86
View all data to identify the rows to be removed
> (store_outlierstemp manuf pop wind.speed annual.precip
11 50.6 3344 3369 10.4 34.44
18 49.9 1064 1513 10.1 30.96
29 54.6 1692 1950 9.6 39.93
35 68.9 721 1233 10.8 48.19
Rows 11, 18 and 29 are outliers on both manuf and pop. I will also removed row 35
because its value for pop was still quite a bit higher than other values (the next highest pop
value is 905 for row 17). This is a subjective choice. There could be an argument that an
outlier on one variable is not a good enough reason to remove a case. If you choose to
Page 4 of 6
remove only cases 11, 18 and 29 that is not incorrect, as long as you describe your
reasoning or justification
Remove rows 11, 18, 29 and 35 from the analysis
The relationship between pop and manuf is much less clear. The appearance of a strong
linear relationship was driven by a few cities with both high pop and high manuf. It is
debatable whether these cities should be removed from future analysis of this data. It
depends strongly on the research question.
The data set ‘emdecade.dat’ contains climate data from Emerald Qld for the decades 1890s
to the 1990s. The data consists of the variables:
• rain: the average monthly rainfall by decade (in mm);
• maxt: the average decade maximum daily temperature (in degrees Celsius);
• mint: the average decade minimum daily temperature (in degrees Celsius;
• radn: radiation (in MJ/m2);
• pan: pan evaporation (in mm);
• vpd: maximum vapour pressure deficit (in hPa).
a) Create a star plot for the variables rain, maxt and mint. Interpret.
> em > ##starplot
> stars( em[,2:4], draw.segments=TRUE, labels =(em[,1]),
+ key.loc=c(7,2),main=”Emerald weather by Decade” )
Page 5 of 6
Rain (black) was relatively high in 1890, 1950 and 1970 but very low other decades,
particularly 1930 and 1960.
1920 had very high max temps and low min temps. 1950 had low min temps and low max
temps. No clear relationship between low rain and high max temps (except 1920).
b) Create a profile plot of variable means by decade. Do not connect points by lines.
> emmeans_melt> attach(emmeans_melt)
> ggplot(emmeans_melt, aes(variable, value, colour=as.factor(decade))) + geo
Page 6 of 6
Higher measurement scale of rain obscures detail on y-axis for other variables. Average
rain has varied quite a bit over the decades.
c) Redo the profile plot excluding the variable ‘rain’. Interpret.
> ggplot(emmeans_melt[12:66,], aes(variable, value, colour=as.factor(decade))
) + geom_point(size=3)
Very little variation in ‘pan’ over the decades.
No consistent decade for high and low across the variables.
 Hand, D.J., Daly, F., Lunn, A. D., MCConway, K. J. and Ostrowski, E. (1994). A
Handbook of Small Data Sets. Chapman and Hall, London)
Page 1 of 6