STM1001: Assignment 3
Science/Health Stream Students Only
Academic Integrity Information
In submitting your work, you are consenting that it may be copied and transmitted by the University for the detection of plagiarism. If you are unsure of your academic integrity responsibilities, please check the information provided in the Assessment Overview tile on the LMS. Please start with the following statement of originality, which you must sign and date:
Statement of Originality:
“This is my own work. I have not copied any of it from anyone else.”
| Name | Student ID | Signature | Date | 
Assignment Submission Details
- This assignment is due by by 11.59pm Thursday 24th October (Week 12).
- This assignment is worth 15% of your final mark and is out of 50 marks. The available marks for each question are displayed in the question.
- You must submit your assignment electronically and as a single Word or pdf file via the STM1001 LMS Assignment 3 Turnitin submission link. Zip files are not accepted.
- To avoid incurring late penalties, please ensure your work is correctly submitted and that the orientation of your file is correct (i.e. not sideways or upside down).
- Where questions require the use of jamovi, you must include all relevant computer output, code and plots in your answer in order to gain full marks.
- Round answers to three decimal places where relevant.
Unless otherwise specified, assume a significance level of
where relevant.
Where your answers to any question refer to a
-value, you must explicitly state the
-value. If the
-value is less than 0.001, express the
-value as “
”.
Tips for your Assignment Submission
You are welcome to submit your work in a Word document format. You can find an example Assignment 3 Template on the LMS in the Assignments tile.
1 Question 1
(8 marks total)
In this question, you will perform a
-means clustering cluster analysis on annual data on various health, socioeconomic and behavioural characteristics of nations around the world. This data has been collected and collated by Omondi et al. (2022a, 2022b), with some data originally sourced from Kaggle, and from the World Health Organization’s and the United Nations’ databases.
1.1
A subset of the Omondi et al. (2022a) data set has been prepared for you, and is available in the Assignment 3 section of the Assignments tile on LMS, in the STM1001_S2_A3_Life_Expectancy_Data.csv file.
Download this file and load the data into jamovi. The variables in this data set include:
- Country – Country Name.
- Region – Global regional location (7 regions of the world identified, i.e., East Asia & Pacific, Europe & Central Asia, Latin America & Caribbean, Middle East & North Africa, North America, South Asia, and Sub-Saharan Africa).
- IncomeG – Income Group (this is a factor variable describing a country’s classification in terms of its social class based on the income levels of the majority of its population, i.e., Low Income, Middle Income, and High Income).
- Year – The calendar year of interest.
- LifeExp – Life expectancy in age.
- Alcohol – The recorded per capita alcohol consumption (in litres) for a country in a specific year.
- HepsB – The percentage of Hepatitis B immunization coverage among 1 year olds.
- Measles – The number of measles reported cases per 1000 population.
- BMI – The average body mass index of the entire population.
- GovHealthExp – A country’s government expenditure on health as a percentage of total government expenditure.
- GDP – A country’s gross domestic product per capita in dollars.
- Pop – A country’s population.
- Schooling – The average number of years spent in school.
N.B.: The continuous numeric variables in your data object have been scaled, to help with subsequent analyses, so some values will appear strange if you interpret them at face value.
(0 marks)
1.2
Perform
-means clustering on your STM1001_S2_A3_Life_Expectancy_Data.csv data, to see if we can accurately cluster countries into the 7 global regional locations. For your clustering:
- Use all variables except Country, Region, IncomeG and Year.
- Use the default algorithm setting. If either of the following error messages appear, do not worry, it is safe to ignore these messages:
- more cluster centers than distinct data points
 
- number of cluster centers must lie between 1 and nrow(x)
Use a
value of 7.
After carrying out your
-means clustering analysis, complete the following:
- Include the Clustering Table in your answers
- List the number of countries in each cluster
- Create a plot of the means across clusters and include this in your answers
(4 marks)
1.3
Create a cluster plot to visualise the
-means clusters identified in 1.2.
Comment on the plot, noting key features. Do you notice anything interesting or strange?
Hint: Recall that you can create cluster plots using the fviz_cluster function.
(4 marks)
2 Question 2
(10 marks total)
Suppose you have computed a set of
-values as part of a multiple hypothesis testing procedure for genomic data. In your chosen statistical software (jamovi or R) run the code below to store these
-values in the object base_p_values, before answering the following questions.
base_p_values <- c(0.0516, 0.0240, 0.0182, 0.2356, 0.0336, 0.0809, 0.1217, 0.2295, 0.0817, 0.0850)
2.1
Perform Bonferroni correction on these
-values to control the FWER.
(1 mark)
2.2
Using an
threshold, answer the following:
How many
-values were significant after Bonferroni correction?
How many
-values were significant prior to Bonferroni correction?
(2 marks)
2.3
A recent study (Sharma et al. 2021) assessed gene expression differences between Annapurna (heat tolerant variety) and IR64 (heat sensitive variety) rice cultivar seedlings when exposed to heat stress (37 degrees Celsius for 1 hour).
In this question we will analyse a subset of their data, with the aim of identifying statistically significantly differentially expressed genes between the two varieties of rice, under heat stress.
This data is stored in the STM1001_2024_S2_A3_Rice_Data.csv file on LMS. Due to its size, this data can be considered Big Data. Download this file now.
Open the STM1001_2024_S2_A3_Rice_Data.csv data file in jamovi. Then run (i.e. execute) the following code in the Rj Editor. This code will:
- load the data (line 1 code) and then
store the base
-values from the data set in the rice_initial_pvalues object (line 2 code)
rice_data <- data
rice_initial_pvalues <- rice_data$P.Value
Note for jamovi users: Do not worry if you run your code and nothing happens. Remember that by default, the Rj editor doesn’t show code that has been run, only output. You can change this in the settings by clicking on the cog symbol and selecting Show code and output, as shown in the screenshot below:
(0 marks)
2.4
We have learnt about several
-value adjustment methods in this subject (Bonferroni, Holm, Hochberg, Hommel, FDR).
Provide a short explanation on which method you think is most appropriate for this data.
(2 marks)
2.5
Using the
-value adjustment method you chose in part 2.4, carry out a correction on the
-values stored in the rice_initial_pvalues data object. Store the adjusted
-values in a new object called rice_padjusted.
Note: You do not need to show the rice_padjusted output – it would take up several pages!
(1 mark)
2.6
Using a
-value threshold of
, how many statistically significant
-values are present in rice_initial_pvalues?
Hint: Check Question 4 of Computer Lab 8B if you are not sure how to proceed.
(2 marks)
2.7
Using an adjusted
-value threshold of
, how many statistically significant
-values are present in your rice_padjusted data object? Comment on the impact the
-adjustment method has had on the number of genes identified as being statistically significant.
(2 marks)
3 Question 3
(17 marks total)
For this question, you will need to analyse data found in the data file anxiety_data.csv (Kassambara 2019). Download the data set from LMS and open it in jamovi.
The data set contains the following variables:
- id : the ID of the individual (ranges from 1 to 45: we are considering a random sample of 30 individuals from the original 45 in the data set)
- t1 : anxiety score at time-point 1
- t2 : anxiety score at time-point 2
- t3 : anxiety score at time-point 3
When answering the following questions, you may assume that all necessary hypothesis test assumptions have been met. Remember to round to three decimal places where relevant.
3.1
What is the mean anxiety score for each of the three time-points?
(2 marks)
3.2
Using an appropriate ANOVA analysis, we wish to test for a difference in average anxiety scores across time-points. Write down the null and alternative hypotheses, ensuring you define any parameters mentioned (e.g.
etc.).
(4 marks)
3.3
In this example, what is the dependent (response) variable?
(1 mark)
3.4
Carry out an appropriate ANOVA analysis to test for a difference in average anxiety scores across time-points. Provide the ANOVA table obtained from jamovi.
(3 marks)
3.5
Do we have enough evidence to conclude that there is a statistically significant difference in anxiety scores across time? Justify your answer.
(2 marks)
3.6
Write a one-sentence summary of your ANOVA results using the format taught in this subject.
(5 marks)
4 Question 4
(10 marks total)
In this question, we will be revisiting the research paper from Assignment 1, Multivitamin Supplementation Improves Memory in Older Adults: A Randomized Clinical Trial. You can access a limited version of the paper here.
Read the Abstract (Background, Objectives, Methods, Results, Conclusions) and Section snippets (COSMOS trial, Baseline characteristics, Discussion) sections of the paper, and answer the following questions.
4.1
In terms of the Hierarchies of Evidence framework discussed in this subject, what type of evidence is provided in the study? Name one type of evidence that would be stronger than this type of evidence.
(2 marks)
4.2
Recall that the results of
-tests have been provided in the “Results” sub-section. No mention has been made regarding checking the Normality assumption. However, based on the information available to us, there is most likely no cause for concern regarding the Normality assumption. Explain why this is the case.
(4 marks)
4.3
What has been done in this study design to manage the Hawthorne effect? Explain.
(4 marks)
5 Question 5
(5 marks total)
For this question, your task is to reflect upon what you have learnt in STM1001, and to write a self-reflective piece (a short paragraph).
While your response must relate to STM1001, what you choose to write about here is to a large extent up to you, and will be specific to your chosen tertiary qualification(s). Some examples (not exhaustive) are provided below:
- A reflection on your most favourite/least favourite STM1001 content covered
- How what you have learnt in STM1001 could tie in with your current/future career
- An area of statistics you would like to learn more about, based on what you have learnt in STM1001
- A difficulty you encountered in learning the content, and what you did to address this issue
Regardless of your selected area of discussion, your response must demonstrate that you have reflected upon and thought about your learning. Marks will be awarded based on this criterion.
