MAST90007 2021 Major assignment

MAST90007: Statistics for Research Workers 2021

1,500 word assignment

Due: 5 pm, Friday 30 July 2021

Submission

Submit an electronic copy of the assignment via the LMS.

IMPORTANT: All students in this subject are required to complete the online plagiarism

declaration form for the subject as a whole, covering all work. You will find a link to the

form on Canvas. If you do not include complete the online plagiarism form your

assignment will not be accepted.

This assignment contains three (3) questions worth a total of 20 marks. There is some

general advice on the assignment at the end of this document, on page 8.

The overall requirement for this assignment is to carry out and report on data analytics that

address three questions about the data from the Framingham heart study.

You may know about this study from your general knowledge; it is one of the most famous

studies in epidemiology. You can learn about the study from information on Wikipedia

(https://en.wikipedia.org/wiki/Framingham_Heart_Study), but also through these references:

Levy, D., National Heart Lung and Blood Institute., et al. (1999). 50 years of

discovery: medical milestones from the National Heart, Lung, and Blood Institute's

Framingham Heart Study. Hackensack, N.J., Center for Bio-Medical Communication

Inc.

Mahmood, S. S., Levy, D., Vasan, R. S., & Wang, T. J. (2014). The Framingham Heart

Study and the epidemiology of cardiovascular disease: a historical perspective. The

Lancet, 383(9921), 999-1008.

Oppenheimer, G. M. (2005). Becoming the Framingham study 1947–1950. American

Journal of Public Health, 95(4), 602-610.

You may also find your own useful references. You are not required to read these references

for the purposes of the assignment.

The data file contains some information from long term follow up as well as baseline

measures. The file contains records for 5,209 people – all the participants in the original

cohort of the study. The participants were followed up every 2 years. The data file includes

information from baseline, the 2nd examination (one variable), and the 16th examination (30

years after baseline).

SRW MAST90007 2021 Major assignment

The data file includes:

Age at baseline (years)

Height at baseline (inches)

Weight at baseline (pounds)

Body Mass Index at baseline (kg/m2)

Sex Female / Male

Diastolic blood pressure at baseline (mmHg)

Systolic blood pressure at baseline (mmHg)

Serum cholesterol (mg/100ml) examination 1 Serum cholesterol (mg/100ml) at baseline;

this variable has 2,037 missing values.

Serum cholesterol (mg/100ml) examination 2 Serum cholesterol (mg/100ml) at the 2nd

examination; this variable has 626 missing

values.

Serum cholesterol (mg/100ml) baseline Baseline serum cholesterol at examination

1, or, when missing at examination 1, the

serum cholesterol at the second

examination.

Metropolitan Relative Weight at baseline A measure of the percentage of actual

weight to desirable weight; a measure very

similar to BMI.

Smoker at baseline Smoker / Non-smoker

Number cigarettes smoked per day at

baseline

Last examination number Number of the last examination that the

person participated in.

Survived at last examination 0 = alive at 16th examination; 1 = died prior

to 16th examination

Cause of death 0 = still alive

1 = sudden death from coronary heart

disease (CHD)

2 = other coronary heart disease

3 = stroke (cerebrovascular accident, CVA)

4 = other cerebral vascular disease

5 = cancer

6 = other causes of death

9 = cause unknown

Examination at which CHD diagnosed, if

applicable

SRW MAST90007 2021 Major assignment

The data were accessed from:

http://courses.washington.edu/b513/datasets/datasets.php?class=513

The data file is Framingham.xlxs. You can drop and drag this file into Minitab.

When you do this, some of the variable names will be truncated; you will need to correct

them to make them clear by shortening them.

There are some references to column numbers in the assignment. These numbers will be

correct if you simply drag and drop the Excel file into Minitab; obviously, if you insert

columns yourself in the Minitab file, your column numbers may differ from those given

here.

SRW MAST90007 2021 Major assignment

Question 1 – Baseline data [6 marks]

This question focuses on baseline characteristics and data.

(a) Briefly describe the design of the study to provide context for the analyses you report.

(b) Produce a summary table to describe the following characteristics of the study

participants: age at baseline, height at baseline, weight at baseline and sex.

(c) Consider systolic and diastolic blood pressure at baseline. Produce suitable visual

display(s) to allow a comparison of the distributions of these according to whether or

not an individual was a smoker at baseline. You can exclude those with missing

information about smoking from visual displays using Data Options > Group options.

(d) Carry out appropriate analyses to compare those who were smokers at baseline with

those who were not, for systolic and diastolic blood pressure. Provide one or more

suitable tables that includes the summary statistics and inferential statistics.

(e) Discuss and justify any assumptions underlying your choice of analysis.

(f) Write a summary of the analyses you have carried out explaining the results of all the

comparisons you have made. Write the summary for a doctor interested in the

practical application of the study results.

(g) Consider predicting systolic blood pressure at baseline from age and Metropolitan

relative weight at baseline. Provide graphical display(s) to illustrate the distributions

of the explanatory variables. Explain if you would recommend rescaling these

variables for this analysis. If appropriate, rescale the variables. Fit the model and

obtain the parameter estimates for each of the explanatory variables. Explain the

meaning of the parameter estimates for each of these explanatory variables, according

to whether you have recommended rescaling or not. (You do not need to report other

details of the analysis.)

(h) A colleague is also working with the same data file, and says: “This is great! The

sample size is so big, everything is really, really significant; this whole study gives so

many meaningful findings.” Respond to this comment.

SRW MAST90007 2021 Major assignment

Question 2 – Serum cholesterol at baseline [8 marks]

Serum cholesterol (mg/100ml) at baseline (column 10 in the datafile) is defined as serum

cholesterol at examination 1 (the true baseline), or, when missing at examination 1, the

serum cholesterol at the second examination. For many people in the study, serum

cholesterol at both examinations 1 and 2 was available.

(a) Produce an appropriate graph showing the relationship between Serum cholesterol

(mg/100ml) examination 1 and Serum cholesterol (mg/100ml) examination 2.

(b) Describe the relationship between the two variables, and give a suitable summary

statistic.

(c) Fit a linear regression predicting Serum cholesterol (mg/100ml) examination 1 from

Serum cholesterol (mg/100ml) examination 2. Provide an appropriate summary table

and give a plain language explanation of the estimates of the parameters of the model.

(d) Find a 95% prediction interval for Serum cholesterol (mg/100ml) examination 1 when

Serum cholesterol (mg/100ml) examination 2 is 300 (mg/100ml). Explain its meaning.

(e) A colleague asks if using the Serum cholesterol (mg/100ml) examination 2 value itself

as the estimate of Serum cholesterol (mg/100ml) examination 1 is a good idea; for

example, if Serum cholesterol (mg/100ml) examination 2 = 275, predict that Serum

cholesterol (mg/100ml) examination 1 = 275. (This is, in fact, what was done.) Does

this under-estimate, or over-estimate Serum cholesterol (mg/100ml) examination 1,

using the data available? Provide a graph that will help answer this question. (Hint:

Consider adding a Calculated line to show y = x.) Provide an explanation in writing.

(f) Consider improving the prediction of Serum cholesterol (mg/100ml) examination 1.

Explain, in principle, a possible approach. You do not need to implement the

approach.

(g) A key research question is about the relationship of smoking status at baseline and sex

to Serum cholesterol (mg/100ml) baseline (column 10). Describe a suitable statistical

model for answering this question, and explain the effects that will be considered in the

model.

(h) Use Minitab to fit the model that you have specified in part (g). Provide a summary

table of the Analysis of variance, and give a plain language explanation of the meaning

of the P-values associated with each of the explanatory variables. Use concrete terms in

relation to the Framingham study, rather than in abstract form.

(i) State one assumption required for analysing the data using the model you have

suggested. State if the assumption is reasonable and provide relevant evidence.

(j) Provide an appropriate graphical display to summarise the findings in relation to the

model you have fitted in (h).

SRW MAST90007 2021 Major assignment

(k) Find 95% confidence intervals for the effects of sex and smoking status on serum

cholesterol at baseline; use Fisher intervals and provide those that best describe the

results. Provide a suitable report of these confidence intervals, including a plain

language explanation in concrete terms.

Question 3 – Survival at last examination [6 marks]

Consider Survived at last examination; this is in column 15.

(a) Produce a graph of the data that allows a comparison of Survived at last examination

in terms of sex.

(b) Comment on any differences for sex, based on the graph.

(c) Estimate the difference in proportions (for sex) surviving at the last examination, and

the 95% confidence for this difference. Write a plain language explanation of the

results, using concrete terms in relation to the Framingham study.

(d) Carry out a logistic regression analysis of “Survived at last examination” using sex as a

predictor. Write a summary of the results, again suitable for a doctor interested in the

findings.

(e) Subset the Minitab worksheet to exclude those who have survived at examination 16,

so that you have the subset of subjects who died prior to examination 16.

Explore the relationship between cause of death and sex, using a suitable graphical

display. You may consider combining causes of death, if you think this is appropriate.

(Hint: Data > Recode). Provide a suitable graph with a brief written description of the

patterns in the graph.

(f) A colleague wants to consider predicting Survived at last examination from Serum

cholesterol (mg/100ml) examination 1 (column 8). She notes that some of the values are

missing. Your colleague suggests says “I don’t think we need to worry about that as

there will still be plenty of data to carry out an analysis”. Provide a response to this,

explaining any assumptions involved, and include a summary table to describe the

amount of missing data for Serum cholesterol (mg/100ml) examination 1.

(g) At the time that the Framingham study, diastolic blood pressure was believed to be a

superior measure of blood pressure compared with systolic blood pressure. High levels

of systolic blood pressure were not believed to be important in terms of health

outcomes. Examine the relationship between these two measures of blood pressure at

baseline visually. Provide a plot that represents this relationship.

Consider the summary tables providing the results of three logistic regression models

predicting Survived at last examination.

7

SRW MAST90007 2021 Major assignment

Model Explanatory variable(s)

Odds ratio

95% confidence interval

for Odds ratio P-value

1 Systolic blood pressure/10 1.34 1.31, 1.38 < 0.001

2 Diastolic blood pressure/10 1.53 1.46, 1.60 < 0.001

3 Systolic blood pressure/10 1.32 1.27, 1.38 < 0.001

Diastolic blood pressure/10 1.04 0.96, 1.12 0.341

Based on these analyses and your examination of the explanatory variables, comment on the

belief about the “superior” blood pressure measure in predicting survival at the last

examination. Formal analyses are not required to answer this question.

SRW MAST90007 2021 Major assignment

Advice

Here is some advice to follow when preparing your assignment.

• The purpose of the assignment is to relate the statistical theory and practice learned

in Statistics for Research Workers to real world data. The essential feature is that you

must demonstrate understanding and application of statistical ideas covered in SRW

to real world practice.

• The presentation of results should be consistent with the principles for presenting

graphics and tables discussed in the course.

• In general, you are not required to provide Minitab output in the assignment, with

the exception of graphs.

• The word limit for the assignment is 1,500 words. From our point of view, this is an

upper limit for the assignment and you should aim to submit between 1,400 and

1,500 words. The word count does not include graphs and tables. University policy

allows for a 10% deduction of marks once a written assignment exceeds 10% of the

specified word limit. As the 1,500-word assignment is worth 20% of your final mark,

you could lose 2% from your final mark if your assignment was, for example, 1,670

words.

• Your answers should be on no more than twelve (12) A4 pages of standard sized

writing. This includes any graphs. Twelve pages is a generous limit for the

assignment; this document is on eight pages, with a lot of white space, and it

contains around 2,000 words.

• You do not need to reproduce the questions in your assignment.

