辅导案例-ENEN90032-Assignment 1
ENEN90032 - Environmental Analysis Tools
Assignment 1
Dongryeol Ryu, Manish K. Patel, and Jie Jian
26 August 2020
Submit an electronic copy (in PDF format) to the Turnitin menu of the subject
LMS by 12pm (NOON) on Monday 14 September 2020. Make sure you meet the
Infrastructure Engineering submission requirements (include the coversheet with
signatures of team members). Include appropriate graphs and tables in your so-
lution. The report should contain no more than 2500 words excluding figures.
Submit your Matlab codes via your assignment group folder. Name your codes
self explanatory (e.g., Q1 1.m, Q3 1.m) and add comments in the code properly.
Compress your codes/report and name it with your group number (i.e., Assign-
ment 01 Group 01.zip)
For the Hypothesis Test problems, you must explain rationale behind the choice
of the null hypothesis, test statistic, alternative hypothesis and conclusions. 30% of
the marks for each section will be given based on the report quality and quality of
figures and tables whenever they are required. Figures should be properly labeled
and self explanatory. For each question, please describe individual members’ con-
tributions briefly. You are expected to work TOGETHER for all questions, thus
issues arising from splitting assignment questions between the assignment group
members will not be assisted with.
1 Exploratory Data Analysis - Meteorological
Datasets (20 marks)
Go to the Climate Data Online1 of Bureau of Meteorology and choose a weather
station in Perth, Western Australia (or surrounding area with six-digit station
number staring with 009XXX ), one in Brisbane, Queensland (or surrounding area
with 04XXXX ) and a station in Melbourne, Victoria (station number staring with
1http://www.bom.gov.au/climate/data/
1
08XXXX ). Download daily rainfall data of the stations collected in a year between
2010 and 2019 inclusive (the selected year for the Brisbane and the Melbourne
stations should be identical). Missing values in the selected year should be fewer
than 10. For the rainfall data analysis, we will be using wet-day daily rainfall data,
which excludes zero-rainfall events and the values lower than the detection limit
(assume that the detection limit is 0.25 mm).
1. Make a table that summarizes the location (sample mean, median and trimean),
spread (sample standard deviation, IQR and median absolute deviation) and
symmetry (sample skewness and Yule-Kendall index) of the datasets in the
cities. Can you infer skewness of the datasets by comparing the mean with
the median? Based on the shape of the distribution (refer to the figures pro-
duced in the next question), discuss the robustness of the summary statistics
calculated above.
2. For the wet-day daily rainfall data, fit i) a Gaussian, ii) a gamma, and iii) a
Weibull2 distribution functions to the dataset and compare the fitted distri-
bution models with the data distribution. For graphical representation of the
probability density of data, use the Gaussian kernel estimates that produce
a smoothed curve for the probability density (you may use a Matlab func-
tion ksdensity to generate kernel estimates). Also, compare its empirical
cumulative distribution with fitted CDFs (Gaussian, gamma, and Weibull)
and make a Q-Q plot for evaluation. Judge which model fits your data best
based on the graphical examinations.
3. Suggest a probability density model that closely fits the wet-day daily rainfall
data (use all datasets from Perth, Brisbane and Melbourne) and compare the
performance of your choice with the results in the previous question. You can
use any existing models with 1-to-3 parameters. You may refer to published
research articles, e.g., Ye et al. (2018)3.
4. For the above fits to the rainfall data (including your suggested model),
calculate the log-likelihood values of the fits and quantitatively prove your
judgement above.
2https://en.wikipedia.org/wiki/Weibull_distribution; you may use Matlab functions
weibcdf, weibfit, weibpdf for the Weibull distribution.
3https://doi.org/10.5194/hess-22-6519-2018
2
2 Newcomb-Michelson Velocity of Light Exper-
iments (10 marks)
Simon Newcomb of the Nautical Almanac Office (NAO), U.S., published the veloc-
ity of light [Newcomb, 1883]4 based on a series of experiments he conducted with
Albert Michelson until 1882. The dataset ‘NewcombLight.txt’ contains 66 sam-
ples (time in seconds taken for light to travel 7442 meters at sea level) Newcomb
collected in 1882. Conduct the t-test and the bootstrap based one-sample tests
and provide the population mean of light velocity (in m/s) with your choice of a
confidence level. Do the estimates include the widely known speed of light as in
HERE5? Do the estimates from the t-test and the bootstrap show any systematic
difference? If so, provide possible reasons based on the sampling distributions used
by the two approaches.
3 Space Shuttle O-Ring Failures (10 marks)
On 27 January 1986, the night before the space shuttle Challenger exploded, en-
gineers at the company that built the shuttle warned NASA scientists that the
shuttle should not be launched because of predicted cold weather. Fuel seal prob-
lems, which had been encountered in earlier flights, were suspected being associated
with low temperatures. It was argued, however, that the evidence was inconclu-
sive. The decision was made to launch, even though the temperature at launch
time was 29 ◦F (∼ −1.67 ◦C).
The dataset ‘O Ring Data.XLS’ summarizes the number of O-ring incidents on
24 space shuttle flights prior to the Challenger disaster. Launch temperature was
below 65 ◦F for data labeled ‘COOL’ and above 65 ◦F for data labeled ‘WARM’.
Conduct a permutation test if the number of O-ring incidents was associated with
the temperature using 99% confidence interval with your choice of one-sided or
two-sided test options. Use 10,000 permutations to draw conclusion. Justify your
choice and show your null distribution as a histogram with a test statistic marked
on it. Make your final suggestion about the launch of the space shuttle on the day
of accident based on the quantitative evidence that supports your suggestions.
4http://vigo.ime.unicamp.br/~fismat/newcomb.pdf
5https://en.wikipedia.org/wiki/Speed_of_light
3
4 Cloud Seeding Experiment (15 marks)
The dataset ‘Cloud Seeding Case Study.XLS’ contains data collected in southern
Florida between 1968 and 1972 to test a hypothesis that massive injection of
silver iodide into cumulus clouds can lead to increased rainfall (J. Simpson, and
J. Eden, “A Bayesian Analysis of a Multiplicative Treatment Effect in Weather
Modification,” Technometrics 17 (1975)). An airplane flew for 52 days in total,
however, silver iodide was injected on randomly chosen 26 days. The pilot was
not aware of whether on any particular day the cloud seed was loaded or not to
prevent biases. The rainfall was measured by radar as the total rain volume falling
from the cloud base following the airplane seeding.
1. Using a parametric method, conduct a test if the cloud seeding made a
significant impact on rainfall using both 95% and 99% confidence intervals.
Choose between one-sided and two-sided tests and justify your choice.
2. Repeat the above test now using a permutation test. Use 10,000 permu-
tations to draw your conclusion and show your resampled data in the his-
togram. Compare your results with those from the parametric test above
and explain the differences identified based on the pros and cons of the two
methods.
3. You may have noticed that the rainfall data in the cloud seeding experiment
are highly skewed. Transform the rainfall values using a logarithm func-
tion and repeat the parametric test under the same conditions used for the
Question 4.1 above. Does the transformation change your conclusion? If so,
discuss about the difference and the implications of the results to the need
of data-transformation when the data is highly asymmetric.
5 Atmospheric CO2 Concentration during Global
Forced Confinement by COVID-19 (12 marks)
Global forced lockdowns caused by fast spreading COVID-196 since late January
2020 reportedly reduced global CO2 emission. A recent report
7 estimates the
reduction in CO2 emission as high as 17%. In this section, we examine if the
atmospheric CO2 concentration was lower than the level it would have been with-
out COVID-19 during the peak forced confinement period, April and May in 2020.
6https://ourworldindata.org/grapher/covid-stringency-index?year=2020-08-24
7https://www.nature.com/articles/s41558-020-0797-x
4
To examine the atmospheric CO2 in April and May 2020, we use the monthly
CO2 data
8 maintained by the Scripps Institution of Oceanography at the Univer-
sity of California, San Diego in the US. The Mauna Loa CO2 monitoring station in
Hawaii provides the longest continuous record of atmospheric CO2 concentration
since 1958 and is ideally located to measure globally representative CO2 values.
Download the monthly CO2 data HERE
9 and use the unadjusted values in the 5th
column.
1. Monthly timeseries of atmospheric CO2 features steady increase from the
beginning of monitoring with strong seasonal fluctuations. It is advised that
anomaly of the concentration in 2020 should be examined after removing
the background trend and cyclic fluctuations. The seasonal fluctuation can
be removed by sampling only April values to test concentrations in April
and sampling May values to test concentrations in May. Construct sepa-
rate batches of CO2 values in April and May for 1958-2020. To remove the
long-term trend from the time series, use a quadratic function (2nd-order
polynomial) fit to the CO2 concentration values in 1958-2019 for April and
May separately. Once the trend is removed, your samples can be viewed as
residuals (deviations) from the expected long-term trend of the atmospheric
CO2. Conduct a residual analysis to the detrended April and May samples
and provide your assessment of the residuals following the steps suggested
in the Linear Regression section.
2. Now you conduct a hypothesis test using confidence interval(s) and Student’s
t distribution. Provide the null and alternative hypotheses for the test.
If you chose to use t statistic as a test statistic and 95% confidence level
to determine acceptance/rejection, what are the critical values for the null
distributions in April and May? What is the result of your test?
3. What is your assessment of the atmospheric CO2 concentration in April and
May of 2020 based on the test statistics in the previous question? Are they
within the range of your intuitive expectations? There exist a large number
of academic and general articles about the expectations and interpretations
around the observed atmospheric CO2 published online this year. Provide
your assessment and interpretation of the test statistic supported by your
choice of relevant articles.
8https://scrippsco2.ucsd.edu/data/atmospheric_co2/primary_mlo_co2_record.html
9https://scrippsco2.ucsd.edu/assets/data/atmospheric/stations/in_situ_co2/
monthly/monthly_in_situ_co2_mlo.csv
5
6 Was July 2019 the Hottest Month in Recorded
History in the Northern Hemisphere? (13 marks)
There was a claim that July of 2019 may be the hottest month in recorded history
in the northern hemisphere10. In this question, you will be testing if the claim is
correct using daily air temperature measured in/near a selection of major cities
in the northern hemisphere. Go to the Global Historical Climatology Network
(GHCN)11 site of National Oceanic and Atmospheric Administration (NOAA) of
the US and download daily temperature of July of two cities chosen from Paris
(France), Moscow (Russia), Berlin (Germany), Beijing (China), Tokyo (Japan),
and New York (USA). Your chosen weather stations should have records back
to the year 1951 or earlier. Exclude the months that include less than 50% of
measurements.
1. Suggest a test statistic that can be used to examine the extremity of monthly
temperature using input from daily temperature values downloaded. You will
choose one or more input variables from daily maximum, daily minimum, and
daily average temperature values. Justify your choice of the test statistic
based on existing works in science/engineering publications (e.g., articles,
reports, online material). Examine if your sample statistic shows that July
2019 was the hottest month in your chosen locations.
2. We want to check how extreme the July temperature in recent 5 years have
been within the recorded period of temperature in the cities you chose. Use
the test static chosen in the previous question averaged over July in 2015-
2019 and quantitatively assess the extremity of July temperature of 2015-
2019 against all July temperature (defined by your monthly test statistic)
since the beginning of the record. You can use either parametric or non-
parametric method for the quantitative assessment. Justify your choice of
the method based on your analysis of the data distribution.
7 Exploratory Data Analysis and Linear Regres-
sion (20 marks)
Nutrient/sediment concentrations vs. stream discharge relationships have been
widely used as a clue to explore hydro-chemical processes that control runoff chem-
istry. Here we examine sediment concentration vs. stream discharge relationships
10https://doi.org/10.1029/2019EO130843
11https://www.ncdc.noaa.gov/data-access/land-based-station-data/
land-based-datasets/global-historical-climatology-network-ghcn
6
using linear regression. This question investigates the correlation between instanta-
neous streamflow and the Total Kjeldahl Nitrogen concentration (TKN) collected
from the site “222101, Curdies River at Curdie” located in Otway Coast of Vic-
toria. The data ‘Q TKN data.csv’ contains three columns: the monitoring date
(column 1), catchment-averaged streamflow (mm/d, column 2) and TKN (mg/L,
column 3).
1. Calculate the Pearson correlation coefficient and Spearman’s rank correla-
tion coefficient for the paired Q vs.TKN data. Then, calculate the same
correlation coefficients for the paired natural log(Q) and natural log(TKN)
i.e. logarithm with a base e. What do these values tell you about the re-
lationship between TKN and Q? Suggest which paired data (raw data pair
vs. log(data) pair) is more suitable for constructing linear regression, and
justify your selection. In the subsequent questions, Q vs. TKN means their
relationship based on your chosen pair.
2. Based on your selection of the paired data, plot Q vs. TKN concentrations
and fit a simple linear regression: i) report the regression parameters and
the goodness-of-fit for the regression; ii) use the linear regression developed,
predict the TKN concentration expected when discharge reaches 2 mm/d.
3. Calculate the 95% confidence intervals for i) conditional mean and ii) pre-
diction. Construct a figure showing the linear model and the confidence
intervals with the observed data values and discuss the difference between
the confidence intervals for conditional mean and prediction (e.g., how to
interpret and use the intervals?).
4. What is the pattern of the residuals of your developed Q∼TKN model?
Provide your assessment of the residuals. Assessment of autocorrelation
(serial correlation) in residuals is needed.
5. Based on the plot you created in Question 6.3, check how much fraction
of the observed data values actually falls within the 95% prediction confi-
dence interval (e.g., you can create a for loop in Matlab to check if indi-
vidual observation falling within 95% CI). According to the results and the
pattern/distribution of residuals, do you recommend the application of this
linear model for predicting further TKN concentrations? Did you find any
specific range of Q where your model struggles to predict TKN (for this last
question, compare the predicted TKN with observed TKN values in the orig-
inal (raw data) space in case you built your model in the log-transformed
space)?
7
51作业君 51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: ITCSdaixie