程序代写案例-BUSS1020

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

StuDocu is not sponsored or endorsed by any college or university
BUSS1020 Quantitative Business Analysis
Quantitative Business Analysis (University of Sydney)
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
BUSS1020 Quantitative Business Analysis

Week 1 - Introduction and Collecting Data

Process of statistical analysis
1. Define the objective, and understand the data we need to collect.
2. Collect the required data type, using an appropriate process.
3. Organise the data:
3.1. ‘Clean’ out any extraneous data points, missing data, and blatant outliers.
3.2. Prepare it in a form suitable for analysis.
3.3. Tabulate and summarise it.
4. Visualise the data with graphs and charts.
5. Analyse the data.
Through better understanding of the data they generate and collect, business can make better decisions.

Branches of statistics used in business
a. Descriptive: collecting, summarising, presenting and organising data.
b. Inferential: inferring conclusions about a population based on a sample.
c. Predictive: predicting future outcomes based on a sample.

Some basic vocabulary
● Variables, or attributes, are characteristics of an item or individual (e.g., age).
● Data are the observed values or outcomes of one or more variables (e.g., 25 years old).
● The operational definition of a variable is a universally accepted, clearly defined meaning of what
that variable is (e.g. the operational definition of ‘age’ would be ‘the number of complete years a
person has been alive for’).
● A population consists of all the items or individuals about which you want to draw a conclusion
(the ‘large group’).
○ A census is an analysis and data collection of every item or individual in a population.
○ A parameter is a numerical measure that describes a relevant characteristic of a
population. (Think of ‘parametric’ equations - they use a variable to describe a curve.)
● A sample is the portion of a population selected for analysis (the ‘small group’).
○ A statistic is a numerical measure that describes a characteristic of a sample. Often a
statistic estimates a parameter.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Types of variables
a. Categorical (e.g. car-drivers, bicycle-riders, public-transport-takers).
b. Numerical:
i. Discrete variables arise from a counting process (e.g. age, number of children, defects per
hour).
ii. Continuous variables arise from a measuring process and can be assigned any value, or a
large range of possible values, within a given interval (e.g. height, stock price, time).

Measurement scales (levels of data usage and usefulness)
a. Nominal: labels used to distinguish different categories that have no order (e.g. employment
classification as: teacher; construction worker; lawyer; doctor; other).
b. Ordinal: labels used to classify and rank data points (e.g. the program was: not at all helpful;
somewhat helpful; mostly helpful; extremely helpful).
c. Interval: data are numerical and differences between values have a consistent meaning (e.g.
temperature, calendar dates, scaled marks).
d. Ratio: data are numerical with consistent meaning given to distances between values, plus the
point ‘0’ has a true meaning (e.g. weight, revenue, Facebook likes). Essentially, ratio-scaled
variables cannot go below 0.

Sources of primary and secondary data
a. Organisations distributing data, such as stock prices, weather conditions, sports stats, and search
engine results.
b. A designed experiment where researchers control treatments given, such as testing for fertiliser
effectiveness.
c. A survey where researchers directly ask people questions, such as political polls.
d. Observational studies where researchers observe a phenomenon, such as traffic volume.
i. These studies usually involve time-series data, which measures the level of a particular
variable at several equally-spaced points in time.
e. Automated and streaming data, such as GPS data, browser history, metadata.
i. This is often what we mean by ‘big data’. Big data has four characteristics of volume,
velocity, variety, and veracity.

In BUSS1020, we focus on structured data, which is stored in easy-to-use databases or tables. There is,
however, a lot of unstructured, ‘messy’ data floating around that cannot be easily stored (e.g. tweets,
emails, texts, podcasts, video streams). Unstructured data needs lots of preparation and organisation
before it can be sufficiently analysed.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

We use sampling because most of the time, we can’t get the whole population. Sampling is less
time-consuming and less costly than a census.

To accurately sample, we need to first generate a sample frame. A sample frame is a list of items or
individuals that are in the population and can be sampled (that is, they have meaningful data that we can
organise, visualise and analyse).

Biased results can occur if parts of the population are excluded (for example, how did pollsters get UK
General Election 2015, Brexit and Trump wrong?) - we need to choose a representative sample that is as
broad as practicably possible.

Different types of samples we can take

a. In non-probability samples, items are chosen without regard to their occurrence.
i. A convenience sample is based on selecting items that are convenient.
ii. A judgment sample is based on experts selecting the most appropriate items, taking
convenience into account.
iii. A self-selecting sample is where individuals choose to participate.
iv. A quota sample is where pre-set quotas of groups are chosen, by convenience.
b. Non-probability samples have the problem of not being representative, thus leading to biased
results.
c. In probability samples, items are chosen randomly, sometimes using known probabilities that
closely match those in the population.
i. A simple random sample (SRS) is where every item has an equal chance of being
selected (i.e. drawing names out of a hat), with or without replacement.
ii. A systematic sample is where you divide the sample frame, N, into n ‘systems’ of k items.
You then select one item from system 1, and select every k-th individual thereafter from
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
each system, up to system n.

iii. A stratified sample is where you divide the sample frame into ‘strata’ according to a
characteristic. You then select a simple random sample from each stratum, with the size of
this sample proportional to the relative sizes of each stratum. Then, the selected items are
combined into one sample.
1. This can be used to ensure proportionate representation and that minorities are
included.

iv. A cluster sample is where you divide the population into several ‘clusters’, each
representative of the population. You then select a simple random sample of clusters, and
use the items in your selected clusters as your sample.
1. This is often used in election exit polls, where results are time-sensitive.

d. When taking samples, you need to consider cost, efficiency, representation and anti-bias.

Possible errors you could make in surveying a sample
a. Coverage error or selection bias: some groups are excluded from the frame and have little to no
chance of being selected.
b. Non-response error or selection bias: people who choose not to respond may be different from
those who do respond.
c. Sampling error: variation from sample to sample is natural and expected.
d. Measurement error: due to weakness in question design, respondent confusion, and the
interviewer’s effects on the respondent. Questions are ambiguous, unclear or leading.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 2 - Organising and Visualising Data

Data is organised and visualised so we can reveal, communicate and gain insight from the patterns hidden
within it.

a. Organising One Variable Categorical Data
A summary table indicates the amount or percentage of variables which fall within each category. We can
see the relative frequency of each category, and compare differences between categories.

b. Visualising One Variable Categorical Data
A bar chart has one bar for each category, and each bar length
represents the amount or percentage of values in that category.

A pie chart is a shaded circle with one slice for each category, and the
size of a slice represents the percentage of all values in that category.

A Pareto chart contains both a vertical bar chart,
with categories shown in descending order of
frequency, and a line graph, which represents the
cumulative total. Pareto charts highlight the most
important among a (typically large) set of factors.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
c. Organising Two Variable Categorical Data
A contingency table cross-tabulates the responses of the categorical variables in question. It can show
patterns or relationships between two or more categorical variables.

d. Visualising Two Variable Categorical Data
A side-by-side bar chart splits the data into several bar charts, each of which add up to 100%.

e. Organising Numerical Data
An ordered array is a sequence of data, in rank order, from the smallest to largest value. It shows the full
range of values and may help identify outliers.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
A frequency distribution shows us how values are distributed within a sample. It organises data into
equally-spaced groups, and the size of each group reflects the percentage of data points in each group.
This is what we mean when we talk about quartiles, quintiles or percentiles. To create a frequency
distribution:
1. Order numerical values from smallest to largest.
2. Calculate the range.
3. Choose the number of groups you will use.
4. Determine the boundaries of each group so they are equal in width.
5. Place values in their correct group.
It is important to choose the appropriate number, width and boundaries of groups.
The table below also shows a cumulative distribution, which takes into account the values in the rows
before to grant a cumulative total (a total so far).

Frequency and cumulative distributions condense data so that we can determine major characteristics,
particularly where the data are concentrated or clustered. It makes it easier to draw a histogram.

f. Visualising One Variable Numerical Data
A histogram is a vertical bar chart of a frequency distribution. There are no gaps between bars (indicating
continuous data). The x-axis has group boundaries or midpoints, while the y-axis represents the
commonality [absolute frequency] or percentage [relative frequency] of values. Some programs also allow
you to display the mean and median on a histogram.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
A polygon is formed by having the midpoint of each group represent the data in that group, and then
connecting the sequence of midpoints through a line. An ogive is a cumulative percentage polygon.
Polygons and ogives are useful when there are two or more groups to compare.

polygon ogive

g. Visualising Two Variable Numerical Data
A scatter plot allows us to examine possible relationships
between two numerical variables. One variable is measured on
the x-axis, the other on the y-axis. (Shown R)

A time-series plot is used
to study patterns in a
numerical variable over
time. The time period is measured on the x-axis, while the numerical
variable is measured on the y-axis. (Shown L)

h. Principles of Graphical Excellence
● Maximise message; minimise noise
● Include a title and label axes, with the scale on the vertical
axis beginning at 0 (generally)
● Include a reference to the source
● Choose an appropriate scale for axis values which doesn’t
compress values
● Keep things in correct proportions and don’t distort data
● 3D charts are just not a good idea in most cases
● Give the reader a chart that creates meaning through a relative basis.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 3 - Numerical Descriptive Measures
(Key question: what does a set of numerical data mean?)

Central tendency is the extent to which the data values group together around a typical or central value.
Measures of central tendency
a. The mean, or average: . =AVERAGE(x1,x2,...,xn)X = n
∑n
i=1
X i

i. means ‘sum the following expression, from i=1, i=2, … up to i=n.’Σ
b. The median, which is the middle-position value (or average of middle-position values) in an
ordered array. The median is not sensitive to extreme outliers. =MEDIAN(x1,x2,...,xn)
c. The mode, which is the most frequently observed value. The
mode is usually reported for discrete or categorical data only.
(Shown far R)
d. The geometric mean, which is often used to measure the rate of change of a variable over time:
. =GEOMEAN(x1,x2,...,xn)X .. )XG = ( 1 × X2 × . × Xn 1/n
i. We use the geometric mean rate of return to measure the average rate of return of an
investment over time: (1 ) 1 ) .. 1 )] RG = [ + R1 × ( + R2 × . × ( + Rn 1/n − 1
http://www.financeformulas.net/Geometric_Mean_Return.html
ii. Example:

Variation is the amount of dispersion of values around the central value.
Measures of variation
a. The range, which is the distance between largest value =MAX(array) and smallest value
=MIN(array).
b. The sample variance, which is how far variables are spread out in a sample: .S2 = n−1
(X −X)∑
n
i=1 i
2

=VAR.S(x1,x2,...,xn)
i. By dividing by n-1 and not n, we recognise that we’re taking a sample. We acknowledge
that this isn’t the whole population.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

c. The sample standard deviation, which shows variation about the mean: . S = √ n−1(X −X)∑ni=1 i 2
=STDEV.S(x1,x2,...,xn)
i. The more the data are spread out, the greater the range, S2 and S.
d. The coefficient of variation, which represents the ratio of the standard deviation to the mean, i.e.
the amount of variation relative to the mean: /X 00% S × 1
e. The z-score for each data point, which is the number of standard deviations a data value is from
the mean: z = SX−X

Shape is the pattern in the distribution of values.
Measures of shape
a. Skewness measures the amount of asymmetry in a distribution.
b. Kurtosis measures the concentration of values in the centre of a distribution, as compared with the
tails.

Data analysis in Excel yields numerical values that describe central tendency, variation and shape.
1. Select the Data tab.
2. Select Data Analysis.
3. Select Descriptive Statistics and click OK.
4. Enter the cell range.
5. Check the Summary Statistics box and Click OK.

Quartiles, interquartile range and boxplots
Quartiles split ordered data into 4 segments, with an equal number of values per segment.
If quartiles come between values, an average is taken. =QUARTILE(array, quartile no.)
Q1 = first quartile
Q2 = second quartile / median
Q3 = third quartile
The interquartile range = Q3 - Q1. It is a measure of variability and is resistant to outliers.
Five number summary = (min, Q1, median, Q3, max).
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
A boxplot graphically displays data with respect to the five number summary. The length of the boxplot
and size of the box will vary with central tendency, variation and shape.

Numerical descriptive measures for a population

Q: How is this different from a sample?
The sample estimates the population.

In a population, data varies in a bell-shaped distribution. According to the empirical rule:
● ~68% of data in a bell-shaped distribution are within 1 std.dev. of the mean i.e. ± σ μ = 1
● ~95% of data in a bell-shaped distribution are within 2 std.dev. of the mean i.e. ± σ μ = 2
● ~99.7% of data in a bell-shaped distribution are within 3 std.dev. of the mean i.e. ± σ μ = 3

According to Chebyshev’s Rule, regardless of how the data are distributed, at least of the1 ) 00%( − 1k2 · 1
values will fall within k std.dev. of the mean (k>1).

Finally, numerical descriptive measures should:
● Document both good and bad results
● Be presented in a fair, objective and neutral manner
● Not use inappropriate summary measures to distort facts.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 4 - Basic Probability
(Key question: what is the likelihood of a particular event?)

Each possible outcome of a variable is called an event.
Probability is the chance that an uncertain event will occur. 0 < P < 1.
An impossible event has no chance of occurring; that is, P = 0.
A certain event, on the other hand, will definitely occur; that is, P = 1
The sample space is the collection of all possible events (e.g. if we were asked to pick a playing card, the
sample space is the 52 unique cards in a deck).

Assessing the probability of an uncertain event
a. A priori - if all outcomes are equally likely, P = Total number of possible outcomesNumber of ways the event can occur
b. Empirical probability - according to historical data, P = Number of trialsNumber of ways the event has occurred
c. Subjective probability - doing it by feel.

Kinds of events
a. A simple event is described by a single characteristic (e.g. a die lands on 6).
b. A joint event is described by two or more characteristics (e.g. a roulette ball lands on a number
that is both red and odd).
c. A complement, A’, of an event A is all events that are not event A: P(A’) = 1 - P(A).

Events can be:
a. Mutually exclusive; that is, event A cannot happen simultaneously with event B. For example,
event A = day in January and event B = day in February are mutually exclusive.
b. Collectively exhaustive; that is, between them, all the specified events A, B, C … cover all
possibilities. For example, event A = weekday, event B = weekend and event C = day in spring are
collectively exhaustive. Events A and B are collectively exhaustive and mutually exclusive.

Example [1]. A contingency table for vehicles that travel along a hypothetical stretch of road.
Cars Non-cars Total
Speeding 60 40 100
Not speeding 820 80 900
Total 880 120 1000
● The sample space is all the vehicles that have travelled along the stretch of road.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
● P(car AND speeding) = 60/1000 = 6%.
● P(car) = P(car AND speeding) + P(car AND not speeding) = 60/1000 + 820/1000 = 880/1000 = 88%
● Speeding and non-speeding are mutually exclusive. They are also collectively exhaustive.

Probability rules
a. The general addition rule: P(A or B) = P(A) + P(B) - P(A and B).
● P(car OR speeding) = P(car) + P(speeding) - P (car AND speeding) = 88% + 10% - 6% = 92%.
● If we simply added P(car) and P(speeding), we ‘double-up’:
Cars Non-cars Total
Speeding 60 40 100
Not speeding 820 80 900
Total 880 120 1000
● Alternatively, complement of P(non-car AND not speeding) = 1 - (80/1000) = 92%.

b. For mutually exclusive events A and B, P(A and B) = 0, so P(A or B) = P(A) + P(B).
c. P(A or B) is also known as P(AUB) (the union) ; P(A and B) is also known as P(A∩B) (the
intersection).
d. Conditional probability: P(A|B) = P(A∩B) / P(B). Read P(A|B) as ‘the probability of one event A,
given that another event B has already occurred’ (e.g. the probability of picking the king of hearts,
given that all cards are red, is given by P(K of hearts | red card) = 1/26.)
● P(speeding | non-cars) = P(speeding AND non-cars) / P (non-cars) = 40/120 = 33.33%.
e. Rearranging the equation gives the multiplication rule: P(A∩B) = P(A|B) * P(B).
f. Two events are independent when the probability of one event is not changed by the other event,
i.e. if and only if P(A|B) = P(A)

Bayes’ theorem
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Bayes’ theorem is used to revise prior or existing probabilities based on new information:
, where A = event of interest, B = new event that might impact A.(B|A)P = P (A)P (A|B)P (B)

Proof [1]. P(A∩B) = P(A) * P(B|A) = P(B) * P(A|B)
              P(B|A) = P(B) * P(A|B) / P(A).

Example [2]. Suppose that a drug test will produce 99% true positive results and 99% true negative results
- in other words, if someone is a drug user, they will test positive 99%
of the time; and if someone is not a drug user, they will test negative
99% of the time. A decision tree for true/false positives/negatives is
shown.

If a randomly selected individual tests positive, what is the probability
that he is actually a user? The answer is not 99%! Mathematically
speaking, they are asking us for P(User | +).

(User| ) 3.2%P + = P (+)P (+|User)P (User) = P (+|User)P (User)P (+|User)P (User) + P (+|Non−user)P (Non−user) = 0.99×0.0050.99×0.005+0.01×0.995 ≈ 3

Extension: Bayes’ theorem solves conditional probability problems like the Monty Hall problem:
● You are on a gameshow where in front of you are doors 1, 2 and 3.
● There is a car behind one of them, and goats behind the other two.
● You pick a door. Then the host, Monty, opens a door that is not your chosen door. That door has a
goat behind it.
● Question: to increase your chances of picking the car, should you stay with your original pick, or
swap to the other door?
The answer is that you should always SWAP. Why?

Proof [2]. Assume that you always pick Door 1.
Let P(C1) be the probability that the car is behind Door 1.
Let P(C2) be the probability that the car is behind Door 2.
Let P(C3) be the probability that the car is behind Door 3.

At first glance, P(C1) = P(C2) = P(C3) = ⅓. However, when Monty opens a door, that is a new event that
impacts on the prior probability of the event of interest, and Bayes’ theorem must be applied.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Let’s focus on Door 3. Let P(D3) be the probability that Monty opens Door 3.
At first glance, P(D3) = ½. This is because Monty is equally likely to open either Door 2 or Door 3.

The probability that Monty opens Door 3, given that the car is behind Door 1, is represented by:
P(D3|C1) = ½. This new information, that the car is not behind Door 3, doesn’t change anything; Monty is still
equally likely to open either Door 2 or Door 3.

But, if the car is behind Door 2, Monty is forced to open Door 3: P(D3|C2) = 1.

P(C1|D3) = P(D3|C1) * P(C1) / P(D3)
               = (½ * ⅓) / (½) = ⅓.
Thus, the probability of winning the car if you
STAY with Door 1 is ⅓.
P(C2|D3) = P(D3|C2) * P(C2) / P(D3)
                = (1 * ⅓) / (½) = ⅔.
Thus, the probability of winning the car if you
SWAP to Door 2 is ⅔.

Thus, you should always SWAP, because you double your chances of winning a car.
Note that the central problem remains unchanged, even if you switch the numbers on the doors around.

Proof [3]. Here’s a graphical representation that seems to take a similar approach:

Counting rules
a. If any one of k events can happen in n trials, the number of possible outcomes is k n. =POWER(k,n)
i. Rolling a die 3 times yields 63 = 216 possible outcomes, where k = 6, n = 3.
b. If the first trial has k1 events, the second trial has k2 events, … the nth trial has kn events, the number
of possible outcomes is (k 1 )(k2)...(k n).  This is denoted by .∏
n
i=1
ki
i. If you are at a fancy dinner and have a choice from 2 entrees, 5 mains and 4 desserts, you
can have 2*5*4 = 40 different meals! (If you’re not full by then.)
c. (Factorials) The number of ways that n items can be arranged in unique order is n! =
(n)(n-1)(n-2)...(2)(1). =FACT(n)
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
i. Why? Think of it this way - you’re trying to fill up n spaces with n items. You fill up the first
space with any of n items. You fill up the second with any of (n-1). And so on, until you can
only choose 1 item to fill the nth space.
ii. For example, you can arrange 7 children in a canteen line in 7*6*5*4*3*2*1 = 5040 ways.
d. (Permutations) The number of ways that you can arrange k items in unique order, selected from n
objects, is n Pk = . =PERMUT(n,k)n!(n−k)!
i. Why? You fill up the first space with any of n items. You fill up the second with any of (n -1).
This time, you stop at the kth space, which you can fill with any of (n-k+1) items. =n!(n−k)!
(n)(n-1)(n-2)...(n-k+1).
ii. If only 4 children out of 7 get to go in the canteen line, you could arrange 4 children in the
line in 7!/(7-4)! = 7!/3! = 5040/6 = 840 ways.
e. (Combinations) The number of ways that you can select k items out of n, disregarding order, is nCk
= . =COMBIN(n,k)n!k!(n−k)!
i. Why? Well, from our example, suppose only 4 children out of the 7 get to go in the line. We
now don’t care which order they are in, so Alice-Diane-Charlie-Bob is the same as
Charlie-Alice-Bob-Diane, etc. If you have 4 people, you can arrange them in 4! ways. So for
each combination of people, we have 4! permutations which are redundant.
So we divide the number of permutations ( ) by the number of redundancies (k!) to getn!(n−k)!
the number of combinations ( )n!k!(n−k)!
ii. Thus, you could choose the 4 lucky children out of our 7 in 7!/(3!*4!) = 5040/(6*24) = 35
ways.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 5 - Discrete Probability Distributions

Random variables
A random variable (rv) represents the possible outcomes from an uncertain event. Numerical rvs can be:
● Discrete - can only assume a countable number of values (e.g. number of new subscribers to a
magazine; number of absent employees per day)
● Continuous - arises from a measuring process (e.g. temperature, rate of unemployment)

The probability at a certain value, P(X = x) can be obtained by a probability mass function (PMF) for
discrete rvs, or by a probability density function (PDF) for continuous rvs.
The probability for values less than x, P(X < x) can be obtained by a cumulative distribution function (CDF).

Discrete rvs      X      P(X=xi)
The probability distribution for a discrete rv is a mutually
exclusive and collectively exhaustive list of all possible
outcomes for that rv, and the associated probabilities for each
outcome, adding to 1.

Expected value (or mean) of a discrete rv
(X) P (X )E = ∑M
i=1
xi = xi = μ

Variance and standard deviation of a discrete rv
[(X (X)) ] (X)]P (X )σ2 = E − E 2 = ∑M
i=1
[xi − E = xi
σ = √ (X)]P (X )∑Mi=1 [xi − E = xi

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Calculating E(X) and σ in Excel
X P(X) X*P(X) (X-E(X))2 (X-E(X))2*P(X)
... ... ... ... ...
                        Σ 1 E(X) σ2

Binomial probability distribution - where rv counts number of “events of interest” occurring from n
observations or trials (e.g. how many heads come from 5 coin tosses; how many defects in 100 light bulbs).
● Either the event occurred, or it didn’t.
○ P(event of interest occurs) = π
■ In a sample of n, the event of interest will occur x times
○ P(event does not occur) = 1 - π
■ In a sample of n, the event of interest will not occur n - x times
● We require that each observation has constant probability every time we conduct an observation.
● We require that observations are independent: the outcome of one observation does not affect
any future observation.
Why? Well, in this formula, you are
multiplying:
● The amount of times that the
event of interest happens (πx ),
by
● The amount of times that the
event of interest does not
happen (1 - π)n - x, by
● The number of combinations
that these events can occur in.

E.g.
X=2, n=3, π=0.5
P(X=2 | 3, 0.5)
= 0.375 = ⅜. 0.5 (1 .5)= 3!2!(3−2)! 2 − 0 3−2

To calculate the binomial distribution of an rv in
Excel, type =BINOM.DIST([X],[n],[π],FALSE).
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647


Poisson distribution - applies when counting the number of times an event occurs in a given “area / time of
opportunity” (e.g. the number of computer crashes in a day; the number of road accidents on a given
stretch of road in a year, the number of chocolate chips in a cookie).
We make the following requirements:
● We require that each observation has constant probability every time we conduct an observation.
● We require that observations are independent: the outcome of one observation does not affect
any future observation.
● The P that two or more events occur in an area of opportunity approaches 0 as the area of
opportunity becomes smaller and smaller.
The average number of events per area of opportunity is λ.

To calculate the Poisson distribution of an rv in Excel, type =POISSON.DIST([x],[λ],FALSE).
The location, scale and shape of the distribution depends on the parameter λ.

Hypergeometric distribution - applicable when selecting from a finite population without replacement.
● Sample size n, population N
● x items of interest in the sample, A items of interest in the population

Example. Going into a merge tribe of 13, Mana has 7 members and Nuku has 6. What is the probability that
in the next 3 Tribal Councils, all 3 vote-offs will be from Mana? (assuming vote-offs are random)

For the Survivor problem, x = 3, n = 3, N = 13, A = 7. P = 7C3 * 13-7C3-3 / 13C3 = 0.1224
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647


To calculate the hypergeometric
distribution of a rv in Excel, type
=HYPGEOM.DIST([x],[n],[A],[N],FALSE).

Covariance
● Measures the strength of the linear relationship between two numerical random variables X and Y;
how one varies with the other
● Positive covariance indicates positive relationship
● Negative covariance indicates negative relationship

= Portfolio Expected Return

= Square of Portfolio Risk
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 6 - Continuous Probability Distributions

Normal / Gaussian distribution
● Bell shaped density curve
● Symmetric (skewness = 0)
● Mean = median = mode (population)
Q: What happens to the graph if you change µ? What about if you change σ? Consider what the mean
and standard deviation really are.
Most data are not normally distributed. The normal distribution is primarily an approximation.

The standardised normal distribution Z has µ = 0 and σ =
1. It is called standardised because with Z, we can express
X-values in standardised units.
Q: How does this relate to z-scores that we learned about in week 3?

In stats courses like this one, you will be provided with a Z
table to determine the probability of Z falling below a
specified threshold.

X values → Z values → probabilities.

Problems involving the normal distribution
Example. The average time it takes to download a video file is 18.0 seconds, with a standard deviation of
5.0. Find the probability that the download time is less than 18.6 seconds.

● Conversely, P(X > 18.6) = 1 - 0.5478 = 0.4522

Example. Suppose X is normal with mean 18.0 and standard deviation 5.0. Find P(18The Z values are 0 and 0.12 respectively.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
In the Z table, the probabilities are .5000 and .5478 respectively.
Thus, P(18
Example. Suppose X is normal with mean 18.0 and standard deviation 5.0. Find X such that 20% of
download times are less than X.
By scanning the table, we find that 0.20 corresponds roughly to a Z value of -0.84.
Converting to X units, -0.84 = (X-18)/5
            -4.2 = X-18
            13.8 = X
Thus, 20% of download times are less than 13.8 seconds.

The normal distribution in Excel
To compute normal probabilities in Excel, type in =NORM.DIST([X],[µ],[σ],TRUE).
To compute Z values in Excel, type in =NORM.S.INV([0
Uniform / rectangular distribution
● Is continuous from a < X < b
● Has equal density from all possible outcomes of the random variable
(i.e. in the region a < X < b)

Exponential distribution is positive and right-skewed.
● Used to model the length of time between two occurrences of an event (e.g. waiting time to be
seated in a restaurant; time between trucks arriving at an unloading dock)
● Related to the Poisson distribution: if λ = the average number of events in the area of opportunity,
then the time between each event counted has an exponential distribution with mean 1/λ.
● Mode < median < mean
● Mean = 1/λ = standard deviation

To use the exponential distribution (X

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 7 - Sampling Distribution

Consider the distribution of all possible values of a sample statistic for a given sample size (n) selected from
a population.
● E.g. Coca-Cola surveys 50 people regarding whether they would buy a new type of cola, then
calculate the sample proportion who would.
● E.g. Amazon calculates the sample mean of 25 customers’ annual purchase amounts.
● E.g. Sanitarium calculates the sample mean weight of 35 ‘500g’ Weetbix boxes.
If Coke/Amazon/Sanitarium repeat this sampling many times, all with the same n, there will be a different
proportion/mean for each sample. These form the sampling distribution, the distribution of the results if
you actually selected all possible samples. It gives us a more accurate picture of data.

The sampling distribution of the mean is unbiased because the mean of all possible sample means (of a
given sample size n) is equal to the population mean. µx̄ = µ.

Developing and visualising a sampling distribution
In this example,
● Population size N = 4
● Random variable X: no. of business meetings this month
● For Person A, X = 18
● For Person B, X = 20
● For Person C, X = 22
● For Person D, X = 24
1μ = NΣxi = 418+20+22+24 = 2 .236 σ = √ NΣ(x −μ)i 2 = 2

Now, consider all possible samples of size n=2:

Summary measures of this sampling distribution:
1μx = 1618+19+19+...+24 = 2 .58 σx = √ 16(18−21) +(19−21) +...+(24−21)2 2 2 = 1
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
The standard error of the mean measures the variability in the mean from sample to sample. It is σx = σ√n
essentially the standard deviation of all possible sample means.

Sample means for normal populations
If a population is normal, the sampling distribution of x̄ is also normally distributed.
For any population with mean µ and standard deviation σ, the sampling distribution of hasx
(i.e. the sample mean is an unbiased estimator of µ).(X) ( (X )) (X ) nμE = E n1 ∑
n
i=1
E i = n1 ∑
n
i=1
E i = n1 = μ
The Z-value for the sampling distribution of the mean is .Z = σX
X−μX = σ
√n
X−μ

Example. Find a symmetrically distributed interval around µ that will include 95% of the sample means
when µ = 368, σ = 15, and n = 25 and population is normal.
● Since the interval is symmetric, the interval encompasses cumulative probabilities 0.025 to 0.975.
● From the standard normal table, the Z value ZL corresponding to 0.025 is -1.96, and the Z value ZU
corresponding to 0.975 is 1.96.
● Calculating the lower limit by rearranging the equation for the Z value,
● Calculating the upper limit by rearranging the equation for the Z value,

Sample means for non-normal populations
We apply the Central Limit Theorem: if the population is not normally distributed, sample means of
random samples from the population will be approximately normally distributed, as long as the sample size
n is large enough (n ≥ 30 closely approximates normality). As before, µx̄ = µ and .σx = σ√n
● Following on from : as n increases, the standard error decreases, and the samplingσx = σ√n σx
distribution approaches a normal distribution.

Example. Suppose a population has µ = 8, σ = 3. n = 36. What is the probability that the sample mean is
between 7.8 and 8.2?
Since n ≥ 30, we can use the central limit theorem, so the sampling distribution of is approximatelyx
normal with µx̄ = µ = 8 and = 3.σx

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Sampling distributions of the proportion
π = the population proportion: the proportion of items in the entire population with the characteristic of
interest (0<π<1).
p = the sample proportion: the proportion of items in the sample with the characteristic of interest (0Similar to how the sample mean, , is an unbiased estimator of the population mean, µ, with , thex σx = σ√n
sample proportion, p, is an unbiased estimator of the population proportion, π. µp = π.
The standard error of the proportion, σp , is given by . σp = √ nπ(1−π)

When you replace , µ and in the formula for Z in the sampling distribution of the mean, with p, π andx σx
, you get the Z-value for the sampling distribution of the proportion: σp Z = σpp−π = p−π√ nπ(1−π)

You can use the normal distribution to approximate the sampling distribution of the proportion when
are each at least 5.π, n(1 ) n − π

Example. If the true proportion of voters who support a GST increase is π = 0.4, what is the probability that
a sample of size 200 yields a sample proportion between 0.40 and 0.45?
.03464 σp = √ nπ(1−π) = √ 2000.40(1−0.4) = 0
Converting to standardised normal, P(0.40≤p≤0.45) = P( )0.034640.40−0.40 ≤ Z ≤ 0.034640.45−0.40
    = P(0≤Z≤1.44) = 0.4251.

Sampling from finite populations
The finite population correction (fpc) factor is used to adjust the standard error of
both the sample mean and the sample proportion. It needs to be applied:
● When n > 5% of N, and
● When sampling occurs without replacement.
The fpc is always <1, so it always reduces the standard error, resulting
in more precise estimates of population parameters.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 8 - Confidence Intervals

Confidence interval for the population mean where σ is known
In an example from last week, we found that for a population with µ = 368 and σ = 15, the interval
= 362.12 to 373.88 contains 95% of the possible sample means when n = 25.683 ± √25(1.96)(15)

Suppose that now, you don’t know µ. You have taken a sample mean =362.3. You develop a 95%x
confidence interval for this sample to estimate µ: , or 356.42 µ 368.18.62.33 ± √25(1.96)(15) ≤ ≤
● This means: “If I repeatedly sample, each time generating a 95% confidence interval, then 95% of
the generated intervals would contain the true µ.”
● This DOES NOT MEAN: “There is a 95% probability that µ lies within 356.42 and 368.18”
○ Because you generate a new confidence interval for each sample, you cannot make
absolute statements about probabilities for any one given interval.
We prefer confidence intervals over or p because they consider variation from sample to sample, basedx
on observations from only one sample.

The general formula for a confidence interval is: Point Estimate (Critical Value)(Standard Error) ±
● Point Estimate is or p, estimating µ or πx
● Standard Error is σ or σp x
● Critical Value is a value based on our desired confidence interval.

Our confidence level (1-α)*100% is the probability that the interval will contain the unknown population
parameter.
● If confidence level = 95%, 1- α = 0.95 (therefore α = 0.05)
● The region outside the interval has cumulative probability α
○ For a two-tailed test, α is split between α/2 in the
upper tail and α/2 in the lower tail.

The  upper and lower limits of the confidence interval for µ, if σ is
known, are .X ± Zα/2 σ√n
● We assume that the population is normally distributed, or that
the Central Limit Theorem applies.
● is the value corresponding to an upper-tail probability of α/2 from the standardised normal Zα/2
distribution (i.e. a cumulative area of 1-α/2.)
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
● To calculate the Z-value in Excel, use =NORM.S.INV([α/2])

Confidence interval for the population mean where σ is not known
In most business situations, σ is not known exactly. (If you in fact knew σ, you would know µ, as per
above.)

If the population standard deviation σ is unknown, we substitute the sample standard deviation S.
While has a normal distribution with mean 0 and variance 1,Z = σ
√n
X−μ
has a specific distribution - a (Student’s) t distribution - with n-1 degrees of freedom.t = S
√n
X−μ
● To find the critical value of t for the appropriate degrees of freedom, you use a table of the t
distribution.
○ The table contains t values (number of S away from the mean), NOT probabilities.
● As n increases, S → σ, and so t values → Z values.
○ As n increases, t distributions estimate the normal distribution.
● Degrees of freedom (d.f.) relates to how many of the sample values are free to vary.

The upper and lower limits of the confidence interval for µ, if σ is unknown, are .X ± tα/2 S√n
● is the value corresponding to an upper-tail probability of α/2 from the t-distribution (i.e. a tα/2
cumulative are of 1-α/2.)
● To calculate tα/2 in Excel, use =TINV([α], [d.f.]) (=TINV gives the two tailed distribution, hence α
instead of α/2)

Example. A random sample of n=25 has X̄=50 and S=8. Form a 95% condence interval for µ.
n = 25, so d.f. = 25 - 1 = 24
α = (1 - 0.95) = 0.05, so α/2 = 0.025
t0.025 with d.f. 24 = 2.0639
The 95% confidence interval for this sample is 0 6.698 3.3025 ± √25(2.0639)(8) = 4 ≤ μ ≤ 5

Confidence interval for the population proportion
Recall that the distribution of the sample proportion is approximately normal, if , withπ and n(1 ) n − π ≥ 5
. We estimate π with p. σp = √ nπ(1−π)
The  upper and lower limits of the confidence interval for π are if p ± Zα/2√ np(1−p) p and n(1 p) n − n ≥ 5
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

Determining sample size
The sampling error / margin of error, e, is the amount added or subtracted onto the sample statistic to get
a confidence interval. Work out the appropriate sample size by solving for n (always round up).

→ → X ± Zα/2 σ√n e = Zα/2 σ√n )n = (Zα/22 e2σ
2
→ → X ± Zα/2 S√n e = Zα/2 S√n Z )n = ( α/22 e2S
2
→ → p ± Zα/2√ nπ(1−π) e = Zα/2√ nπ(1−π) )n = (Zα/22 e2π(1−π)
Note: π can be estimated by p, or by setting π = 0.because it yields the largest value for n.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 9 - Hypothesis Testing (One Sample)

A hypothesis is a claim, often about a population parameter. An example of an hypothesis is:
‘The mean monthly phone bill for all households is µ = $42.’
‘Telstra’s market share proportion in mobile phone customers, π, is greater than 0.5.’
How can we assess whether a hypothesis is true or not?

The null hypothesis (H0) states a status quo assertion about a population parameter. E.g.:
‘The average purchase made on Amazon Prime is worth $50 (H0: µ = 50).’
Tests usually begin by assuming H0 is true. H0 can’t be proven in the test, but it can be rejected.

The alternative hypothesis (H1) opposes the status quo H0 in some way. E.g.:
‘The average purchase made on Amazon Prime is not worth $50 (H1: µ ≠ 50).’
-H1 is normally the hypothesis that the researcher is trying to find evidence for (or against). It is usually
formed first.

The hypothesis testing process (using the above example of Amazon Prime)
1. Test H1 (µ ≠ 50). First, sample the population and find the sample mean.
2. Assess the likelihood of H1.
a. If we assume H0 (that µ = 50), and we return a sample mean close to the stated µ, then we
can conclude that H0 is not rejected.
b. If we assume H0 and we return a sample mean far from the stated µ, then we can conclude
that H0 is rejected ( our assumption that µ = 50 is wrong)

Possible errors in hypothesis test
decision-making
ACTUAL SITUATION
H0 True H0 False

DECISION
Do Not Reject H0 Correct decision
P = 1 – α
Type II Error
P = β
= (do not reject H0 | H0 false)
Reject H0 Type I Error
P = α
= (reject H0 | H0 true)
Correct decision
P = 1 - β
● The confidence coefficient 1-α is the probability of not rejecting H0 when it is true.
○ “Our confidence in the test to accept the true”
● The power of a statistical test 1-β is the probability of rejecting H0 when it is false.
○ “The power of the test to identify the false”
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Importantly, we need to set critical values that tell us how ‘close’ is close enough to
uphold H0, or how ‘far’ is far enough to reject H0.
● In two-tailed tests (i.e. where H1 specifies ‘≠’), we set values of α/2 on
either ‘tail’ as critical values.

Two-tailed hypothesis test for the mean where σ is known: Z-value
approach
1. State H0 and H1.
2. Choose α and n.
3. Determine the critical Z values for a specified α/2.
4. Collect data. Convert the sample statistic, X̄, to a test statistic, ZSTAT.
ZSTAT = σ√n
X−μ
5. Decision rule: If ZSTAT falls within the rejection region, reject H0. Otherwise, do not reject H0.

Example. Test the claim that the true mean diameter of a bolt is 30mm, given σ = 0.8.
H0: µ = 30
H1: µ ≠ 30
Set α = 0.05 and n = 100
Since α/2 = 0.025, the critical Z values are ±1.96
Suppose that X̄ = 29.84
−ZSTAT = σ√n
X−μ = 0.8
√100
29.84−30 = 0.08−0.16 = 2
Since ZSTAT < -1.96, ZSTAT is within the rejection region, we reject the null hypothesis and conclude there is
sufficient evidence, at the 5% level of significance, that the mean diameter of the manufactured bolts is not
equal to 30mm.

Two-tailed hypothesis test for the mean where σ is known: p-value approach
The p-value / observed level of significance is the probability of obtaining a value for the test statistic
equal to or more extreme than the observed sample value, given H0 is true.
● ‘More extreme’ is defined by H1.
1. State H0 and H1.
2. Choose α and n.
3. Collect data and compute the test statistic. Determine the associated p-value for a specified α.
4. Decision rule: If p < α, reject H0 . Otherwise, do not reject H0.
“If the p-value is too low, then H0 must go!”
Example. Test the claim that the true mean diameter of a bolt is 30mm, given σ = 0.8.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
H0: µ = 30
H1: µ ≠ 30
Set α = 0.05 and n = 100
Suppose that X̄ = 29.84
−ZSTAT = σ√n
X−μ = 0.8
√100
29.84−30 = 0.08−0.16 = 2
To calculate the p-value, ask: How likely is it to get a ZSTAT of -2 (or further from the mean (0), in either
direction) if H0 is true? Sum up both upper and lower tail probabilities.

Since the p-value = 0.0456 is less than α =
0.05, we reject the null hypothesis and
conclude there is sufficient evidence, at the
5% level of significance, that the mean
diameter of the manufactured bolts is not
equal to 30mm.

Two-tailed hypothesis test for the mean where σ is unknown: t-value approach
1. State H0 and H1.
2. Choose α and n.
3. Determine the critical Z values for a specified α/2.
4. Collect data. Convert the sample statistic, X̄, to a test statistic, tSTAT.
tSTAT = S√n
X−μ
5. Decision rule: If tSTAT falls within the rejection region, reject H0 . Otherwise, do not reject H0.

Example. The average cost of a hotel room in New York is said to be $168 per night. To determine if this is
accurate, a random sample of 25 hotels is taken and resulted in X = $172.50 and S = $15.40. Test the
appropriate hypotheses at α = 0.05.
H0: µ = 168
H1: µ ≠ 168
α/2 = 0.025, n = 25, d.f. = 24
.46tSTAT = S√n
X−μ =
√25
15.4
172.5−168 = 1
Critical values are t24, 0.025 = ± 2.0639
Since tSTAT does not fall within the rejection region, we do not reject the null hypothesis and conclude there
is insufficient evidence, at the 5% level of significance, that the mean cost of NY hotel rooms is not $168.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Two-tailed hypothesis test for the mean where σ is unknown: p-value approach
1. State H0 and H1.
2. Choose α and n.
3. Collect data and compute the test statistic. Determine the associated p-value for a specified α.
4. Decision rule: If p < α, reject H0 . Otherwise, do not reject H0.

Example. The average cost of a hotel room in New York is said to be $168 per night. To determine if this is
accurate, a random sample of 25 hotels is taken and resulted in X = $172.50 and S = $15.40. Test the
appropriate hypotheses at α = 0.05.
H0: µ = 168
H1: µ ≠ 168
α = 0.05, n = 25, d.f. = 24
.46tSTAT = S√n
X−μ =
√25
15.4
172.5−168 = 1
To calculate the p-value, ask: How likely is it to get a tSTAT of 1.46 (or further from the mean (0), in either
direction) if H0 is true? Sum up both upper and lower tail probabilities.

Since p-value = 0.157 is more than α =
0.05, we do not reject the null
hypothesis and conclude there is
insufficient evidence, at the 5% level of
significance, that the mean cost of NY
hotel rooms is not $168.

Connection between confidence interval and two-tail tests - the length of the non-rejection region is also
the length of the confidence interval, with the same α.

One tail tests - in many cases, H1 focuses on a particular direction.
E.g. H1: µ > 50 - this is an upper-tail test (focused on the upper tail above the value of 50). The rejection
area is only in the upper tail.
E.g. H1: π < 0.1 - this is a lower-tail test (focused on the lower tail below the value of 0.1). The rejection area
is only in the lower tail.
1. Find the critical value for t
2. Reject if test statistic is in rejection region
OR
1. Compute test statistic into tSTAT
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
2. Knowing tSTAT and d.f., calculate p-value
3. Reject if p-value < α

=T.DIST([tSTAT],[d.f.],TRUE): Returns the left tailed student-t distribution (i.e. the p-value)
=T.DIST.2T([tSTAT],[d.f.],TRUE): Returns the two tailed student-t distribution (i.e. the p-value)
=T.DIST.RT([tSTAT],[d.f.],TRUE): Returns the right tailed student-t distribution (i.e. the p-value)

Remember, the definition of the p-value: the probability of observing a test statistic more extreme than that
observed, in the direction of the alternative hypothesis, if the null hypothesis is true.

Also note that, with one-tailed tests, any ZSTAT or tSTAT will be based on α rather than α/2. Why? Because
the rejection region, which by definition is α, is all in one tail. (If in doubt, look at the tables and their
sample diagrams.)

Hypothesis tests for the proportion
Z = p−π√ nπ(1−π)
Example. A marketing company claims that it receives responses from 8% of those surveyed. To test this
claim, a random sample of 500 were surveyed with 25 responses. Test at the α = 0.05 significance level.
H0: π = 0.08
H1: π ≠ 0.08
Set α = 0.05 and n = 500
p = 25/500 = 0.05
ZSTAT = = -2.47 0.05−0.08√ 5000.08(1−0.08)
Since α/2 = 0.025, the critical Z values are ±1.96
Since ZSTAT < -1.96, ZSTAT is within the rejection region, we reject the null hypothesis and conclude there is
sufficient evidence, at the 5% level of significance, that the company does not maintain an 8% response
rate.

p-value test:
α = 0.05
ZSTAT, as calculated above, is -2.47
Corresponding p-value is 2(0.0068) = 0.0136
Reject H0 since p-value < α
One-tailed hypothesis tests for the proportion
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Remember the only difference is that we work with α instead of α/2, because the rejection region (of
width α) is all in one tail.

Example. Coca Cola wants at least 10% of customers to purchase a new cola. To test this claim, a random
sample of 100 were surveyed with 8 positive responses. Test at the α = 0.05 significance level.
H0: π 0.10 (this is a lower tail test)≥
H1: π < 0.10
Set α = 0.05 and n = 100
p = 8/100 = 0.08
ZSTAT = = -0.67 0.08−0.10√ 1000.10(1−0.10)
Since α = 0.05, the critical Z value is -1.645
Since ZSTAT 1.645, ZSTAT is not within the rejection region, we do not reject the null hypothesis and rule≥
that Coca-Cola’s finding is plausible.

p-value approach:
α = 0.05
ZSTAT, as calculated above, is -0.67
Corresponding p-value is .2486+.5 = 0.7486
Do not reject H0 since p-value > α

To calculate the p-value in Excel, use =NORM.S.DIST([ZSTAT], TRUE) to find out the probability from Z = − ∞
to Z = ZSTAT.

Ethical considerations
● Use randomly collected data (probability samples) to reduce selection biases and non-sampling
error AND allow the sampling distribution theory to be used!
● Choose the level of significance, α, and the type of test (one-tail or two-tail) before data collection
● Do not employ “data snooping” to choose between one-tail and two-tail tests, or to determine α
● Do not practice “data cleansing” to hide observations that do not support a stated hypothesis
● Report all pertinent findings including both statistical significance and practical importance

Week 10 - Two Sample Tests
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
(Key question: how do we compare two data sets?)

Businesses compare data sets all the time.
● E.g. eBay wants to test whether pop-up ads or sidebar ads generate more sales.
● E.g. Google wants to test whether showing 10 or 25 search results attracts more traffic.
● E.g. Woolworths wants to test the revenue generated from ‘special’ v non-special products.
Our goal is to test hypotheses or form a confidence interval for the difference between two population
means / proportions.
● The point estimate for the difference of two population means is . X1 − X2
● The point estimate for the difference of two population proportions is . p1 − p2

Two-sample test comparing population means using independent samples
(Key question: what is the relationship between µ1 and µ2?)
● Samples from two populations are independent if a sample selected from either population has no
effect on the sample from the other population.
● The CLT says: linear combinations of independent rvs are approximately normally distributed, as
the sample size gets larger. Thus CLT applies when the two groups are independent and randomly
sampled, as long as each group is EITHER normally distributed OR has large enough sample size.
(do not memorise) X1 − X2 = n1 ∑
n1
i=1
X i1 − n1 ∑
n2
i=1
X i2

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
The pooled variance t-Test is used if σ1 and σ2 are unknown and assumed equal. (Q: Why might we
assume σ1 and σ2 are equal?)
● We assume that samples are randomly and independently drawn.
● We assume that populations are normally distributed OR both sample sizes are symmetric OR both
sample sizes are at least 30 for CLT.

The pooled variance estimate is:
(= )Sp2 n −1+n −11 2
+∑
n1
i=1
(X −X )i 1 1
2 ∑
n2
i=1
(X −X )i 2 2
2
= n +n −21 2
(n −1)S +(n −1)S1 12 2 22
● n1, n2 are the sample sizes of sample 1 and sample 2 respectively.
● X̄ 1, X̄ 2 are the sample means of sample 1 and sample 2 respectively.
● S12, S22 are the sample variances of sample 1 and sample 2 respectively.

The tSTAT, for µ1 - µ2 with σ1 and σ2 unknown and assumed equal, is:
tSTAT= √S ( + )p2 1n1 1n2
(X −X )−(μ −μ )1 2 1 2
(If you want to try to derive this formula, don’t.)
● tSTAT has (n1 + n2 - 2) degrees of freedom.
● Generally, H0: µ1 = µ2
         thus µ1 - µ2 = 0

The confidence interval, for µ1 - µ2 with σ1 and σ2 unknown and assumed equal, is:
X ) ( 1 − X2 ± tα/2√S ( )p2 1n1 + 1n2
where tα/2 has (n1 + n2 - 2) degrees of freedom.

Excel can do all the work for you: Data → Data Analysis → t-Test: Two-Sample Assuming Equal Variances
→ input data and raw α → done.

The separate variance t-Test is used if σ1 and σ2 are unknown and NOT assumed to be equal.
● We assume that samples are randomly and independently drawn.
● We assume that populations are normally distributed OR both sample sizes are symmetric OR both
sample sizes are at least 30 for CLT.

The tSTAT, for µ1 - µ2 with σ1 and σ2 unknown and not assumed equal, is:
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
tSTAT= √ +n1S12 n2S22
(X −X )−(μ −μ )1 2 1 2 (Generally, H0: µ1 = µ2 → thus µ1 - µ2 = 0)

The critical t-value has degrees of freedom according to the formula shown, rounded to
the nearest integer (I can’t even transcribe them myself anymore).

The confidence interval, for µ1 - µ2 with σ1 and σ2 unknown and not assumed equal, is:
X ) ( 1 − X2 ± tα/2,d.f .√ n1S12 + n2S22

Excel can do all the work for you: Data → Data Analysis → t-Test: Two-Sample Assuming Unequal
Variances → input data and raw α → done.

Example. You have been asked to compare the mean value of a Sydney-Melbourne flight on
Jetstar with the mean value on Qantas to see if there is a difference in the means of the
samples. Using α = 0.05,
(a) What are your hypotheses?
(b) What is tSTAT?
(c) How many degrees of freedom does the critical value have?
(d) What is the critical value? What can you conclude?
(e) What is the p-value? What can you conclude?
(f) Form a 95% confidence interval for the population mean of the samples (i.e. µ1 - µ2 ).
(g) What assumptions are you making?

(a) H0: µ1 = µ2 → H1: µ1 ≠ µ2 → µ1 - µ2 = 0
Parts (b), (c), (d), (e) and (f) can be solved in Excel by going to the Data tab
→ Data Analysis → t-Test: Two-Sample Assuming Unequal Variances →
enter the two arrays, α and hypothesised mean difference (=0) → OK. It
will show up in a new sheet.

However, for the purposes of this exercise, I want to go through things
step by step.

This is a representation of the formulas I put into my Excel spreadsheet:
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

And this is what the numbers looked like:

(b) tSTAT = -4.096 (B31)
(c) d.f. = 36 (E33)
(d) This is a two-tail test. If tSTAT is more extreme than the critical values, reject H0 .
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Critical values = ±2.028 (E34)
Since tSTAT < -2.028, reject H0 (E36). There is a significant difference between the mean value of
Sydney-Melbourne flights on Jetstar and Qantas.
(e) The p-value is the probability of getting tSTAT equal to or more extreme than the sample result if
there is no difference in the mean of the two samples (i.e. if H0 holds).
p-value =  0.000227 (E35)
Since p-value < α, reject H0 (E37). There is a significant difference between the mean value of
Sydney-Melbourne flights on Jetstar and Qantas.
(f) Confidence interval: -14.009 < µ1 - µ2 < -4.731 (B34, B35). You can conclude with 95% confidence
that the difference between the population mean of the two samples falls inside this interval.
(g)
(i) Since the sample sizes are both less than 30, it must be assumed that both sampled
populations are approximately normal.
(ii) Observing a difference this large or larger in the two sample means is less likely if you
assume equal population variances than if you assume unequal variances, but the null
hypothesis would be rejected either way.

Two-sample test comparing population means using related samples
The paired difference test uses the difference between matched values (e.g. sales data ‘before’ and ‘after’
a marketing campaign; a student’s test scores in Quiz 1 and Quiz 2).
● We assume that both populations are normally distributed, or large enough for CLT.
● The difference between matched values Di  = Xi1 - Xi2 .
● The point estimate for the paired difference population mean µD:
D = n
∑n
i=1
Di
● The sample standard deviation:
SD = √ n−1∑ni=1 (D −D)i 2
● tSTAT for µD:
where d.f. = n-1tSTAT =
√n
SD
D−μD
● Confidence interval for µD:
D ± tα/2 √nSD

Two-sample test comparing population proportions
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
(Key question: what is the relationship between two population proportions π1 and π2?)
To compute the test statistic we assume the null hypothesis is true, so we assume π1 = π2 and pool the
two sample estimate.

The pooled estimate for the overall proportion is:
where X1 and X2 are the number of items of interest in samples 1 and 2.p = n +n1 2X +X1 2

ZSTAT for π1 - π2:
ZSTAT = (p −p )−(π −π )1 2 1 2√p(1−p)( + )1n1 1n2 once again, remember we assume π1 = π2 .

Confidence interval for π1 - π2 :
p ) ( 1 − p2 ± Zα/2√ n1p (1−p )1 1 + n2p (1−p )2 2

Covariance and correlation
http://ci.columbia.edu/ci/premba_test/c0331/s7/s7_5.html
The sample covariance indicates how two samples are related.
ov(X , )c Y = n−1
(X −X)(Y −Y )∑
n
i=1 i i
If cov(X,Y) > 0 → X and Y tend to move in the same direction (positively correlated).
If cov(X,Y) < 0 → X and Y tend to move in opposite directions (negatively related).
If cov(X,Y) = 0 → X and Y are independent.

Covariance cannot measure the degree to which variables move
together because it does not use one standard unit of
measurement. To measure the strength of the linear relationship
between two numerical variables, you must use the sample
coefficient of correlation.
where SX and SY are the standard deviations of X and Y.r = S SX Ycov(X ,Y )
The sample coefficient of correlation ranges between -1 and 1.
The closer to -1, the stronger the negative linear relationship.
The closer to 1, the stronger the positive linear relationship.
The closer to 0, the weaker the linear relationship - can be
randomly scattered, or follow a non-linear pattern (like a quadratic).

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Week 11, 12, 13 - Simple and Multiple Linear Regression

Regression analysis leads us to construct a model which allows us to see how one or more independent
variables can predict the value of a dependent variable.

In simple linear regression,
● there is only one independent variable, X.
● the relationship between X and the dependent variable, Y, is described by a linear function.
● changes in X are assumed to cause changes in Y.

Two things determine a line: its y-intercept and its slope.
Therefore, a simple linear regression model looks like:
Yi = β0 + β1Xi + εi
Yi = dependent variable
β0 = population y-intercept
β1 = population slope coefficient
Xi = independent variable
β0 + β1Xi = linear component = E(Yi | X = Xi)
εi = random error component

The estimated SLR equation (also called the
prediction line) provides an estimate of the population regression line: X Y︿i = b0 + b1 i
Ŷ = predicted Y for observation i
b0 = estimate of the y-intercept
b1 = estimate of the slope
Xi = value of X for observation i
This line is shown in orange above. εi is therefore the difference between Yi and Ŷi.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
So, how do we determine b0 and b1 (i.e. the y-intercept and the slope)? How do we know what line to draw
through our scatter plot? Well, we want the variance of our standard errors εi (=Yi - Ŷi) around the line to be
as small as possible, right?

We use the least squares method, which minimises the sum of squared differences between the observed
values Yi and the predicted values Ŷi.
This sum of squared differences is equal to = .(Y∑n
i=1 i
− Y )︿i
2 [Y b X )]∑n
i=1 i
− ( 0 + b1 i 2
We minimise the sum of squared differences by differentiating the equation with respect to b0 and b1 and
setting it equal to 0:
(Y X ) (Y X )∂∂b0 ∑
n
i=1 i
− b0 − b1 i 2 = − 2 ∑
n
i=1 i
− b0 − b1 i = 0
        (Y X )− 2 ∑n
i=1
b0 = − 2 ∑
n
i=1 i
− b1 i
    Xb0 = Y − b1

(Y X ) (Y X )X∂∂b1 ∑
n
i=1 i
− b0 − b1 i 2 = − 2 ∑
n
i=1 i
− b0 − b1 i i = 0
   (Y Y X) X )X− 2 ∑n
i=1 i
− ( − b1 − b1 i i = 0
            (b X X)X (Y )X∑n
i=1 1 i
− b1 i = ∑
n
i=1 i
− Y i
     (X ) )b1 ∑
n
i=1 i
− X = ∑n
i=1
(Y i − Y
                      b1 = (X −X)∑n
i=1 i
(Y −Y )∑
n
i=1 i =
∑n
i=1
(X −X)i 2
(Y −Y )(X −X)∑
n
i=1 i i

You’re not expected to remember this exact process. What is important is that you understand the concept:
we’ve got a quadratic and we optimise it by setting its derivative to 0. (Note: these equations were quite
complicated and used the chain rule. If you’re unsure of what that is, I recommend you refresh your 2-unit
mathematics or take a maths help class at uni.)

Example. Find the estimated SLR equation for the relationship between
the selling price of a home and its size (measured in square feet), given
the data.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
In Excel, go to Data → Data Analysis → Regression → input Xs and Ys. (Include the labels ‘House Price in
$1000s’ and ‘Square Feet’.)

You end up with something like:

b0 (= 98.24833) is the estimated mean value of
Y when the value of X is 0. But since a house
cannot have a square footage of 0, b0 in this case
has no practical application.
b1 (= 0.10977) tells us that the mean value of a
house increases by .10977($1000) = $109.77, on
average, for each additional square foot of size.

When using a regression model, you should
(usually) only interpolate within the X-range of
existing observations, not extrapolate. This is
because when a new observation for an X outside
the range occurs, it will change b0 and b1.

(Y |X ) −> (Y |X ) b0 = E
︿ = 0 − β0 = E = 0
(Y |X ) (Y |X) −> (Y |X ) (Y |X) b1 = E
︿ + 1 − E︿ − β1 = E + 1 − E

Measures of variation

SST (total variation) measures the variation of the observed Yi values around their mean . This is relatedY
to the sample variance of Y.
SSE (unexplained variation) measures variation attributable to factors other than X. The whole thing on
Least Squares was about minimising SSE.
SSR (explained variation) measures variation attributable to the relationship between X and Y.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
● It’s the difference between how we would predict Y if we didn’t use X (SST), and how we would
predict Y if we did use X (SSE).
● If SSE gets smaller, SSR gets larger (shown in diagram below).

The coefficient of determination is the proportion of the linear variation in Y that is explained by the linear
variation in X.
where r is the coefficient of correlation.r2 = SSRSST = (Y −Y )∑n
i=1 i
2
(Y −Y )∑
n
i=1 i
︿
i
2

“r2*100% of the variation in Y is explained by variation in X, via the linear model.”

The standard error of the estimate is the standard deviation of the observations around the regression
line, and is measured in the same units as Y.
SY X = √ n−2SSE = √ n−2(Y −Y )∑ni=1 i ︿i 2

We have two parameters to estimate the mean (the intercept and the slope), thus we divide by n-2.

Residual analysis to verify assumptions
The residual ei is the difference between the observed and predicted values. ei = Y i − Y︿i
Assumptions about the residual (“LINE”):
● Linearity: the relationship between X and Y is linear
● Independence of Errors: error values are statistically independent
● Normality of Error: error values are normally distributed (only in small samples or when doing
prediction intervals)
● Equal variance (homoskedasticity): the probability distribution of errors has constant variance
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
We can plot residuals vs. X to test these assumptions.

We don’t want to see any pattern in the residual plot. If we do, that means that our regression model has
not completely captured all the variance in Y as a function of X, and we will need to alter it.

Tests that the linear relationship is not due to chance
Central limit theorem: The Least Squares estimates are linear combinations of the observations Y. Thus,
they have a CLT. Therefore, in large samples, Y, b0 and b1 are approximately normally distributed.
The standard error of b1, the regression slope coefficient, is estimated by
Sb1 =
SY X
√SSX =
SY X
√ (X −X)∑ni=1 i 2

t-test for a population slope: is there a linear relationship between X and Y?
H0: β1 = 0 (there is no linear relationship)
H1: β1 ≠ 0 (a linear relationship exists)
with df = n-2.tSTAT = Sb1
b −β1 1

Example. ^house-price = 98.24833 + 0.10977(square feet). Is there a relationship between the square
footage of the house and its sales price?
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

Note that if α is not specified, assume α = 0.05.
● tSTAT = 3.329378
Critical values of t0.025,8 = 2.3060 ±
Decision = reject H0. There is sufficient evidence that square footage significantly affects house price.
● p-value = 0.01039
α = 0.05
Decision = reject H0. There is sufficient evidence that square footage significantly affects house price.
● Confidence interval for estimate of the slope: 0.03374, 0.18580
We are 95% confident that the average impact on sales price is between $33.74 and $185.80 per square
foot of house size.
This 95% confidence interval does not include 0 (i.e. there is not no impact).
Decision = there is a significant relationship between house price and square feet at the 5% level of
significance.

t-test for a correlation coefficient:
H0: ρ = 0 (there is no correlation between X and Y)
H1: ρ ≠ 0 (correlation exists)
ρ is the population correlation coefficient.
with df = n-2. tSTAT = r−ρ√ n−21−r2

Example. Is there evidence of a linear relationship between square feet and house price at the .05 level of
significance?
(look familiar?) Since tSTAT falls outside the critical values, reject H0. There is evidence.329 tSTAT = .762−0√ 10−21−.7622 = 3
of a linear association at the 5% level of significance.
NB: The t-test for a population slope and a correlation coefficient result in the same thing.

Estimating mean values and predicting individual values

Whereas a confidence interval is for the mean
response, a prediction interval is for an individual
value of Y.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647


Multiple linear regression
In real life, when we’re trying to predict a variable Y, there are several variables X which may be acting
upon Y.
● E.g. if Y is sales, X1 might be advertising, X2 might be price, X3 might be competitors’ sales ...
So we need to use something called a multiple regression model with k independent variables:
Yi = β0 + β1X1i + β2X2i + … + βkXki + εi
Yi = observed value of Y
β0 = y-intercept
β1,2,...,k = population slopes for each occurrence i
X1,2,...,k = observed values of X for each occurrence i
εi = random error

The coefficients of the multiple regression model are again estimated using sample data:
X X .. X Y︿i = b0 + b1 1i + b2 2i + . + bk ki
Ŷi = predicted value of Y
b0 = estimated y-intercept
b1,2,...,k = estimated slope coefficients
If we represent Ŷ graphically, we get a plane.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Example. A distributor of pies wants to evaluate factors thought to influence demand.
Dependent variable: Pie sales (units per week)
Independent variables: Price of pies ($)
            Expenditure on advertising ($100’s)

In Excel,
1. Input the data, with the dependent variable on the leftmost column.
Include headings.
2. Data → Data Analysis → Regression. Input the dependent variable in
the Y, input all the independent variables in the X.

Therefore the prediction equation for sales:
^sales = 306.526 - 24.975(Price) + 74.131(Advertising).
● Sales will decrease on average by 24.975 pies per week for each
$1 increase in selling price, if the level of advertising is held constant.
● Sales will increase on average by 74.131 pies per week for each $100 increase in advertising, if the
selling price is held constant.

Testing for the fit of a multiple regression model
When we have two x-variables, we want to know that both of them are important. Therefore, we want to
compare the fit of the model using two x-variables with the fit of a model using only one of them.
Recall that r2 = SSR/SST = regression sum of squares / total sum of squares.
If we add a third variable, no matter what it is, r2 can never decrease. This might throw out the usefulness of
r2 as a comparative measure of the fit of models. What do we do?
We use adjusted r2 - asking ‘how much have we reduced our SYX below SY?’ The closer r2adj is to 1, the
better our model fits.

This is not an intuitive formula … but that’s what
it is.

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
Testing for linear significance: Are any of the X variables linearly related to Y, or are none? Hypotheses:
H0: β1 = β2 = … = βk = 0 (no linear relationship)
H1: at least one βi 0 (at least one independent variable affects Y)= /
We can’t use a separate t-test for each βi, since they’re all correlated. So what do we do?

What if, we take the sum of all the squares of bi (i = 1, 2, …, n)?
If the sum is close enough to 0, the only way that could happen is if all the tSTAT, were close to 0.
The only way that could happen is if all bi (i = 1, 2, …, n) were close to 0, and the only way that could happen
is if all βi (i = 1, 2, …, n) were close to 0. That is roughly what we do when we use the F Test for overall
significance of the model.

FSTAT = MSR/MSE,
where k = number of
independent variables in the
regression model. (“E12”)
F1,n-2 t2n-2 i.e. FSTAT is a quadratic function of a set of tSTAT.≡
MSR = SSR/k (“D12”)
MSE = SSE/(n-k-1) (“D13”)
FSTAT follows an F distribution with k (numerator) and n-k-1 (denominator) degrees of freedom. (“B12”, “B13”)
Decision rule: Reject H0 at the α level of significance if FSTAT < Fα; otherwise, do not reject H0. (“F12”)

Residual analysis in multiple regression

Residuals from the regression model: Y ) ei = ( i − Y
︿
i
Assumptions about ε:
● The errors are normally distributed (if we have a small sample size or doing prediction intervals)
● They have a constant variance
● They are independent
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

These residual plots are used in multiple regression:
(a) Residuals vs. (predicted)Y︿i
(b) Residuals vs. X1i
(c) Residuals vs. X2i
(d) Residuals vs time (if time series data)
NB: If the residuals are independent of both X1i and X2i, then they’ll be independent of . So we normallyY︿i
just plot residuals vs. .Y︿i
Use the residual plots to check for violations of regression assumptions.

Testing whether individual variables are significant
t-tests / p-values of individual variable slopes: shows if there is a linear relationship between the variable Xj
and Y, holding constant all other X variables.
H0: βj = 0 (no linear relationship)
H1: βj 0 (linear relationship does exist between Xj and Y)= /
tSTAT = Sbj
b −0j
Confidence interval estimate for the population slope βj:
where t has (n-k-1) df.Sbj ± tα/2 bj
If the interval does not contain 0, reject the null hypothesis

Dummy variables
A dummy variable is how we put categorical variables into regression models.
We code binary dummy variables as 0 (did not occur / characteristic does not exist) or 1.

Example. For our pie example, let:
Y = pie sales
X1 = price
X2 = holiday; X2 = 1 if a holiday occurred during the week; X2 = 0 if there was no holiday that week.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

You see that with a dummy variable in the mix, we get TWO regression lines! The only different between
them is the impact of the holiday.

If Y = pie sales and X1 = price, and the flavour of the pie (which has three ‘levels’: apple, strawberry,
chocolate) is thought to matter,
X2 = 1 if apple, 0 otherwise
X3 = 1 if strawberry, 0 otherwise
For apple pie, X X Y︿ = b0 + b1 1 + b2 2
For strawberry pie, X X Y︿ = b0 + b1 1 + b3 3
For chocolate pie, X Y︿ = b0 + b1 1
Therefore, the number of dummy variables is one less than the number of levels.

Autocorrelation
Autocorrelation is correlation of errors over time. If residuals show a repeating pattern, that is a sign of
positive autocorrelation. Autocorrelation violates the regression assumption that residuals are random and
independent.

This plot of time against residuals seems to display autocorrelation:
● ‘spikes’ in the residuals around May and October of every year, as highlighted in orange.
● ‘troughs’ around June and December/January of every year, as highlighted in crimson.
Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647
● Therefore, there seems to be an annually cyclical trend in the residuals.

To be sure, we calculate the Durbin-Watson statistic to test for autocorrelation.
H0: residuals are not autocorrelated
H1: autocorrelation is present
The critical values dL and dU can be found from a Durbin-Watson table, for sample
size n and number of independent variables k.

Testing for positive autocorrelation: Testing for negative autocorrelation:

Pitfalls of regression
● Lacking an awareness of the assumptions underlying least-squares regression
● Not knowing how to evaluate the assumptions
● Not knowing the alternatives to least-squares regression if a particular assumption is violated
● Using a regression model without knowledge of the subject matter
● Extrapolating outside the relevant range

To avoid these pitfalls,
● Start with a scatter plot of Y vs. X to observe possible relationship
● Perform graphical residual analysis to check the assumptions (“LINE”)
● If there is violation of any assumption, use alternative methods or models
● If there is no evidence of assumption violation, then test for the significance of the regression
coefficients and construct confidence intervals and prediction intervals
● Avoid making predictions or forecasts outside the relevant range

Downloaded by zhu meng ([email protected])
lOMoARcPSD|5184647

欢迎咨询51作业君