程序代写案例-STA302H1

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

STA302H1: Methods of Data Analysis I
(Lecture 4)
Mohammad Kaviul Anam Khan
Assistant Professor
Department of Statistical Sciences
University of Toronto
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 1 / 40
ANOVA
• ANOVA is another way of testing the significance of the regression line
• This focuses on variance decomposition
• The total variation of Y is explained by total sum of squares (SST)
• SST =∑ni=1(yi − y¯)2, which is basically the numerator of S2y
• The target is to explain some of the variability of SST by the regression line
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 2 / 40
ANOVA
• The SST can be decomposed as following,
n∑
i=1
(yi − y¯)2 =
n∑
i=1
(yi − yˆi + yˆi − y¯)2
=
n∑
i=1
(yi − yˆi)2 +
n∑
i=1
(yˆi − y¯)2 + 2
n∑
i=1
(yi − yˆi)(yˆi − y¯)
• The third term in the equation becomes,
n∑
i=1
(yi − yˆi)(yˆi − y¯)
=
n∑
i=1
(yˆi(yi − yˆi)− y¯(yi − yˆi))
=
n∑
i=1
yˆiei − y¯
n∑
i=1
ei = 0
• We know that ∑ni=1 ei = 0. Using the second normal equation (Week 1), we can
show that ∑ni=1 xiei = 0, which implies ∑ni=1 yˆiei = 0
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 3 / 40
ANOVA
• Thus, the SST can be divided into two parts,
n∑
i=1
(yi − y¯)2 =
n∑
i=1
(yi − yˆi)2 +
n∑
i=1
(yˆi − y¯)2
• The first term on the right hand side ∑ni=1(yi − yˆi)2 is the residual sum square,
((n − 2)S2)
• The second term explains the variance in yˆi or the variation in fitted values from
the regression. One can easily show that ∑ni=1 yˆi/n = y¯
• The second term on the right hand side is called the regression sum squares
(SSreg)
• The total variation in Y has been decomposed into two parts. One part that is
explained by the regression line and the second part is explained by random errors
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 4 / 40
ANOVA
• What is degrees of freedom?
• For SST the degrees of freedom is n − 1 since there is one constraint, which is all
the data were used to calculate y¯
• The SSreg is determined completely by one parameter estimate βˆ1. Thus the
degrees of freedom is 1
• For RSS there are two constraints. Both βˆ0 and βˆ1 needs to be calculated. Thus,
the degrees of freedom is n − 2
• We will later show (during the lectures of multiple linear regression) that,
SSreg
σ2
∼ χ21 (1)
RSS
σ2
∼ χ2n−2 (2)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 5 / 40
ANOVA
• Using (1) and (2), we can see that,
F0 =
SSreg
1
RSS
n − 2
• Under H0 : β1 = 0, the F0 ∼ F1,n−2 distribution
• Ideally we want the SSreg to as close to SST as possible
• However, in real life data that is not always possible
• The F−test here detects how close SSreg is to TSS. The closer it is the bigger the
value of F0
• One can also show that t2n−2 = F1,n−2
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 6 / 40
ANOVA
• Under H0 The F−test has the following assumption,
• the errors are independent of each other
• they have constant variance and mean 0
• and are Normally distributed
• The F−test which was conducted using the variance decomposition is called the
ANalysis Of VAriance (ANOVA) test
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 7 / 40
ANOVA Table
• The ANOVA test is traditionally presented with the ANOVA table
• For example for the production data the ANOVA table looks as following,
Sources of variation Sum Squares DF Mean Squares F value
Regression SSreg 1 MSreg =
SSreg
1 F0 =
MSreg
MRSS
Residuals RSS n-2 MRSS = RSSn − 2
Total SST n-1
• The p−value is calculated P(F1,n−2 > F0)
• Also, we can reject the H0 if, F0 > F1−α,1,n−2 at α level of significance
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 8 / 40
The Coefficient of Determination
• Another often used measure to assess whether the regression line explains enough
of the variability in the response is the coefficient of determination, R2
• This summary gives the proportion of the total sample variability in the
response that has been explained by the regression model, so it is naturally
based on the sum of squares.
• It can be calculated in two ways:
• R2 = SSregSST
• R2 = 1− RSSSST
• The range is 0 ≤ R2 ≤ 1.
• R2 ≈ 1 implies that the model is a good fit and X is an important predictor of Y
• R2 ≈ 0 implies that the model is not a good fit and X is not an important
predictor of Y
• It provides an idea of how much variation the regression line explains, but since it
is not a formal test, we cannot say how much is enough.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 9 / 40
Categorical Predictors
• So far we have considered the predictor X to be continuous
• However, often the predictor X could be categorical
• For example, let the outcome be Y = blood pressure and predictor X is smoking
status
• Here the predictor is binary (whether the person smokes or not)
• How to deal with these type of predictors
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 10 / 40
Dummy Variables
• Recall the indicator variable defined in the first lecture
• From categorical predictor we can create indicator variables. These are called the
dummy variables
• Let’s considers the simplest form of a categorical predictor with only two values.
• A common use for dummy variable regression is for comparing a response variable
for two different groups
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 11 / 40
Dummy Variables
• We will be working with a new dataset regarding the time it takes for a food
processing center to change from one type of packaging to another.
• The data in the file ‘changeover times.txt’ contains:
• The response: change over time (in minutes) between two types of packaging
• The categorical predictor: indicating whether the new proposed method of changing
over the packaging type was used, or the old method.
• We have 48 change-over times under the new method, and 72 change-over times
under the old method.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 12 / 40
Food Processing and Packaging Dataset
• The test can be performed with a two sample t−test
• However, we may also want to model the relationship between Y and X directly.
• We can use a simple linear regression
E (Y |X = x) = β0 + β1x
• where Y is the change-over time
• x = 1 when the new change-over method is used, and x = 0 when the existing
method is used.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 13 / 40
Food Processing and Packaging Dataset
• Based on the test, we see that we have a significant decrease in change-over time
from the existing method to the new method.
• What would be the interpretation of slope β1?
• Since, X is no longer continuous, we can not interpret based on per unit change in
X
• Based on these, we can say the slope reflects the average reduction in change-over
time of 3.2 minutes when switching from the existing method to the new method.
• because we only have two levels of the variable, it is not the average change in
response for a unit change in X , but rather the average difference in response
between the two methods.
• The slope provides the magnitude of the difference, while the hypothesis test tells
us whether the difference is statistically significant.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 14 / 40
Least Squares for Multiple Linear Regression
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 15 / 40
Least Squares for Multiple Linear Regression
• Again to obtain the least squares estimates we need to minimize the residual sum
square RSS(β0, ..., βp) =
∑n
i=1 e2i , i.e,
RSS(β0, ..., βp) =
n∑
i=1
(yi −
p∑
j=0
βjxij)2
• To obtain the least square estimates we need to minimize the RSS with respect to
the regression parameters. That is,
∂RSS(β0, ..., βp)
∂b0
= −2
n∑
i=1
(yi −
p∑
j=0
βjxij)
∂RSS(β0, ..., βp)
∂bj
= −2
n∑
i=1
(yi −
p∑
j=0
βjxij)xij
• There would be p + 1 normal equations and p + 1 unknowns
• Solving these equations to obtain solutions can be tedious (What should we do?)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 16 / 40
Matrix Algebra
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 17 / 40
Multiple Linear Regression
• To solve these p + 1 equations we need to apply matrix algebra
• In the next part of this lecture we are going to focus on how to use matrix (linear)
algebra to obtain least squares estimates for multiple linear regression
• For this, we are going to right the regression in a matrix form,
Y = Xβ + ε
where,
1. Y is a n × 1 vector
2. X is a n × (p + 1) matrix. The first column is just a vector of 1’s
3. β is a (p + 1)× 1 vector and
4. ε is a n × 1 vector
• Are you comfortable with the notations?
• Let’s review some very basic linear/matrix algebra
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 18 / 40
Definitions
1. Matrix: An n × p matrix A is a rectangular array of elements in n rows and p
columns
A =

a11 a12 ... a1p
a21 a22 ... a2p
... ... ... ...
an1 an2 ... anp

When n = p, then the matrix becomes a square matrix
2. Vector: A matrix with only one row (row vector) or one column (column vector).
For example, Y = (Y1,Y2, ...,Yn) is row vector of dimensions 1× n and
Y =

Y1
Y2
...
Yn
 is a column vector with dimensions n × 1
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 19 / 40
Definitions
3. Transpose of a Matrix: Let A′ is the transpose matrix of A as defined previously
then A′ is an p × n matrix with,
A′ =

a11 a21 ... an1
a12 a22 ... an2
... ... ... ...
a1p a2p ... anp

here we can see that the rows of A are columns of A′ and vice-versa
4. Symmetric Matrix: If A is a square matrix and A = A′ then A is a symmetric
matrix
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 20 / 40
Definitions
5. Diagonal Matrix: square matrix where all elements are zero except those on the
main diagonal (top left to bottom right). For example,
D =

d11 0 ... 0
0 d22 ... 0
... ... ... ...
0 0 ... dnn

The diagonal elements can be different in this case
6. Identity Matrix: a diagonal matrix where the elements on the diagonal are all
equal to 1, denoted by I. For example,
I =

1 0 ... 0
0 1 ... 0
... ... ... ...
0 0 ... 1

Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 21 / 40
Definitions
7. Inverse of a matrix: If A is a square matrix then let B be another matrix such
that AB = I. Then B is the inverse of matrix A and B = A−1
8. Orthogonal Vectors: Two vectors u and v are orthogonal if their dot product
u · v = 0
9. Orthogonal Matrix: If A is a square matrix then if A′A = I, then A is a
orthogonal matrix. That is A−1 = A′
10. Idempotent Matrix: Let, A be a square matrix. Then if AA = A, then A is
called an idempotent matrix.
11. 1n Vector: A vector is called an 1n vector if all the n elements of the vector are 1.
12. Jn Matrix: An n × n square matrix with all the elements being 1. Basically,
Jn = 1n1′n
13. Rank of a Matrix: The rank of a matrix is given by the number of linearly
independent columns or the number of linearly independent rows. If the all the
columns or rows are linearly independent then the matrix has full rank
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 22 / 40
Matrix Operations
• Addition and Subtraction: matrix addition/subtraction involves doing
element-wise addition/subtraction. Only valid when matrices have same orders
(dimensions)
Addition and Subtraction
Let, A =
(
1 2
3 4
)
and B =
(
3 4
1 5
)
. Then A+ B =
(
4 6
4 9
)
addition is commutative, i.e. A+ B = B+ A
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 23 / 40
Matrix Operations
• Multiplication: multiple each row of first matrix to each column of second
matrix, where we
• perform element-wise multiplication and then sum the resultant products within each
row-column combination.
• only valid if number of columns of first matrix equal number of rows in second
matrix.
Multiplication
Let, A =
(
1 2
3 4
)
and B =
(
3 4
1 5
)
. Then AB =
(
5 14
13 32
)
multiplication is not always commutative, i.e. AB ̸= BA
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 24 / 40
Matrix Operations
• Transpose of a sum is equal to the sum of the transposed matrices, i.e.,
(A+ B)′ = A′ + B′
• Transpose of a product is equal to the product in reverse order of the transposed
matrices, i.e., (AB)′ = B′A′
• Scalar multiplication involves multiplying each element by the scalar quantity.
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 25 / 40
Matrix Operations
• Determinant of a square matrix: Let a matrix be A =
(
a11 a12
a21 a22
)
. Then the
determinant of A is
|A| =
∣∣∣∣∣
(
a11 a12
a21 a22
) ∣∣∣∣∣ = a11a22 − a21a12
• This gets a little complicated for n × n matrix but R automatically calculates these
• if determinant is zero, the matrix is singular or not of full rank, otherwise it is
non-singular.
• some properties of determinants:
1. |I| = 1 when I is an identity matrix. For any diagonal matrix the determinant is just
the product of the diagonal elements
2. |A| = |A′| and |A−1| = |A|−1
3. |cA| = cn|A|
4. For square matrices A and B, |AB| = |A||B|
5. The inverse of a matrix is calculated as A−1 = 1|A|
(
a22 −a12
−a21 a11
)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 26 / 40
Matrix Operations
• For any two column vectors x = (x1, x2, ..., xn)′ and y = (y1, y2, ..., yn)′, the dot
product is given by x · y =∑ni=1 xiyi
• If the dot product of two column vectors x and y are 0 then x and y are
orthogonal. That is x ⊥ y
• If x = (x1, x2, ..., xn)′ is a column vector then ||x ||2 =
√∑n
i=1 x2i is the L2 or
Euclidean norm
• A vector x = (x1, x2, ..., xn)′is called an unit vector if
√∑n
i=1 x2i = 1
• Projection matrix P is a square matrix of order p that is both symmetric P′ = P
and idempotent P2 = P.
• linear transformation y = Px means y is the projection of x onto the subspace
defined by columns of P
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 27 / 40
Matrix Operations
• Let A be a p × p matrix. The Trace of the matrix is defined by,
tr(A) =
p∑
i=1
aii
• it is a linear mapping: for all square matrices A and B, and all scalars c,
tr(A+ B) = tr(A) + tr(B)
tr(cA) = ctr(A)
• Trace has some nice properties,
1. tr(AB) = tr(BA)
2. tr(ABC) = tr(CAB)
• An important property of an idempotent matrix is that the rank of the matrix is
the trace of that matrix
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 28 / 40
Matrix Operations
• Few things to remember,
• Addition and subtraction only works for matrices with same order
• Inverse matrix can be constructed only for square non-singular matrix
• For multiplication of matrices, the number of columns of the first matrix have to be
same with the number of rows for the second matrix
• For large matrices, hand calculation can be very tedious. Thus, we are going to use
R for these calculations
• Let’s do some R demonstration
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 29 / 40
Expectations and Variances of Vectors
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 30 / 40
Expectations and Variances of Vectors
• Let, Y = (Y1,Y2, ...,Yn)′ be a random vector then,
E (Y) = (E (Y1),E (Y2), ...,E (Yn))′
• The Variance of a random vector is given by a covariance matrix
• The covariance matrix will have variances at the diagonals and covariances as the
off diagonals. i.e.,
Var(Y) =

Var(Y1) Cov(Y1,Y2) ... Cov(Y1,Yn)
Cov(Y1,Y2) Var(Y2) ... Cov(Y2,Yn)
... ... ... ...
Cov(Y1,Yn) Cov(Y2,Yn) ... Var(Yn)

• Basically the matrix is created by Cov{(Y− E (Y))(Y− E (Y))′}
• It is very easy to see that covariance matrix by definition is symmetric
• Let b be vector and Y be a random vector. Then Var(b′Y) = b′Var(Y)b
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 31 / 40
LS estimation for MLR
• The Residual Sum Squares is given by,
RSS(β) = ε′ε = (Y− Xβ)′(Y− Xβ)
• From the orders of the vectors/matrices we can see that even for multiple linear
regression the RSS is a scalar
• Expanding the RSS we get,
RSS = (Y− Xβ)′(Y− Xβ)
= Y′Y− Y′Xβ − β′X′Y+ β′X′Xβ
= Y′Y− 2β′X′Y+ β′X′Xβ
• We can write Y′Xβ = β′X′Y, since they produce a scalar
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 32 / 40
LS estimation for MLR
• Differentiating the RSS w.r.t. β we get,
∂RSS
∂β
= ∂
∂β
(
Y′Y− 2β′X′Y+ β′X′Xβ) = 0
⇒ 0− 2X′Y+ 2X′Xβ = 0
⇒ X′Xβ = X′Y
⇒ (X′X)−1X′Xβ = (X′X)−1X′Y
⇒ βˆ = (X′X)−1X′Y
• Which is the least square estimate of the β
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 33 / 40
Matrix Algebra with SLR
• For simple linear regression we can write,
X =

1 x1
1 x2
... ...
1 xn

and Y = (y1, y2, · · · yn)′
• Then
X′X =
(
n ∑ni=1 xi∑n
i=1 xi
∑n
i=1 x2i
)
= n
(
1 x¯
x¯ 1n
∑n
i=1 x2i
)
• and the |X′X| = n2
(
1
n
∑n
i=1 x2i − (x¯)2
)
= nSXX
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 34 / 40
Matrix Algebra with SLR
• Thus we have,
(
X′X
)−1 =

1
n
∑n
i=1 x2i
SXX −
x¯
SXX
− x¯SXX
1
SXX

• You can see if you multiply the matrix by σ2 you can get the variance-covariance
matrix (recall lecture 2)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 35 / 40
LS estimation for MLR
• Thus the projection of Y on X is given by,
Yˆ = Xβˆ = X(X′X)−1X′Y
• What are the dimensions of the matrix X(X′X)−1X′?
• This is often called the projection or hat matrix. That is H = X(X′X)−1X′. The
hat matrix maps the vector of observed values to the vector of fitted values.
• The residuals can be calculated as,
e = Y− Yˆ = Y− X(X′X)−1X′Y = (I− X(X′X)−1X′)Y = (I−H)Y
• Again, we can see that the residuals are also a linear combination of Y
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 36 / 40
LS estimation for MLR
• We can see that H′ = (X(X′X)−1X′)′ = X(X′X)−1X′ = H
• Thus, H is symmetric
• HH = ?
HH = X(X′X)−1X′X(X′X)−1X′
= X(X′X)−1X′ = H
since, (X′X)−1X′X = I
• Thus H is idempotent
• (I−H)(I−H) = ?
(I−H)(I−H) = I−H−H+HH
= I−H
• So I−H is also idempotent
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 37 / 40
Partition Matrix
• We have seen that the hat matrix H = X(X′X)−1X′
• Let the X matrix can be column partitioned in two matrices X1 with size n × k
and X2, with size n × (p + 1)− k.
• We can see that HX = X and X′H = X′
• This implies that HX = [HX1 HX2] = X = [X1 X2]
• That is HX1 = X1 and HX2 = X2
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 38 / 40
Assumptions of Multiple Linear Regression
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 39 / 40
Assumptions of MLR
• Recall E (Y|X) = Xβ
• For multiple linear regression the assumptions are still the same as the simple
linear regression
1. Linearity
2. Homoscedasticity
3. Normality (for testing)
• Thus we assume ϵ ∼ N(0, σ2I).Here, 0 = (0, 0, ..., 0)′ and σ2 is a scalar
• This implies that Y|X ∼ N(Xβ, σ2I)
Mohammad Kaviul Anam Khan Data Analysis 1 Lecture 4 40 / 40

欢迎咨询51作业君