辅导案例-STAC32

欢迎使用51辅导,51作业君孵化低价透明的学长辅导平台,服务保持优质,平均费用压低50%以上! 51fudao.top
STAC32 PROJECT - House Asking Price Prediction
Wenyu Huang 1001157017;
Yaxin Hui 1001128594;
Qiyan Lu 1002310956;
Xie Xinyue 1002106151
Qiaoyu Liang 1002920627
Contents
1 Introduction 2
1.1 Data table description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Data Clean 3
3 Model Selection 6
3.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.2 AIC Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Conclusions 12
5 Discussion 13
1
1 Introduction
When a house is put on the market, the seller lists an asking price. The asking price is quite different
from list prices for ordinary retail goods. Because sometimes buyers may be unwilling to pay the asking
price, leading them to negotiate the price down, sometimes buyers may compete with each other with
sufficient vigor that the sales price is pushed beyond the posted list price. (The information come form
https://www.sciencedirect.com/science/article/pii/S0094119016300018#sec0014). A house is typically the
largest single asset in a household’s portfolio, and housing as a whole is a significant fraction of aggregate
wealth (Tracy and Schneider, 2001). So, it is very important to priced an appropriate asking price when you
wanted to sell a house.
There are many different factors will impact the house price. When we try to estimated the price of the
house, we may see the location of house, the availability of Necessities and Amenities, the size of the house
etc.(A lot of similarity research is already did one of them I mentioned here is https://planningtank.com/
real-estate/factors-influencing-house-prices).
In our study we are trying to figure out whether the type of property (house, condo, townhouse), the number
of bedrooms, the number of bathrooms and the MLS listing number those 4 factors will impact the asking
price. Where the MLS - multiple listing service is a suite of services that real estate brokers use to establish
contractual offers of compensation (among brokers) and accumulate and disseminate information to enable
appraisals.(The information come from WIKI https://en.wikipedia.org/wiki/Multiple_listing_service). So,
the MLS listing number is a number to identify the property. In the database we will see there may be the
half of the bedrooms, which is a “den” in an apartment or an extra room in a townhouse that could be used
as a bedroom counts 0.5.
From the common sense, we may think the number of bedrooms and bathrooms plus the type of house will
affect the asking price, the MLS listing number may not impact the asking price. Now we try to build a full
regression model first, which include the number of bedrooms, the number of bathrooms, the MLS listing
number and the type of house (use house and apartment as two dummy variable to represent it) five variables
to estimated the asking price and then use the backward to find a reduce model which can predict the asking
price best. After fit in a reduced model, we then perform a validation for our fitted model to see if the model
is correct based on some assumptions
1.1 Data table description
setwd("~/Desktop/STAC32 Project")
houses = read_xlsx("houses.xlsx")
## Warning in strptime(x, format, tz = tz): unknown timezone 'zone/tz/2018e.
## 1.0/zoneinfo/America/Toronto'
head(houses,10)
## # A tibble: 10 x 5
## mlslisting asking bedrooms bathrooms type
##
## 1 3573407 275000 1.50 1.00 apartment
## 2 3580709 434999 3.50 3.00 townhouse
## 3 3582848 675000 5.00 2.00 house
## 4 3585741 385000 3.50 3.00 townhouse
## 5 3599567 529000 3.50 2.00 apartment
## 6 3602510 849888 5.00 2.00 house
## 7 3607807 139900 3.00 2.00 apartment
## 8 3610979 334900 2.00 2.00 apartment
## 9 3611464 249000 3.00 2.00 townhouse
2
## 10 3617396 259000 5.00 2.00 townhouse
• mlslisting: a unique listing code represents for the idendity of the house;
• asking: a quantitative variable represents for the asking price of the house;
• bedrooms:a quantitative variable represents for the number of bedrooms in the house;
• bathrooms: a quantitative variable represents for the number of bathrooms in the house;
• type: a categorical variable represents for the type of houses, 3 types of houses are listed;
2 Data Clean
# Set 3-1 dummy variable for the data table.
houses %>% mutate(apartment=ifelse(type=="apartment",1,0),
house = ifelse(type=="house",1,0)) -> houses.1
# Remove the categorical variable
houses.1[5] <-NULL
houses.1
## # A tibble: 54 x 6
## mlslisting asking bedrooms bathrooms apartment house
##
## 1 3573407 275000 1.50 1.00 1.00 0
## 2 3580709 434999 3.50 3.00 0 0
## 3 3582848 675000 5.00 2.00 0 1.00
## 4 3585741 385000 3.50 3.00 0 0
## 5 3599567 529000 3.50 2.00 1.00 0
## 6 3602510 849888 5.00 2.00 0 1.00
## 7 3607807 139900 3.00 2.00 1.00 0
## 8 3610979 334900 2.00 2.00 1.00 0
## 9 3611464 249000 3.00 2.00 0 0
## 10 3617396 259000 5.00 2.00 0 0
## # ... with 44 more rows
# Normality check of Y
ggplot(houses.1, aes(x=asking)) + geom_histogram(bins = 8)
3
05
10
250000 500000 750000
asking
co
u
n
t
fit_full <- lm(asking ~ bedrooms + bathrooms + apartment + house, data = houses.1)
mplot(fit_full,which = 1)
## [[1]]
4
48
5
49
−2e+05
−1e+05
0e+00
1e+05
2e+05
4e+05 6e+05 8e+05
Fitted Values
R
es
id
ua
l
Residuals vs Fitted
mplot(fit_full,which = 2)
## [[1]]
5
48
5
49−2
−1
0
1
2
−2 −1 0 1 2
Theoretical Quantiles
St
an
da
rd
ize
d
Re
sid
ua
ls
Normal Q−Q
In our data set there is a categorical variable called type, which contains three different types of house:
apartment, townhouse and house. In order to present our attribute with these three distinct levels, we decided
to create k-1 which is two dummy variables apartment and house in our data set to represent the three levels.
We can thought of this as a truth table such that when the dummy variable apartment is 1 and house is 0,
the type of this building is apartment, when the variable apartment is 0 and house is 1, the type of building
is house, last when both variables apartment and house are 0, the type of house is townhouse. At last, we
removed out the original attribute type from our data set. In order to check the normality of Y with the two
new added dummy variables, we created the histogram graph and the residual graphs. From the histogram
graph, we can see that the distribution seems not too normal, and also we don’t need to transform the data
set due to it’s a pretty large size of data set with more than 50 observations. Form the first residual graph
we noticed that there is a blank space without any points on the middle of the plot due to the two dummy
variables we created, but this would not affect the normality of the data set because most of points are very
randomly distributed on the two sides of plot. On the second plot, which is qq plot to check the normality of
the model, there are even there are some points veer away from the both ends of the line but as an overall
view, this still looks acceptable to us.
3 Model Selection
3.1 Correlation
# Find correlatons within continous variables.
round(cor(houses.1[c(2,1,3,4)]),2)
## asking mlslisting bedrooms bathrooms
## asking 1.00 0.17 0.76 0.52
6
## mlslisting 0.17 1.00 0.00 -0.04
## bedrooms 0.76 0.00 1.00 0.63
## bathrooms 0.52 -0.04 0.63 1.00
# Find if categorical variables have any relationships.
ggplot(houses,aes(y=asking,x=type))+geom_boxplot()
250000
500000
750000
apartment house townhouse
type
a
sk
in
g
oneway.test(asking ~ type, data = houses)
##
## One-way analysis of means (not assuming equal variances)
##
## data: asking and type
## F = 138.63, num df = 2.000, denom df = 22.215, p-value = 2.828e-13
pairwise_median_test(houses, asking, type)
## # A tibble: 3 x 4
## g1 g2 p_value adj_p_value
##
## 1 apartment house 0.00000000140 0.00000000420
## 2 apartment townhouse 0.452 1.35
## 3 house townhouse 0.000108 0.000323
In order to select the model properly, we first check the correlation within the continuous variables, and we
would not need including the two attributes apartment and house because they are both dummy variables that
represents the type of house. As we can see from the correlation matrix, the diagonal numbers in the matrix
are all 1 which is normal because the correlation between each variable themself is always one.We noticed that
the correlation between asking and mislisting is 0.17 which considered as a very weak relationship, and this is
7
due to the variable mislisting is just a random distinct integer that represent the number of each house. The
correlation between asking with bedrooms and bathrooms are correspondingly 0.76 and 0.52 which is satisfy
to our prediction, such that there is strong relationship between them. In this case we would not like to see
there exist a relationship between these two predictor variables bedrooms and bathroom. Unfortunately, as
we can see from the matrix, the correlation between them can be considered as very strong since it’s a 0.63.
Therefore, we may consider to remove one of these two variables.
Also, we would need to determine if the house type is going to affect the house asking price. A side by side
boxplot was performed for house type and asking price. However, We do notice there exists some outliers and
the boxplot are not evenly distributed. Therefore, one way anova test was performed to see if the asking price
for these three types of houses are actually in different since it does not require the assumption of normality.
The P value for the test is 2.828e-13<0.05 so that we reject our H0: there are no differences in the average
asking price for these three type. So we can conclude our Ha: at least one average asking price for 3 types of
houses are in differences is correct. Therefore, pairwise t test was performed for each pair. We do see that
house is significantly differed when comparing the p values with apartment and townhouse p=1.401556e-09 <
0.05 and p=1.075112e-04<0.05 respectively, and there is no evidence showing that the average asking price
for house and apartment are different.
To be said in this way, in our model selction phase, the variables like house and bathrooms(or bedrooms) are
more important than other varibales.
3.2 AIC Variable Selection
AIC_model = lm(asking ~ mlslisting + bedrooms + bathrooms + apartment + house, data = houses.1)
AIC_Step = stepAIC(AIC_model, direction = "both")
## Start: AIC=1237.88
## asking ~ mlslisting + bedrooms + bathrooms + apartment + house
##
## Df Sum of Sq RSS AIC
## - apartment 1 1.4416e+08 3.9058e+11 1235.9
## - mlslisting 1 2.5459e+09 3.9298e+11 1236.2
## - bedrooms 1 4.9891e+09 3.9542e+11 1236.6
## 3.9043e+11 1237.9
## - bathrooms 1 3.8676e+10 4.2911e+11 1241.0
## - house 1 6.2207e+11 1.0125e+12 1287.3
##
## Step: AIC=1235.9
## asking ~ mlslisting + bedrooms + bathrooms + house
##
## Df Sum of Sq RSS AIC
## - mlslisting 1 2.5277e+09 3.9311e+11 1234.2
## - bedrooms 1 7.8893e+09 3.9847e+11 1235.0
## 3.9058e+11 1235.9
## + apartment 1 1.4416e+08 3.9043e+11 1237.9
## - bathrooms 1 3.8636e+10 4.2921e+11 1239.0
## - house 1 6.6748e+11 1.0581e+12 1287.7
##
## Step: AIC=1234.25
## asking ~ bedrooms + bathrooms + house
##
## Df Sum of Sq RSS AIC
## - bedrooms 1 1.0907e+10 4.0401e+11 1233.7
## 3.9311e+11 1234.2
8
## + mlslisting 1 2.5277e+09 3.9058e+11 1235.9
## + apartment 1 1.2594e+08 3.9298e+11 1236.2
## - bathrooms 1 3.8642e+10 4.3175e+11 1237.3
## - house 1 7.4533e+11 1.1384e+12 1289.7
##
## Step: AIC=1233.73
## asking ~ bathrooms + house
##
## Df Sum of Sq RSS AIC
## 4.0401e+11 1233.7
## + bedrooms 1 1.0907e+10 3.9311e+11 1234.2
## + mlslisting 1 5.5458e+09 3.9847e+11 1235.0
## + apartment 1 4.0051e+09 4.0001e+11 1235.2
## - bathrooms 1 9.2233e+10 4.9625e+11 1242.8
## - house 1 1.5705e+12 1.9745e+12 1317.4
AIC_Step$anova
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## asking ~ mlslisting + bedrooms + bathrooms + apartment + house
##
## Final Model:
## asking ~ bathrooms + house
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 48 390434083274 1237.883
## 2 - apartment 1 144157556 49 390578240831 1235.903
## 3 - mlslisting 1 2527679477 50 393105920308 1234.251
## 4 - bedrooms 1 10907429034 51 404013349342 1233.729
In the script above, we secure a linear model in backward elimination approach. The whole process start
with building linear model between asking and all other predictor variables(mlslisting, bedrooms, bathrooms,
apartment and house), We call it full model. For each step, we will drop one variable with the minimum
AIC value. Because it indicates a very weak linear relationship between the response variable asking and
such predictor variables, and we should not include it in our model. Then we will build another linear model
with asking and the rest measures. This process will be repeated until we get the final model. As we can see,
variables bathrooms and house will be kept eventually.
3.3 Model Validation
# Anova Comparison with Full model and reduced model
fit_reduced = lm(asking~bathrooms + house, data = houses.1)
anova(fit_full,fit_reduced)
## Analysis of Variance Table
##
## Model 1: asking ~ bedrooms + bathrooms + apartment + house
## Model 2: asking ~ bathrooms + house
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 49 3.9298e+11
9
## 2 51 4.0401e+11 -2 -1.1033e+10 0.6879 0.5074
In order to validate our model, we would like to use a hypothesis testing to whether there is a difference
between the model with full predictor variables and the model we selected by the last steps. We will first
construct our null hypothesis Ho as there will be no different between the full variables model and the reduced
model. We will construct our Ha as there will be a difference between the two models. So we will use the
anova table to output the result and to check whether the new model is exactly better than the model with
full variables. From the output of ANOVA, we can see the p-value is 0.5 which is greater than 0.05, this
means that there is no difference between the model with full variables and the reduced molde, therefore we
would chose the model with fewer predictor variables.
# Residual plot for y against variables in selected model
fit_reduced = lm(asking ~ bathrooms + house, data = houses.1)
attach(houses.1)
par(mfrow = c(1,2))
plot(fit_reduced$residuals~bathrooms, xlab = "bathrooms", ylab = "Residuals")
abline(h = 0)
plot(fit_reduced$residuals~house, xlab = "House", ylab = "Residuals")
abline(h = 0)
1.0 2.0 3.0 4.0

2e
+0
5
0e
+0
0
2e
+0
5
bathrooms
R
es
id
ua
ls
0.0 0.2 0.4 0.6 0.8 1.0

2e
+0
5
0e
+0
0
2e
+0
5
House
R
es
id
ua
ls
Moreover, In the above script, we checked the residual plot for y against the two predictor variables(bathroom
and house). Form the two outputs, we noticed that the points are separately evenly and randomly distributed
which is exactly what we expected, which means we do not need to do any transformation about predict
varables.
# Residual, Normality, Leverage test, influential Points Test
par(mfrow = c(2,2))
plot(fit_reduced)
10
3e+05 5e+05 7e+05

2e
+0
5
2e
+0
5
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
48
5
7
−2 −1 0 1 2

2
0
2
Theoretical Quantiles
St
an
da
rd
ize
d
re
sid
ua
ls
Normal Q−Q
48
5
7
3e+05 5e+05 7e+05
0.
0
1.
0
Fitted values
St
a
n
da
rd
iz
e
d
re
si
du
a
ls
Scale−Location
4857
0.00 0.05 0.10 0.15

2
0
2
Leverage
St
an
da
rd
ize
d
re
sid
ua
ls
Cook's distance
0.5
0.5
Residuals vs Leverage
323948
# Outlier of Y test
t <- rstudent(fit_reduced)
alpha <- 0.05
n <- length(asking)
p_prime = length(coef(fit_reduced))
t_crit <- qt(1-alpha/(2*n),n-p_prime-1)
round(t,2)
## 1 2 3 4 5 6 7 8 9 10 11 12
## 0.05 0.63 -0.39 0.04 2.43 1.65 -2.21 0.10 -0.88 -0.76 0.27 -0.31
## 13 14 15 16 17 18 19 20 21 22 23 24
## -1.59 0.15 -0.60 0.43 0.46 1.55 1.03 0.04 -0.23 -0.18 0.28 -0.40
## 25 26 27 28 29 30 31 32 33 34 35 36
## 0.54 -0.01 0.75 -0.18 -1.10 0.44 2.05 -1.48 -0.01 1.08 -0.75 0.35
## 37 38 39 40 41 42 43 44 45 46 47 48
## 0.35 0.33 -2.01 -0.85 0.09 1.80 1.13 0.66 -0.16 -0.94 0.40 -2.48
## 49 50 51 52 53 54
## -2.08 -0.70 -0.12 -0.11 0.79 0.53
t_crit
## [1] 3.521574
which(abs(t) > t_crit)
## named integer(0)
# Outlier of X test
Pii <- hatvalues(fit_reduced)
round(Pii, 2)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
11
## 0.04 0.09 0.05 0.09 0.03 0.05 0.03 0.03 0.03 0.03 0.03 0.04 0.04 0.03 0.03
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 0.04 0.05 0.03 0.05 0.03 0.04 0.07 0.04 0.04 0.11 0.04 0.07 0.03 0.05 0.11
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 0.03 0.15 0.03 0.03 0.07 0.05 0.05 0.04 0.05 0.15 0.04 0.03 0.11 0.15 0.05
## 46 47 48 49 50 51 52 53 54
## 0.11 0.07 0.03 0.04 0.04 0.04 0.03 0.04 0.03
which(Pii > 0.5)
## named integer(0)
# Multicolinearity
VIF <- vif(fit_reduced)
VIF
## bathrooms house
## 1.179165 1.179165
# Model Stable since VIF is smaller than 10 and our model is stable
After we got the new model with reduced predictor variables, we are planning to evaluate our this section.
So we did the residual graph, Q-Q plot and the leverage graph to check how well our model can perform.
In the first residual graph, we can see the scatter points are pretty randomly plot, and even better than
the residual graph we got for the model with all full variables. From the second graph we can see there
even there are still some points are veer away from the line at the both ends but it would not affect the
normality of this model. From the leverage points graph we can see that there are no points that plot outside
0.5 (Cook’s distance), which mean there is no leverage points that would affect our model. This would also
great proof of the performance of our model. Then we tested both outlying Y and X observation. In order
to test the outlying Y observation, we checked the studentized deleted residuals with each observation, if
the absolute value of this number is greater than X1−α/(2n);n−p′−1, we will considered this observation as an
outlying Y observation According to the result of studentized deleted residuals test above, we noticed there
is no observation Yi observations with large studentized deleted residuals. In order to test the outlying X
observation, we checked the measure of the distance Pii between the X values of observation i and the center
of the X-space . A large Pii indicates that observation i is far away from the center of all X observations,
which we will considered as leverage points. We used a guideline to determine the leverage points which is
Pii > 0.5
From the output, we can see there is no outlying X observations which make sense to us. At last, we checked
the multicollinearity of our model, the two result we got for bathroom and house are both 1.179, which is
good because they are not exceed to 10. Therefore, we don’t have no problem with the multicollinearity in
our model.
4 Conclusions
summary(fit_reduced)
##
## Call:
## lm(formula = asking ~ bathrooms + house, data = houses.1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -206934 -48639 4256 44615 203066
##
## Coefficients:
12
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 214974 30825 6.974 5.98e-09 ***
## bathrooms 55480 16260 3.412 0.00127 **
## house 383486 27236 14.080 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89000 on 51 degrees of freedom
## Multiple R-squared: 0.8512, Adjusted R-squared: 0.8453
## F-statistic: 145.8 on 2 and 51 DF, p-value: < 2.2e-16
After following the procedure of data operation, pre-model check, AIC backward model selection, post-model
validation. We finally determined that the house asking price follows a linear model which is
ˆAsking = 55480× bathrooms+ 383486× houses+ 214974
. For further explanation, We can say that for the same type of properties, the number of bedrooms is the
dominate factor for predicting price. For the different types of properties; if the number of bedrooms are the
same, the price for house is significantly greater than other type of house, but the price for apartments and
townhouses are similar. It’s also an effecient model for us to predict the asking price for given infomation.
For example, if we have a apartment is ready for sale, and we plan to tag a asking price, the apartment has
3.5 bedrooms, 2 bathrooms. Therefore, the predicted asking price for this property would be
55480× 2 + 383486× 0 + 214974 = 325934CAD
which make sense compare to previous records.
5 Discussion
Although, we successfully fit a model and it looks valid for asking price prediction. However, at first, we notice
that our distribution for asking price is not really normal enough, and it is bimodal distribution actually, we
were directly assuming that the model is normal since the sample size is large enough. However, we could
probably use some transformation to make our response variable looks more normal (eg. ln transformation, or
other box-cox transformation). We tried box-cox transformation before, but it did not return an acceptable
normal distributed histogram. Also we tried to transform y using absolute value of asking price - mean of
asking price.This method was denied although it will return a normal distribution for asking price because
of the absolute value is going to change the property of the our response variable(asking). So due to these
limitations, we did not successfully transformed response variable. This is the area we might need further
investigative.
Also, the model is only specifically fit into our given data set, the factors that actually determine asking price
is far more than these variables. For example, the asking price will be affected by interest rate, yard size,
money spent on decoration, etc. But our method is giving a preview of how to perform a statistical analyze
of dataset which is really helpful in both academic and future career.
13

欢迎咨询51作业君
51作业君

Email:51zuoyejun

@gmail.com

添加客服微信: abby12468