STAC32 PROJECT - House Asking Price Prediction Wenyu Huang 1001157017; Yaxin Hui 1001128594; Qiyan Lu 1002310956; Xie Xinyue 1002106151 Qiaoyu Liang 1002920627 Contents 1 Introduction 2 1.1 Data table description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Data Clean 3 3 Model Selection 6 3.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 AIC Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 4 Conclusions 12 5 Discussion 13 1 1 Introduction When a house is put on the market, the seller lists an asking price. The asking price is quite different from list prices for ordinary retail goods. Because sometimes buyers may be unwilling to pay the asking price, leading them to negotiate the price down, sometimes buyers may compete with each other with sufficient vigor that the sales price is pushed beyond the posted list price. (The information come form https://www.sciencedirect.com/science/article/pii/S0094119016300018#sec0014). A house is typically the largest single asset in a household’s portfolio, and housing as a whole is a significant fraction of aggregate wealth (Tracy and Schneider, 2001). So, it is very important to priced an appropriate asking price when you wanted to sell a house. There are many different factors will impact the house price. When we try to estimated the price of the house, we may see the location of house, the availability of Necessities and Amenities, the size of the house etc.(A lot of similarity research is already did one of them I mentioned here is https://planningtank.com/ real-estate/factors-influencing-house-prices). In our study we are trying to figure out whether the type of property (house, condo, townhouse), the number of bedrooms, the number of bathrooms and the MLS listing number those 4 factors will impact the asking price. Where the MLS - multiple listing service is a suite of services that real estate brokers use to establish contractual offers of compensation (among brokers) and accumulate and disseminate information to enable appraisals.(The information come from WIKI https://en.wikipedia.org/wiki/Multiple_listing_service). So, the MLS listing number is a number to identify the property. In the database we will see there may be the half of the bedrooms, which is a “den” in an apartment or an extra room in a townhouse that could be used as a bedroom counts 0.5. From the common sense, we may think the number of bedrooms and bathrooms plus the type of house will affect the asking price, the MLS listing number may not impact the asking price. Now we try to build a full regression model first, which include the number of bedrooms, the number of bathrooms, the MLS listing number and the type of house (use house and apartment as two dummy variable to represent it) five variables to estimated the asking price and then use the backward to find a reduce model which can predict the asking price best. After fit in a reduced model, we then perform a validation for our fitted model to see if the model is correct based on some assumptions 1.1 Data table description setwd("~/Desktop/STAC32 Project") houses = read_xlsx("houses.xlsx") ## Warning in strptime(x, format, tz = tz): unknown timezone 'zone/tz/2018e. ## 1.0/zoneinfo/America/Toronto' head(houses,10) ## # A tibble: 10 x 5 ## mlslisting asking bedrooms bathrooms type ##
## 1 3573407 275000 1.50 1.00 apartment ## 2 3580709 434999 3.50 3.00 townhouse ## 3 3582848 675000 5.00 2.00 house ## 4 3585741 385000 3.50 3.00 townhouse ## 5 3599567 529000 3.50 2.00 apartment ## 6 3602510 849888 5.00 2.00 house ## 7 3607807 139900 3.00 2.00 apartment ## 8 3610979 334900 2.00 2.00 apartment ## 9 3611464 249000 3.00 2.00 townhouse 2 ## 10 3617396 259000 5.00 2.00 townhouse • mlslisting: a unique listing code represents for the idendity of the house; • asking: a quantitative variable represents for the asking price of the house; • bedrooms:a quantitative variable represents for the number of bedrooms in the house; • bathrooms: a quantitative variable represents for the number of bathrooms in the house; • type: a categorical variable represents for the type of houses, 3 types of houses are listed; 2 Data Clean # Set 3-1 dummy variable for the data table. houses %>% mutate(apartment=ifelse(type=="apartment",1,0), house = ifelse(type=="house",1,0)) -> houses.1 # Remove the categorical variable houses.1[5] <-NULL houses.1 ## # A tibble: 54 x 6 ## mlslisting asking bedrooms bathrooms apartment house ## ## 1 3573407 275000 1.50 1.00 1.00 0 ## 2 3580709 434999 3.50 3.00 0 0 ## 3 3582848 675000 5.00 2.00 0 1.00 ## 4 3585741 385000 3.50 3.00 0 0 ## 5 3599567 529000 3.50 2.00 1.00 0 ## 6 3602510 849888 5.00 2.00 0 1.00 ## 7 3607807 139900 3.00 2.00 1.00 0 ## 8 3610979 334900 2.00 2.00 1.00 0 ## 9 3611464 249000 3.00 2.00 0 0 ## 10 3617396 259000 5.00 2.00 0 0 ## # ... with 44 more rows # Normality check of Y ggplot(houses.1, aes(x=asking)) + geom_histogram(bins = 8) 3 05 10 250000 500000 750000 asking co u n t fit_full <- lm(asking ~ bedrooms + bathrooms + apartment + house, data = houses.1) mplot(fit_full,which = 1) ## [[1]] 4 48 5 49 −2e+05 −1e+05 0e+00 1e+05 2e+05 4e+05 6e+05 8e+05 Fitted Values R es id ua l Residuals vs Fitted mplot(fit_full,which = 2) ## [[1]] 5 48 5 49−2 −1 0 1 2 −2 −1 0 1 2 Theoretical Quantiles St an da rd ize d Re sid ua ls Normal Q−Q In our data set there is a categorical variable called type, which contains three different types of house: apartment, townhouse and house. In order to present our attribute with these three distinct levels, we decided to create k-1 which is two dummy variables apartment and house in our data set to represent the three levels. We can thought of this as a truth table such that when the dummy variable apartment is 1 and house is 0, the type of this building is apartment, when the variable apartment is 0 and house is 1, the type of building is house, last when both variables apartment and house are 0, the type of house is townhouse. At last, we removed out the original attribute type from our data set. In order to check the normality of Y with the two new added dummy variables, we created the histogram graph and the residual graphs. From the histogram graph, we can see that the distribution seems not too normal, and also we don’t need to transform the data set due to it’s a pretty large size of data set with more than 50 observations. Form the first residual graph we noticed that there is a blank space without any points on the middle of the plot due to the two dummy variables we created, but this would not affect the normality of the data set because most of points are very randomly distributed on the two sides of plot. On the second plot, which is qq plot to check the normality of the model, there are even there are some points veer away from the both ends of the line but as an overall view, this still looks acceptable to us. 3 Model Selection 3.1 Correlation # Find correlatons within continous variables. round(cor(houses.1[c(2,1,3,4)]),2) ## asking mlslisting bedrooms bathrooms ## asking 1.00 0.17 0.76 0.52 6 ## mlslisting 0.17 1.00 0.00 -0.04 ## bedrooms 0.76 0.00 1.00 0.63 ## bathrooms 0.52 -0.04 0.63 1.00 # Find if categorical variables have any relationships. ggplot(houses,aes(y=asking,x=type))+geom_boxplot() 250000 500000 750000 apartment house townhouse type a sk in g oneway.test(asking ~ type, data = houses) ## ## One-way analysis of means (not assuming equal variances) ## ## data: asking and type ## F = 138.63, num df = 2.000, denom df = 22.215, p-value = 2.828e-13 pairwise_median_test(houses, asking, type) ## # A tibble: 3 x 4 ## g1 g2 p_value adj_p_value ## ## 1 apartment house 0.00000000140 0.00000000420 ## 2 apartment townhouse 0.452 1.35 ## 3 house townhouse 0.000108 0.000323 In order to select the model properly, we first check the correlation within the continuous variables, and we would not need including the two attributes apartment and house because they are both dummy variables that represents the type of house. As we can see from the correlation matrix, the diagonal numbers in the matrix are all 1 which is normal because the correlation between each variable themself is always one.We noticed that the correlation between asking and mislisting is 0.17 which considered as a very weak relationship, and this is 7 due to the variable mislisting is just a random distinct integer that represent the number of each house. The correlation between asking with bedrooms and bathrooms are correspondingly 0.76 and 0.52 which is satisfy to our prediction, such that there is strong relationship between them. In this case we would not like to see there exist a relationship between these two predictor variables bedrooms and bathroom. Unfortunately, as we can see from the matrix, the correlation between them can be considered as very strong since it’s a 0.63. Therefore, we may consider to remove one of these two variables. Also, we would need to determine if the house type is going to affect the house asking price. A side by side boxplot was performed for house type and asking price. However, We do notice there exists some outliers and the boxplot are not evenly distributed. Therefore, one way anova test was performed to see if the asking price for these three types of houses are actually in different since it does not require the assumption of normality. The P value for the test is 2.828e-13<0.05 so that we reject our H0: there are no differences in the average asking price for these three type. So we can conclude our Ha: at least one average asking price for 3 types of houses are in differences is correct. Therefore, pairwise t test was performed for each pair. We do see that house is significantly differed when comparing the p values with apartment and townhouse p=1.401556e-09 < 0.05 and p=1.075112e-04<0.05 respectively, and there is no evidence showing that the average asking price for house and apartment are different. To be said in this way, in our model selction phase, the variables like house and bathrooms(or bedrooms) are more important than other varibales. 3.2 AIC Variable Selection AIC_model = lm(asking ~ mlslisting + bedrooms + bathrooms + apartment + house, data = houses.1) AIC_Step = stepAIC(AIC_model, direction = "both") ## Start: AIC=1237.88 ## asking ~ mlslisting + bedrooms + bathrooms + apartment + house ## ## Df Sum of Sq RSS AIC ## - apartment 1 1.4416e+08 3.9058e+11 1235.9 ## - mlslisting 1 2.5459e+09 3.9298e+11 1236.2 ## - bedrooms 1 4.9891e+09 3.9542e+11 1236.6 ## 3.9043e+11 1237.9 ## - bathrooms 1 3.8676e+10 4.2911e+11 1241.0 ## - house 1 6.2207e+11 1.0125e+12 1287.3 ## ## Step: AIC=1235.9 ## asking ~ mlslisting + bedrooms + bathrooms + house ## ## Df Sum of Sq RSS AIC ## - mlslisting 1 2.5277e+09 3.9311e+11 1234.2 ## - bedrooms 1 7.8893e+09 3.9847e+11 1235.0 ## 3.9058e+11 1235.9 ## + apartment 1 1.4416e+08 3.9043e+11 1237.9 ## - bathrooms 1 3.8636e+10 4.2921e+11 1239.0 ## - house 1 6.6748e+11 1.0581e+12 1287.7 ## ## Step: AIC=1234.25 ## asking ~ bedrooms + bathrooms + house ## ## Df Sum of Sq RSS AIC ## - bedrooms 1 1.0907e+10 4.0401e+11 1233.7 ## 3.9311e+11 1234.2 8 ## + mlslisting 1 2.5277e+09 3.9058e+11 1235.9 ## + apartment 1 1.2594e+08 3.9298e+11 1236.2 ## - bathrooms 1 3.8642e+10 4.3175e+11 1237.3 ## - house 1 7.4533e+11 1.1384e+12 1289.7 ## ## Step: AIC=1233.73 ## asking ~ bathrooms + house ## ## Df Sum of Sq RSS AIC ## 4.0401e+11 1233.7 ## + bedrooms 1 1.0907e+10 3.9311e+11 1234.2 ## + mlslisting 1 5.5458e+09 3.9847e+11 1235.0 ## + apartment 1 4.0051e+09 4.0001e+11 1235.2 ## - bathrooms 1 9.2233e+10 4.9625e+11 1242.8 ## - house 1 1.5705e+12 1.9745e+12 1317.4 AIC_Step$anova ## Stepwise Model Path ## Analysis of Deviance Table ## ## Initial Model: ## asking ~ mlslisting + bedrooms + bathrooms + apartment + house ## ## Final Model: ## asking ~ bathrooms + house ## ## ## Step Df Deviance Resid. Df Resid. Dev AIC ## 1 48 390434083274 1237.883 ## 2 - apartment 1 144157556 49 390578240831 1235.903 ## 3 - mlslisting 1 2527679477 50 393105920308 1234.251 ## 4 - bedrooms 1 10907429034 51 404013349342 1233.729 In the script above, we secure a linear model in backward elimination approach. The whole process start with building linear model between asking and all other predictor variables(mlslisting, bedrooms, bathrooms, apartment and house), We call it full model. For each step, we will drop one variable with the minimum AIC value. Because it indicates a very weak linear relationship between the response variable asking and such predictor variables, and we should not include it in our model. Then we will build another linear model with asking and the rest measures. This process will be repeated until we get the final model. As we can see, variables bathrooms and house will be kept eventually. 3.3 Model Validation # Anova Comparison with Full model and reduced model fit_reduced = lm(asking~bathrooms + house, data = houses.1) anova(fit_full,fit_reduced) ## Analysis of Variance Table ## ## Model 1: asking ~ bedrooms + bathrooms + apartment + house ## Model 2: asking ~ bathrooms + house ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 49 3.9298e+11 9 ## 2 51 4.0401e+11 -2 -1.1033e+10 0.6879 0.5074 In order to validate our model, we would like to use a hypothesis testing to whether there is a difference between the model with full predictor variables and the model we selected by the last steps. We will first construct our null hypothesis Ho as there will be no different between the full variables model and the reduced model. We will construct our Ha as there will be a difference between the two models. So we will use the anova table to output the result and to check whether the new model is exactly better than the model with full variables. From the output of ANOVA, we can see the p-value is 0.5 which is greater than 0.05, this means that there is no difference between the model with full variables and the reduced molde, therefore we would chose the model with fewer predictor variables. # Residual plot for y against variables in selected model fit_reduced = lm(asking ~ bathrooms + house, data = houses.1) attach(houses.1) par(mfrow = c(1,2)) plot(fit_reduced$residuals~bathrooms, xlab = "bathrooms", ylab = "Residuals") abline(h = 0) plot(fit_reduced$residuals~house, xlab = "House", ylab = "Residuals") abline(h = 0) 1.0 2.0 3.0 4.0 − 2e +0 5 0e +0 0 2e +0 5 bathrooms R es id ua ls 0.0 0.2 0.4 0.6 0.8 1.0 − 2e +0 5 0e +0 0 2e +0 5 House R es id ua ls Moreover, In the above script, we checked the residual plot for y against the two predictor variables(bathroom and house). Form the two outputs, we noticed that the points are separately evenly and randomly distributed which is exactly what we expected, which means we do not need to do any transformation about predict varables. # Residual, Normality, Leverage test, influential Points Test par(mfrow = c(2,2)) plot(fit_reduced) 10 3e+05 5e+05 7e+05 − 2e +0 5 2e +0 5 Fitted values R es id ua ls Residuals vs Fitted 48 5 7 −2 −1 0 1 2 − 2 0 2 Theoretical Quantiles St an da rd ize d re sid ua ls Normal Q−Q 48 5 7 3e+05 5e+05 7e+05 0. 0 1. 0 Fitted values St a n da rd iz e d re si du a ls Scale−Location 4857 0.00 0.05 0.10 0.15 − 2 0 2 Leverage St an da rd ize d re sid ua ls Cook's distance 0.5 0.5 Residuals vs Leverage 323948 # Outlier of Y test t <- rstudent(fit_reduced) alpha <- 0.05 n <- length(asking) p_prime = length(coef(fit_reduced)) t_crit <- qt(1-alpha/(2*n),n-p_prime-1) round(t,2) ## 1 2 3 4 5 6 7 8 9 10 11 12 ## 0.05 0.63 -0.39 0.04 2.43 1.65 -2.21 0.10 -0.88 -0.76 0.27 -0.31 ## 13 14 15 16 17 18 19 20 21 22 23 24 ## -1.59 0.15 -0.60 0.43 0.46 1.55 1.03 0.04 -0.23 -0.18 0.28 -0.40 ## 25 26 27 28 29 30 31 32 33 34 35 36 ## 0.54 -0.01 0.75 -0.18 -1.10 0.44 2.05 -1.48 -0.01 1.08 -0.75 0.35 ## 37 38 39 40 41 42 43 44 45 46 47 48 ## 0.35 0.33 -2.01 -0.85 0.09 1.80 1.13 0.66 -0.16 -0.94 0.40 -2.48 ## 49 50 51 52 53 54 ## -2.08 -0.70 -0.12 -0.11 0.79 0.53 t_crit ## [1] 3.521574 which(abs(t) > t_crit) ## named integer(0) # Outlier of X test Pii <- hatvalues(fit_reduced) round(Pii, 2) ## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 11 ## 0.04 0.09 0.05 0.09 0.03 0.05 0.03 0.03 0.03 0.03 0.03 0.04 0.04 0.03 0.03 ## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ## 0.04 0.05 0.03 0.05 0.03 0.04 0.07 0.04 0.04 0.11 0.04 0.07 0.03 0.05 0.11 ## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 ## 0.03 0.15 0.03 0.03 0.07 0.05 0.05 0.04 0.05 0.15 0.04 0.03 0.11 0.15 0.05 ## 46 47 48 49 50 51 52 53 54 ## 0.11 0.07 0.03 0.04 0.04 0.04 0.03 0.04 0.03 which(Pii > 0.5) ## named integer(0) # Multicolinearity VIF <- vif(fit_reduced) VIF ## bathrooms house ## 1.179165 1.179165 # Model Stable since VIF is smaller than 10 and our model is stable After we got the new model with reduced predictor variables, we are planning to evaluate our this section. So we did the residual graph, Q-Q plot and the leverage graph to check how well our model can perform. In the first residual graph, we can see the scatter points are pretty randomly plot, and even better than the residual graph we got for the model with all full variables. From the second graph we can see there even there are still some points are veer away from the line at the both ends but it would not affect the normality of this model. From the leverage points graph we can see that there are no points that plot outside 0.5 (Cook’s distance), which mean there is no leverage points that would affect our model. This would also great proof of the performance of our model. Then we tested both outlying Y and X observation. In order to test the outlying Y observation, we checked the studentized deleted residuals with each observation, if the absolute value of this number is greater than X1−α/(2n);n−p′−1, we will considered this observation as an outlying Y observation According to the result of studentized deleted residuals test above, we noticed there is no observation Yi observations with large studentized deleted residuals. In order to test the outlying X observation, we checked the measure of the distance Pii between the X values of observation i and the center of the X-space . A large Pii indicates that observation i is far away from the center of all X observations, which we will considered as leverage points. We used a guideline to determine the leverage points which is Pii > 0.5 From the output, we can see there is no outlying X observations which make sense to us. At last, we checked the multicollinearity of our model, the two result we got for bathroom and house are both 1.179, which is good because they are not exceed to 10. Therefore, we don’t have no problem with the multicollinearity in our model. 4 Conclusions summary(fit_reduced) ## ## Call: ## lm(formula = asking ~ bathrooms + house, data = houses.1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -206934 -48639 4256 44615 203066 ## ## Coefficients: 12 ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 214974 30825 6.974 5.98e-09 *** ## bathrooms 55480 16260 3.412 0.00127 ** ## house 383486 27236 14.080 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 89000 on 51 degrees of freedom ## Multiple R-squared: 0.8512, Adjusted R-squared: 0.8453 ## F-statistic: 145.8 on 2 and 51 DF, p-value: < 2.2e-16 After following the procedure of data operation, pre-model check, AIC backward model selection, post-model validation. We finally determined that the house asking price follows a linear model which is ˆAsking = 55480× bathrooms+ 383486× houses+ 214974 . For further explanation, We can say that for the same type of properties, the number of bedrooms is the dominate factor for predicting price. For the different types of properties; if the number of bedrooms are the same, the price for house is significantly greater than other type of house, but the price for apartments and townhouses are similar. It’s also an effecient model for us to predict the asking price for given infomation. For example, if we have a apartment is ready for sale, and we plan to tag a asking price, the apartment has 3.5 bedrooms, 2 bathrooms. Therefore, the predicted asking price for this property would be 55480× 2 + 383486× 0 + 214974 = 325934CAD which make sense compare to previous records. 5 Discussion Although, we successfully fit a model and it looks valid for asking price prediction. However, at first, we notice that our distribution for asking price is not really normal enough, and it is bimodal distribution actually, we were directly assuming that the model is normal since the sample size is large enough. However, we could probably use some transformation to make our response variable looks more normal (eg. ln transformation, or other box-cox transformation). We tried box-cox transformation before, but it did not return an acceptable normal distributed histogram. Also we tried to transform y using absolute value of asking price - mean of asking price.This method was denied although it will return a normal distribution for asking price because of the absolute value is going to change the property of the our response variable(asking). So due to these limitations, we did not successfully transformed response variable. This is the area we might need further investigative. Also, the model is only specifically fit into our given data set, the factors that actually determine asking price is far more than these variables. For example, the asking price will be affected by interest rate, yard size, money spent on decoration, etc. But our method is giving a preview of how to perform a statistical analyze of dataset which is really helpful in both academic and future career. 13 欢迎咨询51作业君