节点文献

基于机器学习和GAM模型方法对北京二手房的交互研究

Interactive Study of Scecond-hand Housing Based on Machine Learning and GAM Model in Beijing

【作者】 罗丹

【导师】 李明;

【作者基本信息】 太原理工大学 , 统计学, 2017, 硕士

【摘要】 近年来我国经济迅猛发展,人民生活水平质量不断提高,也同时激发了人民的投资需求,房产成了重要的投资目标,进而推动了房产价格的上升。尤其是2008年经济危机以后,北京的房价一路飙升,高到天价,有的房子能高到每平米十几万,北京住房压力巨大。截止2016年5月,北京二手房占市场成交比例已高达80%,同时北京二手房价也在短短几年之内翻了几倍。为了寻找到适合研究北京二手房房价差异较好的模型以及观察影响房价的因素是如何造成二手房房价差异的,本文利用2016年5月北京六个城区16210套二手房数据,首先利用K-均值(K-means)聚类对房屋类型进行了分析,然后构建普通最小二乘线性回归模型(Ordinary Least Squares,OLS),对数OLS模型,K近邻(K-nearest neighbor,KNN)回归方法,对数KNN回归,非线性广义相加模型(Generalized Additive Models,GAM),对数GAM模型这六种方法对采集到的预测变量之间有无交互项两种情况进行了研究,进而用稳定性方法寻找最优模型,最后又用OLS模型、对数OLS模型、GAM模型以及对数GAM模型这四个方法建模进行分析。结果发现,所采集的房屋有四种类型,分别为地段型,郊区型,大众型和大户型。在模型的泛化能力方面,对数KNN回归在无交互研究下是最优的,对数GAM模型在有交互研究下是最优的,且对数GAM模型是十二种模型中最优的;在模型解释方面,GAM模型无论是有无交互项还是是否对房价做了对数变换,都揭示连续型预测变量和房价之间的复杂非线性关系;在模型拟合优度方面,有交互对数GAM模型的拟合优度最高,效果最好;交互模型预测效果优于非交互模型,多个预测变量之间存在交互效应,研究预测变量交互效应可以提供很多有用信息,比如:利用有交互的线性模型可以得到在海淀区地铁对房价的影响比在西城区地铁对房价的影响大,说明海淀区地铁房提升二手房价格的速度比西城区地铁房房价更快。得出的结论是,非参交互模型更加适合对二手房的研究,连续型变量对房价的影响是非线性变化的,并且多个变量之间存在交互效应。本文研究的是来自横截面上,同一时间的房价差异,建立更好研究模型的目的让购房者在做决策的时候,拥有一个客观的参照。因为从大量北京二手房样本中得到的房屋价格比简单比较三两家房价得到的房价参照会更加客观可靠,从而做出的决断也会更理性。

【Abstract】 Nowadays,China’s economy is witnessing a rapid development,and the quality of people’s living standard is improving.It also stimulates the people’s investment needs.The estate has become an important investment target,which drives the rise in house prices.Especially house prices in Beijing have been rising since the economic crisis in 2008,from high to sky-high.Some houses are even worth of hundreds of thousands of RMB per square meter.Up to May 2016,the transaction of Beijing second-hand in market has exceeded 80%.At the same time,house prices have doubled in just a few years.In order to find suitable models to study Beijing second-hand house prices and identify influence factors of house prices on house prices differences,we have studied a data set of Beijing second-hand houses in six urban districts in May 2016.First,we use K-means to analyze the house type.Then we employ six method including ordinary least squares(OLS),log-OLS,K-nearest neighbor(KNN),log-KNN regression,generalized additive models(GAM),log-GAM to build statistical regression models,and to find the optimal models.Clustering methods divide house into four types,location-type,suburb-type,standard houses and large houses.In terms of prediction accuracy,the log-KNN regression is optimal without interactions,the log-GAM model is optimal with interactions and it is also best among the twelve models;in terms of model interpretation,GAM model suggests the complex nonlinear relationship between continuous explanatory variables and the house price models with interactions are overall better than without interaction,implying that interaction effects exist between prediction variables.For example,we find that the influence of subway on house price in Hai Dian district is larger than in Xi Cheng distinct,that is,the second-hand subway house prices in Hai Dian district increase faster than in Xi Cheng district.In conclusion,nonparametric models with interaction are more suitable to study the second-hand house market.Continuous explanatory variables have nonlinear effects on the house price.And interaction effect exists among some explanatory variables.We study house price difference of the same time from the cross section.The purpose of building the best model is to have an objective reference when making decisions.Because compared with three two hand-house prices,housing prices from a large number of samples of Beijing second-hand house are more objective and reliable.Thus the decision is more rational.

节点文献中: