节点文献

非均衡缺失数据的神经网络建模及其应用

Neural Network Modeling of Imbalance Missing Data and Its Application

【作者】 周新民

【导师】 朱建军;

【作者基本信息】 南京航空航天大学 , 工程硕士(专业学位), 2018, 硕士

【摘要】 非均衡缺失数据集的分类问题一直是数据分析的研究重点,传统的分类方法在对非均衡缺失数据集分类时少数类的分类精度通常会在较低水平,而预先处理非均衡缺失数据和改进分类模型可以有效提高非均衡缺失数据下少数类的分类精度。网络贷款公司客户违约风险管理涉及到非均衡缺失数据的分类问题,预测、预防和控制网络贷款公司面对的客户违约风险一直是该领域的重点工作。本文围绕非均衡缺失数据的神经网络建模及其应用做了如下的工作:(1)针对现有缺失数据恢复方法存在的不足,提出kNN-DBSCAN恢复方法恢复缺失数据。现有处理数据缺失问题普遍采用均值或者最邻近法进行恢复,但从高纬度下数据分布的视角研究数据恢复的方法较少,针对这一问题提出一种基于密度聚类和最邻近方法的恢复方法,并通过实验证明其有效性。(2)针对SMOTE方法和改进的SMOTE方法存在的不足,提出基于K-means改进SMOTE方法。围绕非均衡数据集的分类问题主要通过预先处理非均衡数据集和改进分类模型的方式解决,分析了现有处理非均衡数据集的过采样方法,为了防止拥有某类特性的样本判定给另一类样本操作的发生,提出了基于K-means改进SMOTE方法,并通过实验证明其有效性。(3)针对神经网络和XGBOOST分类模型的特性,提出基于XGBOOST的神经网络分类模型。鉴于目前没有一个分类算法能在分类精度和稳定性完全胜于其他分类方法,通过UCI数据集对XGBOOST模型和神经网络模型的特性进行分析,提出基于XGBOOST的神经网络分类模型,并通过实验证明该模型在精度和泛化性能方面上的表现要优于单一模型。(4)以融360网络贷款公司客户信用风险预测问题为牵引,研究了非均衡缺失问题下的客户信用风险预测问题,并基于本文提出的算法对客户信用风险进行预测,提高了识别违约客户的能力。研究了融360网络贷款公司在客户风险评估方面所面临的实际问题,分析了其在对客户信用评估时的劣势;依据融360网络贷款公司经营所收集用户数据构建了客户信用风险评估指标体系;将本文提出数据预处理和分类组合模型应用到实际问题,并从关键指标出发,分析了客户违约概率,根据实验所得结果对两类用户的典型特征进行了用户画像。

【Abstract】 The classification of imbalance missing data sets has always been the focus of data analysis.The traditional classification methods tend to keep the classification accuracy of the minority categories at a low level when classifying the imbalance missing data sets,rather than preprocessing the imbalance missing data and to optimize the classification method can well solve the problem of low accuracy of minority classification under the imbalance missing data.The default risk management of clients in online loan companies involves the classification of imbalance missing data.It is always a priority in this field to predict,prevent and control the default risk faced by online loan companies.This paper focuses on the neural network modeling of unbalanced data loss and its application as follows:(1)Aiming at the shortcomings of existing data filling methods,this paper proposes a kNN-DBSCAN filling method to fill in missing data.The existing data processing problems are usually filled by the mean or the nearest neighbor method.However,there are few methods to fill data from the perspective of data distribution at high latitudes.A new method based on density clustering and nearest neighbor method is proposed method,and through experiments to prove its effectiveness.(2)Aiming at the deficiency of over-sampling technique SMOTE of classical synthesis minority,a SMOTE method based on K-means improvement is proposed.The problem of classification around unbalanced datasets is mainly solved by preprocessing datasets and optimizing classification algorithms.The oversampling methods in existing data preprocessing are analyzed,and a SMOTE method based on K-means is proposed.This method prevents samples with certain characteristics from judging the occurrence of another type of sample operation.The experimental verification of UCI data sets proves the effectiveness of the method.(3)In view of the characteristics of neural network and XGBOOST classification model,a neural network classification model based on XGBOOST is proposed.In view of the fact that none of the classification algorithms can completely outperform other classification methods in terms of classification accuracy and stability.Based on the UCI dataset,the characteristics of XGBOOST model and ANN model are analyzed,and a neural network classification model based on XGBOOST is proposed.Experiments show that the combination model is superior to single model in accuracy and stability.(4)Taking the customer risk prediction of rong360 internet loan company as traction,the paper studies the problem of customer credit risk prediction under the imbalance deletion problem,and predicts the customer risk based on the algorithm proposed in this paper,which improves the capability of identifying customer default.This paper studies the practical problems faced by rong360 internet loan company in customer risk assessment,analyzes its disadvantages in customer credit evaluation,constructs credit risk assessment index system based on the data collected by rong360 internet loan companies,The pretreatment and classification combination model are applied to practical problems.Based on the key indicators,the probability of customer default is analyzed.Based on the experimental results,the user portraits of two typical users are presented.

  • 【分类号】TP18;TP311.13
  • 【被引频次】3
  • 【下载频次】142
  • 攻读期成果
节点文献中: