(博文并非完整版,图片并未能加载,完整版参见 外文翻译:基于大数据的电力用户信用评级评估关键技术研究)
ABSTRACT
Electricity power supply company has entered the era of market economy as the deepening of reform, and the pressure to run the company well is increasing gradually. However, some questions, such as malicious default, electricity bills evasion, electricity cheat, and so on, still have not been solved. How to effectively avoid the arrearage risk and improve the economic benefit is key for the enterprises. This paper mainly studied the power users credit rating evaluation based on big data. This system proposed a new algorithm model which combined traditional k-Nearest-Neighbor (KNN) and Simulated annealing (SA) and used the big data software framework and data processing technology to rapid construct the users portrait as well as evaluate the users rating and credit rating division, evaluation and so on. Finally, the results were presented and then laid a good foundation for the researches in future.
Key words:k-Nearest Neighbor; Simulated annealing algorithm; Credit rating;
Big data
With deepening reform of power enterprises, they have entered the era of market economy. However, some questions, such as malicious default, electricity bills evasion, electricity cheat, and so on, still plagued the power enterprises. Therefore, it is urgent to set up a credit evaluation system suitable for electricity market, which can be applied to customer credit rating and risk control management to prevent transaction risk[1].In the era of big data, it can achieve precise marketing and provide basis for scientific decision by accurate positioning division for consumption habits and behaviors of customers, extraction of effective information based on big data, as well as the portrait for abstract customers (specific label) [2]. Data of customers' identity attributes (male/female, age, profession, residence), behavior characteristics (payment frequency and payment way) as well as business channels are classified and analyzed to sum up several behavior labels for customers payment. Then, the behavior labels, as different customers' portraits, are reasonably used for the user credit rating evaluation according to the weight ratio by application of big data algoritlun.
Customer portrait referred to changing the abstract data to a fictitious characters, and it had the labeled information which was different from those of other images. Hence, it can represent a customer base for the demand judgment according to the preference of a customer base, thus communicating information. The model was analyzed based on the customer payment behavior of big data, and the specific objectives were concluded as follows.
The customer portrait need analyze and classify each customer, which relied on big data. Big data technology refers to selecting, integrating and analyzing a large amount of information for automatic customer segmentation and portrait, thus obtaining the conclusive information to guide decision making. With the propulsion of power SG 186 marketing information system construction, the marketing information system has been widely used for the daily marketing[3]. From register to accounting statistics of electric charge, as well as from customer payment to verification and reconciliation of accounting department, each customer, each electrical charge accounting and each electricity charge retrieve can be translated into data and then stored in the system. Diverse data, such as customer information, electricity consumption information, payment information, and so on, interweaved with each other, and constantly increased with the time. This was called "big data in power marketing system". In addition to SG 186 database, business apartment can obtain electricity consumption information as a database through market research, site inspection, and simple statistical analysis for business data.
Before data mining, the original data generally needed to be preprocessed. Data with incorrect format were changed to prepare for the data mining, including data noise elimination, default value at calculation, duplicates elimination, data type conversion completion and dimension reduction processing. Then the data processing link was as follows.
Figure 1 Data processing link
During the process of data mining based on customer behavior, a user number in database represented a user due to perfect customer information in the initial information system in the power enterprises[4]. As data mining, the collected information should be associated with user number to enrich user attributes. The user number was used for data mining and analysis to obtain relevant customer relationship management results. Then based on themes, the data tables and constraints were built, and the database for data mining was constructed, as shown in Figure 2. The data set was selected from SG 186 system and questionnaire information, mainly including basic informations of customers, payment habits, and so on.
Figure 2 The process of Building a data warehouse
Customer clustering is an important way to analyze customer behaviors. It classifies huge amount of customers into different categories. Customers in each category have similar attributes, but different attributes for different categories. Feasible and meticulous customer clustering is of great benefit to business strategies of enterprises. KNN algorithm is a self- organization clustering algorithm to analyze customer behaviors. This algorithm is easy to make data visual, as well as it highlights intriguing features. The number of clustering center automatically generates from data. This paper adopted KNN algorithm and SA algorithm to achieve accurate power user segmentation, clustering and weight analysis for power payment data.
KNN algorithm was used for data classification according to the similarities between some samples and other samples[5]. KNN algorithm was used not only for classification but also for regression, so it was mature in theory and also one of simplest machine learning algorithm .Its thought was that if most of k most similar (the nearest in feature space) samples of one user attribute in the feature space belonged to a certain category, the whole samples belonged to the same category, as shown in Figure 3.
Figure 3 The model drawing of KNN algorithm
Through traditional vector space model, KNN algorithm translated each power user attribute into the weighted feature vector in feature space.
Namely, .
The each attribute similarity between each user attributes and known power customer attributes was calculated to find k most similar attributes. The category was worked out according to the weighted distance and user attributes. Formula for calculating each attribute similarity between this user and the known one was as follows.
(1)
Where represented the feature attributes of the known customer.
represented the jth user center attribute. M referred to the dimensions user feature attributes. Wk represented the kth dimension in vector. Based on previous studies, the initial value of k was between 20 and 500, then k value was adjusted based on the results from experiments. Through calculation and test, this paper suggested that the clustering effect was the best as k equaled 347, so k was set as 347 in our paper.
SA algorithm was a kind of random optllTIlzation algorithm on the basis of Monte-Carlo iterative solution strategy, and it was based on the similarity between annealing process of solid materials and general combinatorial optimization problem. The SA algorithm visited user credit rating probability j in all the iterations, and the SA algorithm visited the probability of user credit rating probability j in the (k+ 1) th iteration. The probability j was composed of two independent probability distribution, the probability from previous arrearage probability i in the kth iteration must meet the normalization condition .The previous payment frequency probability was solved. Where T was the external factor to affect the users payment in the kth iteration. For transition probability was expressed as:
(2)
T did not always have the numerical value, so the new solution may not be accepted. Then the algorithm stayed in the probability of solution i
(3)
Because was a countable set, the random process represented by random variables from SA algorithm was a Markov chain, and the user arrearage probability of one step was defined by two equations above. Then the user arrearage probability of one step was written as:
(4)
The user arrearage probability of k steps was:
(5)
Where I was unit matrix, and referred to the external factor to affect the users payment in the tth iteration.
The meaning of matrix element was:
(6)
After the iteration of Pij for m times, the system was in state i . As the (m+k)th iteration, the probability was the user credit rating probability j.
The relationship between iteration and precision was as follows.
Figure 4 To simulate Iterative and precision relationship
The process of data mining was shown as follows.
Figure 5 The process of data mining
By analyzing user payment record, the normal payment users were divided into payment before penalty and payment after penalty. The arrearage users were divided into occasional arrearage with a large amount and frequent arrearage with a small amount.
The credit rules were formed by the above analysis results, as follows.
Credit rating for normal payment user: users, who paid off all the electric charge after the electric bill was given out, were determined the credit rating according to the times of penalty.
Credit rating for arrearage user。
Users rating was determined according to the arrearage times in a period of time. Credit ratings were divided into five levels, among them, level one is the lowest, and level five is the highest. Users credit ratings were divided into three levels, namely, A, B and C, according to the arrearage amount in a period of time as well as the payment notice times in a period of time. A was the best but C was the worst. Five user ratings as well as three credit ratings, a total of 15 ratings, referred to all the users. According to different users ratings, the risk warning system and VIP mechanism of power users would be used to service the users better so as to enhance the service quality. Credit rules can be defined by each power supply unit according to the payment habits of users in each unit.
Figure 6 The system Structure diagram of Users credit rating management and analysis
Based on the model mentioned above, key steps of user credit rating management and analysis system were constructed under big data as follows.
Figure 7 the key step of users credit rating management system
The system studied by the above method has the following advantages.
This system was constructed on the basis of traditional model, so it kept its business interpretation and stability.
By applying the big data, this system can preprocess, explain and eliminate data noise, deduce default value as calculation. So it overcome the defect that traditional model was sensitive to data.
This system can analyze data from different dimensions to obtain user credit rating and put an end to deception, thus enhancing credibility.
This system can use diverse algorithms to improve data processing ability and speed up processing.
Through using better software framework, this system makes it easier from traditional machine learning model to intelligent learning model.
This paper carried out the modeling analysis for 4.4 million users of eastern Inner Mongolia. The users rating and credit rating in eastern Inner Mongolia were determined as follows.
Figure 8 the classification of the number of the users
Under current economic environment and operation modes, power enterprises are suffering from operation risks inevitably. Therefore, customer credit risk evaluation should be put in real practice and form normal evaluation mechanism. The establishment of credit evaluation index system should be gradually improved. During this process, the advanced experience of banks and other industries should be used for reference. Supply electricity enterprises should pay more attention to customer credit evaluation, risk warning prevention and diversified marketing services. Therefore, there is much more work to do in future, such as construction of marketing policy model, study on risk warning system, and so on. Customer credit rating evaluation system suitable for Chinese power market needs to be established for power enterprises to improve enterprise performance, avoid risks as well as improve the scientific city and efficiency for the power customer credit evaluation.
Roubini."A Balance Sheet Approach to Financial Crisis". IMF Working Paper02/210 . 2002.
随着改革的深入,供电公司已进入市场经济时代,经营好供电公司的压力逐渐增大。然而,一些问题,如恶意违约,电费逃避,电力欺诈等,仍然没有得到解决。如何有效地规避滞纳金风险,提高企业的经济效益,是企业发展的关键。本文主要研究基于大数据的电力用户信用评级评估。该系统提出了一种结合传统K近期邻分类算法(k-Nearest-Neighbor,简称:KNN)和模拟退火(Simulated annealing,简称SA)的新算法模型,利用大数据软件框架和数据处理技术快速构建用户画像,并对用户评价、信用等级划分、评价等进行评价。最后给出了研究结果,为今后的研究奠定了良好的基础。
关键词:K近期邻分类算法;模拟退火算法;信用评级;大数据;
随着电力企业改革的不断深化,电力企业已进入市场经济时代。然而,一些问题,如恶意违约、漏电、漏电等,仍然困扰着电力企业。因此,迫切需要建立一套适合电力市场的信用评价体系,将其应用于客户信用评级和风险控制管理中,防范交易风险[1]。在大数据时代,通过对消费者消费习惯和行为的准确定位划分,基于大数据的有效信息提取,以及抽象的消费者画像(具体标签)[2],可以实现精准营销,为科学决策提供依据。对客户身份属性(男性/女性、年龄、职业、居住地)、行为特征(支付频率、支付方式)、业务渠道等数据进行分类分析,归纳出客户支付的几种行为标签。然后,运用大数据算法algoritlun,将行为标签作为不同客户的画像,根据权重比合理地用于用户信用评级评价。
客户画像是指将抽象的数据转换为虚拟的字符,并具有与其他图像不同的标记信息。因此,它可以代表一个根据客户偏好进行需求判断的客户群,从而传达信息。基于大数据客户支付行为对模型进行分析,得出具体目标如下:
(1)客户属性:名称(数字)所取代,性别,居住区域,收入(稳定和不稳定的收入),电压等级,功耗等。
(2)客户付款习惯:欠款。欠款是否及时付清,付款时间、付款,付款方式付款等业务大厅,人工收集,等等。
客户画像需要对每个客户进行分析和分类,这依赖于大数据。大数据技术是指对大量的信息进行选择、整合和分析,自动对客户进行细分和画像,从而获得决定性的信息,指导决策。随着SG 186营销信息系统建设的推进,营销信息系统已广泛应用于[3]日常营销中。从电费的登记到会计统计,从客户的付款到会计部门的核对核对,每一位客户,每一次电费核算,每一次电费检索,都可以转换成数据存储在系统中。客户信息、用电量信息、支付信息等多种数据相互交织,并随时间不断增加。这被称为“电力营销系统中的大数据”。商务公寓除SG 186数据库外,还可以通过市场调研、实地考察、简单的商业数据统计分析等方式获取用电量信息作为数据库。
在数据挖掘之前,通常需要对原始数据进行预处理。修改格式不正确的数据,为数据挖掘做好准备,包括数据噪声消除、计算默认值、重复项消除、数据类型转换完成和降维处理。数据处理环节如下。
图2.1 数据处理环节
在基于客户行为的数据挖掘过程中,由于电力企业[4]初始信息系统中客户信息的完善,数据库中的一个用户号代表一个用户。作为数据挖掘,收集到的信息应该与用户编号相关联,以丰富用户属性。使用用户号进行数据挖掘和分析,得到相关的客户关系管理结果。然后基于主题构建数据表和约束,构建数据挖掘数据库,如图2.2所示。数据集选取自SG 186系统和问卷信息,主要包括客户基本信息、支付习惯等。
图2.2 建立数据仓库的过程
客户聚类是分析客户行为的重要方法。它将大量的客户划分为不同的类别。每个类别中的客户都有相似的属性,但是不同类别的属性不同。可行而细致的客户聚类对企业的经营战略有很大的好处。KNN算法是一种分析客户行为的自组织聚类算法。该算法易于使数据可视化,并突出了一些有趣的特性。聚类中心的个数由数据自动生成。本文采用KNN算法和SA算法对电力支付数据进行精确的电力用户分割、聚类和权重分析。
根据一些样本与其他样本[5]的相似性,采用KNN算法进行数据分类。KNN算法不仅用于分类,而且用于回归,因此在理论上是成熟的,也是最简单的机器学习算法之一。其思想是,如果特征空间中某一用户属性的k个最相似(特征空间中最近的)样本大部分属于某一类别,则整个样本都属于同一类别,如图2.3所示。
图2.3 KNN算法的模型绘制
通过传统的向量空间模型,KNN算法将每个power user属性转化为特征空间中的加权特征向量。
Namely,
计算每个用户属性与已知power customer属性之间的每个属性相似性,找出k个最相似的属性。根据加权距离和用户属性进行分类。计算该用户与已知用户之间的每个属性相似性的公式如下。
公式(2.1)
其中表示已知客户的特性属性。
表示第用户中心属性。M表示维度用户特性属性。Wk表示向量的第k维。在前人研究的基础上,k的初始值在20 ~ 500之间,然后根据实验结果对k值进行调整。通过计算和测试,本文认为k = 347时聚类效果最好,因此本文将k设为347。
SA算法是一种基于蒙特卡罗迭代求解策略的随机光分层算法,它基于固体材料退火过程与一般组合优化问题的相似性。SA算法在所有迭代中访问用户的信用评级概率j, SA算法在(k+ 1)次迭代中访问用户的信用评级概率j。概率j由两个独立的概率分布构成,第k次迭代中,前一次面积概率i的概率必须满足归一化条件,求解了前一次支付频率概率。其中T为第k次迭代中影响用户支付的外部因素。因为跃迁概率表示为:
公式(2.2)
并不总是有数值,所以新的解可能不被接受。然后算法保持在解i的概率范围内。
公式(2.3)
由于是一个可数集,SA算法中的随机变量所表示的随机过程是一个马尔可夫链,用户一步欠费概率由上述两个方程定义。则用户一步欠费概率为:
公式(2.4)
k步用户欠费概率为:
公式(2.5)
其中I为单位矩阵,在第t次迭代中,表示影响用户支付的外部因素。
矩阵元素的意义为:
公式(2.6)
经过m次Pij迭代,系统处于状态i。第(m+k)次迭代时,概率为用户信用评级概率j。
迭代与精度的关系如下。
图1.4 模拟迭代和精度关系
数据挖掘的过程如下。
图2.5 数据挖掘的过程
通过对用户支付记录的分析,将正常的支付用户分为罚前支付和罚后支付。将欠费用户分为偶尔欠费较多和频繁欠费较少两类。
信用规则由上述分析结果形成,具体如下:
普通支付用户信用等级:用户在电费发出后付清所有电费,根据罚款次数确定信用等级。
欠费用户信用评级。
根据一段时间内的欠费次数,对用户进行扣分。信用评级分为五个等级,其中一级最低,五级最高。根据用户在一段时间内的面积金额以及一段时间内的付款通知次数,将用户的信用等级分为A、B、C三个等级。A是最好的,C是最差的。5个用户评级以及3个信用评级,共15个评级,涉及所有用户。针对不同的用户等级,利用电力用户的风险预警系统和VIP机制,更好地服务于用户,提高服务质量。信用规则可由各供电单元根据各供电单元用户的支付习惯来定义。
图3.1 系统结构图对用户信用等级管理进行了分析
基于上述模型,构建大数据下用户信用评级管理与分析系统的关键步骤如下:
图3.2 用户信用等级管理系统的关键步骤
该系统是在传统模型的基础上构建的,保持了业务解释的稳定性。
该系统利用大数据对数据噪声进行预处理、解释和消除,并通过计算推导出默认值。克服了传统模型对数据敏感的缺点。
该系统可以从不同的维度分析数据,获得用户的信用评级,杜绝欺骗行为,从而提高可信度。
该系统可以采用多种算法提高数据处理能力,加快处理速度。
该系统通过使用更好的软件框架,使传统的机器学习模型向智能学习模型转化变得更加容易。
本文对内蒙古东部地区440万用户进行了建模分析。内蒙古东部地区的用户等级和信用等级确定如下。
图4.1 用户数量的分类
在当前的经济环境和经营模式下,电力企业不可避免地面临着经营风险。因此,应将客户信用风险评估付诸实践,形成规范的评估机制。信用评价指标体系的建立应逐步完善。在这一过程中,应借鉴银行等行业的先进经验。供电企业应更加重视客户信用评价、风险预警防范和多元化营销服务。因此,市场营销政策模型的构建、风险预警系统的研究等方面的工作还有很多需要做。为提高企业绩效,规避风险,提高电力客户信用评价的科学性和效率,需要为电力企业建立适合中国电力市场的客户信用等级评价体系。
Roubini."A Balance Sheet Approach to Financial Crisis". IMF Working Paper02/210 -2002.