名义p值_使用经济指标预测名义国内生产总值

名义p值

This article talks about forecasting nominal GDP (Gross Domestic Product) using data present over the web. The problem statement was to gather data from authentic sources, perform an Exploratory Data Analysis (EDA), train a model and predict the Nominal GDP (Canada). As an IT person, I had little knowledge about these economics-associated terms. That is why, the first thing to do was to get familiar with these concepts. Next was to learn and execute this project in R programming language and get acquainted with popular R packages such as tidyverse, ggplot, caret and others. In practice, good use of online resources make it easy to get accustomed with R and its syntax.

本文讨论使用网络上的现有数据预测名义GDP(国内生产总值)。 问题陈述是从真实来源收集数据,进行探索性数据分析(EDA),训练模型并预测名义GDP(加拿大)。 作为一名IT人员,我对这些与经济学相关的术语一无所知。 这就是为什么要做的第一件事就是熟悉这些概念。 接下来是用R编程语言学习和执行该项目,并熟悉tidyverse,ggplot,caret等流行的R包。 在实践中,充分利用在线资源可以使您容易习惯R及其语法。

Stepping into the task, I came across various economic statistics aka indicators, classified as lagging and leading indicators. Economic indicators are something which may have a direct impact on the Nominal GDP of a country. Our approach here was to use these indicators to predict the Nominal GDP of Canada. After researching these indicators from various online blogs, articles and papers, following indicators were finalized:

在完成任务时,我遇到了各种经济统计指标,也称为落后指标和领先指标 。 经济指标可能会直接影响一个国家的名义GDP。 我们在这里的方法是使用这些指标来预测加拿大的名义GDP。 在通过各种在线博客,文章和论文研究了这些指标之后,确定了以下指标:

PopulationRefugee PopulationReal Interest RateNumber of Domestic CompaniesTravel Services ( % of Import Services -BoP)Tax RevenueHousing MarketLabor ProductivityGovernment Bond Yield 10yr ratePersonal Remittances (Received & Paid in USD)Passengers carried by RailwaysPassengers carried by Air TransportsInflationIncome GrowthUnemployment RateGovernment DeficitConsumer Price IndexCommodity Price IndexEUR to CAD conversion rateUSD to CAD conversion rateToronto Stock Exchange traded valueToronto Stock Exchange traded volumeIncoming International Tourists

人口难民人口实际利率国内公司数量旅行服务(进口服务的百分比-BoP)税收收入住房市场劳动生产率政府债券收益率10年率个人汇款(已收款和已付美元)铁路运送的旅客航空运输的旅客通货膨胀收入增长失业率政府汇率差价rateUSD转换为CAD的汇率多伦多证券交易所的交易价值多伦多证券交易所的交易量

Data was fetched from sources like Statistics Canada, World Bank, Statista and others. These indicators were assumed as features and data was collected for these features between from year 2009 to 2018. The data set was divided into two parts, a traning set and a testing set.

数据是从加拿大统计局 , 世界银行 , Statista等来源获取的。 这些指标被假定为特征,并且从2009年到2018年之间收集了这些特征的数据。数据集分为两个部分,一个转换集和一个测试集。

Training Set — 2009 to 2016esting Set — 2017 and 2018

牛逼 下雨套装- 2009年2016esting集- 2017年和2018年

Exploratory Data Analysis

探索性数据分析

After the dataset was assembled, cleaned and neatly arranged, exploratory data analysis for the data set was started. Data normalization was done using the probability density dnorm function for a standard normal distribution. To check whether the data is normal, Shapiro-Wilk Normality Test was appiled on each feature. All variables except 3 were having their p-value greater than 0.05, which indicated that they are normal. In order to analyze the correlation between the dataset features, findCorrelation function of caret package was utilized and highly correlated attributes were removed. As the data points were less i.e. 8 (2009–2016), the dimensionality of the data had to be reduced. This was done by applying Principal Component Analysis (PCA) using prcomp function. At this point, EDA was concluded. Lets proceed with modelling now! Refer the GitHub link…Click here!

数据集组装,清理和整齐排列后,便开始对该数据集进行探索性数据分析。 使用概率密度dnorm函数对标准正态分布进行数据归一化。 为了检查数据是否正常,对每个功能部件应用了Shapiro-Wilk正态性测试 除3外的所有变量的p值均大于0.05,这表明它们是正常的。 为了分析数据集特征之间的相关性,利用插入符号包的findCorrelation函数并删除了高度相关的属性。 由于数据点较少,即8(2009-2016年),因此必须减小数据的维数。 这是通过使用prcomp函数应用主成分分析(PCA)完成的 。 至此,EDA结束了。 现在开始建模! 请参阅GitHub链接… 单击此处 !

Linear Regression Modelling

线性回归建模

In this type of modelling, PCA values were used to train the simple linear regression model using the lm() funtion. The resultant model was summarized using the summary function. The key observation was that, the p-value was less than significance level (< 0.05). This means that I could safely reject the null hypothesis as co-efficient β of the predictor is zero. Furthermore, the F-statistic value of the model was 428.6 (more the better) . Hence, it was concluded that the model was statistically significant. To validate the model, I tested it on the values of 2017 and 2018. The overall accuracy of the model was 98.31%. A table at the end of the article provides more details about the experiments.

在这种类型的建模中,将PCA值用于使用lm()函数训练简单的线性回归模型。 使用汇总功能汇总了所得模型。 关键观察结果是,p值小于显着性水平(<0.05)。 这意味着我可以安全地拒绝零假设,因为预测变量的系数β为零。 此外,该模型的F统计值是428.6(越好)。 因此,可以得出结论,该模型具有统计学意义。 为了验证该模型,我对2017年和2018年的值进行了测试。模型的总体准确性为98.31%。 本文末尾的表格提供了有关实验的更多详细信息。

Random Forest Modelling

随机森林建模

For this modelling, randomForest library of R language makes it easier to select the features and create a random forest model. The importance() function finds and displays the importance of each feature according to the data leaf node impurity. These were stored in a new dataset which was the sorted on basis of the importance values. The trick here is not to select the features whose importance values are only higher or lower, but to select values to neutralize the spectrum i.e. combination of higher, lower and mid importance values to maintain the balance. Six features were picked using trial and error method.

对于此建模,R语言的randomForest库使选择要素和创建随机森林模型变得更加容易。 Important()函数根据数据叶节点的杂质查找并显示每个功能的重要性。 这些存储在新的数据集中,并根据重要性值进行排序。 这里的技巧不是选择重要性值仅高或低的特征,而是选择值以中和频谱,即较高,较低和中等重要性值的组合以保持平衡。 使用试错法选择了六个特征。

Personal Remittances (Received) Real Interest Rate Travel Services Imports Government Bond Yield 10yr rateGovernment DeficitNumber of Domestic Companies

个人汇款(收款)实际利率旅行服务进口政府债券收益率10年率政府赤字国内公司数量

The model was trained by using the randomForest function. mtry parameter was set to 6 (no of features) and number of trees (ntree) to 1000. The model forecasted the 2017 and 2019 NGDP with an average accuracy of 95.68%.

通过使用randomForest函数训练模型。 mtry参数设置为6(无特征),树数( ntree )设置为1000。该模型预测2017年和2019年的NGDP,平均准确度为95.68%。

Support Vector Machine Modelling

支持向量机建模

This was the final one on our experiment list. In order to utilize the SVM modelling in our use, we used the e1071 package of R that provides the svn function to model. As our use-case was a not a classification one, the regression option had to be applied. For this, the svm() function provides a type parameter which enables us to opt in our choice. Accordingly, the eps-regression type was opted and kernel parameter was set to radial. For more details on kernels, click here. The average accuracy forecasted here was 98.18%. The graph and table below provides more details about the experiment.

这是我们实验列表中的最后一个。 为了在我们的使用中利用SVM建模,我们使用了R的e1071软件包,该软件包提供svn函数进行建模。 由于我们的用例不是分类用例,因此必须应用回归选项。 为此, svm()函数提供了一个类型参数,使我们可以选择自己的选择。 因此,选择了eps回归类型,并将内核参数设置为径向。 有关内核的更多详细信息, 请单击此处 。 此处预测的平均准确性为98.18%。 下表和表格提供了有关实验的更多详细信息。

Chart displaying the actual vs predicted values 显示实际值与预测值的图表
Image for post

To conclude, despite having a small data set, all models predict with 95% + accuracy. As linear model is trained using the PCA values, the resulting output is better than other. Whereas, the Random Forest can be considered to be least reliable as it uses minimal features of the dataset. If not the accuracy or reliability, the results at least advocate the significance of economic indicators in prediction of Nominal GDP.

总而言之,尽管数据集很小,但所有模型的预测均具有95%以上的准确性。 由于使用PCA值训练线性模型,因此结果输出要比其他输出更好。 鉴于随机森林使用数据集的最少特征,因此可以认为是最不可靠的。 如果不是准确性或可靠性,结果至少会提倡经济指标在名义GDP预测中的重要性。

PS: This experiment was a part of submission for Statistics Canada Business Data Scientist Challenge 2019/2020.

PS:该实验是加拿大统计局商业数据科学家挑战赛2019/2020提交的一部分。

翻译自: https://medium.com/@damiandmonte/predicting-the-nominal-gdp-using-economic-indicators-a-data-science-approach-7c56cded782

名义p值

你可能感兴趣的:(python,机器学习)