概念
IRR代码
数据准备
Spark代码
与Pandas效率对比
IRR又叫内部收益率,通俗解释就是内部收益率越高,就说明投入的成本相对较少,这时候可以获得的收益也就相对更多些。另外我们也可以理解为要决定一个项目接受与否,但又不适应现实情况的一个贴现率。这两种理解都算是通俗的解释
IRR 内部收益率在Excel和Pandas中有对应的函数,因需要计算的数据量太大所以给出spark解决方案,本文只做实现记录和部分概念讲解。
在网上IRR 内部收益率有对应的Java和Oracle sql实现,Java代码如下。原文链接
public static double irr(double[] values, double guess) {
int maxIterationCount = 20;
double absoluteAccuracy = 1.0E-007D;
double x0 = guess;
int i = 0;
while (i < maxIterationCount) {
double fValue = 0.0D;
double fDerivative = 0.0D;
for (int k = 0; k < values.length; k++) {
fValue += values[k] / Math.pow(1.0D + x0, k);
fDerivative += -k * values[k] / Math.pow(1.0D + x0, k + 1);
}
double x1 = x0 - fValue / fDerivative;
if (Math.abs(x1 - x0) <= absoluteAccuracy) {
return x1;
}
x0 = x1;
i++;
}
return (0.0D / 0.0D);
}
调用函数只需要传递所有数据和一个guess,这个guess从代码中看出应该是一个迭代参数,函数调用如下
double ret = irr(income,0.1d)* 12 * 100 ;
12和100看自己需要是否乘上去。
将网上的Java代码翻译为Scala代码如下
import scala.math._
def irr(values:String):(String,Double) = {
val list = values.split(",")
val id = list.head
val fees_flatten = list.tail
val maxIterationCount = 20;
var x0:Double = 0.00001;
var fValue:Double = 0.0;
val fDerivative:Double = 0.0;
val absoluteAccuracy:Double = 1.0E-007D;
var i = 0;
while (i < maxIterationCount) {
var fValue:Double = 0.0;
var fDerivative:Double = 0.0;
var k = 0;
while ( k < fees_flatten.length) {
var v = fees_flatten(k).toDouble
fValue = fValue + (v / pow(1.0 + x0, k));
fDerivative = fDerivative + (-k * (fees_flatten(k).toDouble) / pow(1.0 + x0, k + 1));
k = k + 1;
}
var x1 : Double = x0 - fValue / fDerivative;
if (abs(x1 - x0) <= absoluteAccuracy) {
return (id,x1);
}
x0 = x1;
i = i + 1;
}
val res : Double = 0.0;
return (id,res);
}
值得一提的是方法的参数被我替换为了String类型。因为集合数据再RDD中是以Compactbuffer的形式存在的,在类型上是Iterable,传递到Scala函数中不容易操作。我以Iterable的形式传递进去后尝试了遍历取值等操作都报错了,所以干脆转化为String自己切分。
接下来需要准备IRR计算所需要的数据。这个数据由两部分组成
1- 所有本金的值取负数
2-每期的应付金额
因为只是单纯为了加速计算IRR而不是后面经常用,所以我直接将Hive数据在HDFS上让Spark去读取了,我们只需要将这两部分数据导入到一个有两列的表中即可。table_name (id bigint, fee Double 或者decimal)
spark-shell
import scala.math._
def irr(values:String):(String,Double) = {
val list = values.split(",")
val id = list.head
val fees_flatten = list.tail
val maxIterationCount = 20;
var x0:Double = 0.1;
var fValue:Double = 0.0;
val fDerivative:Double = 0.0;
val absoluteAccuracy:Double = 1.0E-007D;
var i = 0;
while (i < maxIterationCount) {
var fValue:Double = 0.0;
var fDerivative:Double = 0.0;
var k = 0;
while ( k < fees_flatten.length) {
var v = fees_flatten(k).toDouble
fValue = fValue + (v / pow(1.0 + x0, k));
fDerivative = fDerivative + (-k * (fees_flatten(k).toDouble) / pow(1.0 + x0, k + 1));
k = k + 1;
}
var x1 : Double = x0 - fValue / fDerivative;
if (abs(x1 - x0) <= absoluteAccuracy) {
return (id,x1);
}
x0 = x1;
i = i + 1;
}
val res : Double = 0.0;
return (id,res);
}
val lines = sc.textFile("/xx/xx/data.csv")
val odd_tup = lines.map(x => (x.split(",")(0),x.split(",")(1))).filter(x => x._1.toInt % 2 == 0).map(x => (x._1.toInt,x._2.toDouble))
val even_tup = lines.map(x => (x.split(",")(0),x.split(",")(1))).filter(x => x._1.toInt % 2 == 1).map(x => (x._1.toInt,x._2.toDouble))
val merged = odd_tup.cogroup(even_tup)
merged.filter(x => x._2._1.mkString(",").length > 0).map(x => (x._1,x._2._1)).map{ case (k,v) => s"""$k,${v.mkString(",")}"""}.map( x => irr(x)).map(x => (x._1.toInt,x._2.toDouble * 12 * 100)).toDF().insertInto(table_name,true)
merged.filter(x => x._2._2.mkString(",").length > 0).map(x => (x._1,x._2._1)).map{ case (k,v) => s"""$k,${v.mkString(",")}"""}.map( x => irr(x)).map(x => (x._1.toInt,x._2.toDouble * 12 * 100)).toDF().insertInto(table_name,false)
/xx/xx/data.csv是我在HDFS上的数据存储路径,先在spark中构建好函数,然后用算子整理好数据后传递到自定义函数中得到结果存储到Hive中。此处得到的结果是没有乘100的,如果需要乘100则merged部分如下
merged.map(x => (x._1,x._2._1)).map{ case (k,v) => s"""$k,${v.mkString(",")}"""}.map( x => irr(x)).map(x => (x._1,x._2 * 100)).toDF().insertInto(result_table_name)
得到的结果挑几笔跟Excel对比下以免出错。
以上代码,在公司数据量的基础上(不方便透露)同事写的Python脚本处理数据用2.5h(不排除代码写的烂的因素),我预估了一下Python代码即便优化后执行也要接近三十分钟,而spark处理只用十秒钟。
spark代码后续用到再优化吧,以上只做记录。
参考文章 : https://www.cnblogs.com/Alex-Zeng/p/9334582.html