http://amplab-extras.github.io/SparkR-pkg/
先把文章标记起来,说不定哪天用的上,期待可以用的上……
R和Spark、Hadoop的前景很光明,学习道路漫漫,技术变革太快了,刚买的Hadoop还没有看几页,貌似已经被超越了,要加紧学习才是啊
R on Spark
SparkR is an R package that provides alight-weight frontend to use Apache Spark from R. SparkR exposes the Spark APIthrough the RDD class and allows users to interactively run jobs from the Rshell on a cluster.
Features
RDDs as Distributed Lists
SparkR exposes the RDD API of Spark asdistributed lists in R. For example we can read an input file from HDFS andprocess every line using lapply on a RDD.
sc<- sparkR.init("local")
lines <- textFile(sc, "hdfs://data.txt")
wordsPerLine <- lapply(lines, function(line) {length(unlist(strsplit(line, " "))) })
In addition to lapply, SparkR also allowsclosures to be applied on every partition using lapplyWithPartition. Othersupported RDD functions include operations like reduce, reduceByKey, groupByKeyand collect.
Serializing closures
SparkR automatically serializes thenecessary variables to execute a function on the cluster. For example if youuse some global variables in a function passed to lapply, SparkR will automaticallycapture these variables and copy them to the cluster. An example of using arandom weight vector to initialize a matrix is shown below
lines <- textFile(sc, "hdfs://data.txt")
initialWeights <- runif(n=D, min = -1, max = 1)
createMatrix <- function(line) {
as.numeric(unlist(strsplit(line, " "))) %*% t(initialWeights)
}
#initialWeights is automatically serialized
matrixRDD <- lapply(lines, createMatrix)
Using existing R packages
SparkR also allows easy use of existing Rpackages inside closures. The includePackage command can be used to indicatepackages that should be loaded before every closure is executed on the cluster.For example to use the Matrix in a closure applied on each partition of an RDD,you could run
generateSparse <- function(x) {
#Use sparseMatrix function from the Matrix package
sparseMatrix(i=c(1, 2, 3), j=c(1, 2, 3), x=c(1, 2, 3))
}
includePackage(sc, Matrix)
sparseMat <- lapplyPartition(rdd, generateSparse)
Installing SparkR
SparkR requires Scala 2.10 and Sparkversion >= 0.9.0 and depends on R packages rJava and testthat (only requiredfor running unit tests).
If you wish to try out SparkR, you can useinstall_github from the devtools package to directly install the package.
library(devtools)
install_github("amplab-extras/SparkR-pkg",subdir="pkg")
If you wish to clone the repository andbuild from source, you can using the following script to build the packagelocally.
./install-dev.sh
Running sparkR
If you have installed it directly fromgithub, you can include the SparkR package and then initialize a SparkContext.For example to run with a local Spark master you can launch R and then run
library(SparkR)
sc <-sparkR.init(master="local")
If you have cloned and built SparkR, youcan start using it by launching the SparkR shell with
./sparkR
SparkR also comes with several sampleprograms in the examples directory. To run one of them, use ./sparkR
./sparkR examples/pi.R local[2]
You can also run the unit-tests for SparkRby running
./run-tests.sh