SparkR(1)Naive Bayesian
1. Naive Bayesian
P(A|B) = P(B|A) P(A) / P(B)
Features - F1, F2, … Fn
Category - C1, C2, … Cm
P(C|F1F2…Fn) = P(F1F2 … Fn|C)P(C) / P(F1F2…Fn)
P(F1F2…Fn|C)P(C) = P(F1|C)P(F2|C) … P(FN|C)P(C)
2. Prepare the Environment
spark-1.4.1
I just download the latest version and place that in my class path
http://mirror.nexcess.net/apache//spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz
R-3.2.2
http://sillycat.iteye.com/blog/2240148
>r --version
R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Rstudio version 0.99.473
http://sillycat.iteye.com/blog/2240148
3. Start the Spark with R shell
> bin/sparkR --master local[2]
And we can directly put what we want into the shell from this example
https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R
4. Execute R script in SparkR
https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd
https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R
https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json
> bin/spark-submit examples/src/main/r/dataframe.R
5. Run the R Codes in Rstudio
Install the JDK 1.6 on my MAC
https://support.apple.com/kb/DL1572?locale=en_US
The file I download is from here.
http://supportdownload.apple.com/download.info.apple.com/Apple_Support_Area/Apple_Software_Updates/Mac_OS_X/downloads/031-29055.20150831-0f779fb2-4bf4-11e5-a8d8-/javaforosx.dmg
Move the binary spark file to /opt/spark
> mv spark-1.4.1-bin-hadoop2.6.tgz /opt/spark/
And this sample R codes can be run on the Rstudio
## download all the related packages
mypkgs <- c("dplyr", "ggplot2", "magrittr")
install.packages(mypkgs)
Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home") # my path in Linux Ubuntu
library("rJava")
mySparkRpackagepath <- "/opt/spark/spark-1.4.1-bin-hadoop2.6.tgz"
install.packages(mySparkRpackagepath)
library("SparkR", lib.loc="/opt/spark/R/lib")
library("SparkR")
Sys.setenv(SPARK_HOME="/opt/spark")
sc <- sparkR.init(master = "local", appName = "SparkR_demo_RTA",
sparkHome = "/opt/spark")
sqlContext <- sparkRSQL.init(sc)
hiveContext <- sparkRHive.init(sc)
path <- file.path(Sys.getenv("SPARK_HOME"),
"examples/src/main/resources/people.json")
peopleDF <- jsonFile(sqlContext, path)
printSchema(peopleDF)
head(peopleDF)
6. Further Example
https://github.com/kiendang/sparkr-naivebayes-example
http://www.slideshare.net/KienDang5/introduction-to-sparkr
Data Types of R language
Vector
> c(1,2,3,4)
[1] 1 2 3 4
> 1:4
[1] 1 2 3 4
> c("a","b","c")
[1] "a" "b" "c"
> c(T,F,T)
[1] TRUE FALSE TRUE
Matrix
> matrix(c(1,2,3,4),ncol=2)
[,1] [,2]
[1,] 1 3
[2,] 2 4
>
> matrix(c(1,2,3,4),ncol=2,byrow=T)
[,1] [,2]
[1,] 1 2
[2,] 3 4
List
> list(12, "twelve")
[[1]]
[1] 12
[[2]]
[1] "twelve"
> list(1,2,3)
[[1]]
[1] 1
[[2]]
[1] 2
[[3]]
[1] 3
Data frame
> name <-c("A","B","C")
> age <- c(30,17,42)
> male <- c(T,F,F)
> data.frame(name, age, male)
name age male
1 A 30 TRUE
2 B 17 FALSE
3 C 42 FALSE
http://sillycat.iteye.com/blog/2240148
http://sillycat.iteye.com/blog/2240395
http://sillycat.iteye.com/blog/2240407
http://sillycat.iteye.com/blog/2240494
runif(n, min=0,max=1) average
x <- 1:100
y <- 1:100 + runif(100,0,20)
> m <- lm(y~x)
> plot(y~x)
> abline(m$coefficients)
R is single-threaded, can only process data sets that fit in a single machine.
SparkR allows users to interactively run jobs from the R shell on a cluster.
Famous Word Count Example
start the shell
> bin/sparkR --master local[2]
> rdd <- SparkR:::textFile(sc, 'README.md')
> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)
[[1]]
[1] 14
[[2]]
[1] 0
[[3]]
[1] 78
Supervised machine learning, Naive Bayes, Classifies texts based on the word frequency.
References:
http://www.iteblog.com/archives/1385
http://spark.apache.org/docs/latest/sparkr.html
https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd
https://github.com/BIDS/sparkR-demo
http://ampcamp.berkeley.edu/5/exercises/sparkr.html
https://github.com/kiendang/sparkr-naivebayes-example
naive bayesian
http://www.cnblogs.com/leoo2sk/archive/2010/09/17/1829190.html
http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html
algorithm
http://www.ruanyifeng.com/blog/algorithm/