SparkR(1)Naive Bayesian

SparkR(1)Naive Bayesian

1. Naive Bayesian
P(A|B) = P(B|A) P(A) / P(B)

Features - F1, F2, … Fn
Category - C1, C2, … Cm

P(C|F1F2…Fn) = P(F1F2 … Fn|C)P(C) / P(F1F2…Fn)

P(F1F2…Fn|C)P(C) = P(F1|C)P(F2|C) … P(FN|C)P(C)

2. Prepare the Environment
spark-1.4.1
I just download the latest version and place that in my class path
http://mirror.nexcess.net/apache//spark/spark-1.4.1/spark-1.4.1-bin-hadoop2.6.tgz

R-3.2.2
http://sillycat.iteye.com/blog/2240148

>r --version
R version 3.2.2 (2015-08-14) -- "Fire Safety"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

Rstudio version 0.99.473
http://sillycat.iteye.com/blog/2240148

3. Start the Spark with R shell
> bin/sparkR --master local[2]

And we can directly put what we want into the shell from this example
https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R

4. Execute R script in SparkR
https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd

https://github.com/apache/spark/blob/master/examples/src/main/r/dataframe.R

https://github.com/apache/spark/blob/master/examples/src/main/resources/people.json

> bin/spark-submit examples/src/main/r/dataframe.R

5. Run the R Codes in Rstudio

Install the JDK 1.6 on my MAC
https://support.apple.com/kb/DL1572?locale=en_US

The file I download is from here.
http://supportdownload.apple.com/download.info.apple.com/Apple_Support_Area/Apple_Software_Updates/Mac_OS_X/downloads/031-29055.20150831-0f779fb2-4bf4-11e5-a8d8-/javaforosx.dmg

Move the binary spark file to /opt/spark
> mv spark-1.4.1-bin-hadoop2.6.tgz /opt/spark/

And this sample R codes can be run on the Rstudio
## download all the related packages
mypkgs <- c("dplyr", "ggplot2", "magrittr")
install.packages(mypkgs)

Sys.setenv(JAVA_HOME="/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home") # my path in Linux Ubuntu
library("rJava")

mySparkRpackagepath <- "/opt/spark/spark-1.4.1-bin-hadoop2.6.tgz"
install.packages(mySparkRpackagepath)

library("SparkR", lib.loc="/opt/spark/R/lib")
library("SparkR")
Sys.setenv(SPARK_HOME="/opt/spark")

sc <- sparkR.init(master = "local", appName = "SparkR_demo_RTA",
                  sparkHome = "/opt/spark")

sqlContext <- sparkRSQL.init(sc)

hiveContext <- sparkRHive.init(sc)

path <- file.path(Sys.getenv("SPARK_HOME"),
                  "examples/src/main/resources/people.json")

peopleDF <- jsonFile(sqlContext, path)

printSchema(peopleDF)
head(peopleDF)

6. Further Example
https://github.com/kiendang/sparkr-naivebayes-example

http://www.slideshare.net/KienDang5/introduction-to-sparkr

Data Types of R language
Vector
> c(1,2,3,4)
[1] 1 2 3 4
> 1:4
[1] 1 2 3 4
> c("a","b","c")
[1] "a" "b" "c"
> c(T,F,T)
[1]  TRUE FALSE  TRUE

Matrix
> matrix(c(1,2,3,4),ncol=2)
     [,1] [,2]
[1,]    1    3
[2,]    2    4
>
> matrix(c(1,2,3,4),ncol=2,byrow=T)
     [,1] [,2]
[1,]    1    2
[2,]    3    4

List
> list(12, "twelve")
[[1]]
[1] 12

[[2]]
[1] "twelve"

> list(1,2,3)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

Data frame
> name <-c("A","B","C")
> age <- c(30,17,42)
> male <- c(T,F,F)
> data.frame(name, age, male)
  name age  male
1    A  30  TRUE
2    B  17 FALSE
3    C  42 FALSE

http://sillycat.iteye.com/blog/2240148
http://sillycat.iteye.com/blog/2240395

http://sillycat.iteye.com/blog/2240407

http://sillycat.iteye.com/blog/2240494

runif(n, min=0,max=1)  average
x <- 1:100

y <- 1:100 + runif(100,0,20)

> m <- lm(y~x)
> plot(y~x)
> abline(m$coefficients)

R is single-threaded, can only process data sets that fit in a single machine.

SparkR allows users to interactively run jobs from the R shell on a cluster.

Famous Word Count Example
start the shell
> bin/sparkR --master local[2]

> rdd <- SparkR:::textFile(sc, 'README.md')

> counts <- SparkR:::map(rdd, nchar)
> SparkR:::take(counts, 3)

[[1]]
[1] 14
[[2]]
[1] 0
[[3]]
[1] 78

Supervised machine learning, Naive Bayes, Classifies texts based on the word frequency.

References:
http://www.iteblog.com/archives/1385
http://spark.apache.org/docs/latest/sparkr.html

https://github.com/math-and-data/SparkR/blob/master/Demo_of_SparkR.Rmd

https://github.com/BIDS/sparkR-demo

http://ampcamp.berkeley.edu/5/exercises/sparkr.html

https://github.com/kiendang/sparkr-naivebayes-example

naive bayesian
http://www.cnblogs.com/leoo2sk/archive/2010/09/17/1829190.html
http://www.ruanyifeng.com/blog/2013/12/naive_bayes_classifier.html

algorithm
http://www.ruanyifeng.com/blog/algorithm/

你可能感兴趣的:(SparkR(1)Naive Bayesian)