SparkR是一个R语言包,它提供了轻量级的方式使得可以在R语言中使用Apache Spark。在Spark 1.4中,SparkR实现了分布式的data frame,支持类似查询、过滤以及聚合的操作(类似于R中的data frames:dplyr),但是这个可以操作大规模的数据集。
从SparkContext和SQLContext开始
SparkContext是SparkR的切入点,它使得你的R程序和Spark集群互通。你可以通过sparkR.init来构建SparkContext,然后可以传入类似于应用程序名称的选项给它。如果想使用DataFrames,我们得创建SQLContext,这个可以通过SparkContext来构造。如果你使用SparkR shell, SQLContext 和SparkContext会自动地构建好。
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
创建DataFrames
df <- createDataFrame(sqlContext, faithful)
# Displays the content of the DataFrame to stdout
head(df)
## eruptions waiting
##1 3.600 79
##2 1.800 54
##3 3.333 74
通过Data Sources构造
people <- read.df(sqlContext, "./examples/src/main/resources/people.json", "json")
head(people)
## age name
##1 NA Michael
##2 30 Andy
##3 19 Justin
# SparkR automatically infers the schema from the JSON file
printSchema(people)
# root
# |-- age: integer (nullable = true)
# |-- name: string (nullable = true)
Data sources API还可以将DataFrames保存成多种的文件格式,比如我们可以通过write.df将上面的DataFrame保存成Parquet文件:
write.df(people, path="people.parquet", source="parquet", mode="overwrite")
通过Hive tables构造
# sc is an existing SparkContext.
hiveContext <- sparkRHive.init(sc)
sql(hiveContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sql(hiveContext, "LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
# Queries can be expressed in HiveQL.
results <- hiveContext.sql("FROM src SELECT key, value")
# results is now a DataFrame
head(results)
## key value
## 1 238 val_238
## 2 86 val_86
## 3 311 val_311
# Create the DataFrame
df <- createDataFrame(sqlContext, faithful)
# Get basic information about the DataFrame
df
## DataFrame[eruptions:double, waiting:double]
# Select only the "eruptions" column
head(select(df, df$eruptions))
## eruptions
##1 3.600
##2 1.800
##3 3.333
# You can also pass in column name as strings
head(select(df, "eruptions"))
# Filter the DataFrame to only retain rows with wait times shorter than 50 mins
head(filter(df, df$waiting < 50))
## eruptions waiting
##1 1.750 47
##2 1.750 47
##3 1.867 48
Grouping和Aggregation
# We use the `n` operator to count the number of times each waiting time appears
head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
## waiting count
##1 81 13
##2 60 6
##3 68 1
# We can also sort the output from the aggregation to get the most common waiting times
waiting_counts <- summarize(groupBy(df, df$waiting), count = n(df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))
## waiting count
##1 78 15
##2 83 14
##3 81 13
列上面的操作
SparkR提供了大量的函数用于直接对列进行数据处理的操作。
# Convert waiting time from hours to seconds.
# Note that we can assign this to a new column in the same DataFrame
df$waiting_secs <- df$waiting * 60
head(df)
## eruptions waiting waiting_secs
##1 3.600 79 4740
##2 1.800 54 3240
##3 3.333 74 4440
# Load a JSON file
people <- read.df(sqlContext, "./examples/src/main/resources/people.json", "json")
# Register this DataFrame as a table.
registerTempTable(people, "people")
# SQL statements can be run by using the sql method
teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
head(teenagers)
## name
##1 Justin