日前,Rstudio公司发布了sparklyr包。该包具有以下几个功能:
安装
通过devtools包实现sparklyr包的安装:
install.packages("devtools")
devtools::install_github("rstudio/sparklyr")
1、不知道大家在这里有没有出错,我安装到这一步时报错:
ERROR: dependency 'dplyr' is not available for package 'sparklyr'
换了多个版本的R仍然报错,然后我直接到github下载sparklyr包,打开里面的README.md后发现里面写的关于dplyr:
- Connect to [Spark](http://spark.apache.org/) from R. The sparklyr package provides a <br/> complete [dplyr](https://github.com/hadley/dplyr) backend.
说明install.packages(“dplyr”)不满足要求,需要下载the latest development version
devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")
然后再执行
devtools::install_github("rstudio/sparklyr")
就不报错,成功了:
> library(sparklyr)
>
2、查看spark log或sparkUI时报错:log4j.spark.log没有这个路径
查看sparklyr包中的源码可以看出,要通过spark_conf_file_set_value函数将日志的配置写入
A) 安装函数
spark_install <- function(version = NULL,
hadoop_version = NULL,
reset = TRUE,
logging = "INFO",
verbose = interactive())
{
installInfo <- spark_install_find(version, hadoop_version, installedOnly = FALSE, latest = TRUE)
if (!dir.exists(installInfo$sparkDir)) {
dir.create(installInfo$sparkDir, recursive = TRUE)
}
if (!dir.exists(installInfo$sparkVersionDir)) {
if (verbose) {
fmt <- paste(c(
"Installing Spark %s for Hadoop %s or later.",
"Downloading from:\n- '%s'",
"Installing to:\n- '%s'"
), collapse = "\n")
msg <- sprintf(fmt,
installInfo$sparkVersion,
installInfo$hadoopVersion,
installInfo$packageRemotePath,
aliased_path(installInfo$sparkVersionDir))
message(msg)
}
download.file(
installInfo$packageRemotePath,
destfile = installInfo$packageLocalPath,
quiet = !verbose
)
untar(tarfile = installInfo$packageLocalPath, exdir = installInfo$sparkDir)
unlink(installInfo$packageLocalPath)
if (verbose)
message("Installation complete.")
} else if (verbose) {
fmt <- "Spark %s for Hadoop %s or later already installed."
msg <- sprintf(fmt, installInfo$sparkVersion, installInfo$hadoopVersion)
message(msg)
}
if (!file.exists(installInfo$sparkDir)) {
stop("Spark version not found.")
}
if (!identical(logging, NULL)) {
tryCatch({
spark_conf_file_set_value(
installInfo,
list(
"log4j.rootCategory" = paste0("log4j.rootCategory=", logging, ",console,localfile"),
"log4j.appender.localfile" = "log4j.appender.localfile=org.apache.log4j.DailyRollingFileAppender",
"log4j.appender.localfile.file" = "log4j.appender.localfile.file=log4j.spark.log",
"log4j.appender.localfile.layout" = "log4j.appender.localfile.layout=org.apache.log4j.PatternLayout",
"log4j.appender.localfile.layout.ConversionPattern" = "log4j.appender.localfile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n"),
reset)
}, error = function(e) {
warning("Failed to set logging settings")
})
}
B) 日志配置函数
spark_conf_file_set_value <- function(installInfo, properties, reset) {
log4jPropertiesPath <- file.path(installInfo$sparkConfDir, "log4j.properties")
if (!file.exists(log4jPropertiesPath) || reset) {
log4jTemplatePath <- file.path(installInfo$sparkConfDir, "log4j.properties.template")
file.copy(log4jTemplatePath, log4jPropertiesPath, overwrite = TRUE)
}
log4jPropertiesFile <- file(log4jPropertiesPath)
lines <- readr::read_lines(log4jPropertiesFile)
lines[[length(lines) + 1]] <- ""
lines[[length(lines) + 1]] <- "# Other settings"
lapply(names(properties), function(property) {
value <- properties[[property]]
pattern <- paste(property, "=.*", sep = "")
if (length(grep(pattern, lines)) > 0) {
lines <<- gsub(pattern, value, lines, perl = TRUE)
}
else {
lines[[length(lines) + 1]] <<- value
}
})
writeLines(lines, log4jPropertiesFile)
close(log4jPropertiesFile)
}
但经常在writeLines(lines, log4jPropertiesFile)时无法写入,不知道是什么原因,此时将log4jPropertiesFile改成file(log4jPropertiesPath)即log4jPropertiesFile的值时即可将配置写入日志配置文件log4j.properties中:
# Set everything to be logged to the console
log4j.rootCategory=INFO, console,localfile
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
# Other settings
log4j.appender.localfile=org.apache.log4j.DailyRollingFileAppender
log4j.appender.localfile.file=log4j.spark.log
log4j.appender.localfile.layout=org.apache.log4j.PatternLayout
log4j.appender.localfile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
把log4j.rootCategory=INFO, console改为log4j.rootCategory=WARN, console即可抑制Spark把INFO级别的日志打到控制台上。如果要显示全面的信息,则把INFO改为DEBUG。这里,log4j.appender.localfile.file即存放日志的位置,也可以设置其他目录,如/home/hadoop/spark.log
当然,为了更方便,也可以直接按上述手动修改spark中的log4j.properties文件。