安装sparklyr包过程中遇到的几个ERROR

日前,Rstudio公司发布了sparklyr包。该包具有以下几个功能:

  • 实现R与Spark的连接—sparklyr包提供了一个完整的dplyr后端
  • 筛选并聚合Spark数据集,接着在R中实现分析与可视化
  • 利用Spark的MLlib机器学习库在R中实现分布式机器学习算法
  • 可以创建一个扩展,用于调用Spark API。并为Spark的所有包集提供了一个接口
  • 未来在RStudio IDE中集成支持Spark和sparklyr包

安装

通过devtools包实现sparklyr包的安装:

install.packages("devtools")
devtools::install_github("rstudio/sparklyr")

1、不知道大家在这里有没有出错,我安装到这一步时报错:

ERROR: dependency 'dplyr' is not available for package 'sparklyr'

换了多个版本的R仍然报错,然后我直接到github下载sparklyr包,打开里面的README.md后发现里面写的关于dplyr:

-   Connect to [Spark](http://spark.apache.org/) from R. The sparklyr package provides a <br/> complete [dplyr](https://github.com/hadley/dplyr) backend.

说明install.packages(“dplyr”)不满足要求,需要下载the latest development version

devtools::install_github("hadley/lazyeval")
devtools::install_github("hadley/dplyr")

然后再执行

devtools::install_github("rstudio/sparklyr")

就不报错,成功了:

> library(sparklyr)
> 

2、查看spark log或sparkUI时报错:log4j.spark.log没有这个路径
查看sparklyr包中的源码可以看出,要通过spark_conf_file_set_value函数将日志的配置写入

A) 安装函数

spark_install <- function(version = NULL,
                          hadoop_version = NULL,
                          reset = TRUE,
                          logging = "INFO",
                          verbose = interactive())
{
  installInfo <- spark_install_find(version, hadoop_version, installedOnly = FALSE, latest = TRUE)

  if (!dir.exists(installInfo$sparkDir)) {
    dir.create(installInfo$sparkDir, recursive = TRUE)
  }

  if (!dir.exists(installInfo$sparkVersionDir)) {

    if (verbose) {

      fmt <- paste(c(
        "Installing Spark %s for Hadoop %s or later.",
        "Downloading from:\n- '%s'",
        "Installing to:\n- '%s'"
      ), collapse = "\n")

      msg <- sprintf(fmt,
                     installInfo$sparkVersion,
                     installInfo$hadoopVersion,
                     installInfo$packageRemotePath,
                     aliased_path(installInfo$sparkVersionDir))

      message(msg)
    }

    download.file(
      installInfo$packageRemotePath,
      destfile = installInfo$packageLocalPath,
      quiet = !verbose
    )

    untar(tarfile = installInfo$packageLocalPath, exdir = installInfo$sparkDir)
    unlink(installInfo$packageLocalPath)

    if (verbose)
      message("Installation complete.")

  } else if (verbose) {
    fmt <- "Spark %s for Hadoop %s or later already installed."
    msg <- sprintf(fmt, installInfo$sparkVersion, installInfo$hadoopVersion)
    message(msg)
  }

  if (!file.exists(installInfo$sparkDir)) {
    stop("Spark version not found.")
  }

  if (!identical(logging, NULL)) {
    tryCatch({
      spark_conf_file_set_value(
        installInfo,
        list(
          "log4j.rootCategory" = paste0("log4j.rootCategory=", logging, ",console,localfile"),
          "log4j.appender.localfile" = "log4j.appender.localfile=org.apache.log4j.DailyRollingFileAppender",
          "log4j.appender.localfile.file" = "log4j.appender.localfile.file=log4j.spark.log",
          "log4j.appender.localfile.layout" = "log4j.appender.localfile.layout=org.apache.log4j.PatternLayout",
          "log4j.appender.localfile.layout.ConversionPattern" = "log4j.appender.localfile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n"),
        reset)
    }, error = function(e) {
      warning("Failed to set logging settings")
    })
  }

B) 日志配置函数

spark_conf_file_set_value <- function(installInfo, properties, reset) {
  log4jPropertiesPath <- file.path(installInfo$sparkConfDir, "log4j.properties")
  if (!file.exists(log4jPropertiesPath) || reset) {
    log4jTemplatePath <- file.path(installInfo$sparkConfDir, "log4j.properties.template")
    file.copy(log4jTemplatePath, log4jPropertiesPath, overwrite = TRUE)
  }

  log4jPropertiesFile <- file(log4jPropertiesPath)
  lines <- readr::read_lines(log4jPropertiesFile)

  lines[[length(lines) + 1]] <- ""
  lines[[length(lines) + 1]] <- "# Other settings"

  lapply(names(properties), function(property) {
    value <- properties[[property]]
    pattern <- paste(property, "=.*", sep = "")

    if (length(grep(pattern, lines)) > 0) {
      lines <<- gsub(pattern, value, lines, perl = TRUE)
    }
    else {
      lines[[length(lines) + 1]] <<- value
    }
  })

  writeLines(lines, log4jPropertiesFile)
  close(log4jPropertiesFile)
}

但经常在writeLines(lines, log4jPropertiesFile)时无法写入,不知道是什么原因,此时将log4jPropertiesFile改成file(log4jPropertiesPath)即log4jPropertiesFile的值时即可将配置写入日志配置文件log4j.properties中:

# Set everything to be logged to the console
log4j.rootCategory=INFO, console,localfile
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

# Settings to quiet third party logs that are too verbose
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

# Other settings
log4j.appender.localfile=org.apache.log4j.DailyRollingFileAppender
log4j.appender.localfile.file=log4j.spark.log
log4j.appender.localfile.layout=org.apache.log4j.PatternLayout
log4j.appender.localfile.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

把log4j.rootCategory=INFO, console改为log4j.rootCategory=WARN, console即可抑制Spark把INFO级别的日志打到控制台上。如果要显示全面的信息,则把INFO改为DEBUG。这里,log4j.appender.localfile.file即存放日志的位置,也可以设置其他目录,如/home/hadoop/spark.log

当然,为了更方便,也可以直接按上述手动修改spark中的log4j.properties文件。

你可能感兴趣的:(R,Spark)