Spark从s3中读取数据

根据Spark官网Quick Start,简单修改下file source
ref: http://spark.apache.org/docs/latest/quick-start.html

package myspark;

import org.apache.spark.api.java.*;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.function.Function;

public class LogAnalyser {

    public static void main(String[] args) {
        SparkConf conf = new SparkConf().setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        sc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "YOUR_KEY_ID");
        sc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "YOUR_SECRET");

        String logFile = "s3n://bucket/*.log";

        JavaRDD logData = sc.textFile(logFile).cache();

        long numAs = logData.filter(new Function() {
            public Boolean call(String s) {
                return s.contains("a");
            }
        }).count();

        long numBs = logData.filter(new Function() {
            public Boolean call(String s) {
                return s.contains("b");
            }
        }).count();

        System.out.println("Lines with a: " + numAs + ", lines with b: " + numBs);

        sc.stop();
    }
}

将项目打包为test-0.1.0.jar,提交给Spark执行:

SPARK_HOME/bin/spark-submit --class myspark.LogAnalyser \
--master local[4] build/libs/test-0.1.0.jar

发现报错:

No FileSystem for scheme: s3n

原因及解决方法:

This message appears when dependencies are missing from your Apache Spark distribution. If you see this error message, you can use the –packages parameter and Spark will use Maven to locate the missing dependencies and distribute them to the cluster. Alternately, you can use –jars if you manually downloaded the dependencies already. These parameters also works on the spark-submit script.

SPARK_HOME/bin/spark-submit --class myspark.LogAnalyser \
--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 \
--master local[4] build/libs/test-0.1.0.jar

其他语言(语言)参考: https://sparkour.urizone.net/recipes/using-s3/

你可能感兴趣的:(spark)