Spark实战(1) 配置AWS EMR 和Zeppelin Notebook

SparkContext和SparkSession的区别,如何取用?

  • SparkContext:
    • 在Spark 2.0.0之前使用
    • 通过资源管理器例如YARN来连接集群
    • 需要传入SparkConf来创建SparkContext对象
    • 如果要使用SQL,HIVE或者Streaming的API, 需要创建单独的Context
    •   val conf = new SparkConf()
        .setAppName(“RetailDataAnalysis”)
        .setMaster(“spark://master:7077)
        .set(“spark.executor.memory”, “2g”)
        
        val sc = new SparkContext(conf)
      
  • SparkSession:
    • 出现在Spark 2.0.0之后, 推荐使用
    • 除了能够调用Spark的全部功能之外,允许DataFrameDataset APIs
    • 对于SQL, HIVE和Streaming,不需要创建单独的Context
    • 可以在初始化session之后配置config
       # Creating Spark session:
       val spark = SparkSession
       			.builder
       			.appName("WorldBankIndex")
       			.getOrCreate()
      
        # Configuring properties:
        spark.conf.set("spark.sql.shuffle.partitions", 6)
        spark.conf.set("spark.executor.memory", "2g")
      

配置AWS EMR

# 1. Open aws console
# 2. Access the EMR
# 3. Create cluser
# 4. Go to andvanced options
# 5. Release: emr-5.11.1
# 6. Hadoop: 2.7.3
# 7. Zeppelin: 0.7.3
# 8. Spark: 2.2.1
# 9. Choose spot price to save budget
# 10. Create you key pair, download and chmod 400 it
# 11. Add inbound Security Group: 22 for ssh, 8890 for Zeppelin

创建Zeppelin Notebook

# 1. access master node public dns:8890
# 2. Create new note
# 3. Default Interpreter: spark
%pyspark # 4. import the pyspark package
# after importing package, you could run python code in zeppelin
for i in [1,2,3]:
	print(i)
	
# the spark context is already set
sc

# the spark session is already set
spark

# read file fro aws s3
df = spark.read.csv("s3n://MyaccessKey:SecretKey@bucketname/file.csv")

你可能感兴趣的:(Spark)