大数据处理或机器学习时的原型( prototype)开发
准备环境:JDK、Spark需要提前安装好
下载Anaconbda
至于版本最好不要使用过低版本,可能无法使用
安装bzip2
缺少 bzip2 安装 Anaconda 会失败
yum install -y bzip2
上传/解压Anaconbda
Linux默认自带python,安装Anacondd会覆盖原有的Python,可以通过修改.bashrc使两个版本pyrhon共存
设置两个版本的python共存
vim /root/.bashrc
#添加以下内容,自行修改自己安装的路径
export PATH="/opt/install/anaconda3/bin:$PATH"
alias pyana="/opt/install/anaconda3/bin/python"
alias python="/bin/python"
source /root/.bashrc
生成 PySpark 配置文件
在当前用户文件夹下运行以下命令生成配置文件:jupyter notebook --generate-config
修改配置文件,但在这之前,需要先执行以下操作
使用 pyana,进入交互模式,运行以下代码
from notebook.auth import passwd
passwd()
#按照提示设置密码后会生成与之对应的加密密码,然后保存这个生成的字符串,后面会赋值给 c.NotebookApp.password 属性
vi ./.jupyter/jupyter_notebook_config.py
c.NotebookApp.allow_root = True
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
c.NotebookApp.password = 'sha1:*****************'#将前面生成的值放到这里
c.NotebookApp.port = 7070 #指定外部访问的端口号
vim /root/.bashrc
export PYSPARK_PYTHON=/opt/install/anaconda3/bin/python3 #指定/anaconda3/bin/python3
export PYSPARK_DRIVER_PYTHON=/opt/install/anaconda3/bin/jupyter #指定/anaconda3/bin/jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
ipython_opts="notebook -pylab inline"
生效环境变量:source /root/.bashrc
注意关闭防火墙
这里安装就算完成了
PySpark
Core Classes:
pyspark.SparkContext
pyspark.RDD
pyspark.sql.SQLContext
pyspark.sql.DataFrame
pyspark.streaming
pyspark.streaming.StreamingContext
pyspark.streaming.DStream
pyspark.ml
pyspark.mllib
from pyspark import SparkContext
sc=SparkContext.getOrCreate()
#不支持
makeRDD()
#支持
parallelize()
textFile()
wholeTextFiles()
#scala
val a=sc.parallelize(List("dog","tiger","lion","cat","panther","eagle"))
val b=a.map(x=>(x,1))
b.collect
#python
a=sc.parallelize(("dog","tiger","lion","cat","panther","eagle"))
b=a.map(lambda x:(x,1))
b.collect()
addFile(path, recursive = False)
addPyFile( path )
vi sci.py
写入下面两个方法人,然后保存退出#sci.py
def sqrt(num):
return num * num
def circle_area(r):
return 3.14 * sqrt(r)
#加载预写入方法的文件
sc.addPyFile("file:///root/sci.py")
#导入文件中的方法
from sci import circle_area
#创建rdd并使用文件中的方法
sc.parallelize([5, 9, 21]).map(lambda x : circle_area(x)).collect()
from pyspark.sql import SparkSession
ss = SparkSession.builder.getOrCreate()
ss.read.format("csv").option("header", "true").load("file:///xxx.csv")
演示
Afghanistan 48.673000 SAs
Albania 76.918000 EuCA
Algeria 73.131000 MENA
Angola 51.093000 SSA
Argentina 75.901000 Amer
Armenia 74.241000 EuCA
Aruba 75.246000 Amer
Australia 81.907000 EAP
Austria 80.854000 EuCA
Azerbaijan 70.739000 EuCA
Bahamas 75.620000 Amer
Bahrain 75.057000 MENA
Bangladesh 68.944000 SAs
Barbados 76.835000 Amer
Belarus 70.349000 EuCA
Belgium 80.009000 EuCA
Belize 76.072000 Amer
Benin 56.081000 SSA
Bhutan 67.185000 SAs
Bolivia 66.618000 Amer
Bosnia_and_Herzegovina 75.670000 EuCA
Botswana 53.183000 SSA
Brazil 73.488000 Amer
Brunei 78.005000 EAP
Bulgaria 73.371000 EuCA
Burkina_Faso 55.439000 SSA
Burundi 50.411000 SSA
Cambodia 63.125000 EAP
Cameroon 51.610000 SSA
Canada 81.012000 Amer
Cape_Verde 74.156000 SSA
Central_African_Rep. 48.398000 SSA
Chad 49.553000 SSA
Channel_Islands 80.055000 EuCA
Chile 79.120000 Amer
China 73.456000 EAP
Colombia 73.703000 Amer
Comoros 61.061000 SSA
Congo_Dem._Rep. 48.397000 SSA
Congo_Rep. 57.379000 SSA
Costa_Rica 79.311000 Amer
Cote_d'Ivoire 55.377000 SSA
Croatia 76.640000 EuCA
Cuba 79.143000 Amer
#导包
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType
#创建sparkSession对象
ss = SparkSession.builder.getOrCreate()
#读取本地csv文件,并为每列设置名称
#pyspark中一条语句换行需要加斜杠
df = ss.read.format("csv").option("delimiter", " ").load("file:///root/example/LifeExpentancy.txt") \
.withColumn("Country", col("_c0")) \
.withColumn("LifeExp", col("_c2").cast(DoubleType())) \
.withColumn("Region", col("_c4")) \
.select(col("Country"), col("LifeExp"), col("Region"))
df.describe("LifeExp").show()
1.Pandas做数据分析
#Pandas DataFrame 转 Spark DataFrame
spark.createDataFrame(pandas_df)
#Spark DataFrame转Pandas DataFrame
spark_df.toPandas()
2.Matplotlib实现数据可视化
3.Scikit-learn完成机器学习
# Pandas DataFrame to Spark DataFrame
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
pandas_df = pd.read_csv("./products.csv", header=None, usecols=[1, 3, 5])
print(pandas_df)
# convert to Spark DataFrame
spark_df = spark.createDataFrame(pandas_df)
spark_df.show()
df2 = spark_df.withColumnRenamed("1", "id").withColumnRenamed("3", "name").withColumnRenamed("5", "remark")
# convert back to Pandas DataFrame
df2.toPandas()
# 获取上面演示示例中的第一个df对象
rdd = df.select("LifeExp").rdd.map(lambda x: x[0])
#把数据划为10个区间,并获得每个区间中的数据个数
(countries, bins) = rdd.histogram(10)
print(countries)
print(bins)
#导入图形生成包
import matplotlib.pyplot as plt
import numpy as np
plt.hist(rdd.collect(), 10) # by default the # of bins is 10
plt.title("Life Expectancy Histogram")
plt.xlabel("Life Expectancy")
plt.ylabel("# of Countries")