操作系统:CentOS7
机器:虚拟机3台,(master 192.168.1.201, slave1 192.168.1.202, slave2 192.168.1.203)
JDK:1.8.0_121(jdk-8u221-linux-x64.tar.gz)
Hadoop:2.9.2(http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.9.2/hadoop-2.9.2.tar.gz)
这儿我就不细分了,网上教程比我写的好的很多
1、
在vmware安装好linux虚拟机后
重启虚拟机
在root权限的根目录下
修改IP地址
vi /etc/sysconfig/network-scripts/ifcfg-ens33
2、
修改BOOTPROTO=“static”
并为其添加IP和网关
IPADDR=“需要配置的IP地址”
GATEWAY=“192.168.1.2”
DNS1=“8.8.8.8”
3、
!wq保存后
执行:service network restart在这里插入图片描述
如果出现错误,执行reboot,重启虚拟机
修改主机名
4、
修改主机名:vi /etc/sysconfig/network
在hosts里面添加内容
vi /etc/hosts
并重启设备,重启后,查看主机名,已经修改成功
5、
修改window10的hosts文件
(1)进入C:\Windows\System32\drivers\etc路径
(2)打开hosts文件并添加如下内容
192.168.1.201 hadoop201
192.168.1.202 hadoop202
192.168.1.203 hadoop203
6、
关闭防火墙,并在命令里面ping虚拟机
防火墙基本语法:
firewall-cmd --state (功能描述:查看防火墙状态)
Service firewalld restart 重启
Service firewalld start 开启
Service firewalld stop 关闭
永久关闭:
systemctl stop firewalld.service停止
systemctl disable firewalld.service禁止开机启动
查询是否安装java软件:rpm -qa|grep java
如果安装的版本低于1.7,卸载该jdk:rpm -e 软件包
opt下创建两个文件夹
mkdir software
mkdir module
在software下
tar -zxvf jdk-8u121-linux-x64.gz -C /opt/module/
依次解压
tar -xvf mysql文件名
复制路径
/opt/module/jdk1.8.0_121
/opt/module/hadoop-2.9.2
配置全局路径
vi /etc/profile
#JAVA_HOME
export JAVA_HOME=/opt/module/jdk1.8.0_121
export PATH=$PATH:$JAVA_HOME/bin
##HADOOP_HOME
export HADOOP_HOME=/opt/module/hadoop-2.9.2
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
让修改后的文件生效: source /etc/profile
以上就是配置好的Java环境和Hadoop环境变量。接下来我们再去配置好Hadoop的其他环境。
core-site.xml
<configuration>
<!-- 鎸囧畾HDFS涓璑ameNode鐨勫湴鍧€ -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://hadoop201:9000</value>
</property>
<!-- 鎸囧畾hadoop杩愯鏃朵骇鐢熸枃浠剁殑瀛樺偍鐩綍 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/module/hadoop-2.9.2/tmp</value>
</property>
</configuration>
hadoop-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_121
hdfs-site.xml
dfs.replication
3
dfs.namenode.secondary.http-address
hadoop203:50090
Slaves(配置哪几台是datanode)
yarn-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_121
yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<!-- 指定YARN的ResourceManager的地址 -->
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop202</value>
</property>
</configuration>
mapred-env.sh
export JAVA_HOME=/opt/module/jdk1.8.0_121
<configuration>
<!-- 指定mr运行在yarn上 -->
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
在vmware中对配置好的虚拟机进行克隆002,003(注意修改主机名和IP)
ssh免密码登录
每台机器执行:
ssh-keygen -t rsa
把 hadoop201 节点上的 authorized_keys 钥发送到其他节点
hadoop201 执行命令,生成 authorized_keys 文件:
ssh-copy-id -i /root/.ssh/id_rsa.pub hadoop201
把 authorized_keys 发送到 hadoop202 hadoop203 节点上
scp /root/.ssh/authorized_keys root@hadoop202:/root/.ssh/
scp /root/.ssh/authorized_keys root@hadoop203:/root/.ssh/
在hadoop201 节点测试免密码登录 hadoop202、hadoop203
命令:ssh 机器名
启动 Hadoop 集群
1.格式化 namenode 节点
只需要在 master 机器上执行就好 hdfs namenode -format
2. 启动集群:在master上执行 start-all.sh
启动时候发现resourcemanager没有起来关闭防火墙输入以下代码
sbin/yarn-daemon.sh start resourcemanager
完美搞定!
hive,flume,mysql,sqoop安装包和安装步骤在链接里面
链接:https://pan.baidu.com/s/1C3e4FpeX-RQ-9GVak6rekA
提取码:sed6
安装好文件以后
flume配置
配置flie-hdfs.conf文件
# 相当定义 Flume 的三个组件变量名
a3.sources = r3
a3.sinks = k3
a3.channels = c3
# Describe/configure the source
a3.sources.r3.type = spooldir
a3.sources.r3.spoolDir = /opt/data_log/
a3.sources.r3.fileSuffix = .log
a3.sources.r3.fileHeader = true
a3.sources.r3.inputCharset = GBK
# Describe the sink
a3.sinks.k3.type = hdfs
a3.sinks.k3.hdfs.path = hdfs://hadoop201:9000/flume/%Y%m%d/%H
#上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = wuyou-
#是否按照时间滚动文件夹
a3.sinks.k3.hdfs.rollSize = 0
a3.sinks.k3.hdfs.rollCount = 0
a3.sinks.k3.hdfs.useLocalTimeStamp = true
a3.sinks.k3.hdfs.fileType = DataStream
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity =30000
a3.channels.c3.transactionCapacity = 30000
# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3
点击进入网页
我们先来看看网页构造。再来分析思路:
由于我们需要的大数据岗位的分布数据,所以我们就直接搜索条件,分析大数据岗位在全国的一个分布情况。
我的思路是,既然需要字段,而且这个页面上的所有字段并没有我们需要的全部,那我们就需要进入到每一个网址里面去分析我们的字段。先来看看进去后是什么样子。
我们需要的字段都在这里面了,所以,我们就可以开始动手写代码了。
我们之前说过,要进入到每一个网址去,那么就必然需要每一个进去的入口,而这个入口就是这个:
新建一个爬虫项目:
scrapy startproject qianchengwuyou
然后打开我们的项目,进入瞅瞅会发现啥都没有,我们再cd到我们的项目里面去开始一个爬虫项目
scrapy genspider qcwy https://search.51job.com/
当然,这后边的网址就是你要爬取的网址。
首先在开始敲代码之前,还是要设置一下我们的配置文件settings.py中写上我们的配置信息:
# 关闭网页机器人协议
ROBOTSTXT_OBEY = False
# mongodb地址
MONGODB_HOST='127.0.0.1'
# mongodb端口号
MONGODB_PORT = 27017
# 设置数据库名称
MONGODB_DBNAME = '51job'
# 存放本数据的表名称
MONGODB_DOCNAME = 'jobTable'
# 请求头信息
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User_Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36'
}
# 下载管道
ITEM_PIPELINES = {
'qianchengwuyou.pipelines.QianchengwuyouPipeline': 300,
}
# 下载延时
DOWNLOAD_DELAY = 1
然后再去我们的pipelines.py中开启我们的爬虫保存工作
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
import pymongo
class QianchengwuyouPipeline:
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
self.client = pymongo.MongoClient(host=host,port=port)
self.db = self.client[settings['MONGODB_DBNAME']]
self.coll = self.db[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
data = dict(item)
self.coll.insert(data)
return item
def close(self):
self.client.close()
定义好pipelines.py之后,我们还需要去items.py中去定义好我们需要爬取的字段,用来向pipelines.py中传输数据
import scrapy
class QianchengwuyouItem(scrapy.Item):
zhiweimingcheng = scrapy.Field()
xinzishuipin = scrapy.Field()
zhaopindanwei = scrapy.Field()
gongzuodidian = scrapy.Field()
gongzuojingyan = scrapy.Field()
xueli = scrapy.Field()
yaoqiu = scrapy.Field()
jineng = scrapy.Field()
然后就可以在我们的qcwy.py中开始敲我们的代码了;
在敲代码之前,还是要先分析一下网页结构。打开审查工具,看看我们需要的网址用xpath语法该怎么写:
我们需要拿到的是这个超链接,然后才能用框架去自动进入这个超链接提取超链接里面的内容。
再来分析分析他的文档树结构是什么样子的:
可以很清晰的看到,这个整个栏目都在div class='el'
下,而且所有的招聘岗位都在这下面,所以,我们为了能够拿到所有的url,就可以去定位他的上一级标签,然后拿到所有子标签。再通过子标签去拿里面的href
属性。
所以,xpath语法就可以这样写:
//*[@id='resultList']/div[@class='el']/p/span/a/@href
我们来打印一下试试:
import scrapy
from qianchengwuyou.items import QianchengwuyouItem
class QcwySpider(scrapy.Spider):
name = 'qcwy'
allowed_domains = ['https://www.51job.com/']
start_urls = ['https://search.51job.com/list/000000,000000,0130%252C7501%252C7506%252C7502,01%252C32%252C38,9,99,%2520,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']
def parse(self, response):
all_urls = response.xpath("//*[@id='resultList']/div[@class='el']/p/span/a/@href").getall()
for url in all_urls:
print(url)
接着我们再写一个启动函数:
在当前项目的任何位置新建一个main.py(名字随便你起,体现出启动俩字就行),然后写上这两行代码:
from scrapy.cmdline import execute
execute("scrapy crawl qcwy".split())
这个意思就是从scrapy包的cmdline下导入execute模块,然后,用这个模块去运行当前项目;
运行结果:
我们拿到所有超链接之后,还不够。要记住,我们需要的所有页面的超链接。所以,我们再来分析分析每一页之间的规律。注意看最上方的url:
https://search.51job.com/list/000000,000000,0130%252C7501%252C7506%252C7502,01%252C32%252C38,9,99,%2B,2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=
当我们点击下一页的时候,看看url发生了哪些变化:
https://search.51job.com/list/000000,000000,0130%252C7501%252C7506%252C7502,01%252C32%252C38,9,99,%2B,2,2.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=
有没有发现,中间的1.html变成了2.html;
我们再来看看,一共有多少页:
93页。那我们直接把93输入进去替换掉会怎么样:
果不其然,规律就是这个样子的。那你知道怎么去获取全部网页的xpath了吧
但是
我跳转下一页不喜欢这样跳,这样跳,每一次我还得去找规律,一大串网址麻烦的不要不要的,所以,给你们说一种新方法;
我们来看看下一页的url,用xpath该怎么去获取:
看到这个标签之后,再看看我打的箭头,知道我要干什么了吧。我们可以直接写xpath语法去获取这个url,如果有,就交给解析函数去解析当前页的网址,没有的话就结束函数的运行:
看看xpath语法该咋写:
//div[@class='p_in']//li[last()]/a/@href
所以我们只需要做如下判断:
next_page = response.xpath("//div[@class='p_in']//li[last()]/a/@href").get()
if next_page:
yield scrapy.Request(next_page,callback=self.parse,dont_filter=True)
如果有下一页,就继续交给parse()函数去解析网址;
我们再回到之前的问题,我们拿到当前页的招聘网址之后,就要进入到这个网址里面;所以,我们就需要一个解析页面的函数去解析我们的字段;
所以我们全部的代码如下:
import scrapy
from qianchengwuyou.items import QianchengwuyouItem
class QcwySpider(scrapy.Spider):
name = 'qcwy'
allowed_domains = ['https://www.51job.com/']
start_urls = ['https://search.51job.com/list/000000,000000,0130%252C7501%252C7506%252C7502,01%252C32%252C38,9,99,%2520,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']
def parse(self, response):
all_urls = response.xpath("//*[@id='resultList']/div[@class='el']/p/span/a/@href").getall()
for url in all_urls:
yield scrapy.Request(url,callback=self.parse_html,dont_filter=True)
next_page = response.xpath("//div[@class='p_in']//li[last()]/a/@href").get()
if next_page:
yield scrapy.Request(next_page,callback=self.parse,dont_filter=True)
我们就单独再写一个parse_html()函数去解析页面就行了;
这样就能抓取到93页全部招聘网址;
我们随便点进去一个招聘信息瞅一瞅;
我们需要的是这里面的字段信息;
还是老样子,打开审查工具看看构造是什么样的;
我们很大一部分的信息都在这个标签里面,所以,我们就可以单独在这个标签中拿到我们需要的信息。所以,我这儿就直接把xpath语法贴出来,至于怎么分析怎么写,上面讲过了;
zhiweimingcheng = response.xpath("//div[@class='cn']/h1/text()").getall()[0]
xinzishuipin = response.xpath("//div[@class='cn']//strong/text()").get()
zhaopindanwei = response.xpath("//div[@class='cn']//p[@class='cname']/a[1]/@title").get()
gongzuodidian = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[0]
gongzuojingyan = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[1]
xueli = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[2]
我们除了这些字段,还需要任职要求和技能要求。我们再看看,下面的职位信息这个标签中的内容;
我们需要的内容都在这个标签里面。但是,这些技能标签我们该怎么去提取呢。我也没想到办法,所以我干脆就不提取这里面的技能,我们重新打开一个招聘信息看看;
看见没有,有关键字。所以我们就可以把这个关键字拿来当我们需要的技能标签。
但是又有细心的小伙伴发问了,有的没有关键字这一栏咋办,那岂不是找不到标签会报语法错误。
so,你是忘了你学过的try except语法了吗。
所以,我们抓取全部的字段,就可以这样来写:
def parse_html(self,response):
item = QianchengwuyouItem()
try:
zhiweimingcheng = response.xpath("//div[@class='cn']/h1/text()").getall()[0]
xinzishuipin = response.xpath("//div[@class='cn']//strong/text()").get()
zhaopindanwei = response.xpath("//div[@class='cn']//p[@class='cname']/a[1]/@title").get()
gongzuodidian = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[0]
gongzuojingyan = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[1]
xueli = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[2]
yaoqius = response.xpath("//div[@class='bmsg job_msg inbox']//text()").getall()
yaoqiu_str = ""
for yaoqiu in yaoqius:
yaoqiu_str+=yaoqiu.strip()
jineng = ""
guanjianzi = response.xpath("//p[@class='fp'][2]/a/text()").getall()
for i in guanjianzi:
jineng+=i+" "
except:
zhiweimingcheng=""
xinzishuipin=""
zhaopindanwei = ""
gongzuodidian = ""
gongzuojingyan =""
xueli = ""
yaoqiu_str = ""
jineng = ""
当然,这些字段抓取到了然后呢,总得需要去想个办法保存到MongoDB中吧,所以,我们就可以通过之前定义好的items来保存数据;
so,你看了这么久,终于等到主页的全部代码了,我也废话讲了这么一大片了:
# -*- coding: utf-8 -*-
import scrapy
from qianchengwuyou.items import QianchengwuyouItem
class QcwySpider(scrapy.Spider):
name = 'qcwy'
allowed_domains = ['https://www.51job.com/']
start_urls = ['https://search.51job.com/list/000000,000000,0130%252C7501%252C7506%252C7502,01%252C32%252C38,9,99,%2520,2,1.html?lang=c&stype=&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&providesalary=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=']
def parse(self, response):
all_urls = response.xpath("//*[@id='resultList']/div[@class='el']/p/span/a/@href").getall()
for url in all_urls:
yield scrapy.Request(url,callback=self.parse_html,dont_filter=True)
next_page = response.xpath("//div[@class='p_in']//li[last()]/a/@href").get()
if next_page:
yield scrapy.Request(next_page,callback=self.parse,dont_filter=True)
def parse_html(self,response):
item = QianchengwuyouItem()
try:
zhiweimingcheng = response.xpath("//div[@class='cn']/h1/text()").getall()[0]
xinzishuipin = response.xpath("//div[@class='cn']//strong/text()").get()
zhaopindanwei = response.xpath("//div[@class='cn']//p[@class='cname']/a[1]/@title").get()
gongzuodidian = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[0]
gongzuojingyan = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[1]
xueli = response.xpath("//div[@class='cn']//p[@class='msg ltype']/text()").getall()[2]
yaoqius = response.xpath("//div[@class='bmsg job_msg inbox']//text()").getall()
yaoqiu_str = ""
for yaoqiu in yaoqius:
yaoqiu_str+=yaoqiu.strip()
jineng = ""
guanjianzi = response.xpath("//p[@class='fp'][2]/a/text()").getall()
for i in guanjianzi:
jineng+=i+" "
except:
zhiweimingcheng=""
xinzishuipin=""
zhaopindanwei = ""
gongzuodidian = ""
gongzuojingyan =""
xueli = ""
yaoqiu_str = ""
jineng = ""
finally:
item["zhiweimingcheng"] = zhiweimingcheng
item["xinzishuipin"] = xinzishuipin
item["zhaopindanwei"] = zhaopindanwei
item["gongzuodidian"] = gongzuodidian
item["gongzuojingyan"] = gongzuojingyan
item["xueli"] = xueli
item["yaoqiu"] = yaoqiu_str
item["jineng"] = jineng
yield item
然后,保存数据就简单了,我们去pipelines.py中连接到我们的MongoDB,然后保存数据进去;由于保存数据太简单,百度上一搜就能搜到,我就直接贴代码出来了:
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
import pymongo
class QianchengwuyouPipeline:
def __init__(self):
host = settings['MONGODB_HOST']
port = settings['MONGODB_PORT']
self.client = pymongo.MongoClient(host=host,port=port)
self.db = self.client[settings['MONGODB_DBNAME']]
self.coll = self.db[settings['MONGODB_DOCNAME']]
def process_item(self, item, spider):
data = dict(item)
self.coll.insert(data)
return item
def close(self):
self.client.close()
然后,开始我们的爬虫启动;
由于这个爬虫不是分布式,所以,要一条一条的请求,会很慢很慢,你可以在这个时间去吃个饭,看个电影都行;
最后看看我们拿到的数据是什么样子的;
一共有4600条数据;整整爬了一个多小时。
由于scrapy很牛逼,所以为我们提供了scrapy的并发配置(当然这种配置的话容易导致数据泄露,建议在存储数据的时候多线程存储,你开了好多个并发数,就对应的开好多个多线程)
将settings.py中的文件改成一下的样子:
BOT_NAME = 'qianchengwuyou'
SPIDER_MODULES = ['qianchengwuyou.spiders']
NEWSPIDER_MODULE = 'qianchengwuyou.spiders'
# 关闭网页机器人协议
ROBOTSTXT_OBEY = False
# mongodb地址
MONGODB_HOST='127.0.0.1'
# mongodb端口号
MONGODB_PORT = 27017
# 设置数据库名称
MONGODB_DBNAME = '51job'
# 存放本数据的表名称
MONGODB_DOCNAME = 'jobTable'
# 爬虫并发数量
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 0
CONCURRENT_REQUESTS_PER_DOMAIN = 8
CONCURRENT_REQUESTS_PER_IP = 8
# 禁用cookie
COOKIES_ENABLED = False
# 中间件
DOWNLOADER_MIDDLEWARES = {
'qianchengwuyou.middlewares.downloadMiddlewares': 343,
}
# 管道
ITEM_PIPELINES = {
'qianchengwuyou.pipelines.QianchengwuyouPipeline': 300,
}
# 本地http缓存
HTTPCACHE_ENABLED = False
多线程存储部分的代码我就不放了,很简单的几行代码。再次重新开启爬虫会发现速度提升了一倍不止。
我们将数据拿到之后,下一步要做的就是清洗数据了。
既然分析的是岗位数据,那么我们就需要统计好每个岗位对应的薪资,岗位数之类的。
由于我们的数据是存储在mongodb上,要直接放在hdfs上的话,嗯。。。我不会,所以我就用笨一点的方法。先将数据放在csv中,利用python将数据先清洗好,再将数据表放在hdfs上。
在mongodb安装的bin目录下输入以下命令:
mongoexport -d 51job -c 51job -f id, zhiweimingcheng,xinzishuipin,zhaopindanwei,gongzuodidian,gongzuojingyan,xueli,yaoqiu,jineng --csv -o c:/job.csv
我们先来看看导出来的数据表是什么样的。
数据很乱,没错,我也看出来了。所以,我们要用到一点点python技术将我们的数据表改一下,将薪资这一栏全部转化成对应的数字格式。
我就直接上代码了,很简单,有注释。
data = [i.replace('\n', '').split(',') for i in open('user.csv', 'r', encoding='ANSI').readlines() if i.replace('\n', '').split(',')[2]]
title = data[0]
data = data[1:]
a = []
flag = True
while flag:
flag = False
for k, v in enumerate(data):
if isinstance(v[2], int) or isinstance(v[2], float):
continue
# 清洗不符合条件的数据
if '天' in v[2] or '-' not in v[2]:
del data[k]
flag = True
continue
# 清洗个别数据的?
for j in range(4, 7):
data[k][j] = v[j].replace('?', '')
# 将工资转化成数值
if '万/月' in v[2]:
data[k][2] = eval(v[2].split('万')[0].split('-')[-1]) * 10000
elif '千/月' in v[2]:
data[k][2] = eval(v[2].split('千')[0].split('-')[-1]) * 1000
elif '万/年' in v[2]:
data[k][2] = eval(v[2].split('万')[0].split('-')[-1]) * 10000 // 12
for k, v in enumerate(data):
if isinstance(v[2], int) or isinstance(v[2], float):
continue
# 清洗不符合条件的数据
if not (isinstance(v[2], int) or isinstance(v[2], float)):
del data[k]
continue
for k, v in enumerate(data):
data[k][2] = str(int(data[k][2]))
with open('user_1.csv', 'w', encoding='ANSI') as j:
for i in data:
j.write(','.join(i)+'\n')
弄好了的数据就是这样的。
很nice
在上传文件时发现。。。报错,编码问题。。
然后导入成txt文件。
最后利用flume传输文件
bin/flume-ng agent -c conf/ -f job/file-hdfs.conf -n a3 -Dflume.root.logger=INFO,console
一大堆文件等待操作。。。合并吧
hadoop fs -cat /flume/20200714/14/* | hadoop fs -put - /flume/20200714/14
修改一下数据名称
hadoop dfs -mv /flume/20200712/20/- /flume/20200712/20/qcwy
数据放在了hdfs上,就可以继续使用我们的hive了。
对照着我们的数据表的字段创建对应的hive表
另外创建一个小的数据表,拿来装我们需要的字段
create table xwuyou as
select wuyouwai.jobname as jobname,wuyouwai.salary as salary ,wuyouwai.address as address,wuyouwai.release_date as release_date
from wuyouwai
where jobname LIKE '%数据采集%';
insert into table xwuyou
select jobname,salary,address,release_date
from wuyouwai
where jobname ='大数据开发工程师';
insert into table xwuyou
select jobname,salary,address,release_date
from wuyouwai
where jobname = '数据分析';
至此,hive存储数据就完事儿了。接下来就可以将数据存在mysql中了。
为了简单,我把每一个需求都创建了一个对应的数据表,这样在用pymysql的时候就很简单了。
需求字段:
新建一个的表,名字随便取,我这儿就取的是bigdata
创建四个字段,分别用来存储工作名称,最高,最低,平均工资
create table bigdata(
jobname varchar(30),
avg int,
min int,
max int);
再创建一个过度表,名字随便取,用来装名称,薪资,地点,日期四个数据
create table caiji as
select xwuyou.jobname as jobname,xwuyou.salary as salary ,xwuyou.address as address,xwuyou.release_date as release_date
from xwuyou
where salary is not null;
然后就向bigdata表插入数据
insert into table bigdata
select caiji.jobname as jobname,ceiling(avg(salary)),min(salary),max(salary) from caiji where jobname like '数据分析' group by jobname;
insert into table bigdata
select caiji.jobname as jobname,ceiling(avg(salary)),min(salary),max(salary) from caiji where jobname like '大数据开发工程师' group by jobname;
insert into table bigdata
select '数据采集',ceiling(avg(salary)),min(salary),max(salary) from caiji where jobname like '%数据采集%';
我们来看看插入的数据
可以,很强大。
这样就分析出来了最高、最低、平均工资
还是老规矩吧。多创建几个表就完事儿了。
创建三个表,分别表示三个工作名称的数量,这样方便做图。
计算数据分析岗位数量
insert into table fenxi
select '成都',count(address) from xwuyou where jobname like '数据分析' and address like '%成都%';
insert into table fenxi
select '北京',count(address) from xwuyou where jobname like '数据分析' and address like '%北京%';
insert into table fenxi
select '上海',count(address) from xwuyou where jobname like '数据分析' and address like '%上海%';
insert into table fenxi
select '广州',count(address) from xwuyou where jobname like '数据分析' and address like '%广州%';
insert into table fenxi
select '深圳',count(address) from xwuyou where jobname like '数据分析' and address like '%深圳%';
insert into table cj
select '成都',count(address) from xwuyou where jobname like '%数据采集%' and address like '%成都%';
insert into table cj
select '北京',count(address) from xwuyou where jobname like '%数据采集%' and address like '%北京%';
insert into table cj
select '上海',count(address) from xwuyou where jobname like '%数据采集%' and address like '%上海%';
insert into table cj
select '广州',count(address) from xwuyou where jobname like '%数据采集%' and address like '%广州%';
insert into table cj
select '深圳',count(address) from xwuyou where jobname like '%数据采集%' and address like '%深圳%';
insert into table big
select '成都',count(address) from xwuyou where jobname like '大数据开发工程师' and address like '%成都%';
insert into table big
select '北京',count(address) from xwuyou where jobname like '大数据开发工程师' and address like '%北京%';
insert into table big
select '上海',count(address) from xwuyou where jobname like '大数据开发工程师' and address like '%上海%';
insert into table big
select '广州',count(address) from xwuyou where jobname like '大数据开发工程师' and address like '%广州%';
insert into table big
select '深圳',count(address) from xwuyou where jobname like '大数据开发工程师' and address like '%深圳%';
创建表
create table jingyan as
select wuyouwai.jobname as jobname,wuyouwai.salary as salary ,wuyouwai.experience as experience
from wuyouwai
where salary is not null and experience like '%经验%' and jobname like '%大数据%';
create table oneth(
jobname varchar(30),
avg int,
min int,
max int);
创建好了之后插入数据
select '大数据相关',ceiling(avg(salary)),min(salary),max(salary) from jingyan
where experience in ('1年经验','2年经验','3-4年经验');
创建一个表,拿来装折线图的数据
create table fourbigdata(
release_date date,
gangweishu int
);
insert into table fourbigdata
select release_date,count(jobname) from caiji group by release_date;
insert overwrite [local] directory '/root' --> 导出的路径 去掉 loacl 就是导出到 HDFS
row format delimited fields terminated by '\t' --> 导出的分隔符
select * from hive_db; --> 需要导出的内容
# 数据分析表饼图
insert overwrite directory '/flume/20200714/20'
row format delimited fields terminated by '\t'
select * from fenxi;
#大数据开发工程师饼图
insert overwrite directory '/flume/20200714/21'
row format delimited fields terminated by '\t'
select * from big;
#1-3年
insert overwrite directory '/flume/20200714/22'
row format delimited fields terminated by '\t'
select * from oneth;
# 数据采集饼图
insert overwrite directory '/flume/20200714/23'
row format delimited fields terminated by '\t'
select * from cj;
# 三个职业薪资水平
insert overwrite directory '/flume/20200714/24'
row format delimited fields terminated by '\t'
select * from bigdata;
# 带日期的表
insert overwrite directory '/flume/20200714/25'
row format delimited fields terminated by '\t'
select * from caiji;
在sqoop安装目录的bin目录下输入导出命令
sqoop export --connect jdbc:mysql://127.0.0.1:3306/qianchengwuyou --username root --password 111111 --table caiji --export-dir '/flume/20200714/25' --fields-terminated-by '\t' -m 1
sqoop export --connect jdbc:mysql://127.0.0.1:3306/qianchengwuyou --username root --password 111111 --table bigdata --export-dir '/flume/20200714/24' --fields-terminated-by '\t' -m 1
sqoop export --connect jdbc:mysql://127.0.0.1:3306/qianchengwuyou --username root --password 111111 --table cj --export-dir '/flume/20200714/23' --fields-terminated-by '\t' -m 1
sqoop export --connect jdbc:mysql://127.0.0.1:3306/qianchengwuyou --username root --password 111111 --table oneth --export-dir '/flume/20200714/22' --fields-terminated-by '\t' -m 1
sqoop export --connect jdbc:mysql://127.0.0.1:3306/qianchengwuyou --username root --password 111111 --table big --export-dir '/flume/20200714/21' --fields-terminated-by '\t' -m 1
sqoop export --connect jdbc:mysql://127.0.0.1:3306/qianchengwuyou --username root --password 111111 --table fenxi --export-dir '/flume/20200714/20' --fields-terminated-by '\t' -m 1
数据存储在mysql中,一切就都变得简单了。
接下来只需要将我们mysql中的数据利用pymysql将数据取出来就完事儿。直接上代码:
import pymysql
from pyecharts.charts import Bar
from pyecharts import options as opts
db = pymysql.connect(host="192.168.1.201",port=3306,database="wuyou",user='root',password='111111')
cursor = db.cursor()
sql = "select * from bigdata"
cursor.execute(sql)
data = cursor.fetchall()
print(data)
zhiwei = [data[0][0], data[1][0], data[2][0]]
print(zhiwei)
min_list = [data[0][2], data[1][2], data[2][2]]
max_list = [data[0][3], data[1][3], data[2][3]]
average_list = [data[0][1], data[1][1], data[2][1]]
bar = Bar()
bar.add_xaxis(xaxis_data=zhiwei)
# 第一个参数是图例名称,第二个参数是y轴数据
bar.add_yaxis(series_name="最低工资", yaxis_data=min_list)
bar.add_yaxis(series_name="最高工资", yaxis_data=max_list)
bar.add_yaxis(series_name="平均工资", yaxis_data=average_list)
# 设置表的名称
bar.set_global_opts(title_opts=opts.TitleOpts(title='薪资水平图', subtitle='工资单位:万/月'), toolbox_opts=opts.ToolboxOpts(),
)
bar.render("薪资水平图.html")
import pymysql
from pyecharts.charts import Pie
from pyecharts import options as opts
db = pymysql.connect(host="192.168.1.201",port=3306,database="wuyou",user='root',password='111111')
cursor = db.cursor()
sql = "select * from fenxi"
cursor.execute(sql)
data = cursor.fetchall()
print(data)
addr = ["成都","北京","上海","广州","深圳"]
num = [data[0][1],data[1][1],data[2][1],data[3][1],data[4][1]]
data_pair = [list(z) for z in zip(addr, num)]
data_pair.sort(key=lambda x: x[1])
# 画饼图
c = (
Pie()
.add("", [list(z) for z in zip(addr,num)])
.set_global_opts(title_opts=opts.TitleOpts(title="数据分析工程师地区岗位数",subtitle='单位:个数'),toolbox_opts=opts.ToolboxOpts())
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
).render("数据分析工程师地区岗位数.html")
import pymysql
from pyecharts.charts import Pie
from pyecharts import options as opts
db = pymysql.connect(host="192.168.1.201",port=3306,database="wuyou",user='root',password='111111')
cursor = db.cursor()
sql = "select * from cj"
cursor.execute(sql)
data = cursor.fetchall()
print(data)
addr = ["成都","北京","上海","广州","深圳"]
num = [data[0][1],data[1][1],data[2][1],data[3][1],data[4][1]]
data_pair = [list(z) for z in zip(addr, num)]
data_pair.sort(key=lambda x: x[1])
# 画饼图
c = (
Pie()
.add("", [list(z) for z in zip(addr,num)])
.set_global_opts(title_opts=opts.TitleOpts(title="数据采集工程师地区岗位数",subtitle='单位:个数'),toolbox_opts=opts.ToolboxOpts())
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
).render("数据采集工程师地区岗位数.html")
import pymysql
from pyecharts.charts import Pie
from pyecharts import options as opts
db = pymysql.connect(host="192.168.1.201",port=3306,database="wuyou",user='root',password='111111')
cursor = db.cursor()
sql = "select * from big"
cursor.execute(sql)
data = cursor.fetchall()
print(data)
addr = ["成都","北京","上海","广州","深圳"]
num = [data[0][1],data[1][1],data[2][1],data[3][1],data[4][1]]
data_pair = [list(z) for z in zip(addr, num)]
data_pair.sort(key=lambda x: x[1])
# 画饼图
c = (
Pie()
.add("", [list(z) for z in zip(addr,num)])
.set_global_opts(title_opts=opts.TitleOpts(title="大数据开发工程师各地区岗位数",subtitle='单位:个数'),toolbox_opts=opts.ToolboxOpts())
.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {c}"))
).render("大数据开发工程师地区岗位数.html")
import pymysql
from pyecharts.charts import Bar
from pyecharts import options as opts
db = pymysql.connect(host="192.168.1.201",port=3306,database="wuyou",user='root',password='111111')
cursor = db.cursor()
sql = "select * from oneth"
cursor.execute(sql)
data = cursor.fetchall()
print(data)
zhiwei = [data[0][0]]
print(zhiwei)
min_list = [data[0][2]]
max_list = [data[0][3]]
average_list = [data[0][1]]
bar = Bar()
bar.add_xaxis(xaxis_data=zhiwei)
# 第一个参数是图例名称,第二个参数是y轴数据
bar.add_yaxis(series_name="最低工资", yaxis_data=min_list)
bar.add_yaxis(series_name="最高工资", yaxis_data=max_list)
bar.add_yaxis(series_name="平均工资", yaxis_data=average_list)
# 设置表的名称
bar.set_global_opts(title_opts=opts.TitleOpts(title='1-3', subtitle='工资单位:万/月'), toolbox_opts=opts.ToolboxOpts(),
)
bar.render("1-3年经验.html")
import pymysql
from pyecharts.charts import Line
from pyecharts import options as opts
db = pymysql.connect(host="192.168.1.201",port=3306,database="wuyou",user='root',password='111111')
cursor = db.cursor()
sql = "select * from fourbigdata"
cursor.execute(sql)
data = cursor.fetchall()
time_list = []
renshu = []
for i in data:
time_list.append(str(i[0]))
renshu.append(str(i[1]))
print(time_list)
print(renshu)
data_pair = [list(z) for z in zip(time_list, renshu)]
data_pair.sort(key=lambda x: x[1])
(
Line(init_opts=opts.InitOpts(width="6000px", height="800px"))
.set_global_opts(
tooltip_opts=opts.TooltipOpts(is_show=False),
xaxis_opts=opts.AxisOpts(type_="category"),
yaxis_opts=opts.AxisOpts(
type_="value",
axistick_opts=opts.AxisTickOpts(is_show=True),
splitline_opts=opts.SplitLineOpts(is_show=True),
),
)
.add_xaxis(xaxis_data=time_list)
.add_yaxis(
series_name="大数据岗位需求变化趋势",
y_axis=renshu,
symbol="emptyCircle",
is_symbol_show=True,
label_opts=opts.LabelOpts(is_show=False),
)
.render("需求变化趋势.html")
)
这个任务的难点在于对大数据技术的掌握,具体要求在于对Hadoop,hive,mysql,sqoop,hdfs以及flume的操作。很有综合性的一个项目。