使用hadoop-streaming 过滤数据

使用python 和hadoop-streaming过滤数据
1、测试数据如下
$cat test.txt
ngry-benz-9d02e5.netlify.com  35.197.55.186
*.heliumelephant.com    heliumelephant.com
0-0-2.11.edge.mrn.m.oml.ru      185.32.56.57
0-0-23-vln.fw1.pop.arcos.de     195.3.215.194
0-0-3522.cr1.mst1.bbinfra.net   77.93.70.50
0-0-62-37.mobileinternet.proximus.be    37.62.0.0
0-0.nu  46.30.215.63
0-1-124-92.pppoe.irtel.ru       92.124.1.0
0-10-9.connect.netcom.no        89.9.10.0
0-100.206-83.static-ip.oleane.fr        83.206.100.0
0-102-221-166.mobile.uscc.net   166.221.102.0
0-104-226-166.mobile.uscc.net   166.226.104.0
0-105.static.highlandsfibernetwork.com  216.9.0.105
0-112-235-166.mobile.uscc.net   166.235.112.0
0-1129580104972765614-step-0.embe-it.de 185.183.156.43
0-115.80-90.static-ip.oleane.fr 90.80.115.0
0-12-187-213.wifi4all.it        213.187.12.0
0-124-222-166.mobile.uscc.net   166.222.124.0
0-125-229-166.mobile.uscc.net   166.229.125.0
2、python代码
mappertest.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import re
regex = re.compile('\s+')
for line in sys.stdin:
        
    data = line.split()
    domainname = data[0]
    ipaddress = data[1]
    print >>sys.stdout,"%s\t%s"%(domainname,ipaddress)
    


reducertest.py
#!/usr/bin/python
#-*- coding: utf-8 -*-
import sys
import re
for line in sys.stdin:
    line = line.strip()
    data = line.split()
    domainname = data[0]
    ipaddress = data[1]
    pat=re.compile(r'([0-9]{1,3})\.')
    r=re.findall(pat,ipaddress+".")
    #print len(domainname.split('.'))
    if len(domainname.split('.')) == 3:
        if len(r)==4:
            print >>sys.stdout,"%s\t%s"%(domainname,ipaddress)
            
3、将test.txt mappertest.py reducertest.py 三个文件上传到 hdfs 的 /user/用户名/input目录下,如果存在output目录则删除它,否则会报output目录已存在的错误信息
            
$ hadoop dfs -rmr /user/clusteruser/output

$ hadoop jar ./hadoop-streaming-2.7.1.jar -D stream.non.zero.exit.is.failure=false -files mappertest.py,reducertest.py -mapper mappertest.py -reducer reducertest.py -input /user/clusteruser/input/dnsjson0818txt.txt -output /user/clusteruser/output
将stream.non.zero.exit.is.failure指定为true或false,以使具有非零状态的流式任务分别为Failure或Success。 默认情况下,退出非零状态的流任务被视为失败的任务。
4、
输出结果
ngry-benz-9d02e5.netlify.com    35.197.55.186
0-1129580104972765614-step-0.embe-it.de 185.183.156.43
0-12-187-213.wifi4all.it        213.187.12.0
0-126-13-46.tmcz.cz     46.13.126.0
0-193-ftth.onsbrabantnet.nl     88.159.193.0

你可能感兴趣的:(编程,运维,知识点,分布式系统)