一个pig例子(REGEX_EXTRACT_ALL, DBStorage)

Pig是hadoop下的一个子项目,简化了hadoop的编程。Pig的语法很干净,高效,言简意赅,5-6行pig语句就能完成五六十行java代码能做的事情,相当地高效。今天写了一个简单例子来学习pig.

Pig语法参考官方文档
http://pig.apache.org/docs/r0.11.0/index.html
或好心人整理的中文教程
http://www.codelast.com/?p=4550

Pig版本 0.11.0
需求: 抓取文本里的时间,并存入mysql数据库
文本片段:
03/17/13 00:00:00 #723988
......
M 00 OP:CELL OOS
     TOTAL CELLS EQUIPPED: 1014
     TOTAL OOS/ISOL CELLS: 0
     TOTAL ABNORMAL CELLS: 39
......


摘自官方文档
REGEX_EXTRACT
Performs regular expression matching and extracts the matched group defined by an index parameter.
Syntax
REGEX_EXTRACT (string, regex, index)

Usage
Use the REGEX_EXTRACT function to perform regular expression matching and to extract the matched group defined by the index parameter (where the index is a 1-based parameter.) The function uses Java regular expression form.

The function returns a string that corresponds to the matched group in the position specified by the index. If there is no matched expression at that position, NULL is returned.

REGEX_EXTRACT函数是Pig的内置函数,用的是java的正则表达式格式。

PIG的强大在于它支持用户自定义的函数UDFs,用户根据需要定制化处理,目前支持Java, Python, JavaScript and Ruby的UDFs.还提供了jar包Piggy Bank,支持用户共享自己的JAVA UDFs.
DBStorage这个函数就来自于Piggy Bank.Pig中引用jar包,需要REGISTER, 与java的import类似。DBStorage貌似只能写入,无法读取,而Sqoop可以在Hadoop和关系型数据库中的数据相互转移。

test.pig代码如下:
-- Register Piggy bank
REGISTER /usr/local/pig-0.11.0/contrib/piggybank/java/piggybank.jar;

-- Register MySQL driver
REGISTER /home/hadoop/mysql-connector-java-5.1.24-bin.jar;

A = LOAD '$INPUT' as (line:chararray);
SRC = FOREACH A GENERATE 
    FLATTEN (
	REGEX_EXTRACT(line, '(\\s*\\d*\\/\\d*\\/\\d*\\s\\d*:\\d*:\\d*)\\s#(\\d*\\s*)', 1)
	) AS (date : chararray);
-- Filter where date is null
FILTER_T1 = FILTER SRC BY NOT date IS NULL;
DUMP FILTER_T1;
-- Store data into DB
STORE FILTER_T1 INTO 'test' using org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 'jdbc:mysql://localhost/cnet', 'root', '123456',
	'INSERT into test(string) values (?)');


Pig支持两种运行模式:本地的和MapReduce
这里用了本地模式,运行命令:
pig -x local -f test.pig -param INPUT=test.txt

成功运行后,数据成功load到数据库。

mysql> select * from cnet.test;
+-------------------+
| string            |
+-------------------+
| 03/17/13 00:00:00 |
| 03/17/13 00:00:03 |
| 03/17/13 00:00:03 |
| 03/17/13 00:00:03 |
+-------------------+
4 rows in set (0.00 sec)


你可能感兴趣的:(regex)