thrift and php 环境搭建
PHP 读取HADOOP的HDFS文件
四月 29th,
2010
进行评论
Hadoop的分布式文件系统HDFS为java提供了原生的接口,可以像访问地文件一样的,对HDFS中的文件进行增删改查等操作。
对于其他非java语言的支持,hadoop使用了Thrift。
对于该方式,hadoop里针对thrift的hadoop
-
0.20
.
2
/
src
/
contrib
/
thriftfs
/
readme是这样说的:
Thrift
is
a software framework
for
scalable cross
-
language services
development. It combines a powerful software stack with a code generation
engine to build services that work efficiently and seamlessly
between C
++
, Java, Python, PHP, and Ruby.
This project exposes HDFS APIs
using
the Thrift software stack. This
allows applciations written
in
a myriad of languages to access
HDFS elegantly.
The Application Programming Interface (API)
===========================================
The HDFS API that
is
exposed through Thrift can be found
in
if
/
hadoopfs.thrift.
Compilation
===========
The compilation process creates a server org.apache.hadoop.thriftfs.HadooopThriftServer
that implements the Thrift
interface
defined
in
if
/
hadoopfs.thrift.
Th thrift compiler
is
used to generate API stubs
in
python, php, ruby,
cocoa, etc. The generated code
is
checked
into the directories gen
-*
.
The generated java API
is
checked
into lib
/
hadoopthriftapi.jar.
There
is
a sample python script hdfs.py
in
the scripts directory. This python
script, when invoked, creates a HadoopThriftServer
in
the background, and then
communicates wth HDFS
using
the API. This script
is
for
demonstration purposes
only.
由于它说的过于简单,而我又对java世界了解甚少,导致走了不少弯路,现记录如下:
1
、下载thrift源码,安装(.
/
bootstrap.sh ;.
/
configure –prefix
=/
usr
/
local
/
thrift; make;sudo make install)
2
、将一些必须的文件cp到thrift安装目录下:
cp
/
path
/
to
/
thrift
-
0.2
.
0
/
lib
/
php
/
/
usr
/
local
/
thrift
/
lib
/
-
r
mkdir
/
usr
/
local
/
thrift
/
lib
/
php
/
src
/
packages
/
cp
/
path
/
to
/
hadoop
-
0.20
.
2
/
src
/
contrib
/
thriftfs
/
gen
-
php
/
/
usr
/
local
/
thrift
/
lib
/
php
/
src
/
packages
/
hadoopfs
/
-
r
3
、安装thrift的php扩展(针对php而言)
cd
/
path
/
to
/
thrift
-
0.2
.
0
/
lib
/
php
/
src
/
ext
/
thrift_protocol;phpize; .
/
configure;make ;make install;
修改php.ini,添加extension
=
thrift_protocol.so
4
、编译hadoop
cd
/
path
/
to
/
hadoop
-
0.20
.
2
; ant compile (ant
-
projecthelp可以查看项目信息;compile 是编译core和contrib目录)
5
、启动hadoop的thrift代理
cd
/
path
/
to
/
hadoop
-
0.20
.
2
/
src
/
contrib
/
thriftfs
/
scripts
/
; .
/
start_thrift_server.sh [your
-
port] (如果不输入port,则随机一个port)
6
、执行php测试代码
<?
php
error_reporting(E_ALL);
ini_set(‘display_errors’, ‘on’);
$GLOBALS[
'
THRIFT_ROOT
'
]
=
‘
/
usr
/
local
/
thrift
/
lib
/
php
/
src‘;
define(‘ETCC_THRIFT_ROOT’, $GLOBALS[
'
THRIFT_ROOT
'
]);
require_once(ETCC_THRIFT_ROOT.’
/
Thrift.php’ ); require_once(ETCC_THRIFT_ROOT.’
/
transport
/
TSocket.php’ ); require_once(ETCC_THRIFT_ ROOT.’
/
transport
/
TBufferedTransport.php’ ); require_once(ETCC_THRIFT_ROOT.’
/
protocol
/
TBinaryProtocol.php’ );
$socket
=
new
TSocket(your
-
host, your
-
port);
$socket
->
setSendTimeout(
10000
);
$socket
->
setRecvTimeout(
20000
);
$transport
=
new
TBufferedTransport($socket);
$protocol
=
new
TBinaryProtocol($transport);
require_once(ETCC_THRIFT_ROOT.’
/
packages
/
hadoopfs
/
ThriftHadoopFileSystem.php’);
$client
=
new
ThriftHadoopFileSystemClient($protocol);
$transport
->
open();
try
{
$pathname
=
new
Pathname(array(‘pathname’
=>
‘your
-
hdfs
-
file
-
name‘));
$fp
=
$client
->
open($pathname);
var_dump($client
->
stat($pathname));
var_dump($client
->
read($fp,
0
,
1024
));
exit;
}
catch
(Exception $e)
{
print_r($e);
}
$transport
->
close();
?>
可能出现的问题:
1
、可以创建目录或者文件,但是读取不到文件内容
这时可以打开hadoop thrift的log4j配置(如果你用的是log4j记录日志的话),在
/
path
/
to
/
hadoop
/
conf
/
log4j.properties 里,修改:
hadoop.root.logger
=
ALL,console
这时,没执行一条访问HDFS的操作,都会把debug信息打印出来。
我这里看到的是file的id很奇怪,怀疑是在32位机上溢出了,尝试修改未果。之后迁移到64位机,运行正常!
问题代码在:
/
usr
/
local
/
thrift
/
lib
/
php
/
src
/
protocol
/
TBinaryProtocol.php (由于我代码里用的TBinaryProtocol类)的readI64函数里。
2
、启动start_thrift_server.s失败
Exception
in
thread “main” java.lang.NoClassDefFoundError: org
/
apache
/
hadoop
/
thriftfs
/
HadoopThriftServer
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.thriftfs.HadoopThriftServer
at java.net.URLClassLoader$
1
.run(URLClassLoader.java:
217
)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:
205
)
at java.lang.ClassLoader.loadClass(ClassLoader.java:
323
)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:
294
)
at java.lang.ClassLoader.loadClass(ClassLoader.java:
268
)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:
336
)
Could not find the main
class
: org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit.
可以查看classpath,是否正确。我是添加了以下语句才正常的:
CLASSPATH
=
$CLASSPATH:$TOP
/
build
/
contrib
/
thriftfs
/
classes
/
:$TOP
/
build
/
classes
/
:$TOP
/
conf
/
3
、安装thrift出现问题
运行.
/
bootstrap.sh出错: 查看boost是否安装,或者版本过旧
make出错:需要jdk
1
.6以上
make出错:ImportError: No module named java_config_2 可能因为python升级,导致java
-
config不可用,重新安装java
-
config即可
line
832
: X–tag
=
CXX: command not found : 把thrif
/
libtool文件内所有的“$echo”改为”$ECHO”即可(应该是libtool版本问题)
如果过于频繁的出现问题,且现象说明是大部分软件版本过旧,那可以考虑emerge –sync更新全部软件。
比较惨的时候,emerge总是被masked,那就手动安装一些依赖库吧,比如boost:
首先,下载boost包 http:
//
www.boost.org/users/download/ ;并解压到/usr/local下(想安装的地方)
然后,建软连接到
/
usr
/
include下:ln
-
s
/
usr
/
local
/
boost
-
version
/
boost
/
usr
/
include
/
boost,这样就安装完了不需要编译的部分
如果还要继续安装需要编译的部分,那么进入到boost目录,运行bootstrap.sh脚本,生成bjam,运行.
/
bjam install即可
4
、ant安装报找不到jdk
但是我已经在
/
etc
/
profile,每个用户的.bash_profile里都把JAVA_HOME指向jdk目录了,而且echo $JAVA_HOME的结果也是jdk目录。
查看ant脚本,在第一行echo $JAVA_HOME,发现为空。无奈,只能手工把javahome添加到ant脚本里。
5
、hadoop版本统一以后,可以正常启动datanode了,但是tasktracker还是起不来,在log里找到:
2010
-
04
-
30
18
:
07
:
31
,
975
ERROR org.apache.hadoop.mapred.TaskTracker: Shutting down. Incompatible buildVersion.
JobTracker’s:
0.20
.
3
-
dev from by ms on Thu Apr
29
17
:
44
:
22
CST
2010
TaskTracker’s:
0.20
.
3
-
dev from by root on Fri Apr
30
17
:
48
:
14
CST
2010
解决方法:
把master的hadoop目录,整个覆盖过去!
(尝试过用ms账号重新ant hadoop,无效;即版本一致了,但是安装时间不一致也不行。。。。)
PHP 读取HADOOP的HDFS文件
四月 29th, 2010 进行评论
Hadoop的分布式文件系统HDFS为java提供了原生的接口,可以像访问地文件一样的,对HDFS中的文件进行增删改查等操作。
对于其他非java语言的支持,hadoop使用了Thrift。
对于该方式,hadoop里针对thrift的hadoop - 0.20 . 2 / src / contrib / thriftfs / readme是这样说的:
Thrift is a software framework for scalable cross - language services
development. It combines a powerful software stack with a code generation
engine to build services that work efficiently and seamlessly
between C ++ , Java, Python, PHP, and Ruby.
This project exposes HDFS APIs using the Thrift software stack. This
allows applciations written in a myriad of languages to access
HDFS elegantly.
The Application Programming Interface (API)
===========================================
The HDFS API that is exposed through Thrift can be found in if / hadoopfs.thrift.
Compilation
===========
The compilation process creates a server org.apache.hadoop.thriftfs.HadooopThriftServer
that implements the Thrift interface defined in if / hadoopfs.thrift.
Th thrift compiler is used to generate API stubs in python, php, ruby,
cocoa, etc. The generated code is checked into the directories gen -* .
The generated java API is checked into lib / hadoopthriftapi.jar.
There is a sample python script hdfs.py in the scripts directory. This python
script, when invoked, creates a HadoopThriftServer in the background, and then
communicates wth HDFS using the API. This script is for demonstration purposes
only.
由于它说的过于简单,而我又对java世界了解甚少,导致走了不少弯路,现记录如下:
1 、下载thrift源码,安装(. / bootstrap.sh ;. / configure –prefix =/ usr / local / thrift; make;sudo make install)
2 、将一些必须的文件cp到thrift安装目录下:
cp / path / to / thrift - 0.2 . 0 / lib / php / / usr / local / thrift / lib / - r
mkdir / usr / local / thrift / lib / php / src / packages /
cp / path / to / hadoop - 0.20 . 2 / src / contrib / thriftfs / gen - php / / usr / local / thrift / lib / php / src / packages / hadoopfs / - r
3 、安装thrift的php扩展(针对php而言)
cd / path / to / thrift - 0.2 . 0 / lib / php / src / ext / thrift_protocol;phpize; . / configure;make ;make install;
修改php.ini,添加extension = thrift_protocol.so
4 、编译hadoop
cd / path / to / hadoop - 0.20 . 2 ; ant compile (ant - projecthelp可以查看项目信息;compile 是编译core和contrib目录)
5 、启动hadoop的thrift代理
cd / path / to / hadoop - 0.20 . 2 / src / contrib / thriftfs / scripts / ; . / start_thrift_server.sh [your - port] (如果不输入port,则随机一个port)
6 、执行php测试代码
<? php
error_reporting(E_ALL);
ini_set(‘display_errors’, ‘on’);
$GLOBALS[ ' THRIFT_ROOT ' ] = ‘ / usr / local / thrift / lib / php / src‘;
define(‘ETCC_THRIFT_ROOT’, $GLOBALS[ ' THRIFT_ROOT ' ]);
require_once(ETCC_THRIFT_ROOT.’ / Thrift.php’ ); require_once(ETCC_THRIFT_ROOT.’ / transport / TSocket.php’ ); require_once(ETCC_THRIFT_ ROOT.’ / transport / TBufferedTransport.php’ ); require_once(ETCC_THRIFT_ROOT.’ / protocol / TBinaryProtocol.php’ );
$socket = new TSocket(your - host, your - port);
$socket -> setSendTimeout( 10000 );
$socket -> setRecvTimeout( 20000 );
$transport = new TBufferedTransport($socket);
$protocol = new TBinaryProtocol($transport);
require_once(ETCC_THRIFT_ROOT.’ / packages / hadoopfs / ThriftHadoopFileSystem.php’);
$client = new ThriftHadoopFileSystemClient($protocol);
$transport -> open();
try
{
$pathname = new Pathname(array(‘pathname’ => ‘your - hdfs - file - name‘));
$fp = $client -> open($pathname);
var_dump($client -> stat($pathname));
var_dump($client -> read($fp, 0 , 1024 ));
exit;
} catch (Exception $e)
{
print_r($e);
}
$transport -> close();
?>
可能出现的问题:
1 、可以创建目录或者文件,但是读取不到文件内容
这时可以打开hadoop thrift的log4j配置(如果你用的是log4j记录日志的话),在 / path / to / hadoop / conf / log4j.properties 里,修改:
hadoop.root.logger = ALL,console
这时,没执行一条访问HDFS的操作,都会把debug信息打印出来。
我这里看到的是file的id很奇怪,怀疑是在32位机上溢出了,尝试修改未果。之后迁移到64位机,运行正常!
问题代码在: / usr / local / thrift / lib / php / src / protocol / TBinaryProtocol.php (由于我代码里用的TBinaryProtocol类)的readI64函数里。
2 、启动start_thrift_server.s失败
Exception in thread “main” java.lang.NoClassDefFoundError: org / apache / hadoop / thriftfs / HadoopThriftServer
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.thriftfs.HadoopThriftServer
at java.net.URLClassLoader$ 1 .run(URLClassLoader.java: 217 )
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java: 205 )
at java.lang.ClassLoader.loadClass(ClassLoader.java: 323 )
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: 294 )
at java.lang.ClassLoader.loadClass(ClassLoader.java: 268 )
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java: 336 )
Could not find the main class : org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit.
可以查看classpath,是否正确。我是添加了以下语句才正常的:
CLASSPATH = $CLASSPATH:$TOP / build / contrib / thriftfs / classes / :$TOP / build / classes / :$TOP / conf /
3 、安装thrift出现问题
运行. / bootstrap.sh出错: 查看boost是否安装,或者版本过旧
make出错:需要jdk 1 .6以上
make出错:ImportError: No module named java_config_2 可能因为python升级,导致java - config不可用,重新安装java - config即可
line 832 : X–tag = CXX: command not found : 把thrif / libtool文件内所有的“$echo”改为”$ECHO”即可(应该是libtool版本问题)
如果过于频繁的出现问题,且现象说明是大部分软件版本过旧,那可以考虑emerge –sync更新全部软件。
比较惨的时候,emerge总是被masked,那就手动安装一些依赖库吧,比如boost:
首先,下载boost包 http: // www.boost.org/users/download/ ;并解压到/usr/local下(想安装的地方)
然后,建软连接到 / usr / include下:ln - s / usr / local / boost - version / boost / usr / include / boost,这样就安装完了不需要编译的部分
如果还要继续安装需要编译的部分,那么进入到boost目录,运行bootstrap.sh脚本,生成bjam,运行. / bjam install即可
4 、ant安装报找不到jdk
但是我已经在 / etc / profile,每个用户的.bash_profile里都把JAVA_HOME指向jdk目录了,而且echo $JAVA_HOME的结果也是jdk目录。
查看ant脚本,在第一行echo $JAVA_HOME,发现为空。无奈,只能手工把javahome添加到ant脚本里。
5 、hadoop版本统一以后,可以正常启动datanode了,但是tasktracker还是起不来,在log里找到:
2010 - 04 - 30 18 : 07 : 31 , 975 ERROR org.apache.hadoop.mapred.TaskTracker: Shutting down. Incompatible buildVersion.
JobTracker’s: 0.20 . 3 - dev from by ms on Thu Apr 29 17 : 44 : 22 CST 2010
TaskTracker’s: 0.20 . 3 - dev from by root on Fri Apr 30 17 : 48 : 14 CST 2010
解决方法:
把master的hadoop目录,整个覆盖过去!
(尝试过用ms账号重新ant hadoop,无效;即版本一致了,但是安装时间不一致也不行。。。。)
四月 29th, 2010 进行评论
Hadoop的分布式文件系统HDFS为java提供了原生的接口,可以像访问地文件一样的,对HDFS中的文件进行增删改查等操作。
对于其他非java语言的支持,hadoop使用了Thrift。
对于该方式,hadoop里针对thrift的hadoop - 0.20 . 2 / src / contrib / thriftfs / readme是这样说的:
Thrift is a software framework for scalable cross - language services
development. It combines a powerful software stack with a code generation
engine to build services that work efficiently and seamlessly
between C ++ , Java, Python, PHP, and Ruby.
This project exposes HDFS APIs using the Thrift software stack. This
allows applciations written in a myriad of languages to access
HDFS elegantly.
The Application Programming Interface (API)
===========================================
The HDFS API that is exposed through Thrift can be found in if / hadoopfs.thrift.
Compilation
===========
The compilation process creates a server org.apache.hadoop.thriftfs.HadooopThriftServer
that implements the Thrift interface defined in if / hadoopfs.thrift.
Th thrift compiler is used to generate API stubs in python, php, ruby,
cocoa, etc. The generated code is checked into the directories gen -* .
The generated java API is checked into lib / hadoopthriftapi.jar.
There is a sample python script hdfs.py in the scripts directory. This python
script, when invoked, creates a HadoopThriftServer in the background, and then
communicates wth HDFS using the API. This script is for demonstration purposes
only.
由于它说的过于简单,而我又对java世界了解甚少,导致走了不少弯路,现记录如下:
1 、下载thrift源码,安装(. / bootstrap.sh ;. / configure –prefix =/ usr / local / thrift; make;sudo make install)
2 、将一些必须的文件cp到thrift安装目录下:
cp / path / to / thrift - 0.2 . 0 / lib / php / / usr / local / thrift / lib / - r
mkdir / usr / local / thrift / lib / php / src / packages /
cp / path / to / hadoop - 0.20 . 2 / src / contrib / thriftfs / gen - php / / usr / local / thrift / lib / php / src / packages / hadoopfs / - r
3 、安装thrift的php扩展(针对php而言)
cd / path / to / thrift - 0.2 . 0 / lib / php / src / ext / thrift_protocol;phpize; . / configure;make ;make install;
修改php.ini,添加extension = thrift_protocol.so
4 、编译hadoop
cd / path / to / hadoop - 0.20 . 2 ; ant compile (ant - projecthelp可以查看项目信息;compile 是编译core和contrib目录)
5 、启动hadoop的thrift代理
cd / path / to / hadoop - 0.20 . 2 / src / contrib / thriftfs / scripts / ; . / start_thrift_server.sh [your - port] (如果不输入port,则随机一个port)
6 、执行php测试代码
<? php
error_reporting(E_ALL);
ini_set(‘display_errors’, ‘on’);
$GLOBALS[ ' THRIFT_ROOT ' ] = ‘ / usr / local / thrift / lib / php / src‘;
define(‘ETCC_THRIFT_ROOT’, $GLOBALS[ ' THRIFT_ROOT ' ]);
require_once(ETCC_THRIFT_ROOT.’ / Thrift.php’ ); require_once(ETCC_THRIFT_ROOT.’ / transport / TSocket.php’ ); require_once(ETCC_THRIFT_ ROOT.’ / transport / TBufferedTransport.php’ ); require_once(ETCC_THRIFT_ROOT.’ / protocol / TBinaryProtocol.php’ );
$socket = new TSocket(your - host, your - port);
$socket -> setSendTimeout( 10000 );
$socket -> setRecvTimeout( 20000 );
$transport = new TBufferedTransport($socket);
$protocol = new TBinaryProtocol($transport);
require_once(ETCC_THRIFT_ROOT.’ / packages / hadoopfs / ThriftHadoopFileSystem.php’);
$client = new ThriftHadoopFileSystemClient($protocol);
$transport -> open();
try
{
$pathname = new Pathname(array(‘pathname’ => ‘your - hdfs - file - name‘));
$fp = $client -> open($pathname);
var_dump($client -> stat($pathname));
var_dump($client -> read($fp, 0 , 1024 ));
exit;
} catch (Exception $e)
{
print_r($e);
}
$transport -> close();
?>
可能出现的问题:
1 、可以创建目录或者文件,但是读取不到文件内容
这时可以打开hadoop thrift的log4j配置(如果你用的是log4j记录日志的话),在 / path / to / hadoop / conf / log4j.properties 里,修改:
hadoop.root.logger = ALL,console
这时,没执行一条访问HDFS的操作,都会把debug信息打印出来。
我这里看到的是file的id很奇怪,怀疑是在32位机上溢出了,尝试修改未果。之后迁移到64位机,运行正常!
问题代码在: / usr / local / thrift / lib / php / src / protocol / TBinaryProtocol.php (由于我代码里用的TBinaryProtocol类)的readI64函数里。
2 、启动start_thrift_server.s失败
Exception in thread “main” java.lang.NoClassDefFoundError: org / apache / hadoop / thriftfs / HadoopThriftServer
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.thriftfs.HadoopThriftServer
at java.net.URLClassLoader$ 1 .run(URLClassLoader.java: 217 )
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java: 205 )
at java.lang.ClassLoader.loadClass(ClassLoader.java: 323 )
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java: 294 )
at java.lang.ClassLoader.loadClass(ClassLoader.java: 268 )
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java: 336 )
Could not find the main class : org.apache.hadoop.thriftfs.HadoopThriftServer. Program will exit.
可以查看classpath,是否正确。我是添加了以下语句才正常的:
CLASSPATH = $CLASSPATH:$TOP / build / contrib / thriftfs / classes / :$TOP / build / classes / :$TOP / conf /
3 、安装thrift出现问题
运行. / bootstrap.sh出错: 查看boost是否安装,或者版本过旧
make出错:需要jdk 1 .6以上
make出错:ImportError: No module named java_config_2 可能因为python升级,导致java - config不可用,重新安装java - config即可
line 832 : X–tag = CXX: command not found : 把thrif / libtool文件内所有的“$echo”改为”$ECHO”即可(应该是libtool版本问题)
如果过于频繁的出现问题,且现象说明是大部分软件版本过旧,那可以考虑emerge –sync更新全部软件。
比较惨的时候,emerge总是被masked,那就手动安装一些依赖库吧,比如boost:
首先,下载boost包 http: // www.boost.org/users/download/ ;并解压到/usr/local下(想安装的地方)
然后,建软连接到 / usr / include下:ln - s / usr / local / boost - version / boost / usr / include / boost,这样就安装完了不需要编译的部分
如果还要继续安装需要编译的部分,那么进入到boost目录,运行bootstrap.sh脚本,生成bjam,运行. / bjam install即可
4 、ant安装报找不到jdk
但是我已经在 / etc / profile,每个用户的.bash_profile里都把JAVA_HOME指向jdk目录了,而且echo $JAVA_HOME的结果也是jdk目录。
查看ant脚本,在第一行echo $JAVA_HOME,发现为空。无奈,只能手工把javahome添加到ant脚本里。
5 、hadoop版本统一以后,可以正常启动datanode了,但是tasktracker还是起不来,在log里找到:
2010 - 04 - 30 18 : 07 : 31 , 975 ERROR org.apache.hadoop.mapred.TaskTracker: Shutting down. Incompatible buildVersion.
JobTracker’s: 0.20 . 3 - dev from by ms on Thu Apr 29 17 : 44 : 22 CST 2010
TaskTracker’s: 0.20 . 3 - dev from by root on Fri Apr 30 17 : 48 : 14 CST 2010
解决方法:
把master的hadoop目录,整个覆盖过去!
(尝试过用ms账号重新ant hadoop,无效;即版本一致了,但是安装时间不一致也不行。。。。)