基于Hadoop的视频流服务
Internet Memory supplies a service to browse archived Web pages, including multimedia content. We use Hadoop, HDFS and HBase for storing and indexing our data, and associates this storage with a Web server that lets users navigate through the archive and retrieve documents. In the present post, we focus on videos and detail the solution adopted to serve true streaming from HDFS storage.
互联网存储提供了浏览网页存档,包括多媒体内容的服务。本文使用Hadoop,HDFS和HBase存储和索引数据,将存储服务与Web服务器连接,可以让用户浏览归档和检索文件。目前,本文专注于视频和详细的解决方案用来实现HDFS存储提供的流媒体服务。
Many video formats are found on the Web, including Windows Media (.wmv), RealMedia (.rm), Quicktime (.mov), MPEG, Adobe Flash (.flv), etc.
在Web服务中,可以找到很多的视频格式,包括Windows Media (.wmv), RealMedia (.rm), Quicktime (.mov), MPEG, Adobe Flash (.flv), 等等。
In order to display a video, we need a player, which can be incorporated in the Web browser.
我们需要一个可以嵌入到Web浏览器的播放器来播放视频。
The player depends on the specific video format, but most browsers are able to detect the format and choose the appropriate player.
播放器只兼容特定的视频格式,但浏览器可以检测视频格式并选择合适的播放器。
Firefox for instance comes with a lot of plugins, which can be quickly integrated in the presence of a specific video to display it content.
例如Firefox浏览器有很多的插件,可以快速集成并显示存在特定视频格式的视频内容。
There are basically two ways to play a video.
通常情况下有两种基本的方法去播放视频。
The simplest one is a two-steps process: first the whole file is downloaded from the Web server to the user’s computer, and then displayed by the player running the local copy.
最简单的一种方法:首先把整个文件从Web服务端下载到用户计算机上,然后通过播放器播放本地的视频复件。
It has the disadvantage that the download step may take a while is the file is big (hundreds of megabytes are not uncommon).
但是这存在一个缺陷,当下载文件过大,下载过程需要占用很长的时间(常见的都是Gb以上)。
The second one uses (true) streaming: the video file is split into fragments which are sent from the Web server to the player, giving the illusion of a continuous stream.
第二种方式使用流服务:视频文件被分割成块,通过Web服务器发送到播放器上,产生持续视频流的错觉。
From the user point of view, it looks as if a window is swept over the video content, saving the need of a full initial download of the whole file.
在用户的角度,这就是一个窗口不断的播放视频的内容,不必要去下载整个视频文件。
Obviously, streaming is a more involved method because it requires a strong coordination between the components involved in the process, namely the player, the Web server, and the file system from which the video is retrieved.
显然,流媒体是一个更为复杂的过程,因为它需要参与这个过程的组件之间强壮的协调,播放器,Web服务器,以及存储视频资源的文件系统。
We examine this technical issue in the context of a Hadoop system where files are stored in HDFS, a file system dedicated to large distributed storage.
本文中,我们在Hadoop系统中检测这个技术,文件保存在HDFS中,HDFS是致力于大量分布式存储的文件系统。
At explained above, streaming requires a strong coordination between the Web server and the file system.
在上述中,流服务需要Web服务器和文件系统之间的协助。
The former produces requests to access chunks of the video file (think to what happens when the user suddenly requires a move to a specific part of the video), whereas the later must be able to seek in the file to position the cursor at a specific location.
前者产生请求被访问的视频文件的块(考虑下当用户突然申请视频中某一个节点的内容,会发生什么呢),而后者必须能够在文件中把访问光标定位到某一特定的位置上。
When using HDFS, enabling such a close cooperation turns out to be a problem because HDFS can in principle only be accessed through a Hadoop client, which the standard Apache server is not.
使用HDFS实现上述的协助方式可能是个问题,因为HDFS原则上只能被Hadoop客户机去访问,而标准的Apache服务器却不是这样。
We investigated two possible solutions: Hoop, the Hadoop web server, and Apache/FUSE.
我们调查了两种可能的解决方案:Hoop,Hadoop服务器和Apache/FUSE。
Hoop (see http:///cloudera.github.com/hoop/) is an HTTP-HDFS-Connector.
Hoop(http:///cloudera.github.com/hoop/)是一种HTTP-HDFS-Connector.
It allows the HDFS file system to be accessed via HTTP.
它允许HDFS文件系统通过HTTP协议去访问。
A working local prototype has been developed using JW Player and a large video file.
以及开发了一种正在运转的本地原型,使用JW播放器和一个大容量视频文件。
Streaming works, but seeking in an unbuffered part results in the playback stopping.
流媒体可用了,但跳转到一个没有缓存的节点时播放会停止。
It seems that the Hoop API does not support seeking in a file, so we had to give up this approach.
这说明了Hoop API不支持在文件中跳转,所以我们放弃了这个方法。
The second solution is based on HDFS/FUSE. FUSE (File System in User Space) is an API that captures the file system operations and allows to implement them with ad-hoc functions running in the the user’s processus space (thereby saving the need to change the operating system kernel, a tricky and dangerous option).
第二种解决方案是基于HDFS/FUSE。FUSE(用户空间文件系统)是一系列API捕获文件系统操作并实现,通过特殊的功能运行在用户的活动空间(从而节约修改操作系统内核的必要,一个棘手和危险的选择)。
FUSE is provided in Hadoop as a component named “Mountable HDFS” (see http://wiki.apache.org/hadoop/MountableHDFS).
FUSE在Hadoop中以组件的形式存在,命名为“Mountable HDFS”(http://wiki.apache.org/hadoop/MountableHDFS)。
It lets the standard file system user or program see the HDFS name space as a locally mounted directory.
允许标准文件系统用户或程序把HDFS命名空间作为一个本地挂载的目录。
All file system operations, including directory browsing, file opening and content access, are enabled over HDFS content through the FUSE interface.
所有的文件系统操作,包括目录浏览,打开文件和内容访问,都可以通过FUSE接口访问HDFS内容。
It remained to configure Apache to access the mounted FUSE system and load content from video files. How this is done depends on the video format.
它依然是配置Apache去访问安装FUSE系统并从视频文件中加载内容。如何实现取决于视频格式。
At the moment, we tested and validated.mp4 files and Flash video files.
目前我们测试和验证了.mp4文件和Flash视频文件。
For the first format we use H264 Streaming Module (see http://h264.code-shop.com/trac), an Apache plugin, which enables adaptive streaming.
对于.mp4视频格式我们使用了H264流媒体模块(http://h264.code-shop.com/trac),一种Apache插件,可以实现自适应的流媒体服务。
For FLV we used pseudo-stream module for Apache named “mod_flv”. Both behave nicely and go along with the mountable HDFS without problem.
对于FLV,我们采用了Apache中mod_flv伪流模块。两个测试都不错的播放,与安装的HDFS没有产生问题。
The solution based on Apache + Mountable HDFS (FUSE) turned out to be both reliable, functionally adequate (seeking is well supported) and efficient.
基于Apache+可安装HDFS(FUSE)的解决方案,可行并具有可靠性,功能完善性和高效性。
The architecture is simple and easy to set up, and allows to combine the benefits of HDFS for very large repositories and standard Web server streaming solutions.
架构简单并容易搭建,允许结合HDFS大容量文件的优点和标准Web流媒体服务解决方案。
Although we chose to adopt Apache plugins in our current service, nothing keeps you from using a more powerful streaming server since the FUSE approach (virtually) moves all the HDFS content in the standard file system scope.
尽管在当前服务中我们选择采用Apache插件,并不影响你使用一个更强大的流媒体服务器,因为FUSE几乎移动所有的HDFS内容到一个标准的文件系统范围。
Hoop remains a potential option for the future, but it appeared not mature enough when we tested it, at least for the complex operations (seeking at a specific offset in a file) required by video streaming.
Hoop在未来依然是一个潜在的选择方向,但在我们测试它的时候还不够成熟,至少在一些复杂的视频流请求操作上(在文件中寻找某一个偏移位置)。