johnsonxp

[Cloudera Hadoop] CDH 4.0 Quick Start Guide （动手实践，最新版CDH4.0，企业版Hadoop）

Cloudera推出了最新版的CDH4.0

在安装CDH4.0之前，首先要进行一些系统准备工作，下面是具体需要做的事情。

1. Support OS for CDH4.0.

2. Install JDK.

=========================安装前准备实践开始 ===============================

1. 准备操作系统：由于日后希望使用Crowbar来自动安装部署CDH4.0和Cloudera Manager所以选择了 Centos6.2 64bit Server的系统，根据Centos6.2的安装步骤顺利完成安装。

2. 安装好操作系统之后，首先要配置网络，设置为静态IP地址，设置GW和DNS，便于系统设置的稳定。

2.1 修改对应网卡的IP地址的配置文件

# vi /etc/sysconfig/network-scripts/ifcfg-eth0

#Set Static IP address

DEVICE="eth0"

NM_CONTROLLED="yes"

ONBOOT=yes

TYPE=Ethernet

BOOTPROTO=static

IPADDR=192.168.26.140

NETMASK=255.255.255.0

NETWORK=192.168.26.0

BROADCAST=192.168.26.255

DEFROUTE=yes

IPV4_FAILURE_FATAL=yes

IPV6INIT=no

NAME="System eth0"

UUID=5fb06bd0-0bb0-7ffb-45f1-d6edd65f3e03

HWADDR=00:0C:29:7A:CE:59

PEERDNS=yes

PEERROUTES=yes

2.2 CentOS 修改网关

修改对应网卡的网关的配置文件

[root@centos]# vi /etc/sysconfig/network

修改以下内容

NETWORKING=yes(表示系统是否使用网络，一般设置为yes。如果设为no，则不能使用网络，而且很多系统服务程序将无法启动)

HOSTNAME=CDH4.0-Node-1(设置本机的主机名，这里设置的主机名要和/etc/hosts中设置的主机名对应)

GATEWAY=192.168.26.2(设置本机连接的网关的IP地址。例如，网关为10.0.0.2)

2.3 CentOS 修改DNS服务器

# vi /etc/resolv.conf

修改以下内容

# Generated by NetworkManager

domain localdomain

search localdomain

nameserver 192.168.26.2

2.4 CentOS 网络配置的3个关键文件

[root@CDH4 ~]# vi /etc/sysconfig/network-scripts/ifcfg-eth0

[root@CDH4 ~]# vi /etc/sysconfig/network

[root@CDH4 ~]# vi /etc/resolv.conf

2.5 CentOS 网络配置验证：

网关通畅：

[root@CDH4 ~]# ping 192.168.26.2

PING 192.168.26.2 (192.168.26.2) 56(84) bytes of data.

64 bytes from 192.168.26.2: icmp_seq=1 ttl=128 time=1.57 ms

64 bytes from 192.168.26.2: icmp_seq=2 ttl=128 time=0.238 ms

--- 192.168.26.2 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1182ms

rtt min/avg/max/mdev = 0.238/0.907/1.577/0.670 ms

出口通畅：

[root@CDH4 ~]# ping 192.168.88.1

PING 192.168.88.1 (192.168.88.1) 56(84) bytes of data.

64 bytes from 192.168.88.1: icmp_seq=1 ttl=128 time=2.64 ms

64 bytes from 192.168.88.1: icmp_seq=2 ttl=128 time=3.20 ms

--- 192.168.88.1 ping statistics ---

2 packets transmitted, 2 received, 0% packet loss, time 1630ms

rtt min/avg/max/mdev = 2.644/2.924/3.204/0.280 ms

公网通畅：

[root@CDH4 ~]# ping www.baidu.com

PING www.a.shifen.com (61.135.169.125) 56(84) bytes of data.

64 bytes from 61.135.169.125: icmp_seq=1 ttl=128 time=33.9 ms

64 bytes from 61.135.169.125: icmp_seq=2 ttl=128 time=33.3 ms

3. 然后更新源，然后升级操作系统到最新。

关于源的速度，国内比较好的源有“USTC中国科技大学”和“网易 mirrors.163.com”

centos更新源(网易 163.com)

# 备份

# mv /etc/yum.repos.d/CentOS-Base.repo{,.bak}

# 修改

# vi /etc/yum.repos.d/CentOS-Base.repo

# CentOS-Base.repo

# The mirror system uses the connecting IP address of the client and the

# update status of each mirror to pick mirrors that are updated to and

# geographically close to the client. You should use this for CentOS updates

# unless you are manually picking other mirrors.

# If the mirrorlist= does not work for you, as a fall back you can try the

# remarked out baseurl= line instead.

[base]

name=CentOS-$releasever - Base

#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=os

baseurl=http://mirrors.163.com/centos/$releasever/os/$basearch/

gpgcheck=1

gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

#released updates

[updates]

name=CentOS-$releasever - Updates

#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=updates

baseurl=http://mirrors.163.com/centos/$releasever/updates/$basearch/

gpgcheck=1

gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

#additional packages that may be useful

[extras]

name=CentOS-$releasever - Extras

#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=extras

baseurl=http://mirrors.163.com/centos/$releasever/extras/$basearch/

gpgcheck=1

gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

#additional packages that extend functionality of existing packages

[centosplus]

name=CentOS-$releasever - Plus

#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=centosplus

baseurl=http://mirrors.163.com/centos/$releasever/centosplus/$basearch/

gpgcheck=1

enabled=0

gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

#contrib - packages by Centos Users

[contrib]

name=CentOS-$releasever - Contrib

#mirrorlist=http://mirrorlist.centos.org/?release=$releasever&arch=$basearch&repo=contrib

baseurl=http://mirrors.163.com/centos/$releasever/contrib/$basearch/

gpgcheck=1

enabled=0

gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-6

# yum clean all

# yum makecache #将服务器上的软件包信息缓存到本地,以提高搜索安装软件的速度

# yum install vim* #测试域名是否可用

# yum update #更新系统

4. 安装JDK，由于Hadoop运行于JVM之上所以这个是必须安装的。

4.1 在安装CDH4.0之前，必须先安装Orcal JDK，并且满足下列要求，建议版本 1.6.0_31，集群上面的JDK保持相同版本，JAVA_HOME 环境变量必须设置。

Requirements:

CDH4 requires the Oracle JDK 1.6.0_8 at a minimum. Cloudera recommends version 1.6.0_31.

After installing the JDK, and before installing and deploying CDH:

If you are deploying CDH on a cluster, make sure you have the same version of the Oracle JDK on each node.

Make sure the JAVA_HOMEenvironment variable is set for the root user on each node. You can check by using a command such as

$ sudo env | grep JAVA_HOME

It should be set to point to the directory where the JDK is installed, as shown in the example below.

4.2 开始安装Orcal JDK.

下载JDK http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html

安装JDK

设置环境变量：JAVA_HOME将$JAVA_HOME/bin加入到$PATH之中。

As the root user, set JAVA_HOMEto the directory where the JDK is installed; for example:

# export JAVA_HOME=<jdk-install-dir>

# export PATH=$JAVA_HOME/bin:$PATH

where <jdk-install-dir> might be something like /usr/java/jdk1.6.0_31, depending on the system configuration and where the JDK is actually installed.

5. Create Snashot named 'OS Reday'，系统准备完毕。

=========================安装前准备实践结束 ===============================

========================= 安装Cloudera Hadoop 开始 ===============================

For OS Version Click this Link

Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link

Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)

Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)

Installing CDH4 with MRv1 on a Single Linux Node in Pseudo-distributed mode

1.Download the CDH4 Package :

For OS Version Click this Link

Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link

Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)

Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)

2. Install the RPM :

$ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm

3. Install CDH4 Hadoop with MRv1

To install Hadoop with MRv1:

$ sudo yum install hadoop-0.20-conf-pseudo

Download the CDH4 Package

Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

For OS Version Click this Link

Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link

Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)

Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)

Install the RPM:

$ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm

Note

For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

Install CDH4

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing one of the the following commands:

For Red Hat/CentOS/Oracle 5 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera

For Red Hat/CentOS 6 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Install Hadoop in pseudo-distributed mode:

To install Hadoop with MRv1:

$ sudo yum install hadoop-0.20-conf-pseudo

========================= 安装Cloudera Hadoop 结束 ===============================

Before you install CDH4 on a single node, there are some important steps you need to do to prepare your system:

Verify you are using a supported operating system for CDH4. See the next section: Supported Operating Systems for CDH4.

If you haven't already done so, install the Oracle Java Development Kit (JDK) before deploying CDH4. See the section below: Install the Oracle Java Development Kit.

Important

On SLES 11 platforms, do not install or try to use the IBM Java version bundled with the SLES distribution; Hadoop will not run correctly with that version. Install the Oracle JDK following directions under Install the Oracle Java Development Kit.

Supported Operating Systems for CDH4

CDH4 supports the following operating systems:

For Red Hat-compatible systems, Cloudera provides:

64-bit packages for Red Hat Enterprise Linux 5.7, CentOS 5.7, and Oracle Linux 5.6 with Unbreakable Enterprise Kernel.

32-bit and 64-bit packages for Red Hat Enterprise Linux 6.2 and CentOS 6.2.

For SUSE systems, Cloudera provides 64-bit packages for SUSE Linux Enterprise Server 11 (SLES 11). Service pack 1 or later is required.

For Ubuntu systems, Cloudera provides 64-bit packages for the Long-Term Support (LTS) releases Lucid (10.04) and Precise (12.04).

For Debian systems, Cloudera provides 64-bit packages for Squeeze (6.0.3).

Note

Cloudera has received reports that our RPMs work well on Fedora, but we have not tested this.

Important

For production environments, 64-bit packages are recommended. Except as noted above, CDH4 provides only 64-bit packages.

Note

If you are using an operating system that is not supported by Cloudera's packages, you can also download source tarballs from Downloads.

Install the Oracle Java Development Kit

If you have already installed the Oracle JDK, skip this step and proceed to Installing CDH4 on a Single Linux Node in Pseudo-distributed Mode.

Install the Oracle Java Development Kit (JDK) before deploying CDH4.

To install the JDK, follow the instructions under Oracle JDK Installation. The completed installation must meet the requirements in the box below.

If you have already installed a version of the JDK, make sure your installation meets the requirements in the box below.

Requirements:

CDH4 requires the Oracle JDK 1.6.0_8 at a minimum. Cloudera recommends version 1.6.0_31.

After installing the JDK, and before installing and deploying CDH:

If you are deploying CDH on a cluster, make sure you have the same version of the Oracle JDK on each node.

Make sure the JAVA_HOMEenvironment variable is set for the root user on each node. You can check by using a command such as

$ sudo env | grep JAVA_HOME

It should be set to point to the directory where the JDK is installed, as shown in the example below.

You may be able to install the Oracle JDK with your package manager, depending on your choice of operating system.

Oracle JDK Installation

Important

The Oracle JDK installer is available both as an RPM-based installer (note the "-rpm" modifier before the bin file extension) for RPM-based systems, and as a binary installer for other systems. Make sure you install the jdk-6uXX-linux-x64-rpm.bin file for 64-bit systems, or jdk-6uXX-linux-i586-rpm.binfor 32-bit systems.

On SLES 11 platforms, do not install or try to use the IBM Java version bundled with the SLES distribution; Hadoop will not run correctly with that version. Install the Oracle JDK by following the instructions below.

To install the Oracle JDK:

Download one of the recommended versions of the Oracle JDK from this page, which you can also reach by going to the Java SE Downloadspage and clicking on the Previous Releases tab and then on the Java SE 6 link. (These links and directions were correct at the time of writing, but the page is restructured frequently.)

Install the Oracle JDK following the directions on the the Java SE Downloads page.

As the root user, set JAVA_HOMEto the directory where the JDK is installed; for example:

# export JAVA_HOME=<jdk-install-dir>

# export PATH=$JAVA_HOME/bin:$PATH

where <jdk-install-dir> might be something like /usr/java/jdk1.6.0_31, depending on the system configuration and where the JDK is actually installed.

Contents

Installing CDH4 with MRv1 on a Single Linux Node in Pseudo-distributed mode

Installing CDH4 with YARN on a Single Linux Node in Pseudo-distributed mode

Components That Require Additional Configuration

Next Steps

You can evaluate CDH4 by quickly installing Apache Hadoop and CDH4 components on a single Linux node in pseudo-distributed mode. In pseudo-distributed mode, Hadoop processing is distributed over all of the cores/processors on a single machine. Hadoop writes all files to the Hadoop Distributed File System (HDFS), and all services and daemons communicate over local TCP sockets for inter-process communication.

Note

CDH4 is based on Apache Hadoop 2.0, which (among other features) introduces a new generation of MapReduce — MapReduce 2.0, known as MRv2 or YARN. (See Apache Hadoop NextGen MapReduce (YARN) for more information about YARN.) CDH4 also supports an implementation of the previous version of MapReduce, now referred to as MapReduce version 1 (MRv1).

Important

For installations in pseudo-distributed mode, there are separate conf-pseudo packages for an installation that includes MRv1 (hadoop-0.20-conf-pseudo) or an installation that includes YARN (hadoop-conf-pseudo). Only one conf-pseudo package can be installed at a time: if you want to change from one to the other, you must uninstall the one currently installed.

Installing CDH4 with MRv1 on a Single Linux Node in Pseudo-distributed mode

Important

If you have not already done so, install the Oracle Java Development Kit (JDK) before deploying CDH4. Follow these instructions.

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

Download the CDH4 Package

Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

For OS Version Click this Link

Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link

Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)

Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)

Install the RPM:

$ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm

Note

For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

Install CDH4

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing one of the the following commands:

For Red Hat/CentOS/Oracle 5 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera

For Red Hat/CentOS 6 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/6/x86_64/cdh/RPM-GPG-KEY-cloudera

Install Hadoop in pseudo-distributed mode:

To install Hadoop with MRv1:

$ sudo yum install hadoop-0.20-conf-pseudo

Starting Hadoop and Verifying it is Working Properly:

For MRv1, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, jobtracker, secondarynamenode,datanode, and tasktracker.

To verify the hadoop-0.20-conf-pseudo packages on your system.

To view the files on Red Hat or SUSE systems:

$ rpm -ql hadoop-0.20-conf-pseudo

To view the files on Ubuntu systems:

$ dpkg -L hadoop-0.20-conf-pseudo

The new configuration is self-contained in the /etc/hadoop/conf.pseudo.mr1 directory.

Note

The Cloudera packages use the alternatives framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

To start Hadoop, proceed as follows.

Step 1: Format the NameNode.

Before starting the NameNode for the first time you must format the file system.

$ sudo -u hdfs hdfs namenode -format

Note

Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.

Important

In earlier releases, the hadoop-conf-pseudo package automatically formatted HDFS on installation. In CDH4, you must do this explicitly.

Step 2: Start HDFS

$ for service in /etc/init.d/hadoop-hdfs-*

> do

> sudo $service start

> done

To verify services have started, you can check the web console. The NameNode provides a web console http://localhost:50070/ for viewing your Distributed File System (DFS) capacity, number of DataNodes, and logs. In this pseudo-distributed configuration, you should see one live DataNode named localhost.

Step 3: Create the /tmp Directory

Create the /tmp directory and set permissions:

Important

If you do not create /tmp properly, with the right permissions as shown below, you may have problems with CDH components later. Specifically, if you don't create /tmp yourself, another process may create it automatically with restrictive permissions that will prevent your other applications from using it.

Create the /tmp directory after HDFS is up and running, and set its permissions to 1777 (drwxrwxrwt), as follows:

$ sudo -u hdfs hadoop fs -mkdir /tmp

$ sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Note

This is the root of hadoop.tmp.dir (/tmp/hadoop-$<user.name> by default) which is used both for the local file system and HDFS.

Step 4: Create the MapReduce system directories:

sudo -u hdfs hadoop fs -mkdir /var

sudo -u hdfs hadoop fs -mkdir /var/lib

sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs

sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache

sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred

sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred

sudo -u hdfs hadoop fs -mkdir /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

sudo -u hdfs hadoop fs -chmod 1777 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

sudo -u hdfs hadoop fs -chown -R mapred /var/lib/hadoop-hdfs/cache/mapred

Step 5: Verify the HDFS File Structure

$ sudo -u hdfs hadoop fs -ls -R /

You should see:

drwxrwxrwt - hdfs supergroup 0 2012-04-19 15:14 /tmp

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:26 /user

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs

drwxr-xr-x - hdfs supergroup 0 2012-04-19 15:16 /var/lib/hadoop-hdfs/cache

drwxr-xr-x - mapred supergroup 0 2012-04-19 15:19 /var/lib/hadoop-hdfs/cache/mapred

drwxr-xr-x - mapred supergroup 0 2012-04-19 15:29 /var/lib/hadoop-hdfs/cache/mapred/mapred

drwxrwxrwt - mapred supergroup 0 2012-04-19 15:33 /var/lib/hadoop-hdfs/cache/mapred/mapred/staging

Step 6: Start MapReduce

for service in /etc/init.d/hadoop-0.20-mapreduce-*

> do

> sudo $service start

> done

To verify services have started, you can check the web console. The JobTracker provides a web console http://localhost:50030/ for viewing and running completed and failed jobs with logs.

Step 7: Create User Directories

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir /user/<user>

$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USER

sudo -u hdfs hadoop fs -chown $USER /user/$USER

Running an example application with MRv1

Create a home directory on HDFS for the user who will be running the job (for example, joe):

sudo -u hdfs hadoop fs -mkdir /user/joe

sudo -u hdfs hadoop fs -chown joe /user/joe

Do the following steps as the user joe.

Make a directory in HDFS called inputand copy some XML files into it by running the following commands:

$ hadoop fs -mkdir input

$ hadoop fs -put /etc/hadoop/conf/*.xml input

$ hadoop fs -ls input

Found 3 items:

-rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml

-rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml

-rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml

Run an example Hadoop job to grep with a regular expression in your input data.

$ /usr/bin/hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar grep input output 'dfs[a-z.]+'

After the job completes, you can find the output in the HDFS directory named outputbecause you specified that output directory to Hadoop.

$ hadoop fs -ls

Found 2 items

drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input

drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output

You can see that there is a new directory called output.

List the output files.

$ hadoop fs -ls output

Found 2 items

drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output/_logs

-rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output/part-00000

-rw-r--r- 1 joe supergroup 0 2009-02-25 10:33 /user/joe/output/_SUCCESS

Read the results in the output file; for example:

$ hadoop fs -cat output/part-00000 | head

1 dfs.datanode.data.dir

1 dfs.namenode.checkpoint.dir

1 dfs.namenode.name.dir

1 dfs.replication

1 dfs.safemode.extension

1 dfs.safemode.min.datanodes

Installing CDH4 with YARN on a Single Linux Node in Pseudo-distributed mode

Before you start, uninstall MRv1 if necessary

If you have already installed MRv1 following the steps in the previous section, you now need to uninstall hadoop-0.20-conf-pseudo before running YARN. Proceed as follows.

Stop the daemons:

$ for service in /etc/init.d/hadoop-hdfs-* /etc/init.d/hadoop-0.20-mapreduce-*

> do

> sudo $service stop

> done

Remove hadoop-0.20-conf-pseudo:

On Red Hat-compatible systems:

sudo yum remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*

On SUSE systems:

sudo zypper remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*

On Ubuntu or Debian systems:

sudo apt-get remove hadoop-0.20-conf-pseudo hadoop-0.20-mapreduce-*

Note

In this case (after uninstalling hadoop-0.20-conf-pseudo) you can skip the package download steps below.

Important

If you have not already done so, install the Oracle Java Development Kit (JDK) before deploying CDH4. Follow these instructions.

On Red Hat/CentOS/Oracle 5 or Red Hat 6 systems, do the following:

Download the CDH4 Package

Click the entry in the table below that matches your Red Hat or CentOS system, choose Save File, and save the file to a directory to which you have write access (it can be your home directory).

For OS Version Click this Link

Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5 link

Red Hat/CentOS 6 (32-bit) Red Hat/CentOS 6 link (32-bit)

Red Hat/CentOS 6 (64-bit) Red Hat/CentOS 6 link (64-bit)

Install the RPM:

$ sudo yum --nogpgcheck localinstall sudo yum --nogpgcheck localinstall cloudera-cdh-4-0.noarch.rpm

Note

For instructions on how to add a CDH4 yum repository or build your own CDH4 yum repository, see Installing CDH4 On Red Hat-compatible systems.

Install CDH4

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

For Red Hat/CentOS/Oracle 5 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera

For Red Hat/CentOS/Oracle 5 systems:

$ sudo rpm --import http://archive.cloudera.com/cdh4/redhat/5/x86_64/cdh/RPM-GPG-KEY-cloudera

Install Hadoop in pseudo-distributed mode:

To install Hadoop with YARN:

$ sudo yum install hadoop-conf-pseudo

On SUSE systems, do the following:

Download and install the CDH4 package

Click this link, choose Save File, and save it to a directory to which you have write access (it can be your home directory).

Install the RPM:

$ sudo rpm -i cloudera-cdh-4-0.noarch.rpm

Note

For instructions on how to add a CDH4 SUSE repository or build your own CDH4 SUSE repository, see Installing CDH4 On SUSE systems.

Install CDH4

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

For all SUSE systems:

$ sudo rpm --import http://archive.cloudera.com/cdh4/sles/11/x86_64/cdh/RPM-GPG-KEY-cloudera

Install Hadoop in pseudo-distributed mode:

To install Hadoop with YARN:

$ sudo zypper install hadoop-conf-pseudo

On Ubuntu and other Debian systems, do the following:

Download and install the package

Click one of the following:

this link for a Squeeze system, or

this link for a Lucid system

this link for a Precise system.

Install the package. Do one of the following:

Choose Open with in the download window to use the package manager, or

Choose Save File, save the package to a directory to which you have write access (it can be your home directory) and install it from the command line, for example:

sudo dpkg -i Downloads/cdh4-repository_1.0_all.deb

Note

For instructions on how to add a CDH4 Debian repository or build your own CDH4 Debian repository, see Installing CDH4 On Ubuntu or Debian systems.

Install CDH4

(Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command:

For Ubuntu Lucid systems:

$ curl -s http://archive.cloudera.com/cdh4/ubuntu/lucid/amd64/cdh/archive.key | sudo apt-key add -

For Ubuntu Precise systems:

$ curl -s http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key | sudo apt-key add -

For Debian Squeeze systems:

$ curl -s http://archive.cloudera.com/cdh4/debian/squeeze/amd64/cdh/archive.key | sudo apt-key add -

Install Hadoop in pseudo-distributed mode:

To install Hadoop with YARN:

$ sudo apt-get update

$ sudo apt-get install hadoop-conf-pseudo

Starting Hadoop and Verifying it is Working Properly

For YARN, a pseudo-distributed Hadoop installation consists of one node running all five Hadoop daemons: namenode, secondarynamenode,resourcemanager, datanode, and nodemanager.

To view the files on Red Hat or SUSE systems:

$ rpm -ql hadoop-conf-pseudo

To view the files on Ubuntu systems:

$ dpkg -L hadoop-conf-pseudo

The new configuration is self-contained in the /etc/hadoop/conf.pseudo directory.

Note

The Cloudera packages use the alternative framework for managing which Hadoop configuration is active. All Hadoop components search for the Hadoop configuration in /etc/hadoop/conf.

To start Hadoop, proceed as follows.

Step 1: Format the NameNode.

Before starting the NameNode for the first time you must format the file system.

$ sudo -u hdfs hdfs namenode -format

Note

Make sure you perform the format of the NameNode as user hdfs. You can do this as part of the command string, using sudo -u hdfs as in the command above.

Important

In earlier releases, the hadoop-conf-pseudo package automatically formatted HDFS on installation. In CDH4, you must do this explicitly.

Step 2: Start HDFS

$ for service in /etc/init.d/hadoop-hdfs-*

> do

> sudo $service start

> done

Step 3: Create the /tmp Directory

Remove the old /tmpif it exists:

sudo -u hdfs hadoop fs -rmr /tmp

Create a new /tmpdirectory and set permissions:

sudo -u hdfs hadoop fs -mkdir /tmp

sudo -u hdfs hadoop fs -chmod -R 1777 /tmp

Step 4: Create User, Staging, and Log Directories

Create a user directory and set ownership:

sudo -u hdfs hadoop fs -mkdir /user/mydir

sudo -u hdfs hadoop fs -chown myuser:myuser /user/mydir

Create the /var/log/hadoop-yarndirectory and set ownership:

sudo -u hdfs hadoop fs -mkdir /var/log/hadoop-yarn

sudo -u hdfs hadoop fs -chown yarn:mapred /var/log/hadoop-yarn

Create the staging directory and set permissions:

sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging

sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging

Create the done_intermediatedirectory under the staging directory and set permissions:

sudo -u hdfs hadoop fs -mkdir /tmp/hadoop-yarn/staging/history/done_intermediate

sudo -u hdfs hadoop fs -chmod -R 1777 /tmp/hadoop-yarn/staging/history/done_intermediate

Change ownership on the staging directory and subdirectory:

sudo -u hdfs hadoop fs -chown -R mapred:mapred /tmp/hadoop-yarn/staging

Step 5: Verify the HDFS File Structure:

Run the following command:

$ sudo -u hdfs hadoop fs -ls -R /

You should see the following directory structure:

drwxrwxrwt - hdfs supergroup 0 2012-05-31 15:31 /tmp

drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /tmp/hadoop-yarn

drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging

drwxr-xr-x - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history

drwxrwxrwt - mapred mapred 0 2012-05-31 15:31 /tmp/hadoop-yarn/staging/history/done_intermediate

drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /user

drwxr-xr-x - myuser myuser 0 2012-05-31 15:30 /user/mydir

drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var

drwxr-xr-x - hdfs supergroup 0 2012-05-31 15:31 /var/log

drwxr-xr-x - yarn mapred 0 2012-05-31 15:31 /var/log/hadoop-yarn

Step 6: Start YARN

sudo /etc/init.d/hadoop-yarn-resourcemanager start

sudo /etc/init.d/hadoop-yarn-nodemanager start

sudo /etc/init.d/hadoop-mapreduce-historyserver start

Step 7: Create User Directories

Create a home directory for each MapReduce user. It is best to do this on the NameNode; for example:

$ sudo -u hdfs hadoop fs -mkdir /user/<user>

$ sudo -u hdfs hadoop fs -chown <user> /user/<user>

where <user> is the Linux username of each user.

Alternatively, you can log in as each Linux user (or write a script to do so) and create the home directory as follows:

sudo -u hdfs hadoop fs -mkdir /user/$USER

sudo -u hdfs hadoop fs -chown $USER /user/$USER

Running an example application with YARN

Create a home directory on HDFS for the user who will be running the job (for example, joe):

sudo -u hdfs hadoop fs -mkdir /user/joe

sudo -u hdfs hadoop fs -chown joe /user/joe

Do the following steps as the user joe.

Make a directory in HDFS called inputand copy some XML files into it by running the following commands in pseudo-distributed mode:

$ hadoop fs -mkdir input

$ hadoop fs -put /etc/hadoop/conf/*.xml input

$ hadoop fs -ls input

Found 3 items:

-rw-r--r-- 1 joe supergroup 1348 2012-02-13 12:21 input/core-site.xml

-rw-r--r-- 1 joe supergroup 1913 2012-02-13 12:21 input/hdfs-site.xml

-rw-r--r-- 1 joe supergroup 1001 2012-02-13 12:21 input/mapred-site.xml

Set HADOOP_MAPRED_HOME for user joe:

$ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce

Run an example Hadoop job to grepwith a regular expression in your input data.

$ hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar grep input output23 'dfs[a-z.]+'

After the job completes, you can find the output in the HDFS directory named output23because you specified that output directory to Hadoop.

$ hadoop fs -ls

Found 2 items

drwxr-xr-x - joe supergroup 0 2009-08-18 18:36 /user/joe/input

drwxr-xr-x - joe supergroup 0 2009-08-18 18:38 /user/joe/output23

You can see that there is a new directory called output23.

List the output files.

$ hadoop fs -ls output23

Found 2 items

drwxr-xr-x - joe supergroup 0 2009-02-25 10:33 /user/joe/output23/_SUCCESS

-rw-r--r-- 1 joe supergroup 1068 2009-02-25 10:33 /user/joe/output23/part-r-00000

Read the results in the output file.

$ hadoop fs -cat output23/part-r-00000 | head

1 dfs.safemode.min.datanodes

1 dfs.safemode.extension

1 dfs.replication

1 dfs.permissions.enabled

1 dfs.namenode.name.dir

1 dfs.namenode.checkpoint.dir

1 dfs.datanode.data.dir

你可能感兴趣的:(hadoop)

浅谈MapReduce Android路上的人 Hadoop 分布式计算 mapreduce 分布式框架 hadoop
从今天开始，本人将会开始对另一项技术的学习，就是当下炙手可热的Hadoop分布式就算技术。目前国内外的诸多公司因为业务发展的需要，都纷纷用了此平台。国内的比如BAT啦，国外的在这方面走的更加的前面，就不一一列举了。但是Hadoop作为Apache的一个开源项目，在下面有非常多的子项目，比如HDFS，HBase,Hive，Pig,等等，要先彻底学习整个Hadoop，仅仅凭借一个的力量，是远远不够的。
Hadoop 傲雪凌霜，松柏长青后端大数据 hadoop 大数据分布式
ApacheHadoop是一个开源的分布式计算框架，主要用于处理海量数据集。它具有高度的可扩展性、容错性和高效的分布式存储与计算能力。Hadoop核心由四个主要模块组成，分别是HDFS（分布式文件系统）、MapReduce（分布式计算框架）、YARN（资源管理）和HadoopCommon（公共工具和库）。1.HDFS（HadoopDistributedFileSystem）HDFS是Hadoop生
Hadoop架构 henan程序媛 hadoop 大数据分布式
一、案列分析1.1案例概述现在已经进入了大数据(BigData)时代，数以万计用户的互联网服务时时刻刻都在产生大量的交互，要处理的数据量实在是太大了，以传统的数据库技术等其他手段根本无法应对数据处理的实时性、有效性的需求。HDFS顺应时代出现，在解决大数据存储和计算方面有很多的优势。1.2案列前置知识点1.什么是大数据大数据是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的大量数据集合，
分享一个基于python的电子书数据采集与可视化分析 hadoop电子书数据分析与推荐系统 spark大数据毕设项目（源码、调试、LW、开题、PPT) 计算机源码社 Python项目大数据大数据 python hadoop 计算机毕业设计选题计算机毕业设计源码数据分析 spark毕设
作者：计算机源码社个人简介：本人八年开发经验，擅长Java、Python、PHP、.NET、Node.js、Android、微信小程序、爬虫、大数据、机器学习等，大家有这一块的问题可以一起交流！学习资料、程序开发、技术解答、文档报告如需要源码，可以扫取文章下方二维码联系咨询Java项目微信小程序项目Android项目Python项目PHP项目ASP.NET项目Node.js项目选题推荐项目实战|p
hbase介绍 CrazyL- 云计算+大数据 hbase
hbase是一个分布式的、多版本的、面向列的开源数据库hbase利用hadoophdfs作为其文件存储系统，提供高可靠性、高性能、列存储、可伸缩、实时读写、适用于非结构化数据存储的数据库系统hbase利用hadoopmapreduce来处理hbase、中的海量数据hbase利用zookeeper作为分布式系统服务特点：数据量大：一个表可以有上亿行，上百万列（列多时，插入变慢）面向列：面向列（族）的
大数据毕业设计hadoop+spark+hive知识图谱租房数据分析可视化大屏租房推荐系统 58同城租房爬虫房源推荐系统房价预测系统计算机毕业设计机器学习深度学习人工智能 2401_84572577 程序员大数据 hadoop 人工智能
做了那么多年开发，自学了很多门编程语言，我很明白学习资源对于学一门新语言的重要性，这些年也收藏了不少的Python干货，对我来说这些东西确实已经用不到了，但对于准备自学Python的人来说，或许它就是一个宝藏，可以给你省去很多的时间和精力。别在网上瞎学了，我最近也做了一些资源的更新，只要你是我的粉丝，这期福利你都可拿走。我先来介绍一下这些东西怎么用，文末抱走。（1）Python所有方向的学习路线（
Spark集群的三种模式 MelodyYN #Spark spark hadoop big data
文章目录1、Spark的由来1.1Hadoop的发展1.2MapReduce与Spark对比2、Spark内置模块3、Spark运行模式3.1Standalone模式部署配置历史服务器配置高可用运行模式3.2Yarn模式安装部署配置历史服务器运行模式4、WordCount案例1、Spark的由来定义：Hadoop主要解决，海量数据的存储和海量数据的分析计算。Spark是一种基于内存的快速、通用、可
月度总结 | 2022年03月 | 考研与就业的抉择 | 确定未来走大数据开发路线「已注销」个人总结 hadoop
一、时间线梳理3月3日，寻找到同专业的就业伙伴3月5日，着手准备Java八股文，决定先走Java后端路线3月8月，申请到了校图书馆的考研专座，决定暂时放弃就业，先准备考研，买了数学和408的资料书3月9日-3月13日，因疫情原因，宿舍区暂封，这段时间在准备考研，发现内容特别多3月13日-3月19日，大部分时间在刷Hadoop、Zookeeper、Kafka的视频，同时在准备实习的项目3月20日，退
HBase介绍 mingyu1016 数据库
概述HBase是一个分布式的、面向列的开源数据库,源于google的一篇论文《bigtable：一个结构化数据的分布式存储系统》。HBase是GoogleBigtable的开源实现，它利用HadoopHDFS作为其文件存储系统，利用HadoopMapReduce来处理HBase中的海量数据，利用Zookeeper作为协同服务。HBase的表结构HBase以表的形式存储数据。表有行和列组成。列划分为
Java中的大数据处理框架对比分析省赚客app开发者 java 开发语言
Java中的大数据处理框架对比分析大家好，我是微赚淘客系统3.0的小编，是个冬天不穿秋裤，天冷也要风度的程序猿！今天，我们将深入探讨Java中常用的大数据处理框架，并对它们进行对比分析。大数据处理框架是现代数据驱动应用的核心，它们帮助企业处理和分析海量数据，以提取有价值的信息。本文将重点介绍ApacheHadoop、ApacheSpark、ApacheFlink和ApacheStorm这四种流行的
Hadoop windows intelij 跑 MR WordCount piziyang12138
一、软件环境我使用的软件版本如下:IntellijIdea2017.1Maven3.3.9Hadoop分布式环境二、创建maven工程打开Idea,file->new->Project,左侧面板选择maven工程。(如果只跑MapReduce创建java工程即可，不用勾选Creatfromarchetype，如果想创建web工程或者使用骨架可以勾选)image.png设置GroupId和Artif
Hadoop学习第三课（HDFS架构--读、写流程）小小程序员呀~ 数据库 hadoop 架构 big data
1.块概念举例1：一桶水1000ml，瓶子的规格100ml=>需要10个瓶子装完一桶水1010ml，瓶子的规格100ml=>需要11个瓶子装完一桶水1010ml，瓶子的规格200ml=>需要6个瓶子装完块的大小规格，只要是需要存储，哪怕一点点，也是要占用一个块的块大小的参数：dfs.blocksize官方默认的大小为128M官网：https://hadoop.apache.org/docs/r3.
hadoop启动HDFS命令 m0_67401228 java 搜索引擎 linux 后端
启动命令：/hadoop/sbin/start-dfs.sh停止命令：/hadoop/sbin/stop-dfs.sh
【计算机毕设-大数据方向】基于Hadoop的电商交易数据分析可视化系统的设计与实现程序员-石头山大数据实战案例大数据 hadoop 毕业设计毕设
博主介绍：✌全平台粉丝5W+,高级大厂开发程序员，博客之星、掘金/知乎/华为云/阿里云等平台优质作者。【源码获取】关注并且私信我【联系方式】最下边感兴趣的可以先收藏起来，同学门有不懂的毕设选题，项目以及论文编写等相关问题都可以和学长沟通，希望帮助更多同学解决问题前言随着电子商务行业的迅猛发展，电商平台积累了海量的数据资源，这些数据不仅包括用户的基本信息、购物记录，还包括用户的浏览行为、评价反馈等多
分布式离线计算—Spark—基础介绍测试开发abbey 人工智能—大数据
原文作者：饥渴的小苹果原文地址：【Spark】Spark基础教程目录Spark特点Spark相对于Hadoop的优势Spark生态系统Spark基本概念Spark结构设计Spark各种概念之间的关系Executor的优点Spark运行基本流程Spark运行架构的特点Spark的部署模式Spark三种部署方式Hadoop和Spark的统一部署摘要：Spark是基于内存计算的大数据并行计算框架Spar
spark常用命令我是浣熊的微笑 spark
查看报错日志：yarnlogsapplicationIDspark2-submit--masteryarn--classcom.hik.ReadHdfstest-1.0-SNAPSHOT.jar进入$SPARK_HOME目录，输入bin/spark-submit--help可以得到该命令的使用帮助。hadoop@wyy:/app/hadoop/spark100$bin/spark-submit--
spark启动命令学不会又听不懂 spark 大数据分布式
hadoop启动：cd/root/toolssstart-dfs.sh，只需在hadoop01上启动stop-dfs.sh日志查看：cat/root/toolss/hadoop/logs/hadoop-root-datanode-hadoop03.outzookeeper启动：cd/root/toolss/zookeeperbin/zkServer.shstart，三台都要启动bin/zkServ
编程常用命令总结 Yellow0523 Linux BigData 大数据
编程命令大全1.软件环境变量的配置JavaScalaSparkHadoopHive2.大数据软件常用命令Spark基本命令Spark-SQL命令Hive命令HDFS命令YARN命令Zookeeper命令kafka命令Hibench命令MySQL命令3.Linux常用命令Git命令conda命令pip命令查看Linux系统的详细信息查看Linux系统架构(X86还是ARM，两种方法都可)端口号命令L
Hadoop常见面试题整理及解答叶青舟 Linux hdfs 大数据 hadoop linux
Hadoop常见面试题整理及解答一、基础知识篇：1.把数据仓库从传统关系型数据库转到hadoop有什么优势？答：（1）关系型数据库成本高，且存储空间有限。而Hadoop使用较为廉价的机器存储数据，且Hadoop可以将大量机器构建成一个集群，并在集群中使用HDFS文件系统统一管理数据，极大的提高了数据的存储及处理能力。（2）关系型数据库仅支持标准结构化数据格式，Hadoop不仅支持标准结构化数据格式
2025毕业设计指南：如何用Hadoop构建超市进货推荐系统？大数据分析助力精准采购计算机编程指导师 Java实战集 Python实战集大数据实战集课程设计 hadoop 数据分析 spring boot java 进货 python
✍✍计算机编程指导师⭐⭐个人介绍：自己非常喜欢研究技术问题！专业做Java、Python、小程序、安卓、大数据、爬虫、Golang、大屏等实战项目。⛽⛽实战项目：有源码或者技术上的问题欢迎在评论区一起讨论交流！⚡⚡Java实战|SpringBoot/SSMPython实战项目|Django微信小程序/安卓实战项目大数据实战项目⚡⚡文末获取源码文章目录⚡⚡文末获取源码基于hadoop的超市进货推荐系
Hadoop Common 之序列化机制小解猫君之上 #Apache Hadoop
1.JavaSerializable序列化该序列化通过ObjectInputStream的readObject实现序列化，ObjectOutputStream的writeObject实现反序列化。这不过此种序列化虽然跨病态兼容性强，但是因为存储过多的信息，但是传输效率比较低，所以hadoop弃用它。（序列化信息包括这个对象的类，类签名，类的所有静态，费静态成员的值，以及他们父类都要被写入）publ
深入理解hadoop(一)----Common的实现----Configuration maoxiao_jsd 深入理解----hadoop
属本人个人原创，转载请注明,希望对大家有帮助！！一,hadoop的配置管理a,hadoop通过独有的Configuration处理配置信息Configurationconf=newConfiguration();conf.addResource("core-default.xml");conf.addResource("core-site.xml");后者会覆盖前者中未final标记的相同配置项b
hadoop 0.22.0 部署笔记 weixin_33701564 大数据 java 运维
为什么80%的码农都做不了架构师？>>>因为需要使用hbase，所以开始对hbase进行学习。hbase是部署在hadoop平台上的NOSql数据库，因此在部署hbase之前需要先部署hadoop。环境：redhat5、hadoop-0.22.0.tar.gz、jdk-6u13-linux-i586.zipip192.168.1.128hostname：localhost.localdomain（
解决Windows环境下hadoop集群的运行_window运行hadoop,unknown hadoop01(4) 2401_84160087 大数据面试学习
网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。需要这份系统化资料的朋友，可以戳这里获取一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！org.apache.hadoophadoop-com
解决Windows环境下hadoop集群的运行_window运行hadoop,unknown hadoop01(3) 2401_84160087 大数据面试学习
网上学习资料一大堆，但如果学到的知识不成体系，遇到问题时只是浅尝辄止，不再深入研究，那么很难做到真正的技术提升。需要这份系统化资料的朋友，可以戳这里获取一个人可以走的很快，但一群人才能走的更远！不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人，都欢迎加入我们的的圈子（技术交流、学习资源、职场吐槽、大厂内推、面试辅导），让我们一起学习成长！xmlns:xsi="http://www.w3.or
深入解析HDFS：定义、架构、原理、应用场景及常用命令 CloudJourney hdfs 架构 hadoop
引言Hadoop分布式文件系统（HDFS，HadoopDistributedFileSystem）是Hadoop框架的核心组件之一，它提供了高可靠性、高可用性和高吞吐量的大规模数据存储和管理能力。本文将从HDFS的定义、架构、工作原理、应用场景以及常用命令等多个方面进行详细探讨，帮助读者全面深入地了解HDFS。1.HDFS的定义1.1什么是HDFSHDFS是Hadoop生态系统中的一个分布式文件系
Hadoop的搭建流程 lzhlizihang hadoop 大数据分布式
文章目录一、配置IP二、配置主机名三、配置主机映射四、关闭防火墙五、配置免密六、安装jdk1、第一步：2、第二步：3、第三步：4、第四步：5、第五步：七、安装hadoop1、上传2、解压3、重命名4、开始配置环境变量5、刷新配置文件6、验证hadoop命令是否可以识别八、全分布搭建7、修改配置文件core-site.xml8、修改配置文件hdfs-site.xml9、修改配置文件hadoop-en
hive搭建 -----内嵌模式和本地模式 lzhlizihang hive hadoop
文章目录一、内嵌模式（使用较少）1、上传、解压、重命名2、配置环境变量3、配置conf下的hive-env.sh4、修改conf下的hive-site.xml5、启动hadoop集群6、给hdfs创建文件夹7、修改hive-site.xml中的非法字符8、初始化元数据9、测试是否成功10、内嵌模式的缺点二、本地模式（最常用）1、检查mysql是否正常2、上传、解压、重命名3、配置环境变量4、修改c
Hadoop之mapreduce -- WrodCount案例以及各种概念 lzhlizihang hadoop mapreduce 大数据
文章目录一、MapReduce的优缺点二、MapReduce案例--WordCount1、导包2、Mapper方法3、Partitioner方法（自定义分区器）4、reducer方法5、driver（main方法）6、Writable（手机流量统计案例的实体类）三、关于片和块1、什么是片，什么是块？2、mapreduce启动多少个MapTask任务？四、MapReduce的原理五、Shuffle过
IAAS: IT公司去IOE-Alibaba系统构架解读 wishchin 心理学/职业 BigDataMini Spark PaaS
从Hadoop到自主研发，技术解读阿里去IOE后的系统架构原地址：......................云计算阿里飞天摘要：从IOE时代，到Hadoop与飞天并行，再到飞天单集群5000节点的实现，阿里一直摸索在技术衍变的前沿。这里，我们将从架构、性能、运维等多个方面深入了解阿里基础设施。【导读】互联网的普及，智能终端的增加，大数据时代悄然而至。在这个数据为王的时代，数十倍、数百倍的数据给各
java数字签名三种方式知了ing java jdk
以下3钟数字签名都是基于jdk7的 1，RSA String password="test"; // 1.初始化密钥 KeyPairGenerator keyPairGenerator = KeyPairGenerator.getInstance("RSA"); keyPairGenerator.initialize(51
Hibernate学习笔记 caoyong Hibernate
1>、Hibernate是数据访问层框架，是一个ORM(Object Relation Mapping)框架，作者为:Gavin King 2>、搭建Hibernate的开发环境 a>、添加jar包: aa>、hibernatte开发包中/lib/required/所
设计模式之装饰器模式Decorator（结构型）漂泊一剑客 Decorator
1. 概述若你从事过面向对象开发，实现给一个类或对象增加行为，使用继承机制，这是所有面向对象语言的一个基本特性。如果已经存在的一个类缺少某些方法，或者须要给方法添加更多的功能（魅力），你也许会仅仅继承这个类来产生一个新类—这建立在额外的代码上。
读取磁盘文件txt，并输入String 一炮送你回车库 String
public static void main(String[] args) throws IOException { String fileContent = readFileContent("d:/aaa.txt"); System.out.println(fileContent);
js三级联动下拉框 3213213333332132 三级联动
//三级联动省/直辖市<select id="province"></select> 市/省直辖<select id="city"></select> 县/区 <select id="area"></select>
erlang之parse_transform编译选项的应用 616050468 parse_transform 游戏服务器属性同步 abstract_code
最近使用erlang重构了游戏服务器的所有代码，之前看过C++/lua写的服务器引擎代码，引擎实现了玩家属性自动同步给前端和增量更新玩家数据到数据库的功能，这也是现在很多游戏服务器的优化方向，在引擎层面去解决数据同步和数据持久化，数据发生变化了业务层不需要关心怎么去同步给前端。由于游戏过程中玩家每个业务中玩家数据更改的量其实是很少
JAVA JSON的解析 darkranger java
// { // “Total”：“条数”， // Code: 1, // // “PaymentItems”:[ // { // “PaymentItemID”:”支款单ID”, // “PaymentCode”:”支款单编号”, // “PaymentTime”:”支款日期”, // ”ContractNo”:”合同号”， //
POJ-1273-Drainage Ditches aijuans ACM_POJ
POJ-1273-Drainage Ditches http://poj.org/problem?id=1273 基本的最大流，按LRJ的白书写的 #include<iostream> #include<cstring> #include<queue> using namespace std; #define INF 0x7fffffff int ma
工作流Activiti5表的命名及含义 atongyeye 工作流 Activiti
activiti5 - http://activiti.org/designer/update在线插件安装 activiti5一共23张表 Activiti的表都以ACT_开头。第二部分是表示表的用途的两个字母标识。用途也和服务的API对应。 ACT_RE_*: 'RE'表示repository。这个前缀的表包含了流程定义和流程静态资源（图片，规则，等等）。 A
android的广播机制和广播的简单使用百合不是茶 android 广播机制广播的注册
Android广播机制简介在Android中，有一些操作完成以后，会发送广播，比如说发出一条短信，或打出一个电话，如果某个程序接收了这个广播，就会做相应的处理。这个广播跟我们传统意义中的电台广播有些相似之处。之所以叫做广播，就是因为它只负责“说”而不管你“听不听”，也就是不管你接收方如何处理。另外，广播可以被不只一个应用程序所接收，当然也可能不被任何应
Spring事务传播行为详解 bijian1013 java spring 事务传播行为
在service类前加上@Transactional，声明这个service所有方法需要事务管理。每一个业务方法开始时都会打开一个事务。 Spring默认情况下会对运行期例外(RunTimeException)进行事务回滚。这
eidtplus operate 征客丶 eidtplus
开启列模式: Alt+C 鼠标选择 OR Alt+鼠标左键拖动列模式替换或复制内容(多行): 右键-->格式-->填充所选内容-->选择相应操作 OR Ctrl+Shift+V(复制多行数据,必须行数一致) -------------------------------------------------------
【Kafka一】Kafka入门 bit1129 kafka
这篇文章来自Spark集成Kafka(http://bit1129.iteye.com/blog/2174765)，这里把它单独取出来，作为Kafka的入门吧下载Kafka http://mirror.bit.edu.cn/apache/kafka/0.8.1.1/kafka_2.10-0.8.1.1.tgz 2.10表示Scala的版本，而0.8.1.1表示Kafka
Spring 事务实现机制 BlueSkator spring 代理事务
Spring是以代理的方式实现对事务的管理。我们在Action中所使用的Service对象，其实是代理对象的实例，并不是我们所写的Service对象实例。既然是两个不同的对象，那为什么我们在Action中可以象使用Service对象一样的使用代理对象呢？为了说明问题，假设有个Service类叫AService，它的Spring事务代理类为AProxyService，AService实现了一个接口
bootstrap源码学习与示例：bootstrap-dropdown（转帖） BreakingBad bootstrap dropdown
bootstrap-dropdown组件是个烂东西，我读后的整体感觉。一个下拉开菜单的设计： <ul class="nav pull-right"> <li id="fat-menu" class="dropdown">
读《研磨设计模式》-代码笔记-中介者模式-Mediator bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ /* * 中介者模式（Mediator）：用一个中介对象来封装一系列的对象交互。 * 中介者使各对象不需要显式地相互引用，从而使其耦合松散，而且可以独立地改变它们之间的交互。 * * 在我看来，Mediator模式是把多个对象（
常用代码记录 chenjunt3 UI Excel J#
1、单据设置某行或某字段不能修改 //i是行号,"cash"是字段名称 getBillCardPanelWrapper().getBillCardPanel().getBillModel().setCellEditable(i, "cash", false); //取得单据表体所有项用以上语句做循环就能设置整行了 getBillC
搜索引擎与工作流引擎 comsci 算法工作搜索引擎网络应用
最近在公司做和搜索有关的工作，(只是简单的应用开源工具集成到自己的产品中)工作流系统的进一步设计暂时放在一边了，偶然看到谷歌的研究员吴军写的数学之美系列中的搜索引擎与图论这篇文章中的介绍，我发现这样一个关系(仅仅是猜想) -----搜索引擎和流程引擎的基础--都是图论，至少像在我在JWFD中引擎算法中用到的是自定义的广度优先
oracle Health Monitor daizj oracle Health Monitor
About Health Monitor Beginning with Release 11g, Oracle Database includes a framework called Health Monitor for running diagnostic checks on the database. About Health Monitor Checks Health M
JSON字符串转换为对象 dieslrae java json
作为前言,首先是要吐槽一下公司的脑残编译部署方式,web和core分开部署本来没什么问题,但是这丫居然不把json的包作为基础包而作为web的包,导致了core端不能使用,而且我们的core是可以当web来用的(不要在意这些细节),所以在core中处理json串就是个问题.没办法,跟编译那帮人也扯不清楚,只有自己写json的解析了.
C语言学习八结构体，综合应用，学生管理系统 dcj3sjt126com C语言
实现功能的代码： # include <stdio.h> # include <malloc.h> struct Student { int age; float score; char name[100]; }; int main(void) { int len; struct Student * pArr; int i,
vagrant学习笔记 dcj3sjt126com vagrant
想了解多主机是如何定义和使用的, 所以又学习了一遍vagrant 1. vagrant virtualbox 下载安装 https://www.vagrantup.com/downloads.html https://www.virtualbox.org/wiki/Downloads 查看安装在命令行输入vagrant 2.
14.性能优化-优化-软件配置优化 frank1234 软件配置性能优化
1.Tomcat线程池修改tomcat的server.xml文件： <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" maxThreads="1200" m
一个不错的shell 脚本教程入门级 HarborChung linux shell
一个不错的shell 脚本教程入门级建立一个脚本　　Linux中有好多中不同的shell，但是通常我们使用bash (bourne again shell) 进行shell编程，因为bash是免费的并且很容易使用。所以在本文中笔者所提供的脚本都是使用bash（但是在大多数情况下，这些脚本同样可以在 bash的大姐，bourne shell中运行）。　　如同其他语言一样
Spring4新特性——核心容器的其他改进 jinnianshilongnian spring 动态代理 spring4 依赖注入
Spring4新特性——泛型限定式依赖注入 Spring4新特性——核心容器的其他改进 Spring4新特性——Web开发的增强 Spring4新特性——集成Bean Validation 1.1(JSR-349)到SpringMVC Spring4新特性——Groovy Bean定义DSL Spring4新特性——更好的Java泛型操作API Spring4新
Linux设置tomcat开机启动 liuxingguome tomcat linux 开机自启动
执行命令sudo gedit /etc/init.d/tomcat6 然后把以下英文部分复制过去。（注意第一句#!/bin/sh如果不写，就不是一个shell文件。然后将对应的jdk和tomcat换成你自己的目录就行了。 #!/bin/bash # # /etc/rc.d/init.d/tomcat # init script for tomcat precesses
第13章 Ajax进阶（下） onestopweb Ajax
index.html <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/
Troubleshooting Crystal Reports off BW blueoxygen BO
http://wiki.sdn.sap.com/wiki/display/BOBJ/Troubleshooting+Crystal+Reports+off+BW#TroubleshootingCrystalReportsoffBW-TracingBOE Quite useful, especially this part: SAP BW connectivity For t
Java开发熟手该当心的11个错误 tomcat_oracle java jvm 多线程单元测试
#1、不在属性文件或XML文件中外化配置属性。比如，没有把批处理使用的线程数设置成可在属性文件中配置。你的批处理程序无论在DEV环境中，还是UAT（用户验收测试）环境中，都可以顺畅无阻地运行，但是一旦部署在PROD 上，把它作为多线程程序处理更大的数据集时，就会抛出IOException，原因可能是JDBC驱动版本不同，也可能是#2中讨论的问题。如果线程数目可以在属性文件中配置，那么使它成为
正则表达式大全 yang852220741 html 编程正则表达式
今天向大家分享正则表达式大全，它可以大提高你的工作效率正则表达式也可以被当作是一门语言，当你学习一门新的编程语言的时候，他们是一个小的子语言。初看时觉得它没有任何的意义，但是很多时候，你不得不阅读一些教程，或文章来理解这些简单的描述模式。一、校验数字的表达式数字：^[0-9]*$ n位的数字：^\d{n}$ 至少n位的数字：^\d{n,}$ m-n位的数字：^\d{m,n}$