shuhuai007

Open Source Big Data for the Impatient, Part 1

转载：http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/

There is a lot of excitement about Big Data and a lot of confusion to go with it. This article will provide a working definition of Big Data and then work through a series of examples so you can have a first-hand understanding of some of the capabilities of Hadoop, the leading open source technology in the Big Data domain. Specifically let's focus on the following questions.

What is Big Data, Hadoop, Sqoop, Hive, and Pig, and why is there so much excitement in this space?
How does Hadoop relate to IBM DB2 and Informix? Can these technologies play together?
How can I get started with Big Data? What are some easy examples that run on a single PC?
For the super impatient, if you can already define Hadoop and want to get right to working on the code samples, then do the following.
1. Fire up your Informix or DB2 instance.
2. Download the VMWare image from the Cloudera Website and increase the virtual machine RAM setting to 1.5 GB.
3. Jump to the section that contains the code samples.
4. There is a MySQL instance built into the VMWare image. If you are doing the exercises without network connectivity, use the MySQL examples.

For everyone else, read on...

What is Big Data?

Big Data is large in quantity, is captured at a rapid rate, and is structured or unstructured, or some combination of the above. These factors make Big Data difficult to capture, mine, and manage using traditional methods. There is so much hype in this space that there could be an extended debate just about the definition of big data.

Using Big Data technology is not restricted to large volumes. The examples in this article use small samples to illustrate the capabilities of the technology. As of the year 2012, clusters that are big are in the 100 Petabyte range.

Big Data can be both structured and unstructured. Traditional relational databases, like Informix and DB2, provide proven solutions for structured data. Via extensibility they also manage unstructured data. The Hadoop technology brings new and more accessible programming techniques for working on massive data stores with both structured and unstructured data.

Why all the excitement?

There are many factors contributing to the hype around Big Data, including the following.

Bringing compute and storage together on commodity hardware: The result is blazing speed at low cost.
Price performance: The Hadoop big data technology provides significant cost savings (think a factor of approximately 10) with significant performance improvements (again, think factor of 10). Your mileage may vary. If the existing technology can be so dramatically trounced, it is worth examining if Hadoop can complement or replace aspects of your current architecture.
Linear Scalability: Every parallel technology makes claims about scale up. Hadoop has genuine scalability since the latest release is expanding the limit on the number of nodes to beyond 4,000.
Full access to unstructured data: A highly scalable data store with a good parallel programming model, MapReduce, has been a challenge for the industry for some time. Hadoop's programming model doesn't solve all problems, but it is a strong solution for many tasks.

Hadoop distributions: IBM and Cloudera

One of the points of confusion is, "Where do I get software to work on Big Data?" The examples in this article are based on the free Cloudera distribution of Hadoop called CDH (for Cloudera distribution including Hadoop). This is available as a VMWare image from the Cloudera web site. IBM has recently announced it is porting it's big data platform to run on CDH. More details are found in theResources section.

The term disruptive technology is heavily overused, but in this case may be appropriate.

What is Hadoop?

Following are several definitions of Hadoop, each one targeting a different audience within the enterprise:

For the executives: Hadoop is an Apache open source software project to get value from the incredible volume/velocity/variety of data about your organization. Use the data instead of throwing most of it away.
For the technical managers: An open source suite of software that mines the structured and unstructured BigData about your company. It integrates with your existing Business Intelligence ecosystem.
Legal: An open source suite of software that is packaged and supported by multiple suppliers. Please see the Resourcessection regarding IP indemnification.
Engineering: A massively parallel, shared nothing, Java-based map-reduce execution environment. Think hundreds to thousands of computers working on the same problem, with built-in failure resilience. Projects in the Hadoop ecosystem provide data loading, higher-level languages, automated cloud deployment, and other capabilities.
Security: A Kerberos-secured software suite.

What are the components of Hadoop?

The Apache Hadoop project has two core components, the file store called Hadoop Distributed File System (HDFS), and the programming framework called MapReduce. There are a number of supporting projects that leverage HDFS and MapReduce. This article will provide a summary, and encourages you to get the OReily book "Hadoop The Definitive Guide", 3rd Edition, for more details.

The definitions below are meant to provide just enough background for you to use the code examples that follow. This article is really meant to get you started with hands-on experience with the technology. This is a how-to article more than a what-is or let's-discuss article.

HDFS: If you want 4000+ computers to work on your data, then you'd better spread your data across 4000+ computers. HDFS does this for you. HDFS has a few moving parts. The Datanodes store your data, and the Namenode keeps track of where stuff is stored. There are other pieces, but you have enough to get started.
MapReduce: This is the programming model for Hadoop. There are two phases, not surprisingly called Map and Reduce. To impress your friends tell them there is a shuffle-sort between the Map and Reduce phase. The JobTracker manages the 4000+ components of your MapReduce job. The TaskTrackers take orders from the JobTracker. If you like Java then code in Java. If you like SQL or other non-Java languages you are still in luck, you can use a utility called Hadoop Streaming.
Hadoop Streaming: A utility to enable MapReduce code in any language: C, Perl, Python, C++, Bash, etc. The examples include a Python mapper and an AWK reducer.
Hive and Hue: If you like SQL, you will be delighted to hear that you can write SQL and have Hive convert it to a MapReduce job. No, you don't get a full ANSI-SQL environment, but you do get 4000 notes and multi-Petabyte scalability. Hue gives you a browser-based graphical interface to do your Hive work.
Pig: A higher-level programming environment to do MapReduce coding. The Pig language is called Pig Latin. You may find the naming conventions somewhat unconventional, but you get incredible price-performance and high availability.
Sqoop: Provides bi-directional data transfer between Hadoop and your favorite relational database.
Oozie: Manages Hadoop workflow. This doesn't replace your scheduler or BPM tooling, but it does provide if-then-else branching and control within your Hadoop jobs.
HBase: A super-scalable key-value store. It works very much like a persistent hash-map (for python fans think dictionary). It is not a relational database despite the name HBase.
FlumeNG: A real time loader for streaming your data into Hadoop. It stores data in HDFS and HBase. You'll want to get started with FlumeNG, which improves on the original flume.
Whirr: Cloud provisioning for Hadoop. You can start up a cluster in just a few minutes with a very short configuration file.
Mahout: Machine learning for Hadoop. Used for predictive analytics and other advanced analysis.
Fuse: Makes the HDFS system look like a regular file system so you can use ls, rm, cd, and others on HDFS data
Zookeeper: Used to manage synchronization for the cluster. You won't be working much with Zookeeper, but it is working hard for you. If you think you need to write a program that uses Zookeeper you are either very, very, smart and could be a committee for an Apache project, or you are about to have a very bad day.

Figure 1 shows the key pieces of Hadoop.

Figure 1. Hadoop architecture

Open Source Big Data for the Impatient, Part 1_第1张图片

HDFS, the bottom layer, sits on a cluster of commodity hardware. Simple rack-mounted servers, each with 2-Hex core CPUs, 6 to 12 disks, and 32 gig ram. For a map-reduce job, the mapper layer reads from the disks at very high speed. The mapper emits key value pairs that are sorted and presented to the reducer, and the reducer layer summarizes the key-value pairs. No, you don't have to summarize, you can actually have a map-reduce job that has only mappers. This should become easier to understand when you get to the python-awk example.

How does Hadoop integrate with my Informix or DB2 infrastructure?

Hadoop integrates very well with your Informix and DB2 databases with Sqoop. Sqoop is the leading open-source implementation for moving data between Hadoop and relational databases. It uses JDBC to read and write Informix, DB2, MySQL, Oracle, and other sources. There are optimized adapters for several databases, including Netezza and DB2. Please see the Resources section on how to download these adapters. The examples are all Sqoop specific.

Getting started: How to run simple Hadoop, Hive, Pig, Oozie, and Sqoop examples

You are done with introductions and definitions, now it is time for the good stuff. To continue, you'll need to download the VMWare, virtual box, or other image from the Cloudera Web Site and start doing MapReduce! The virtual image assumes you have a 64bit computer and one of the popular virtualization environments. Most of the virtualization environments have a free download. When you try to boot up a 64bit virtual image you may get complaints about BIOS settings. Figure 2 shows the required change in the BIOS, in this case on a Thinkpad™. Use caution when making changes. Some corporate security packages will require a passcode after a BIOS change before the system will reboot.

Figure 2. BIOS settings for a 64bit virtual guest

Open Source Big Data for the Impatient, Part 1_第2张图片

The big data used here is actually rather small. The point is not to make your laptop catch on fire from grinding on a massive file, but to show you sources of data that are interesting, and map-reduce jobs that answer meaningful questions.

Download the Hadoop virtual image

It is highly recommended that you use the Cloudera image for running these examples. Hadoop is a technology that solves problems. The Cloudera image packaging allows you to focus on the big-data questions. But if you decide to assemble all the parts yourself, Hadoop has become the problem, not the solution.

Download an image. The CDH4 image, the latest offering is available here: CDH4 image. The Prior version, CDH3, is available here:CDH3 image.

You have your choice of virtualization technologies. You can download a free virtualization environment from VMWare and others. For example, go to vmware.com and download the vmware-player. Your laptop is probably running Windows so you'd download the vmware-player for windows. The examples in this article will be using VMWare for these examples, and running Ubuntu Linux using "tar" instead of "winzip" or equivalent.

Once downloaded, untar/unzip as follows: tar -zxvf cloudera-demo-vm-cdh4.0.0-vmware.tar.gz.

Or, if you are using CDH3, then use the following: tar -zxvf cloudera-demo-vm-cdh3u4-vmware.tar.gz

Unzip typically works on tar files. Once unzipped, you can fire up the image as follows:
vmplayer cloudera-demo-vm.vmx.

You'll now have a screen that looks like what is shown in Figure 3.

Figure 3. Cloudera virtual image

Open Source Big Data for the Impatient, Part 1_第3张图片

The vmplayer command dives right in and starts the virtual machine. If you are using CDH3, then you will need to shut down the machine and change the memory settings. Use the power button icon next to the clock at the middle bottom of the screen to power off the virtual machine. You then have edit access to the virtual machine settings.

For CDH3 the next step is to super-charge the virtual image with more RAM. Most settings can only be changed with the virtual machine powered off. Figure 4 shows how to access the setting and increasing the RAM allocated to over 2GB.

Figure 4. Adding RAM to the virtual machine

Open Source Big Data for the Impatient, Part 1_第4张图片

As shown in Figure 5, you can change the network setting to bridged. With this setting the virtual machine will get its own IP address. If this creates issues on your network, then you can optionally use Network Address Translation (NAT). You'll be using the network to connect to the database.

Figure 5. Changing the network settings to bridged

Open Source Big Data for the Impatient, Part 1_第5张图片

You are limited by the RAM on the host system, so don't try to allocate more RAM than what exists on your machine. If you do, the computer will run very slowly.

Now, for the moment you've been waiting for, go ahead and power on the virtual machine. The user cloudera is automatically logged in on startup. If you need it, the Cloudera password is: cloudera.

Install Informix and DB2

You'll need a database to work with. If you don't already have a database, then you can download the Informix Developer edition here, or the free DB2 Express-C Edition.

Another alternative for installing DB2 is to download the VMWare image that already has DB2 installed on a SuSE Linux operating system. Login as root, with the password: password.

Switch to the db2inst1 userid. Working as root is like driving a car without a seatbelt. Please talk to your local friendly DBA about getting the database running. This article won't cover that here. Don't try to install the database inside the Cloudera virtual image because there isn't enough free disk space.

The virtual machine will be connecting to the database using Sqoop which requires a JDBC driver. You will need to have the JDBC driver for your database in the virtual image. You can install the Informix driver here.

The DB2 driver is located here: http://www.ibm.com/services/forms/preLogin.do?source=swg-idsdjs or http://www-01.ibm.com/support/docview.wss?rs=4020&uid=swg21385217.

The Informix JDBC driver (remember, only the driver inside the virtual image, not the database) install is shown in Listing 1.

Listing 1. Informix JDBC driver install

                    
tar -xvf ../JDBC.3.70.JC5DE.tar
followed by 
java -jar setup.jar

Note: Select a subdirectory relative to /home/cloudera so as not to require root permission for the installation.

The DB2 JDBC driver is in zipped format, so just unzip it in the destination directory, as shown in Listing 2.
Listing 2. DB2 JDBC driver install

                     
mkdir db2jdbc
cd db2jdbc
unzip ../ibm_data_server_driver_for_jdbc_sqlj_v10.1.zip

A quick introduction to HDFS and MapReduce

Before you start moving data between your relational database and Hadoop, you need a quick introduction to HDFS and MapReduce. There are a lot of "hello world" style tutorials for Hadoop, so the examples here are meant to give only enough background for the database exercises to make sense to you.

HDFS provides storage across the nodes in your cluster. The first step in using Hadoop is putting data into HDFS. The code shown in Listing 3 gets a copy of a book by Mark Twain and a book by James Fenimore Cooper and copies these texts into HDFS.

Listing 3. Load Mark Twain and James Fenimore Cooper into HDFS

                 
# install wget utility into the virtual image
sudo yum install wget
                
# use wget to download the Twain and Cooper's works
$ wget -U firefox http://www.gutenberg.org/cache/epub/76/pg76.txt
$ wget -U firefox http://www.gutenberg.org/cache/epub/3285/pg3285.txt
                
# load both into the HDFS file system
# first give the files better names
# DS for Deerslayer
# HF for  Huckleberry Finn
$ mv pg3285.txt DS.txt
$ mv pg76.txt HF.txt
                
# this next command will fail if the directory already exists
$ hadoop fs -mkdir /user/cloudera 
                
# now put the text into the directory 
$ hadoop fs -put HF.txt /user/cloudera
                
                
# way too much typing, create aliases for hadoop commands
$ alias hput="hadoop fs -put"
$ alias hcat="hadoop fs -cat"
$ alias hls="hadoop fs -ls"
# for CDH4 
$ alias hrmr="hadoop fs -rm -r"
# for CDH3 
$ alias hrmr="hadoop fs -rmr"
                
# load the other article
# but add some compression because we can 
                
$ gzip DS.txt 
                
# the  .  in the next command references the cloudera home directory
# in hdfs, /user/cloudera 
                
$ hput DS.txt.gz .
                
# now take a look at the files we have in place
$ hls
Found 2 items
-rw-r--r-- 1 cloudera supergroup  459386 2012-08-08 19:34 /user/cloudera/DS.txt.gz
-rw-r--r-- 1 cloudera supergroup  597587 2012-08-08 19:35 /user/cloudera/HF.txt

You now have two files in a directory in HDFS. Please contain your excitement. Seriously, on a single node and with only about 1 megabyte, this is as exciting as watching paint dry. But if this was a 400 node cluster and you had 5 petabytes live, then you really would have trouble containing your excitement.

Many of the Hadoop tutorials use the word count example that is included in the example jar file. It turns out that a lot of the analysis involves counting and aggregating. The example in Listing 4 shows you how to invoke the word counter.

Listing 4. Counting words of Twain and Cooper

                 
# hadoop comes with some examples
# this next line uses the provided java implementation of a 
# word count program
                
# for CDH4:
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount HF.txt HF.out

# for CDH3:
hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount HF.txt HF.out
                
# for CDH4:
hadoop jar /usr/lib/hadoop-0.20-mapreduce/hadoop-examples.jar wordcount DS.txt.gz DS.out

# for CDH3:
hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount  DS.txt.gz DS.out

The .gz suffix on the DS.txt.gz tells Hadoop to deal with decompression as part of the Map-Reduce processing. Cooper is a bit verbose so well deserves the compaction.

There is quite a stream of messages from running your word count job. Hadoop is happy to provide a lot of detail about the Mapping and Reducing programs running on your behalf. The critical lines you want to look for are shown in Listing 5, including a second listing of a failed job and how to fix one of the most common errors you'll encounter running MapReduce.

Listing 5. MapReduce messages - the "happy path"

                 
$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount HF.txt HF.out
12/08/08 19:23:46 INFO input.FileInputFormat: Total input paths to process : 1
12/08/08 19:23:47 WARN snappy.LoadSnappy: Snappy native library is available
12/08/08 19:23:47 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/08/08 19:23:47 INFO snappy.LoadSnappy: Snappy native library loaded
12/08/08 19:23:47 INFO mapred.JobClient: Running job: job_201208081900_0002
12/08/08 19:23:48 INFO mapred.JobClient:  map 0% reduce 0%
12/08/08 19:23:54 INFO mapred.JobClient:  map 100% reduce 0%
12/08/08 19:24:01 INFO mapred.JobClient:  map 100% reduce 33%
12/08/08 19:24:03 INFO mapred.JobClient:  map 100% reduce 100%
12/08/08 19:24:04 INFO mapred.JobClient: Job complete: job_201208081900_0002
12/08/08 19:24:04 INFO mapred.JobClient: Counters: 26
12/08/08 19:24:04 INFO mapred.JobClient:   Job Counters 
12/08/08 19:24:04 INFO mapred.JobClient:     Launched reduce tasks=1
12/08/08 19:24:04 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=5959
12/08/08 19:24:04 INFO mapred.JobClient:     Total time spent by all reduces...
12/08/08 19:24:04 INFO mapred.JobClient:     Total time spent by all maps waiting...
12/08/08 19:24:04 INFO mapred.JobClient:     Launched map tasks=1
12/08/08 19:24:04 INFO mapred.JobClient:     Data-local map tasks=1
12/08/08 19:24:04 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=9433
12/08/08 19:24:04 INFO mapred.JobClient:   FileSystemCounters
12/08/08 19:24:04 INFO mapred.JobClient:     FILE_BYTES_READ=192298
12/08/08 19:24:04 INFO mapred.JobClient:     HDFS_BYTES_READ=597700
12/08/08 19:24:04 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=498740
12/08/08 19:24:04 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=138218
12/08/08 19:24:04 INFO mapred.JobClient:   Map-Reduce Framework
12/08/08 19:24:04 INFO mapred.JobClient:     Map input records=11733
12/08/08 19:24:04 INFO mapred.JobClient:     Reduce shuffle bytes=192298
12/08/08 19:24:04 INFO mapred.JobClient:     Spilled Records=27676
12/08/08 19:24:04 INFO mapred.JobClient:     Map output bytes=1033012
12/08/08 19:24:04 INFO mapred.JobClient:     CPU time spent (ms)=2430
12/08/08 19:24:04 INFO mapred.JobClient:     Total committed heap usage (bytes)=183701504
12/08/08 19:24:04 INFO mapred.JobClient:     Combine input records=113365
12/08/08 19:24:04 INFO mapred.JobClient:     SPLIT_RAW_BYTES=113
12/08/08 19:24:04 INFO mapred.JobClient:     Reduce input records=13838
12/08/08 19:24:04 INFO mapred.JobClient:     Reduce input groups=13838
12/08/08 19:24:04 INFO mapred.JobClient:     Combine output records=13838
12/08/08 19:24:04 INFO mapred.JobClient:     Physical memory (bytes) snapshot=256479232
12/08/08 19:24:04 INFO mapred.JobClient:     Reduce output records=13838
12/08/08 19:24:04 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1027047424
12/08/08 19:24:04 INFO mapred.JobClient:     Map output records=113365

What do all the messages mean? Hadoop has done a lot of work and is trying to tell you about it, including the following.

Checked to see if the input file exists.
Checked to see if the output directory exists and if it does, abort the job. Nothing worse than overwriting hours of computation by a simple keyboard error.
Distributed the Java jar file to all the nodes responsible for doing the work. In this case, this is only one node.
Ran the mapper phase of the job. Typically this parses the input file and emits a key value pair. Note the key and value can be objects.
Ran the sort phase, which sorts the mapper output based on the key.
Ran the reduce phase, typically this summarizes the key-value stream and writes output to HDFS.
Created many metrics on the progress.

Figure 6 shows a sample web page of Hadoop job metrics after running the Hive exercise.

Figure 6. Sample web page of Hadoop

Open Source Big Data for the Impatient, Part 1_第6张图片

What did the job do, and where is the output? Both are good questions, and are shown in Listing 6.

Listing 6. Map-Reduce output

                        
# way too much typing, create aliases for hadoop commands
$ alias hput="hadoop fs -put"
$ alias hcat="hadoop fs -cat"
$ alias hls="hadoop fs -ls"
$ alias hrmr="hadoop fs -rmr"
                
# first list the output directory
$ hls /user/cloudera/HF.out
Found 3 items
-rw-r--r-- 1 cloudera supergroup 0 2012-08-08 19:38 /user/cloudera/HF.out/_SUCCESS
drwxr-xr-x - cloudera supergroup 0 2012-08-08 19:38 /user/cloudera/HF.out/_logs
-rw-r--r-- 1 cl... sup... 138218 2012-08-08 19:38 /user/cloudera/HF.out/part-r-00000
                
# now cat the file and pipe it to  the  less command
$ hcat /user/cloudera/HF.out/part-r-00000 | less
                
# here are a few lines from the file, the word elephants only got used twice
elder,  1
eldest  1
elect   1
elected 1
electronic      27
electronically  1
electronically, 1
elegant 1
elegant!--'deed 1
elegant,        1
elephants       2

In the event you run the same job twice and forget to delete the output directory, you'll receive the error messages shown in Listing 7. Fixing this error is as simple as deleting the directory.

Listing 7. MapReduce messages - failure due to output already existing in HDFS

                 
# way too much typing, create aliases for hadoop commands
$ alias hput="hadoop fs -put"
$ alias hcat="hadoop fs -cat"
$ alias hls="hadoop fs -ls"
$ alias hrmr="hadoop fs -rmr"               
                
$ hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount HF.txt HF.out
12/08/08 19:26:23 INFO mapred.JobClient: 
Cleaning up the staging area hdfs://0.0.0.0/var/l...
12/08/08 19:26:23 ERROR security.UserGroupInformation: PriviledgedActionException 
as:cloudera (auth:SIMPLE) 
cause:org.apache.hadoop.mapred.FileAlreadyExistsException: 
Output directory HF.out already exists
org.apache.hadoop.mapred.FileAlreadyExistsException: 
Output directory HF.out already exists
at org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.
checkOutputSpecs(FileOutputFormat.java:132)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:872)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833)
                
.... lines deleted
                
# the simple fix is to remove the existing output directory
                
$ hrmr HF.out
                
# now you can re-run the job successfully
                
# if you run short of space and the namenode enters safemode
# clean up some file space and then
                
$ hadoop dfsadmin -safemode leave

Hadoop includes a browser interface to inspect the status of HDFS. Figure 7 shows the output of the word count job.

Figure 7. Exploring HDFS with a browser

A more sophisticated console is available for free from the Cloudera web site. It provides a number of capabilities beyond the standard Hadoop web interfaces. Notice that the health status of HDFS in Figure 8 is shown as Bad.

Figure 8. Hadoop services managed by Cloudera Manager

Open Source Big Data for the Impatient, Part 1_第7张图片

Why is it Bad? Because in a single virtual machine, HDFS cannot make three copies of the data blocks. When blocks are under-replicated, then there is a risk of data loss, so the health of the system is bad. Good thing you aren't trying to run production Hadoop jobs on a single node.

You are not limited to Java for your MapReduce jobs. This last example of MapReduce uses Hadoop Streaming to support a mapper written in Python and a reducer using AWK. No, you don't have to be a Java-guru to write Map-Reduce!

Mark Twain was not a big fan of Cooper. In this use case, Hadoop will provide some simple literary criticism comparing Twain and Cooper. The Flesch–Kincaid test calculates the reading level of a particular text. One of the factors in this analysis is the average sentence length. Parsing sentences turns out to be more complicated than just looking for the period character. The openNLP package and the Python NLTK package have excellent sentence parsers. For simplicity, the example shown in Listing 8 will use word length as a surrogate for number of syllables in a word. If you want to take this to the next level, implement the Flesch–Kincaid test in MapReduce, crawl the web, and calculate reading levels for your favorite news sites.

Listing 8. A Python-based mapper literary criticism

                
# here is the mapper we'll connect to the streaming hadoop interface
                
# the mapper is reading the text in the file - not really appreciating Twain's humor
# 
                
# modified from 
# http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
$ cat mapper.py 
#!/usr/bin/env python
import sys
                
# read stdin
for linein in sys.stdin:
# strip blanks
linein = linein.strip()
# split into words
mywords = linein.split()
# loop on mywords, output the length of each word
for word in mywords:
# the reducer just cares about the first column, 
# normally there is a key - value pair
print '%s %s' % (len(word), 0)

The mapper output, for the word "Twain", would be: 5 0. The numerical word lengths are sorted in order and presented to the reducer in sorted order. In the examples shown in Listings 9 and 10, sorting the data isn't required to get the correct output, but the sort is built into the MapReduce infrastructure and will happen anyway.

Listing 9. An AWK reducer for literary criticism

                
# the awk code is modified from http://www.commandlinefu.com
                
# awk is calculating
#  NR - the number of words in total
#  sum/NR - the average word length
# sqrt(mean2/NR) - the standard deviation 
                
$ cat statsreducer.awk 
awk '{delta = $1 - avg; avg += delta / NR; \
mean2 += delta * ($1 - avg); sum=$1+sum } \
END { print NR, sum/NR, sqrt(mean2 / NR); }'

Listing 10. Running a Python mapper and AWK reducer with Hadoop Streaming

                 
# test locally
                
# because we're using Hadoop Streaming, we can test the 
# mapper and reducer with simple pipes
                
# the "sort" phase is a reminder the keys are sorted
# before presentation to the reducer
#in this example it doesn't matter what order the 
# word length values are presented for calculating the std deviation
                
$ zcat ../DS.txt.gz  | ./mapper.py | sort | ./statsreducer.awk 
215107 4.56068 2.50734
                
# now run in hadoop with streaming
                
# CDH4
hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar \
-input HF.txt -output HFstats -file ./mapper.py -file \
./statsreducer.awk -mapper ./mapper.py -reducer ./statsreducer.awk
		
# CDH3
$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \
-input HF.txt -output HFstats -file ./mapper.py -file ./statsreducer.awk \
-mapper ./mapper.py -reducer ./statsreducer.awk
                
$ hls HFstats
Found 3 items
-rw-r--r--   1 cloudera supergroup   0 2012-08-12 15:38 /user/cloudera/HFstats/_SUCCESS
drwxr-xr-x   - cloudera supergroup   0 2012-08-12 15:37 /user/cloudera/HFstats/_logs
-rw-r--r--   1 cloudera ...  24 2012-08-12 15:37 /user/cloudera/HFstats/part-00000
                
$ hcat /user/cloudera/HFstats/part-00000
113365 4.11227 2.17086
                
# now for cooper
                
$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u4.jar \
-input DS.txt.gz -output DSstats -file ./mapper.py -file ./statsreducer.awk \
-mapper ./mapper.py -reducer ./statsreducer.awk
                
$ hcat /user/cloudera/DSstats/part-00000
215107 4.56068 2.50734

The Mark Twain fans can happily relax knowing that Hadoop finds Cooper to use longer words, and with a shocking standard deviation (humor intended). That does of course assume that shorter words are better. Let's move on, next up is writing data in HDFS to Informix and DB2.

Using Sqoop to write data from HDFS into Informix, DB2, or MySQL via JDBC

The Sqoop Apache Project is an open-source JDBC-based Hadoop to database data movement utility. Sqoop was originally created in a hackathon at Cloudera and then open sourced.

Moving data from HDFS to a relational database is a common use case. HDFS and Map-Reduce are great at doing the heavy lifting. For simple queries or a back-end store for a web site, caching the Map-Reduce output in a relational store is a good design pattern. You can avoid re-running the Map-Reduce word count by just Sqooping the results into Informix and DB2. You've generated data about Twain and Cooper, now let's move it into a database, as shown in Listing 11.

Listing 11. JDBC driver setup

                 
#Sqoop needs access to the JDBC driver for every
# database that it will access
                
# please copy the driver for each database you plan to use for these exercises
# the MySQL database and driver are already installed in the virtual image
# but you still need to copy the driver to the sqoop/lib directory
                
#one time copy of jdbc driver to sqoop lib directory
$ sudo cp Informix_JDBC_Driver/lib/ifxjdbc*.jar /usr/lib/sqoop/lib/
$ sudo cp db2jdbc/db2jcc*.jar /usr/lib/sqoop/lib/
$ sudo cp /usr/lib/hive/lib/mysql-connector-java-5.1.15-bin.jar /usr/lib/sqoop/lib/

The examples shown in Listings 12 through 15 are presented for each database. Please skip to the example of interest to you, including Informix, DB2, or MySQL. For the database polyglots, have fun doing every example. If your database of choice is not included here, it won't be a grand challenge to make these samples work elsewhere.

Listing 12. Informix users: Sqoop writing the results of the word count to Informix

                 
# create a target table to put the data
# fire up dbaccess and use this sql 
# create table wordcount ( word char(36) primary key, n int);
                
# now run the sqoop command
# this is best put in a shell script to help avoid typos...
                
$ sqoop export -D sqoop.export.records.per.statement=1 \
--fields-terminated-by '\t' --driver com.informix.jdbc.IfxDriver \
--connect \
"jdbc:informix-sqli://myhost:54321/stores_demo:informixserver=i7;user=me;password=mypw" \
--table wordcount --export-dir /user/cloudera/HF.out

Listing 13. Informix Users: Sqoop writing the results of the word count to Informix

                 
12/08/08 21:39:42 INFO manager.SqlManager: Using default fetchSize of 1000
12/08/08 21:39:42 INFO tool.CodeGenTool: Beginning code generation
12/08/08 21:39:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM wordcount AS t WHERE 1=0
12/08/08 21:39:43 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM wordcount AS t WHERE 1=0
12/08/08 21:39:43 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop
12/08/08 21:39:43 INFO orm.CompilationManager: Found hadoop core jar at: 
/usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
12/08/08 21:39:45 INFO orm.CompilationManager: Writing jar file: 
/tmp/sqoop-cloudera/compile/248b77c05740f863a15e0136accf32cf/wordcount.jar
12/08/08 21:39:45 INFO mapreduce.ExportJobBase: Beginning export of wordcount
12/08/08 21:39:45 INFO manager.SqlManager: Executing SQL statement: SELECT t.* 
FROM wordcount AS t WHERE 1=0
12/08/08 21:39:46 INFO input.FileInputFormat: Total input paths to process : 1
12/08/08 21:39:46 INFO input.FileInputFormat: Total input paths to process : 1
12/08/08 21:39:46 INFO mapred.JobClient: Running job: job_201208081900_0012
12/08/08 21:39:47 INFO mapred.JobClient:  map 0% reduce 0%
12/08/08 21:39:58 INFO mapred.JobClient:  map 38% reduce 0%
12/08/08 21:40:00 INFO mapred.JobClient:  map 64% reduce 0%
12/08/08 21:40:04 INFO mapred.JobClient:  map 82% reduce 0%
12/08/08 21:40:07 INFO mapred.JobClient:  map 98% reduce 0%
12/08/08 21:40:09 INFO mapred.JobClient: Task Id : 
attempt_201208081900_0012_m_000000_0, Status : FAILED
java.io.IOException: java.sql.SQLException: 
    Encoding or code set not supported.
at ...SqlRecordWriter.close(AsyncSqlRecordWriter.java:187)
at ...$NewDirectOutputCollector.close(MapTask.java:540)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at ....doAs(UserGroupInformation.java:1177)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Caused by: java.sql.SQLException: Encoding or code set not supported.
at com.informix.util.IfxErrMsg.getSQLException(IfxErrMsg.java:413)
at com.informix.jdbc.IfxChar.toIfx(IfxChar.java:135)
at com.informix.jdbc.IfxSqli.a(IfxSqli.java:1304)
at com.informix.jdbc.IfxSqli.d(IfxSqli.java:1605)
at com.informix.jdbc.IfxS
12/08/08 21:40:11 INFO mapred.JobClient:  map 0% reduce 0%
12/08/08 21:40:15 INFO mapred.JobClient: Task Id : 
attempt_201208081900_0012_m_000000_1, Status : FAILED
java.io.IOException: java.sql.SQLException: 
    Unique constraint (informix.u169_821) violated.
at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:223)
at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:49)
at .mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
at .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:82)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:40)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at .mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:189)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.a
12/08/08 21:40:20 INFO mapred.JobClient: 
Task Id : attempt_201208081900_0012_m_000000_2, Status : FAILED
java.sql.SQLException: Unique constraint (informix.u169_821) violated.
at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:223)
at .mapreduce.AsyncSqlRecordWriter.write(AsyncSqlRecordWriter.java:49)
at .mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
at .mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:82)
at com.cloudera.sqoop.mapreduce.TextExportMapper.map(TextExportMapper.java:40)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at .mapreduce.AutoProgressMapper.run(AutoProgressMapper.java:189)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.a
12/08/08 21:40:27 INFO mapred.JobClient: Job complete: job_201208081900_0012
12/08/08 21:40:27 INFO mapred.JobClient: Counters: 7
12/08/08 21:40:27 INFO mapred.JobClient:   Job Counters 
12/08/08 21:40:27 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=38479
12/08/08 21:40:27 INFO mapred.JobClient:     
Total time spent by all reduces waiting after reserving slots (ms)=0
12/08/08 21:40:27 INFO mapred.JobClient:     
Total time spent by all maps waiting after reserving slots (ms)=0
12/08/08 21:40:27 INFO mapred.JobClient:     Launched map tasks=4
12/08/08 21:40:27 INFO mapred.JobClient:     Data-local map tasks=4
12/08/08 21:40:27 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
12/08/08 21:40:27 INFO mapred.JobClient:     Failed map tasks=1
12/08/08 21:40:27 INFO mapreduce.ExportJobBase: 
Transferred 0 bytes in 41.5758 seconds (0 bytes/sec)
12/08/08 21:40:27 INFO mapreduce.ExportJobBase: Exported 0 records.
12/08/08 21:40:27 ERROR tool.ExportTool: Error during export: Export job failed!
                
# despite the errors above, rows are inserted into the wordcount table
# one row is missing
# the retry and duplicate key exception are most likely related, but
# troubleshooting will be saved for a later article
                
# check how we did
# nothing like a "here document" shell script
                
$ dbaccess stores_demo - <<eoj
> select count(*) from wordcount;
> eoj
                
Database selected.
(count(*)) 
13837
1 row(s) retrieved.
Database closed.

Listing 14. DB2 users: Sqoop writing the results of the word count to DB2

                 
# here is the db2 syntax
# create a destination table for db2
#
#db2 => connect to sample
#
#   Database Connection Information
#
# Database server        = DB2/LINUXX8664 10.1.0
# SQL authorization ID   = DB2INST1
# Local database alias   = SAMPLE
#
#db2 => create table wordcount ( word char(36) not null primary key , n int)
#DB20000I  The SQL command completed successfully.
#
                
sqoop export -D sqoop.export.records.per.statement=1 \
--fields-terminated-by '\t' \
--driver com.ibm.db2.jcc.DB2Driver \
--connect "jdbc:db2://192.168.1.131:50001/sample"  \
--username db2inst1 --password db2inst1 \
--table wordcount --export-dir /user/cloudera/HF.out 
                
12/08/09 12:32:59 WARN tool.BaseSqoopTool: Setting your password on the 
command-line is insecure. Consider using -P instead.
12/08/09 12:32:59 INFO manager.SqlManager: Using default fetchSize of 1000
12/08/09 12:32:59 INFO tool.CodeGenTool: Beginning code generation
12/08/09 12:32:59 INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM wordcount AS t WHERE 1=0
12/08/09 12:32:59 INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM wordcount AS t WHERE 1=0
12/08/09 12:32:59 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop
12/08/09 12:32:59 INFO orm.CompilationManager: Found hadoop core jar 
at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
12/08/09 12:33:00 INFO orm.CompilationManager: Writing jar 
file: /tmp/sqoop-cloudera/compile/5532984df6e28e5a45884a21bab245ba/wordcount.jar
12/08/09 12:33:00 INFO mapreduce.ExportJobBase: Beginning export of wordcount
12/08/09 12:33:01 INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM wordcount AS t WHERE 1=0
12/08/09 12:33:02 INFO input.FileInputFormat: Total input paths to process : 1
12/08/09 12:33:02 INFO input.FileInputFormat: Total input paths to process : 1
12/08/09 12:33:02 INFO mapred.JobClient: Running job: job_201208091208_0002
12/08/09 12:33:03 INFO mapred.JobClient:  map 0% reduce 0%
12/08/09 12:33:14 INFO mapred.JobClient:  map 24% reduce 0%
12/08/09 12:33:17 INFO mapred.JobClient:  map 44% reduce 0%
12/08/09 12:33:20 INFO mapred.JobClient:  map 67% reduce 0%
12/08/09 12:33:23 INFO mapred.JobClient:  map 86% reduce 0%
12/08/09 12:33:24 INFO mapred.JobClient:  map 100% reduce 0%
12/08/09 12:33:25 INFO mapred.JobClient: Job complete: job_201208091208_0002
12/08/09 12:33:25 INFO mapred.JobClient: Counters: 16
12/08/09 12:33:25 INFO mapred.JobClient:   Job Counters 
12/08/09 12:33:25 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=21648
12/08/09 12:33:25 INFO mapred.JobClient:     Total time spent by all 
reduces waiting after reserving slots (ms)=0
12/08/09 12:33:25 INFO mapred.JobClient:     Total time spent by all 
maps waiting after reserving slots (ms)=0
12/08/09 12:33:25 INFO mapred.JobClient:     Launched map tasks=1
12/08/09 12:33:25 INFO mapred.JobClient:     Data-local map tasks=1
12/08/09 12:33:25 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
12/08/09 12:33:25 INFO mapred.JobClient:   FileSystemCounters
12/08/09 12:33:25 INFO mapred.JobClient:     HDFS_BYTES_READ=138350
12/08/09 12:33:25 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=69425
12/08/09 12:33:25 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 12:33:25 INFO mapred.JobClient:     Map input records=13838
12/08/09 12:33:25 INFO mapred.JobClient:     Physical memory (bytes) snapshot=105148416
12/08/09 12:33:25 INFO mapred.JobClient:     Spilled Records=0
12/08/09 12:33:25 INFO mapred.JobClient:     CPU time spent (ms)=9250
12/08/09 12:33:25 INFO mapred.JobClient:     Total committed heap usage (bytes)=42008576
12/08/09 12:33:25 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=596447232
12/08/09 12:33:25 INFO mapred.JobClient:     Map output records=13838
12/08/09 12:33:25 INFO mapred.JobClient:     SPLIT_RAW_BYTES=126
12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Transferred 135.1074 KB 
in 24.4977 seconds (5.5151 KB/sec)
12/08/09 12:33:25 INFO mapreduce.ExportJobBase: Exported 13838 records.                
                
# check on the results...
#
#db2 => select count(*) from wordcount 
#
#1          
#-----------
#      13838
#
#  1 record(s) selected.
#
#

Listing 15. MySQL users: Sqoop writing the results of the word count to MySQL

                 
# if you don't have Informix or DB2 you can still do this example
# mysql - it is already installed in the VM, here is how to access
                
# one time copy of the JDBC driver
                
sudo cp /usr/lib/hive/lib/mysql-connector-java-5.1.15-bin.jar /usr/lib/sqoop/lib/
                
# now create the database and table
                
$ mysql -u root
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 45
Server version: 5.0.95 Source distribution
                
Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.
                
Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.
                
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
                
mysql> create database mydemo;
Query OK, 1 row affected (0.00 sec)
                
mysql> use mydemo
Database changed
mysql> create table wordcount ( word char(36) not null primary key, n int);
Query OK, 0 rows affected (0.00 sec)
                
mysql> exit
Bye
                
# now export
                
$ sqoop export --connect jdbc:mysql://localhost/mydemo \
--table wordcount --export-dir /user/cloudera/HF.out \
--fields-terminated-by '\t' --username root

Importing data into HDFS from Informix and DB2 with Sqoop

Inserting data into Hadoop HDFS can also be accomplished with Sqoop. The bi-directional functionality is controlled via the import parameter.

The sample databases that come with both products have some simple datasets that you can use for this purpose. Listing 16 shows the syntax and results for Sqooping each server.

For MySQL users, please adapt the syntax from either the Informix or DB2 examples that follow.

Listing 16. Sqoop import from Informix sample database to HDFS

                 
$ sqoop import --driver com.informix.jdbc.IfxDriver \
--connect \
"jdbc:informix-sqli://192.168.1.143:54321/stores_demo:informixserver=ifx117" \
--table orders \
--username informix --password useyours
                
12/08/09 14:39:18 WARN tool.BaseSqoopTool: Setting your password on the command-line 
is insecure. Consider using -P instead.
12/08/09 14:39:18 INFO manager.SqlManager: Using default fetchSize of 1000
12/08/09 14:39:18 INFO tool.CodeGenTool: Beginning code generation
12/08/09 14:39:19 INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM orders AS t WHERE 1=0
12/08/09 14:39:19 INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM orders AS t WHERE 1=0
12/08/09 14:39:19 INFO orm.CompilationManager: HADOOP_HOME is /usr/lib/hadoop
12/08/09 14:39:19 INFO orm.CompilationManager: Found hadoop core jar 
at: /usr/lib/hadoop/hadoop-0.20.2-cdh3u4-core.jar
12/08/09 14:39:21 INFO orm.CompilationManager: Writing jar 
file: /tmp/sqoop-cloudera/compile/0b59eec7007d3cff1fc0ae446ced3637/orders.jar
12/08/09 14:39:21 INFO mapreduce.ImportJobBase: Beginning import of orders
12/08/09 14:39:21 INFO manager.SqlManager: Executing SQL statement: 
SELECT t.* FROM orders AS t WHERE 1=0
12/08/09 14:39:22 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: 
SELECT MIN(order_num), MAX(order_num) FROM orders
12/08/09 14:39:22 INFO mapred.JobClient: Running job: job_201208091208_0003
12/08/09 14:39:23 INFO mapred.JobClient:  map 0% reduce 0%
12/08/09 14:39:31 INFO mapred.JobClient:  map 25% reduce 0%
12/08/09 14:39:32 INFO mapred.JobClient:  map 50% reduce 0%
12/08/09 14:39:36 INFO mapred.JobClient:  map 100% reduce 0%
12/08/09 14:39:37 INFO mapred.JobClient: Job complete: job_201208091208_0003
12/08/09 14:39:37 INFO mapred.JobClient: Counters: 16
12/08/09 14:39:37 INFO mapred.JobClient:   Job Counters 
12/08/09 14:39:37 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=22529
12/08/09 14:39:37 INFO mapred.JobClient:     Total time spent by all reduces 
waiting after reserving slots (ms)=0
12/08/09 14:39:37 INFO mapred.JobClient:     Total time spent by all maps 
waiting after reserving slots (ms)=0
12/08/09 14:39:37 INFO mapred.JobClient:     Launched map tasks=4
12/08/09 14:39:37 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
12/08/09 14:39:37 INFO mapred.JobClient:   FileSystemCounters
12/08/09 14:39:37 INFO mapred.JobClient:     HDFS_BYTES_READ=457
12/08/09 14:39:37 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=278928
12/08/09 14:39:37 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=2368
12/08/09 14:39:37 INFO mapred.JobClient:   Map-Reduce Framework
12/08/09 14:39:37 INFO mapred.JobClient:     Map input records=23
12/08/09 14:39:37 INFO mapred.JobClient:     Physical memory (bytes) snapshot=291364864
12/08/09 14:39:37 INFO mapred.JobClient:     Spilled Records=0
12/08/09 14:39:37 INFO mapred.JobClient:     CPU time spent (ms)=1610
12/08/09 14:39:37 INFO mapred.JobClient:     Total committed heap usage (bytes)=168034304
12/08/09 14:39:37 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2074587136
12/08/09 14:39:37 INFO mapred.JobClient:     Map output records=23
12/08/09 14:39:37 INFO mapred.JobClient:     SPLIT_RAW_BYTES=457
12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Transferred 2.3125 KB in 16.7045 
seconds (141.7585 bytes/sec)
12/08/09 14:39:37 INFO mapreduce.ImportJobBase: Retrieved 23 records.
                
# now look at the results
                
$ hls
Found 4 items
-rw-r--r--   1 cloudera supergroup     459386 2012-08-08 19:34 /user/cloudera/DS.txt.gz
drwxr-xr-x   - cloudera supergroup          0 2012-08-08 19:38 /user/cloudera/HF.out
-rw-r--r--   1 cloudera supergroup     597587 2012-08-08 19:35 /user/cloudera/HF.txt
drwxr-xr-x   - cloudera supergroup          0 2012-08-09 14:39 /user/cloudera/orders
$ hls orders
Found 6 items
-rw-r--r--   1 cloudera supergroup 0 2012-08-09 14:39 /user/cloudera/orders/_SUCCESS
drwxr-xr-x   - cloudera supergroup 0 2012-08-09 14:39 /user/cloudera/orders/_logs
-rw-r--r--   1 cloudera ...roup 630 2012-08-09 14:39 /user/cloudera/orders/part-m-00000
-rw-r--r--   1 cloudera supergroup        
564 2012-08-09 14:39 /user/cloudera/orders/part-m-00001
-rw-r--r--   1 cloudera supergroup        
527 2012-08-09 14:39 /user/cloudera/orders/part-m-00002
-rw-r--r--   1 cloudera supergroup        
647 2012-08-09 14:39 /user/cloudera/orders/part-m-00003
                
# wow  there are four files part-m-0000x
# look inside one 
                
# some of the lines are edited to fit on the screen
$ hcat /user/cloudera/orders/part-m-00002
1013,2008-06-22,104,express ,n,B77930    ,2008-07-10,60.80,12.20,2008-07-31
1014,2008-06-25,106,ring bell,  ,n,8052      ,2008-07-03,40.60,12.30,2008-07-10
1015,2008-06-27,110,        ,n,MA003     ,2008-07-16,20.60,6.30,2008-08-31
1016,2008-06-29,119, St.          ,n,PC6782    ,2008-07-12,35.00,11.80,null
1017,2008-07-09,120,use                 ,n,DM354331  ,2008-07-13,60.00,18.00,null

Why are there four different files each containing only part of the data? Sqoop is a highly parallelized utility. If a 4000 node cluster running Sqoop did a full throttle import from a database, the 4000 connections would look very much like a denial of service attack against the database. Sqoop's default connection limit is four JDBC connections. Each connection generates a data file in HDFS. Thus the four files. Not to worry, you'll see how Hadoop works across these files without any difficulty.

The next step is to import a DB2 table. As shown in Listing 17, by specifying the -m 1 option, a table without a primary key can be imported, and the result is a single file.

Listing 17. Sqoop import from DB2 sample database to HDFS

                 
# very much the same as above, just a different jdbc connection
# and different table name
                
sqoop import --driver com.ibm.db2.jcc.DB2Driver \
--connect "jdbc:db2://192.168.1.131:50001/sample"  \
--table staff --username db2inst1  \
--password db2inst1 -m 1 

# Here is another example
# in this case set the sqoop default schema to be different from
# the user login schema
		
sqoop import --driver com.ibm.db2.jcc.DB2Driver \
--connect "jdbc:db2://192.168.1.3:50001/sample:currentSchema=DB2INST1;" \
--table helloworld \
--target-dir "/user/cloudera/sqoopin2" \
--username marty \
-P -m 1 
		
# the the schema name is CASE SENSITIVE 
# the -P option prompts for a password that will not be visible in
# a "ps" listing

Using Hive: Joining Informix and DB2 data

There is an interesting use case to join data from Informix to DB2. Not very exciting for two trivial tables, but a huge win for multiple terabytes or petabytes of data.

There are two fundamental approaches for joining different data sources. Leaving the data at rest and using federation technology versus moving the data to a single store to perform the join. The economics and performance of Hadoop make moving the data into HDFS and performing the heavy lifting with MapReduce an easy choice. Network bandwidth limitations create a fundamental barrier if trying to join data at rest with a federation style technology. For more information on federation, please see the Resources section.

Hive provides a subset of SQL for operating on a cluster. It does not provide transaction semantics. It is not a replacement for Informix or DB2. If you have some heavy lifting in the form of table joins, even if you have some smaller tables but need to do nasty Cartesian products, Hadoop is the tool of choice.

To use Hive query language, a subset of SQL called Hiveql table metadata is required. You can define the metadata against existing files in HDFS. Sqoop provides a convenient shortcut with the create-hive-table option.

The MySQL users should feel free to adapt the examples shown in Listing 18. An interesting exercise would be joining MySQL, or any other relational database tables, to large spreadsheets.

Listing 18. Joining the informix.customer table to the db2.staff table

                 
# import the customer table into Hive
$ sqoop import --driver com.informix.jdbc.IfxDriver  \
--connect \
"jdbc:informix-sqli://myhost:54321/stores_demo:informixserver=ifx;user=me;password=you"  \
--table customer
                
# now tell hive where to find the informix data
                
# to get to the hive command prompt just type in hive
                
$ hive
Hive history file=/tmp/cloudera/yada_yada_log123.txt
hive> 
                
# here is the hiveql you need to create the tables
# using a file is easier than typing 
                
create external table customer (
cn int,
fname string,
lname string,
company string,
addr1 string,
addr2 string,
city string,
state string,
zip string,
phone string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/customer'
;
                               
# we already imported the db2 staff table above
                
# now tell hive where to find the db2 data
create external table staff (
id int,
name string,
dept string,
job string,
years string,
salary float,
comm float) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/cloudera/staff'
;
                
# you can put the commands in a file 
# and execute them as follows:
                
$ hive -f hivestaff
Hive history file=/tmp/cloudera/hive_job_log_cloudera_201208101502_2140728119.txt
OK
Time taken: 3.247 seconds
OK
10	Sanders	20	Mgr  	7	98357.5	NULL
20	Pernal	20	Sales	8	78171.25	612.45
30	Marenghi	38	Mgr  	5	77506.75	NULL
40	O'Brien	38	Sales	6	78006.0	846.55
50	Hanes	15	Mgr  	10	80
... lines deleted
                
# now for the join we've all been waiting for :-)
                
# this is a simple case, Hadoop can scale well into the petabyte range!                 
                
$ hive
Hive history file=/tmp/cloudera/hive_job_log_cloudera_201208101548_497937669.txt
hive> select customer.cn, staff.name, 
> customer.addr1, customer.city, customer.phone
> from staff join customer
> on ( staff.id = customer.cn );
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=number
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=number
In order to set a constant number of reducers:
set mapred.reduce.tasks=number
Starting Job = job_201208101425_0005, 
Tracking URL = http://0.0.0.0:50030/jobdetails.jsp?jobid=job_201208101425_0005
Kill Command = /usr/lib/hadoop/bin/hadoop 
job  -Dmapred.job.tracker=0.0.0.0:8021 -kill job_201208101425_0005
2012-08-10 15:49:07,538 Stage-1 map = 0%,  reduce = 0%
2012-08-10 15:49:11,569 Stage-1 map = 50%,  reduce = 0%
2012-08-10 15:49:12,574 Stage-1 map = 100%,  reduce = 0%
2012-08-10 15:49:19,686 Stage-1 map = 100%,  reduce = 33%
2012-08-10 15:49:20,692 Stage-1 map = 100%,  reduce = 100%
Ended Job = job_201208101425_0005
OK
110	Ngan	520 Topaz Way       	Redwood City   	415-743-3611      
120	Naughton	6627 N. 17th Way    	Phoenix        	602-265-8754      
Time taken: 22.764 seconds

It is much prettier when you use Hue for a graphical browser interface, as shown in Figures 9, 10, and 11.

Figure 9. Hue Beeswax GUI for Hive in CDH4, view Hiveql query

Figure 10. Hue Beeswax GUI for Hive, view Hiveql query

Figure 11. Hue Beeswax graphical browser, view Informix-DB2 join result

Using Pig: Joining Informix and DB2 data

Pig is a procedural language. Just like Hive, under the covers it generates MapReduce code. Hadoop ease-of-use will continue to improve as more projects become available. Much as some of us really like the command line, there are several graphical user interfaces that work very well with Hadoop.

Listing 19 shows the Pig code that is used to join the customer table and the staff table from the prior example.

Listing 19. Pig example to join the Informix table to the DB2 table

                 
$ pig
grunt> staffdb2 = load 'staff' using PigStorage(',') 
>> as ( id, name, dept, job, years, salary, comm ); 
grunt> custifx2 = load 'customer' using PigStorage(',') as  
>>  (cn, fname, lname, company, addr1, addr2, city, state, zip, phone)
>> ;
grunt> joined = join custifx2 by cn,  staffdb2 by id;
                
# to make pig generate a result set use the dump command
# no work has happened up till now
                
grunt> dump joined;
2012-08-11 21:24:51,848 [main] INFO  org.apache.pig.tools.pigstats.ScriptState 
- Pig features used in the script: HASH_JOIN
2012-08-11 21:24:51,848 [main] INFO  org.apache.pig.backend.hadoop.executionengine
.HExecutionEngine - pig.usenewlogicalplan is set to true. 
New logical plan will be used.
                
HadoopVersion	PigVersion	UserId	StartedAt	FinishedAt	Features
0.20.2-cdh3u4	0.8.1-cdh3u4	cloudera	2012-08-11 21:24:51	
2012-08-11 21:25:19	HASH_JOIN
                
Success!
                
Job Stats (time in seconds):
JobId	Maps	Reduces	MaxMapTime	MinMapTIme	AvgMapTime	
MaxReduceTime	MinReduceTime	AvgReduceTime	Alias	Feature	Outputs
job_201208111415_0006	2	1	8	8	8	10	10	10
custifx,joined,staffdb2	HASH_JOIN	hdfs://0.0.0.0/tmp/temp1785920264/tmp-388629360,
                
Input(s):
Successfully read 35 records from: "hdfs://0.0.0.0/user/cloudera/staff"
Successfully read 28 records from: "hdfs://0.0.0.0/user/cloudera/customer"
                
Output(s):
Successfully stored 2 records (377 bytes) in: 
"hdfs://0.0.0.0/tmp/temp1785920264/tmp-388629360"
                
Counters:
Total records written : 2
Total bytes written : 377
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
                
Job DAG:
job_201208111415_0006                
                
2012-08-11 21:25:19,145 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2012-08-11 21:25:19,149 [main] INFO  org.apache.hadoop.mapreduce.lib.
input.FileInputFormat - Total input paths to process : 1
2012-08-11 21:25:19,149 [main] INFO  org.apache.pig.backend.hadoop.
executionengine.util.MapRedUtil - Total input paths to process : 1
(110,Roy            ,Jaeger         ,AA Athletics        ,520 Topaz Way       
,null,Redwood City   ,CA,94062,415-743-3611      ,110,Ngan,15,Clerk,5,42508.20,206.60)
(120,Fred           ,Jewell         ,Century Pro Shop    ,6627 N. 17th Way    
,null,Phoenix        ,AZ,85016,602-265-8754      
,120,Naughton,38,Clerk,null,42954.75,180.00)

How do I choose Java, Hive, or Pig?

You have multiple options for programming Hadoop, and it is best to look at the use case to choose the right tool for the job. You are not limited to working on relational data but this article is focused on Informix, DB2, and Hadoop playing well together. Writing hundreds of lines in Java to implement a relational style hash-join is a complete waste of time since this Hadoop MapReduce algorithm is already available. How do you choose? This is a matter of personal preference. Some like coding set operations in SQL. Some prefer procedural code. You should choose the language that will make you the most productive. If you have multiple relational systems and want to combine all the data with great performance at a low price point, Hadoop, MapReduce, Hive, and Pig are ready to help.

Don't delete your data: Rolling a partition from Informix into HDFS

Most modern relational databases can partition data. A common use case is to partition by time period. A fixed window of data is stored, for example a rolling 18 month interval, after which the data is archived. The detach-partition capability is very powerful. But after the partition is detached what does one do with the data?

Tape archive of old data is a very expensive way to discard the old bytes. Once moved to a less accessible medium, the data is very rarely accessed unless there is a legal audit requirement. Hadoop provides a far better alternative.

Moving the archival bytes from the old partition into Hadoop provides high-performance access with much lower cost than keeping the data in the original transactional or datamart/datawarehouse system. The data is too old to be of transactional value, but is still very valuable to the organization for long term analysis. The Sqoop examples shown previously provide the basics for how to move this data from a relational partition to HDFS.

Fuse - Getting to your HDFS files via NFS

The Informix/DB2/flat file data in HDFS can be accessed via NFS, as shown in Listing 20. This provides command line operations without using the "hadoop fs -yadayada" interface. From a technology use-case perspective, NFS is severely limited in a Big Data environment, but the examples are included for developers and not-so-big data.

Listing 20. Setting up Fuse - access your HDFS data via NFS

                 
# this is for CDH4, the CDH3 image doesn't have fuse installed...
$ mkdir fusemnt
$ sudo hadoop-fuse-dfs dfs://localhost:8020 fusemnt/
INFO fuse_options.c:162 Adding FUSE arg fusemnt/
$ ls fusemnt
tmp  user  var
$ ls fusemnt/user
cloudera  hive
$ ls fusemnt/user/cloudera
customer  DS.txt.gz  HF.out  HF.txt  orders  staff
$ cat fusemnt/user/cloudera/orders/part-m-00001 
1007,2008-05-31,117,null,n,278693    ,2008-06-05,125.90,25.20,null
1008,2008-06-07,110,closed Monday    
,y,LZ230     ,2008-07-06,45.60,13.80,2008-07-21
1009,2008-06-14,111,next door to grocery                    
,n,4745      ,2008-06-21,20.40,10.00,2008-08-21
1010,2008-06-17,115,deliver 776 King St. if no answer       
,n,429Q      ,2008-06-29,40.60,12.30,2008-08-22
1011,2008-06-18,104,express                                 
,n,B77897    ,2008-07-03,10.40,5.00,2008-08-29
1012,2008-06-18,117,null,n,278701    ,2008-06-29,70.80,14.20,null

Flume - create a load ready file

Flume next generation, or flume-ng is a high-speed parallel loader. Databases have high-speed loaders, so how do these play well together? The relational use case for Flume-ng is creating a load ready file, locally or remotely, so a relational server can use its high speed loader. Yes, this functionality overlaps Sqoop, but the script shown in Listing 21 was created at the request of a client specifically for this style of database load.

Listing 21. Exporting HDFS data to a flat file for loading by a database

                 
$  sudo yum install flume-ng              
                
$ cat flumeconf/hdfs2dbloadfile.conf 
#
# started with example from flume-ng documentation
# modified to do hdfs source to file sink
#
                
# Define a memory channel called ch1 on agent1
 agent1.channels.ch1.type = memory                
                
# Define an exec source called exec-source1 on agent1 and tell it
# to bind to 0.0.0.0:31313. Connect it to channel ch1.
agent1.sources.exec-source1.channels = ch1
agent1.sources.exec-source1.type = exec
agent1.sources.exec-source1.command =hadoop fs -cat /user/cloudera/orders/part-m-00001
# this also works for all the files in the hdfs directory
# agent1.sources.exec-source1.command =hadoop fs
# -cat /user/cloudera/tsortin/*
agent1.sources.exec-source1.bind = 0.0.0.0
agent1.sources.exec-source1.port = 31313               
                
# Define a logger sink that simply file rolls
# and connect it to the other end of the same channel.
agent1.sinks.fileroll-sink1.channel = ch1
agent1.sinks.fileroll-sink1.type = FILE_ROLL
agent1.sinks.fileroll-sink1.sink.directory =/tmp                
                
# Finally, now that we've defined all of our components, tell
# agent1 which ones we want to activate.
agent1.channels = ch1
agent1.sources = exec-source1
agent1.sinks = fileroll-sink1                
                
# now time to run the script
                
$ flume-ng agent --conf ./flumeconf/ -f ./flumeconf/hdfs2dbloadfile.conf -n 
agent1                
                
# here is the output file
# don't forget to stop flume - it will keep polling by default and generate
# more files
                
$ cat /tmp/1344780561160-1 
1007,2008-05-31,117,null,n,278693    ,2008-06-05,125.90,25.20,null
1008,2008-06-07,110,closed Monday ,y,LZ230     ,2008-07-06,45.60,13.80,2008-07-21
1009,2008-06-14,111,next door to  ,n,4745      ,2008-06-21,20.40,10.00,2008-08-21
1010,2008-06-17,115,deliver 776 King St. if no answer       ,n,429Q      
,2008-06-29,40.60,12.30,2008-08-22
1011,2008-06-18,104,express     ,n,B77897    ,2008-07-03,10.40,5.00,2008-08-29
1012,2008-06-18,117,null,n,278701    ,2008-06-29,70.80,14.20,null
                
# jump over to dbaccess and use the greatest
# data loader in informix: the external table
# external tables were actually developed for 
# informix XPS back in the 1996 timeframe
# and are now available in may servers
                
# 
drop table eorders;
create external table eorders
(on char(10),
mydate char(18),
foo char(18),
bar char(18),
f4 char(18),
f5 char(18),
f6 char(18),
f7 char(18),
f8 char(18),
f9 char(18)
)
using (datafiles ("disk:/tmp/myfoo" ) , delimiter ",");
select * from eorders;

Oozie - adding work flow for multiple jobs

Oozie will chain together multiple Hadoop jobs. There is a nice set of examples included with oozie that are used in the code set shown in Listing 22.

Listing 22. Job control with oozie

                 
# This sample is for CDH3
		
# untar the examples
		
# CDH4
$ tar -zxvf /usr/share/doc/oozie-3.1.3+154/oozie-examples.tar.gz
                
# CDH3
$ tar -zxvf /usr/share/doc/oozie-2.3.2+27.19/oozie-examples.tar.gz
                
# cd to the directory where the examples live 
# you MUST put these jobs into the hdfs store to run them
                
$  hadoop fs -put examples examples
                
# start up the oozie server - you need to be the oozie user
# since the oozie user is a non-login id use the following su trick
                
# CDH4
$ sudo su - oozie -s /usr/lib/oozie/bin/oozie-sys.sh start

# CDH3
$ sudo su - oozie -s /usr/lib/oozie/bin/oozie-start.sh 
                
# checkthe status
oozie admin -oozie http://localhost:11000/oozie -status
System mode: NORMAL
                
# some jar housekeeping so oozie can find what it needs
                
$ cp /usr/lib/sqoop/sqoop-1.3.0-cdh3u4.jar examples/apps/sqoop/lib/
$ cp /home/cloudera/Informix_JDBC_Driver/lib/ifxjdbc.jar examples/apps/sqoop/lib/
$ cp /home/cloudera/Informix_JDBC_Driver/lib/ifxjdbcx.jar examples/apps/sqoop/lib/
                
# edit the workflow.xml  file to use your relational database:
                
#################################
<command> import 
--driver com.informix.jdbc.IfxDriver 
--connect jdbc:informix-sqli://192.168.1.143:54321/stores_demo:informixserver=ifx117 
--table orders --username informix --password useyours 
--target-dir /user/${wf:user()}/${examplesRoot}/output-data/sqoop --verbose<command>
#################################
                
# from the directory where you un-tarred the examples file do the following:
                
$ hrmr examples;hput examples examples
                
# now you can run your sqoop job by submitting it to oozie
                
$  oozie job -oozie http://localhost:11000/oozie -config  \
    examples/apps/sqoop/job.properties -run
                
job: 0000000-120812115858174-oozie-oozi-W
                
# get the job status from the oozie server
                
$ oozie job -oozie http://localhost:11000/oozie -info 0000000-120812115858174-oozie-oozi-W
Job ID : 0000000-120812115858174-oozie-oozi-W
-----------------------------------------------------------------------
Workflow Name : sqoop-wf
App Path      : hdfs://localhost:8020/user/cloudera/examples/apps/sqoop/workflow.xml
Status        : SUCCEEDED
Run           : 0
User          : cloudera
Group         : users
Created       : 2012-08-12 16:05
Started       : 2012-08-12 16:05
Last Modified : 2012-08-12 16:05
Ended         : 2012-08-12 16:05
                
Actions
----------------------------------------------------------------------
ID       Status    Ext ID                 Ext Status Err Code  
---------------------------------------------------------------------
0000000-120812115858174-oozie-oozi-W@sqoop-node                               OK
job_201208120930_0005  SUCCEEDED  -         
--------------------------------------------------------------------
                
# how to kill a job may come in useful at some point
                
oozie job -oozie http://localhost:11000/oozie -kill 
0000013-120812115858174-oozie-oozi-W                
                
# job output will be in the file tree 
$ hcat /user/cloudera/examples/output-data/sqoop/part-m-00003
1018,2008-07-10,121,SW corner of Biltmore Mall              ,n,S22942    
,2008-07-13,70.50,20.00,2008-08-06
1019,2008-07-11,122,closed till noon Mondays                 ,n,Z55709    
,2008-07-16,90.00,23.00,2008-08-06
1020,2008-07-11,123,express                                 ,n,W2286     
,2008-07-16,14.00,8.50,2008-09-20
1021,2008-07-23,124,ask for Elaine                          ,n,C3288     
,2008-07-25,40.00,12.00,2008-08-22
1022,2008-07-24,126,express                                 ,n,W9925     
,2008-07-30,15.00,13.00,2008-09-02
1023,2008-07-24,127,no deliveries after 3 p.m.              ,n,KF2961    
,2008-07-30,60.00,18.00,2008-08-22               
                
                
# if you run into this error there is a good chance that your
# database lock file is owned by root
$  oozie job -oozie http://localhost:11000/oozie -config \
examples/apps/sqoop/job.properties -run
                
Error: E0607 : E0607: Other error in operation [<openjpa-1.2.1-r752877:753278 
fatal store error> org.apache.openjpa.persistence.RollbackException: 
The transaction has been rolled back.  See the nested exceptions for 
details on the errors that occurred.], {1}
                
# fix this as follows
$ sudo chown oozie:oozie  /var/lib/oozie/oozie-db/db.lck 
                
# and restart the oozie server
$ sudo su - oozie -s /usr/lib/oozie/bin/oozie-stop.sh 
$ sudo su - oozie -s /usr/lib/oozie/bin/oozie-start.sh

HBase, a high-performance key-value store

HBase is a high-performance key-value store. If your use case requires scalability and only requires the database equivalent of auto-commit transactions, HBase may well be the technology to ride. HBase is not a database. The name is unfortunate since to some, the term base implies database. It does do an excellent job for high-performance key-value stores. There is some overlap between the functionality of HBase, Informix, DB2 and other relational databases. For ACID transactions, full SQL compliance, and multiple indexes a traditional relational database is the obvious choice.

This last code exercise is to give basic familiarity with HBASE. It is simple by design and in no way represents the scope of HBase's functionality. Please use this example to understand some of the basic capabilities in HBase. "HBase, The Definitive Guide", by Lars George, is mandatory reading if you plan to implement or reject HBase for your particular use case.

This last example, shown in Listings 23 and 24, uses the REST interface provided with HBase to insert key-values into an HBase table. The test harness is curl based.

Listing 23. Create an HBase table and insert a row

                 
# enter the command line shell for hbase
                
$ hbase shell
HBase Shell; enter 'help<RETURN> for list of supported commands.
Type "exit<RETURN> to leave the HBase Shell
Version 0.90.6-cdh3u4, r, Mon May  7 13:14:00 PDT 2012
                
#  create a table with a single column family
                
hbase(main):001:0> create 'mytable', 'mycolfamily'   
                
# if you get errors from hbase you need to fix the 
# network config
                
# here is a sample of the error:
                
ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase 
is able to connect to ZooKeeper but the connection closes immediately. 
This could be a sign that the server has too many connections 
(30 is the default). Consider inspecting your ZK server logs for 
that error and then make sure you are reusing HBaseConfiguration 
as often as you can. See HTable's javadoc for more information.
                
# fix networking:
                
# add the eth0 interface to /etc/hosts with a hostname
                
$ sudo su - 
# ifconfig | grep addr
eth0      Link encap:Ethernet  HWaddr 00:0C:29:8C:C7:70  
inet addr:192.168.1.134  Bcast:192.168.1.255  Mask:255.255.255.0
Interrupt:177 Base address:0x1400 
inet addr:127.0.0.1  Mask:255.0.0.0
[root@myhost ~]# hostname myhost
[root@myhost ~]# echo "192.168.1.134 myhost" >gt; /etc/hosts
[root@myhost ~]# cd /etc/init.d
                
# now that the host and address are defined restart Hadoop
                
[root@myhost init.d]# for i in hadoop*
> do
> ./$i restart
> done
                
# now try table create again:
                
$ hbase shell
HBase Shell; enter 'help<RETURN> for list of supported commands.
Type "exit<RETURN> to leave the HBase Shell
Version 0.90.6-cdh3u4, r, Mon May  7 13:14:00 PDT 2012
                
hbase(main):001:0> create 'mytable' , 'mycolfamily'
0 row(s) in 1.0920 seconds
                
hbase(main):002:0> 
                
# insert a row into the table you created
# use some simple telephone call log data
# Notice that mycolfamily can have multiple cells
# this is very troubling for DBAs at first, but
# you do get used to it
                
hbase(main):001:0>  put 'mytable',  'key123', 'mycolfamily:number','6175551212'
0 row(s) in 0.5180 seconds
hbase(main):002:0>  put 'mytable',  'key123', 'mycolfamily:duration','25'      
                
# now describe and then scan the table
                
hbase(main):005:0> describe 'mytable'
DESCRIPTION                                          ENABLED                    
{NAME => 'mytable', FAMILIES => [{NAME => 'mycolfam true                       
ily', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '                            
0', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>                             
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'f                            
alse', BLOCKCACHE => 'true'}]}                                                 
1 row(s) in 0.2250 seconds                
                
#  notice that timestamps are included
                
hbase(main):007:0> scan 'mytable'
ROW                        COLUMN+CELL    
key123                    column=mycolfamily:duration, 
timestamp=1346868499125, value=25  
key123                    column=mycolfamily:number, 
timestamp=1346868540850, value=6175551212  
1 row(s) in 0.0250 seconds

Listing 24. Using the HBase REST interface

                
# HBase includes a REST server
                
$ hbase rest start -p 9393 &
                
# you get a bunch of messages.... 
                
# get the status of the HBase server
                
$ curl http://localhost:9393/status/cluster
                
# lots of output...
# many lines deleted...
                
mytable,,1346866763530.a00f443084f21c0eea4a075bbfdfc292.
stores=1
storefiless=0
storefileSizeMB=0
memstoreSizeMB=0
storefileIndexSizeMB=0
                
# now scan the contents of mytable
                
$ curl http://localhost:9393/mytable/*
                
# lines deleted
12/09/05 15:08:49 DEBUG client.HTable$ClientScanner: 
Finished with scanning at REGION => 
# lines deleted
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<CellSet><Row key="a2V5MTIz">
<Cell timestamp="1346868499125" column="bXljb2xmYW1pbHk6ZHVyYXRpb24=">MjU=</Cell>
<Cell timestamp="1346868540850" column="bXljb2xmYW1pbHk6bnVtYmVy">NjE3NTU1MTIxMg==</Cell>
<Cell timestamp="1346868425844" column="bXljb2xmYW1pbHk6bnVtYmVy">NjE3NTU1MTIxMg==</Cell>
</Row></CellSet>
                
# the values from the REST interface are base64 encoded
$ echo a2V5MTIz | base64 -d
key123
$ echo bXljb2xmYW1pbHk6bnVtYmVy | base64 -d
mycolfamily:number
                
# The table scan above gives the schema needed to insert into the HBase table
                
$ echo RESTinsertedKey | base64
UkVTVGluc2VydGVkS2V5Cg==
                
$ echo 7815551212 | base64
NzgxNTU1MTIxMgo=
                
# add a table entry with a key value of "RESTinsertedKey" and
# a phone number of "7815551212"
                
# note - curl is all on one line
$  curl -H "Content-Type: text/xml" -d '<CellSet>
<Row key="UkVTVGluc2VydGVkS2V5Cg==">
<Cell column="bXljb2xmYW1pbHk6bnVtYmVy">NzgxNTU1MTIxMgo=<Cell>
<Row><CellSet> http://192.168.1.134:9393/mytable/dummykey
                
12/09/05 15:52:34 DEBUG rest.RowResource: POST http://192.168.1.134:9393/mytable/dummykey
12/09/05 15:52:34 DEBUG rest.RowResource: PUT row=RESTinsertedKey\x0A, 
families={(family=mycolfamily, 
keyvalues=(RESTinsertedKey\x0A/mycolfamily:number/9223372036854775807/Put/vlen=11)}
                
# trust, but verify
                
hbase(main):002:0> scan 'mytable'
ROW                  COLUMN+CELL                           
RESTinsertedKey\x0A column=mycolfamily:number,timestamp=1346874754883,value=7815551212\x0A
key123              column=mycolfamily:duration, timestamp=1346868499125, value=25 
key123              column=mycolfamily:number, timestamp=1346868540850, value=6175551212 
2 row(s) in 0.5610 seconds
                
# notice the \x0A at the end of the key and value
# this is the newline generated by the "echo" command
# lets fix that
                
$ printf 8885551212 | base64
ODg4NTU1MTIxMg==
                
$ printf mykey | base64
bXlrZXk=
                
# note - curl statement is all on one line!
curl -H "Content-Type: text/xml" -d '<CellSet><Row key="bXlrZXk=">
<Cell column="bXljb2xmYW1pbHk6bnVtYmVy">ODg4NTU1MTIxMg==<Cell>
<Row><CellSet> 
http://192.168.1.134:9393/mytable/dummykey              
                
# trust but verify
hbase(main):001:0> scan 'mytable'
ROW                   COLUMN+CELL                                   
RESTinsertedKey\x0A column=mycolfamily:number,timestamp=1346875811168,value=7815551212\x0A
key123              column=mycolfamily:duration, timestamp=1346868499125, value=25     
key123              column=mycolfamily:number, timestamp=1346868540850, value=6175551212
mykey               column=mycolfamily:number, timestamp=1346877875638, value=8885551212
3 row(s) in 0.6100 seconds

Conclusion

Wow, you made it to the end, well done! This is just the beginning of understanding Hadoop and how it interacts with Informix and DB2. Here are some suggestions for your next steps.

Take the examples shown previously and adapt them to your servers. You'll want to use small data since there isn't that much space in the virtual image.
Get certified as an Hadoop Administrator. Visit the Cloudera site for courses and testing information.
Get certified as a Hadoop Developer.
Start up a cluster using the free edition of Cloudera Manager.
Get started with IBM Big Sheets running on top of CDH4.

Resources

Learn

Use an RSS feed to request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
Learn more about Hadoop, Open Source, and Indemnification with free demos of Cloudera Hadoop.
Download demos of Cloudera Hadoop.
Learn more about IBM and Cloudera Partnership by reading the "Integration at the Hadoop level: IBM's BigInsights now supports Cloudera" developerWorks article.
Read the book "Hadoop the Definitive Guide" by Tom White, 3rd edition.
Read the book "Programming Hive" by Edward Capriolo, Dean Wampler, and Jason Rutherglen.
Read the book "HBase the Definitive Guide" by Lars George.
Read the book "Programming Pig" by Alan Gates.
Read other developerWorks articles by Marty Lurie.
Visit the developerWorks Information Management zone to find more resources for DB2 developers and administrators.
Stay current with developerWorks technical events and webcasts focused on a variety of IBM products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and tools as well as IT industry trends.
Follow developerWorks on Twitter.
Watch developerWorks on-demand demos ranging from product installation and setup demos for beginners, to advanced functionality for experienced developers.

Get products and technologies

Get free downloads of Cloudera Hadoop.
Download the Informix Developer edition.
Download the DB2 Express-C edition.
Install the Informix driver.
Install the DB2 driver from the IBM Software page.
Install the DB2 driver from the IBM Support Portal.
Build your next development project with IBM trial software, available for download directly from developerWorks.
Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Get involved in the My developerWorks community. Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis.

About the author

Marty Lurie started his computer career generating chads while attempting to write Fortran on an IBM 1130. His day job is Hadoop Systems Engineering at Cloudera, but if pressed he will admit he mostly plays with computers. His favorite program is the one he wrote to connect his Nordic Track to his laptop (the laptop lost two pounds, and lowered its cholesterol by 20%). Marty is a Cloudera Certified Hadoop Administrator, Cloudera Certified Hadoop Developer, an IBM-certified Advanced WebSphere Administrator, Informix-certified Professional, Certified DB2 DBA, Certified Business Intelligence Solutions Professional, Linux+ Certified, and has trained his dog to play basketball. You can contact Marty at [email protected].

你可能感兴趣的:(Open Source Big Data for the Impatient, Part 1)

斤斤计较的婚姻到底有多难？白心之岂必有为
很多人私聊我会问到在哪个人群当中斤斤计较的人最多？我都会回答他，一般婚姻出现问题的斤斤计较的人士会非常多，以我多年经验，在婚姻落的一塌糊涂的人当中，斤斤计较的人数占比在20～30%以上，也就是说10个婚姻出现问题的斤斤计较的人有2-3个有多不减。在婚姻出问题当中，有大量的心理不平衡的、尖酸刻薄的怨妇。在婚姻中仅斤斤计较有两种类型：第一种是物质上的，另一种是精神上的。在物质与精神上抠门已经严重的影响
机器学习与深度学习间关系与区别 ℒℴѵℯ心·动ꦿ໊ོ꫞ 人工智能学习深度学习 python
一、机器学习概述定义机器学习（MachineLearning,ML）是一种通过数据驱动的方法，利用统计学和计算算法来训练模型，使计算机能够从数据中学习并自动进行预测或决策。机器学习通过分析大量数据样本，识别其中的模式和规律，从而对新的数据进行判断。其核心在于通过训练过程，让模型不断优化和提升其预测准确性。主要类型1.监督学习（SupervisedLearning）监督学习是指在训练数据集中包含输入
android系统selinux中添加新属性property 辉色投像
1.定位/android/system/sepolicy/private/property_contexts声明属性开头：persist.charge声明属性类型：u:object_r:system_prop:s0图12.定位到android/system/sepolicy/public/domain.te删除neverallow{domain-init}default_prop:property
底层逆袭到底有多难，不甘平凡的你准备好了吗？让吴起给你说说造命者说
底层逆袭到底有多难，不甘平凡的你准备好了吗？让吴起给你说说我叫吴起，生于公元前440年的战国初期，正是群雄并起、天下纷争不断的时候。后人说我是军事家、政治家、改革家，是兵家代表人物。评价我一生历仕鲁、魏、楚三国，通晓兵家、法家、儒家三家思想，在内政军事上都有极高的成就。周安王二十一年（公元前381年），因变法得罪守旧贵族，被人乱箭射死。我出生在卫国一个“家累万金”的富有家庭，从年轻时候起就不甘平凡
2020-01-25 晴岚85
郑海燕坚持分享590天2020.1.24在生活中只存在两个问题。一个问题是：你知道想要达成的目标是什么，但却不知道如何才能达成；另一个问题是：你不知道你的目标是什么。前一个是行动的问题，后一个是结果的问题。通过制定具体的下一步行动，可以解决不知道如何开始行动的问题。而通过去想象结果，对结果做预估，可以解决找不着目标的问题。对于所有吸引我们注意力，想要完成的任务，你可以先想象一下，预期的结果究竟是什
10月|愿你的青春不负梦想-读书笔记-01 Tracy的小书斋
本书的作者是俞敏洪，大家都很熟悉他了吧。俞敏洪老师是我行业的领头羊吧，也是我事业上的偶像。本日摘录他书中第一章中的金句：『一个人如果什么目标都没有，就会浑浑噩噩，感觉生命中缺少能量。能给我们能量的，是对未来的期待。第一件事，我始终为了进步而努力。与其追寻全世界的骏马，不如种植丰美的草原，到时骏马自然会来。第二件事，我始终有阶段性的目标。什么东西能给我能量？答案是对未来的期待。』读到这里的时候，我便
C语言宏函数南林yan C语言 c语言
一、什么是宏函数？通过宏定义的函数是宏函数。如下，编译器在预处理阶段会将Add(x,y)替换为((x)*(y))#defineAdd(x,y)((x)*(y))#defineAdd(x,y)((x)*(y))intmain(){inta=10;intb=20;intd=10;intc=Add(a+d,b)*2;cout<
谢谢你们，爱你们！鹿游儿
昨天家人去泡温泉，二个孩子也带着去，出发前一晚，匆匆下班，赶回家和孩子一起收拾。饭后，我拿出笔和本子（上次去澳门时做手帐的本子）写下了1\2\3\4\5\6\7\8\9,让后让小壹去思考，带什么出发去旅游呢？她在对应的数字旁边画上了，泳衣、泳圈、肖恩、内衣内裤、tapuy、拖鞋……画完后，就让她自己对着这个本子，将要带的，一一带上，没想到这次带的书还是这本《便便工厂》(晚上姑婆发照片过来，妹妹累得
C语言如何定义宏函数？小九格物 c语言
在C语言中，宏函数是通过预处理器定义的，它在编译之前替换代码中的宏调用。宏函数可以模拟函数的行为，但它们不是真正的函数，因为它们在编译时不会进行类型检查，也不会分配存储空间。宏函数的定义通常使用#define指令，后面跟着宏的名称和参数列表，以及宏展开后的代码。宏函数的定义方式：1.基本宏函数：这是最简单的宏函数形式，它直接定义一个表达式。#defineSQUARE(x)((x)*(x))2.带参
微服务下功能权限与数据权限的设计与实现 nbsaas-boot 微服务 java 架构
在微服务架构下，系统的功能权限和数据权限控制显得尤为重要。随着系统规模的扩大和微服务数量的增加，如何保证不同用户和服务之间的访问权限准确、细粒度地控制，成为设计安全策略的关键。本文将讨论如何在微服务体系中设计和实现功能权限与数据权限控制。1.功能权限与数据权限的定义功能权限：指用户或系统角色对特定功能的访问权限。通常是某个用户角色能否执行某个操作，比如查看订单、创建订单、修改用户资料等。数据权限：
理解Gunicorn：Python WSGI服务器的基石范范0825 ipython linux 运维
理解Gunicorn：PythonWSGI服务器的基石介绍Gunicorn，全称GreenUnicorn，是一个为PythonWSGI（WebServerGatewayInterface）应用设计的高效、轻量级HTTP服务器。作为PythonWeb应用部署的常用工具，Gunicorn以其高性能和易用性著称。本文将介绍Gunicorn的基本概念、安装和配置，帮助初学者快速上手。1.什么是Gunico
2021年12月19日，春蕾教育集团团建活动感受——黄晓丹黄错错加油
感受:1.从陌生到熟悉的过程。游戏环节让我们在轻松的氛围中得到了锻炼，也增长了不少知识。2.游戏过程中，我们贡献的是个人力量，展现的是团队的力量。它磨合的往往不止是工作的熟悉，更是观念上契合度的贴近。3.这和工作是一样的道理。在各自的岗位上，每个人摆正自己的位置、各司其职充分发挥才能，并团结一致劲往一处使，才能实现最大的成功。新知:1.团队精神需要不断地创新。过去，人们把创新看作是冒风险，现在人们
Cell Insight | 单细胞测序技术又一新发现，可用于HIV-1和Mtb共感染个体诊断尐尐呅
结核病是艾滋病合并其他疾病中导致患者死亡的主要原因。其中结核病由结核分枝杆菌（Mycobacteriumtuberculosis,Mtb）感染引起，获得性免疫缺陷综合症（艾滋病）由人免疫缺陷病毒（Humanimmunodeficiencyvirustype1,HIV-1）感染引起。国家感染性疾病临床医学研究中心/深圳市第三人民医院张国良团队携手深圳华大生命科学研究院吴靓团队，共同研究得出单细胞测序
c++ 的iostream 和 c++的stdio的区别和联系黄卷青灯77 c++算法开发语言 iostream stdio
在C++中，iostream和C语言的stdio.h都是用于处理输入输出的库，但它们在设计、用法和功能上有许多不同。以下是两者的区别和联系：区别1.编程风格iostream（C++风格）：C++标准库中的输入输出流类库，支持面向对象的输入输出操作。典型用法是cin（输入）和cout（输出），使用>操作符来处理数据。更加类型安全，支持用户自定义类型的输入输出。#includeintmain(){in
《投行人生》读书笔记小蘑菇的树洞
《投行人生》----作者詹姆斯-A-朗德摩根斯坦利副主席40年的职业洞见-很短小精悍的篇幅，比较适合初入职场的新人。第一部分成功的职业生涯需要规划1.情商归为适应能力分享与协作同理心适应能力，更多的是自我意识，你有能力识别自己的情并分辨这些情绪如何影响你的思想和行为。2.对于初入职场的人的建议，细节，截止日期和数据很重要截止日期，一种有效的方法是请老板为你所有的任务进行优先级排序。和老板喝咖啡的好
《策划经理回忆录之二》路基雅虎
话说三年变六年，飘了，飘了……眨眼，2013年5月，老吴回到了他的家乡——油城从新开启他的工作幻想症生涯。很庆幸，这是一家很有追求，同时敢于尝试的，且实力不容低调的新星房企——金源置业(前身泰源置业)更值得庆幸的是第一个盘就是油城十路的标杆之一:金源盛世。2013年5月，到2015年11月，两年的陪伴，迎来了一场大爆发。2000个筹，5万/筹，直接回笼1个亿！！！这……让我开始认真审视这座看似五线
Long类型前后端数据不一致 igotyback 前端
响应给前端的数据浏览器控制台中response中看到的Long类型的数据是正常的到前端数据不一致前后端数据类型不匹配是一个常见问题，尤其是当后端使用Java的Long类型（64位）与前端JavaScript的Number类型（最大安全整数为2^53-1，即16位）进行数据交互时，很容易出现精度丢失的问题。这是因为JavaScript中的Number类型无法安全地表示超过16位的整数。为了解决这个问
swagger访问路径 igotyback swagger
Swagger2.x版本访问地址：http://{ip}:{port}/{context-path}/swagger-ui.html{ip}是你的服务器IP地址。{port}是你的应用服务端口，通常为8080。{context-path}是你的应用上下文路径，如果应用部署在根路径下，则为空。Swagger3.x版本对于Swagger3.x版本（也称为OpenAPI3）访问地址：http://{ip
扫地机类清洁产品之直流无刷电机控制悟空胆好小清洁服务机器人单片机人工智能
扫地机类清洁产品之直流无刷电机控制1.1前言扫地机产品有很多的电机控制，滚刷电机1个，边刷电机1-2个，清水泵电机，风机一个，部分中高端产品支持抹布功能，也就是存在抹布盘电机，还有追觅科沃斯石头等边刷抬升电机，滚刷抬升电机等的，这些电机有直流有刷电机，直接无刷电机，步进电机，电磁阀，挪动泵等不同类型。电机的原理，驱动控制方式也不行。接下来一段时间的几个文章会作个专题分析分享。直流有刷电机会自动持续
Linux下QT开发的动态库界面弹出操作（SDL2） 13jjyao QT类 qt 开发语言 sdl2 linux
需求：操作系统为linux，开发框架为qt，做成需带界面的qt动态库，调用方为java等非qt程序难点：调用方为java等非qt程序，也就是说调用方肯定不带QApplication::exec()，缺少了这个，QTimer等事件和QT创建的窗口将不能弹出(包括opencv也是不能弹出)；这与qt调用本身qt库是有本质的区别的思路：1.调用方缺QApplication::exec()，那么我们在接口
绘本讲师训练营【24期】8/21阅读原创《独生小孩》 1784e22615e0
24016-孟娟《独生小孩》图片发自App今天我想分享一个蛮特别的绘本，讲的是一个特殊的群体，我也是属于这个群体，80后的独生小孩。这是一本中国绘本，作者郭婧，也是一个80厚。全书一百多页，均为铅笔绘制，虽然为黑白色调，但并不显得沉闷。全书没有文字，犹如“默片”，但并不影响读者对该作品的理解，反而显得神秘，梦幻，給读者留下想象的空间。作者在前蝴蝶页这样写到：“我更希望父母和孩子一起分享这本书，使他
店群合一模式下的社区团购新发展——结合链动 2+1 模式、AI 智能名片与 S2B2C 商城小程序源码说私域人工智能小程序
摘要：本文探讨了店群合一的社区团购平台在当今商业环境中的重要性和优势。通过分析店群合一模式如何将互联网社群与线下终端紧密结合，阐述了链动2+1模式、AI智能名片和S2B2C商城小程序源码在这一模式中的应用价值。这些创新元素的结合为社区团购带来了新的机遇，提升了用户信任感、拓展了营销渠道，并实现了线上线下的完美融合。一、引言随着互联网技术的不断发展，社区团购作为一种新兴的商业模式，在满足消费者日常需
我校举行新老教师师徒结对仪式暨名师专业工作室工作交流活动李蕾1229
为促进我校教师专业发展，发挥骨干教师的引领带头作用，11月6日下午，我校举行新老教师师徒结对仪式暨名师专业工作室工作交流活动。图片发自App会议由教师发展处李蕾主任主持，首先，由范校长宣读新老教师结对名单及双方承担职责。随后，两位新调入教师陈玉萍、莫正杰分别和他们的师傅鲍元美、刘召彬老师签订了师徒结对协议书。图片发自App图片发自App师徒拥抱、握手。有了师傅就有了目标有了方向，相信两位新教师在师
向内而求陈陈_19b4
10月27日，阴。阅读书目:《次第花开》。作者:希阿荣博堪布，是当今藏传佛家宁玛派最伟大的上师法王，如意宝晋美彭措仁波切颇具影响力的弟子之一。多年以来，赴海内外各地弘扬佛法，以正式授课、现场开示、发表文章等多种方法指导佛学弟子修行佛法。代表作《寂静之道》、《生命这出戏》、《透过佛法看世界》自出版以来一直是佛教类书籍中的畅销书。图片发自App金句:1.佛陀说，一切痛苦的根源在于我们长期以来对自身及外
2021-08-26 影幽
在生活中，女人与男人的感悟往往有所不同。人生最大的舞台就是生活，大幕随时都可能拉开，关键是你愿不愿意表演都无法躲避。在生活中，遇事不要急躁，不要急于下结论，尤其生气时不要做决断，要学会换位思考，大事化小小事化了，把复杂的事情尽量简单处理，千万不要把简单的事情复杂化。永远不要扭曲，别人善意，无药可救。昨天是张过期的支票，明天是张信用卡，只有今天才是现金，要善加利用！执着的攀登者不必去与别人比较自己的
ArcGIS栅格计算器常见公式（赋值、0和空值的转换、补充栅格空值）研学随笔 arcgis 经验分享
我们在使用ArcGIS时通常经常用到栅格计算器，今天主要给大家介绍我日常中经常用到的几个公式，供大家参考学习。将特定值（-9999）赋值为0，例如-9999.Con("raster"==-9999,0,"raster")2.给空值赋予特定的值（如0）Con(IsNull("raster"),0,"raster")3.将特定的栅格值(如1)赋值为空值，其他保留原值SetNull("raster"==
高级编程--XML+socket练习题 masa010 java 开发语言
1.北京华北2114.8万人上海华东2,500万人广州华南1292.68万人成都华西1417万人（1）使用dom4j将信息存入xml中（2）读取信息，并打印控制台（3）添加一个city节点与子节点（4）使用socketTCP协议编写服务端与客户端，客户端输入城市ID，服务器响应相应城市信息（5）使用socketTCP协议编写服务端与客户端，客户端要求用户输入city对象，服务端接收并使用dom4j
抖音乐买买怎么加入赚钱?赚钱方法是什么测评君高省
你会在抖音买东西吗?如果会，那么一定要免费注册一个乐买买，抖音直播间，橱窗，小视频里的小黄车买东西都可以返佣金!省下来都是自己的，分享还可以赚钱乐买买是好省旗下的抖音返佣平台，乐买买分析社交电商的价值，乐买买属于今年难得的副业项目风口机会，2019年错过做好省的搞钱的黄金时期，那么2022年千万别再错过乐买买至于我为何转到高省呢？当然是高省APP佣金更高，模式更好，终端用户不流失。【高省】是一个自
开心蒋泳频
从无比抗拒来上课到接受，感动，收获～看着波哥成长，晶晶幸福笑容满面。感觉自己做的事情很有意义，很开心！还有3个感召目标就是还有三个有缘人，哈哈。明天感召去明日计划：8：30-11：00小公益11：00-21点上班，感召图片发自App图片发自App图片发自App
2018-07-23-催眠日作业-#不一样的31天#-66小鹿小鹿_33
预言日：人总是在逃避命运的路上，与之不期而遇。心理学上有个著名的名词，叫做自证预言；经济学上也有一个很著名的定律叫做，墨菲定律；在灵修派上，还有一个很著名的法则，叫做吸引力法则。这3个领域的词，虽然看起来不太一样，但是他们都在告诉人们一个现象：你越担心什么，就越有可能会发生什么。同样的道理，你越想得到什么，就应该要积极地去创造什么。无论是自证预言，墨菲定律还是吸引力法则，对人都有正反2个维度的影响
PHP，安卓，UI，java，linux视频教程合集 cocos2d-x小菜 java UI linux PHP android
╔-----------------------------------╗┆
zookeeper admin 笔记 braveCS zookeeper
Required Software 1) JDK>=1.6 2)推荐使用ensemble的ZooKeeper(至少3台)，并run on separate machines 3)在Yahoo!，zk配置在特定的RHEL boxes里，2个cpu，2G内存，80G硬盘数据和日志目录 1)数据目录里的文件是zk节点的持久化备份，包括快照和事务日
Spring配置多个连接池 easterfly spring
项目中需要同时连接多个数据库的时候，如何才能在需要用到哪个数据库就连接哪个数据库呢？ Spring中有关于dataSource的配置： <bean id="dataSource" class="com.mchange.v2.c3p0.ComboPooledDataSource" &nb
Mysql 171815164 mysql
例如，你想myuser使用mypassword从任何主机连接到mysql服务器的话。 GRANT ALL PRIVILEGES ON *.* TO 'myuser'@'%'IDENTIFIED BY 'mypassword' WI TH GRANT OPTION; 如果你想允许用户myuser从ip为192.168.1.6的主机连接到mysql服务器，并使用mypassword作
CommonDAO（公共/基础DAO） g21121 DAO
好久没有更新博客了，最近一段时间工作比较忙，所以请见谅，无论你是爱看呢还是爱看呢还是爱看呢，总之或许对你有些帮助。 DAO(Data Access Object)是一个数据访问（顾名思义就是与数据库打交道）接口，DAO一般在业
直言有讳永夜-极光感悟随笔
1.转载地址:http://blog.csdn.net/jasonblog/article/details/10813313 精华: “直言有讳”是阿里巴巴提倡的一种观念，而我在此之前并没有很深刻的认识。为什么呢？就好比是读书时候做阅读理解，我喜欢我自己的解读，并不喜欢老师给的意思。在这里也是。我自己坚持的原则是互相尊重，我觉得阿里巴巴很多价值观其实是基本的做人
安装CentOS 7 和Win 7后，Win7 引导丢失随便小屋 centos
一般安装双系统的顺序是先装Win7，然后在安装CentOS，这样CentOS可以引导WIN 7启动。但安装CentOS7后，却找不到Win7 的引导，稍微修改一点东西即可。一、首先具有root 的权限。即进入Terminal后输入命令su，然后输入密码即可二、利用vim编辑器打开/boot/grub2/grub.cfg文件进行修改 v
Oracle备份与恢复案例 aijuans oracle
Oracle备份与恢复案例一. 理解什么是数据库恢复当我们使用一个数据库时，总希望数据库的内容是可靠的、正确的，但由于计算机系统的故障（硬件故障、软件故障、网络故障、进程故障和系统故障）影响数据库系统的操作，影响数据库中数据的正确性，甚至破坏数据库，使数据库中全部或部分数据丢失。因此当发生上述故障后，希望能重构这个完整的数据库，该处理称为数据库恢复。恢复过程大致可以分为复原(Restore)与
JavaEE开源快速开发平台G4Studio v5.0发布無為子
我非常高兴地宣布,今天我们最新的JavaEE开源快速开发平台G4Studio_V5.0版本已经正式发布。访问G4Studio网站 http://www.g4it.org 2013-04-06 发布G4Studio_V5.0版本功能新增 (1). 新增了调用Oracle存储过程返回游标，并将游标映射为Java List集合对象的标
Oracle显示根据高考分数模拟录取百合不是茶 PL/SQL编程 oracle例子模拟高考录取学习交流
题目要求: 1,创建student表和result表 2,pl/sql对学生的成绩数据进行处理 3,处理的逻辑是根据每门专业课的最低分线和总分的最低分数线自动的将录取和落选 1,创建student表,和result表学生信息表; create table student( student_id number primary key,--学生id
优秀的领导与差劲的领导 bijian1013 领导管理团队
责任优秀的领导：优秀的领导总是对他所负责的项目担负起责任。如果项目不幸失败了，那么他知道该受责备的人是他自己，并且敢于承认错误。差劲的领导：差劲的领导觉得这不是他的问题，因此他会想方设法证明是他的团队不行，或是将责任归咎于团队中他不喜欢的那几个成员身上。努力工作优秀的领导：团队领导应该是团队成员的榜样。至少，他应该与团队中的其他成员一样努力工作。这仅仅因为他
js函数在浏览器下的兼容 Bill_chen jquery 浏览器 IE DWR ext
做前端开发的工程师，少不了要用FF进行测试，纯js函数在不同浏览器下，名称也可能不同。对于IE6和FF，取得下一结点的函数就不尽相同： IE6：node.nextSibling,对于FF是不能识别的； FF：node.nextElementSibling,对于IE是不能识别的；兼容解决方式：var Div = node.nextSibl
【JVM四】老年代垃圾回收：吞吐量垃圾收集器(Throughput GC) bit1129 垃圾回收
吞吐量与用户线程暂停时间衡量垃圾回收算法优劣的指标有两个：吞吐量越高，则算法越好暂停时间越短，则算法越好首先说明吞吐量和暂停时间的含义。垃圾回收时，JVM会启动几个特定的GC线程来完成垃圾回收的任务，这些GC线程与应用的用户线程产生竞争关系，共同竞争处理器资源以及CPU的执行时间。GC线程不会对用户带来的任何价值，因此，好的GC应该占
J2EE监听器和过滤器基础白糖_ J2EE
Servlet程序由Servlet，Filter和Listener组成，其中监听器用来监听Servlet容器上下文。监听器通常分三类：基于Servlet上下文的ServletContex监听，基于会话的HttpSession监听和基于请求的ServletRequest监听。 ServletContex监听器 ServletContex又叫application
博弈AngularJS讲义(16) - 提供者 boyitech js AngularJS api Angular Provider
Angular框架提供了强大的依赖注入机制，这一切都是有注入器(injector)完成. 注入器会自动实例化服务组件和符合Angular API规则的特殊对象，例如控制器，指令，过滤器动画等。那注入器怎么知道如何去创建这些特殊的对象呢？ Angular提供了5种方式让注入器创建对象，其中最基础的方式就是提供者(provider), 其余四种方式(Value, Fac
java-写一函数f(a,b)，它带有两个字符串参数并返回一串字符，该字符串只包含在两个串中都有的并按照在a中的顺序。 bylijinnan java
public class CommonSubSequence { /** * 题目：写一函数f(a,b)，它带有两个字符串参数并返回一串字符，该字符串只包含在两个串中都有的并按照在a中的顺序。 * 写一个版本算法复杂度O(N^2)和一个O(N) 。 * * O(N^2)：对于a中的每个字符，遍历b中的每个字符，如果相同，则拷贝到新字符串中。 * O(
sqlserver 2000 无法验证产品密钥 Chen.H sql windows SQL Server Microsoft
在 Service Pack 4 (SP 4), 是运行 Microsoft Windows Server 2003、 Microsoft Windows Storage Server 2003 或 Microsoft Windows 2000 服务器上您尝试安装 Microsoft SQL Server 2000 通过卷许可协议 (VLA) 媒体。这样做, 收到以下错误信息CD KEY的 SQ
[新概念武器]气象战争 comsci
气象战争的发动者必须是拥有发射深空航天器能力的国家或者组织.... 原因如下: 地球上的气候变化和大气层中的云层涡旋场有密切的关系,而维持一个在大气层某个层次
oracle 中 rollup、cube、grouping 使用详解 daizj oracle grouping rollup cube
oracle 中 rollup、cube、grouping 使用详解 -- 使用oracle 样例表演示转自namesliu -- 使用oracle 的样列库，演示 rollup, cube, grouping 的用法与使用场景 --- ROLLUP ，为了理解分组的成员数量，我增加了分组的计数 COUNT(SAL)
技术资料汇总分享 Dead_knight 技术资料汇总分享
本人汇总的技术资料，分享出来，希望对大家有用。 http://pan.baidu.com/s/1jGr56uE 资料主要包含： Workflow->工作流相关理论、框架(OSWorkflow、JBPM、Activiti、fireflow...) Security->java安全相关资料(SSL、SSO、SpringSecurity、Shiro、JAAS...) Ser
初一下学期难记忆单词背诵第一课 dcj3sjt126com english word
could 能够 minute 分钟 Tuesday 星期二 February 二月 eighteenth 第十八 listen 听 careful 小心的，仔细的 short 短的 heavy 重的 empty 空的 certainly 当然 carry 携带；搬运 tape 磁带 basket 蓝子 bottle 瓶 juice 汁，果汁 head 头；头部
截取视图的图片, 然后分享出去 dcj3sjt126com OS Objective-C
OS 7 has a new method that allows you to draw a view hierarchy into the current graphics context. This can be used to get an UIImage very fast. I implemented a category method on UIView to get the vi
MySql重置密码 fanxiaolong MySql重置密码
方法一: 在my.ini的[mysqld]字段加入： skip-grant-tables 重启mysql服务，这时的mysql不需要密码即可登录数据库然后进入mysql mysql>use mysql; mysql>更新 user set password=password('新密码') WHERE User='root'; mysq
Ehcache（03）——Ehcache中储存缓存的方式 234390216 ehcache MemoryStore DiskStore 存储驱除策略
Ehcache中储存缓存的方式目录 1 堆内存（MemoryStore） 1.1 指定可用内存 1.2 驱除策略 1.3 元素过期 2 &nbs
spring mvc中的@propertysource jackyrong spring mvc
在spring mvc中，在配置文件中的东西，可以在java代码中通过注解进行读取了： @PropertySource 在spring 3.1中开始引入比如有配置文件 config.properties mongodb.url=1.2.3.4 mongodb.db=hello 则代码中 @PropertySource(&
重学单例模式 lanqiu17 单例 Singleton 模式
最近在重新学习设计模式，感觉对模式理解更加深刻。觉得有必要记下来。第一个学的就是单例模式，单例模式估计是最好理解的模式了。它的作用就是防止外部创建实例，保证只有一个实例。单例模式的常用实现方式有两种，就人们熟知的饱汉式与饥汉式，具体就不多说了。这里说下其他的实现方式静态内部类方式: package test.pattern.singleton.statics; publ
.NET开源核心运行时，且行且珍惜 netcome java .net 开源
背景 2014年11月12日，ASP.NET之父、微软云计算与企业级产品工程部执行副总裁Scott Guthrie，在Connect全球开发者在线会议上宣布，微软将开源全部.NET核心运行时，并将.NET 扩展为可在 Linux 和 Mac OS 平台上运行。.NET核心运行时将基于MIT开源许可协议发布，其中将包括执行.NET代码所需的一切项目——CLR、JIT编译器、垃圾收集器（GC）和核心
使用oscahe缓存技术减少与数据库的频繁交互 Everyday都不同 Web 高并发 oscahe缓存
此前一直不知道缓存的具体实现，只知道是把数据存储在内存中，以便下次直接从内存中读取。对于缓存的使用也没有概念，觉得缓存技术是一个比较”神秘陌生“的领域。但最近要用到缓存技术，发现还是很有必要一探究竟的。缓存技术使用背景：一般来说，对于web项目，如果我们要什么数据直接jdbc查库好了，但是在遇到高并发的情形下，不可能每一次都是去查数据库，因为这样在高并发的情形下显得不太合理——
Spring+Mybatis 手动控制事务 toknowme mybatis
@Override public boolean testDelete(String jobCode) throws Exception { boolean flag = false; &nbs
菜鸟级的android程序员面试时候需要掌握的知识点 xp9802 android
熟悉Android开发架构和API调用掌握APP适应不同型号手机屏幕开发技巧熟悉Android下的数据存储熟练Android Debug Bridge Tool 熟练Eclipse/ADT及相关工具熟悉Android框架原理及Activity生命周期熟练进行Android UI布局熟练使用SQLite数据库；熟悉Android下网络通信机制，S