Analysis is done by the analyzer, which is built of tokenizer and zero or more filters, and can also have zero or more character mappers. Mastering ElasticSearch
Up until we start using the StringTokenizer towards the bottom of the code, we’re just cleaning up the string. MapReduce Design Patterns
The mapper takes a column and breaks it into tokens (individual words) using StringTokenizer, a class that splits strings on common tokens such as spaces and columns. Cassandra High Performance Cookbook
Ironically, the next step may be to fall back on the StringTokenizer class, use of which is discouraged in the class Javadocs comments in favor of the newer regular expression-based tokenizers. Hadoop in Practice
TokenCounterMapper A mapper that tokenizes the input value into words (using Java’s StringTokenizer) and emits each word along with a count of one. Hadoop:The Definitive Guide 3rd Edition
Tokenizer Next, the string is tokenized into individual terms by a tokenizer. Elasticsearch The Definitive Guide
We see q that WordCount uses Java’s StringTokenizer in its default setting, which tokenizes based only on whitespaces. Hadoop in Action
It has a more complicated API and has more powerful features than StringTokenizer. Java in a Nutshell 5th Edition
Note that better tokenization could be provided by use of regular expressions or the StringTokenizer class, but for our purposes this simple approach will suffice. Hadoop Beginners Guide
For example, if a low-level IO exception occurs in the midst of a StreamTokenizer or ObjectInputStream operation, there is no sensible. Concurrent Programming in Java Design Principles and Pattern
By the end of this chapter, you will be able to integrate Apache Nutch with Apache Accumulo as well as with MySQL. Web Crawling and Data Mining with Apache Nutch
Adam Shook is a software engineer at ClearEdge IT Solutions, LLC, working with a number of big data technologies such as Hadoop, Accumulo, Pig, and ZooKeeper. MapReduce Design Patterns
Providing column and table-level authorization works well for situations that require column-based security, but it does not does address fine-grained access controls at the cell level, such as what is provided by Apache Accumulo (discussed in Chapter 12) Professional Hadoop Solutions
This book covers unloading/loading to and from HDFS, graph analytics with Giraph, batch data analysis using Hive, Pig, and MapReduce, machine-learning approaches with Mahout, debugging and troubleshooting MapReduce jobs, and columnar storage and retrieval of structured data using Apache Accumulo. Hadoop RealWorld Solutions Cookbook
With sFlow instrumentation of the JVM, data is pushed from the JVMs to gmond, with no polling necessary. Monitoring with Ganglia
There will be multiple instances of your map function running, potentially in multiple Java Virtual Machines (JVMs), and potentially on multiple machines. Pro Hadoop
The number of map and reduce slots determines the number of concurrent map/reduce tasks for a TaskTracker, which forks multiple JVMs to run these tasks. Hadoop Operations and Cluster Management Cookbook
In the case of very small, fast-running map tasks (where the order of execution time is in seconds), Hadoop allows you to enable several tasks to reuse JVMs by specifying the job configuration mapreduce.job .jvm.numtasks. Professional Hadoop Solutions
However, the licensing and maturity of these other JVMs vary. Cassandra High Performance Cookbook
If multiple reduce tasks are specified, cityA and cityB could be processed independently on separate reduce JVMs. Hadoop RealWorld Solutions Cookbook
Bear in mind, too, that tasktrackers in a mini-cluster launch separate JVMs to run tasks in, which can make debugging more difficult. Hadoop:The Definitive Guide 3rd Edition
Although there is a single TaskTracker per slave node, each TaskTracker can spawn multiple JVMs to handle many map or reduce tasks in parallel. Hadoop in Action
The Hadoop community keeps a list of tested JVMs at http://wiki.apache.org/hadoop/HadoopJavaVersions where users can post their experiences with various Java VMs and versions. Hadoop Operations
Gradle provides an intelligent compromise in that it offers a maxParallelForks that governs the maximum simultaneous JVMs that are spawned. Building and Testing with Gradle
Being able to control the versioning scheme programmatically will become a necessity the more you want to automate your project lifecycle. Gradle in Action
Configuration changes, which we would normally make in Eclipse, can now be done programmatically in a Gradle build file. Gradle Effective Implementation Guide
The ElasticSearch API has some methods to programmatically create JSON documents which we will show. Mastering ElasticSearch
To rename files programmatically, we can pass a closure to the rename method (Example 1-7) Gradle Beyond The Basics
This will generate object allocation stacks sizes that are too small to be useful, so instead you’ll programmatically set custom HPROF parameters: Hadoop in Practice
Section 3.1.2 covers Hadoop Java libraries for handling HDFS files programmatically. Hadoop in Action
Programmatically Defining Keyspaces and Column Families You can create keyspaces and column families through the API as well. Cassandra The Definitive Guide
It is possible to use the JobConf class to programmatically set configuration properties on the executing job. Hadoop Beginners Guide
If you want to embed an ActiveMQ broker inside an application, the message store can also be configured programmatically. ActiveMQ in Action
Gradle also implements a strong domain model that is easily programmatically accessible. Building and Testing with Gradle
If a worker does not receive a ping message after a specified interval, the worker process terminates. Pregel - A System for Large-Scale Graph Processing
We can use this plugin to send a ping to a Hadoop/HBase daemon's RPC port, to check whether the daemon is alive. HBase Administration Cookbook
Imagine a check_ping plug-in that, when executed by the Nagios scheduler, pings a host and then return the following output: Monitoring with Ganglia
The counter values from individual worker machines are periodically propagated to the master (piggybacked on the ping response) MapReduce - Simplied Data Processing on Large Clusters
The master node will send the ping requests to other nodes and wait for the response. Mastering ElasticSearch
If an OSD has not heard from a peer recently, an explicit ping is sent. Ceph A Scalable, High-Performance Distributed File System
Open a shell or Terminal window on each machine, and use the ping command to see whether it can find the other. Beginning CouchDB
Creating a Rule Let’s create a rule to ping an arbitrary server and store the results of the ping. Gradle Beyond The Basics
Sessions are kept alive by the client sending ping requests (also known as heartbeats) whenever the session is idle for longer than a certain period. Hadoop:The Definitive Guide 3rd Edition
When a RPC client detects a tcp-socket timeout, instead of declaring a RPC timeout, it sends a ping to the RPC server. Apache Hadoop Goes Realtime at Facebook
For example, the following command uses sflowtool to convert the binary sFlow HTTP operation data into ASCII CLF so that the operations can be visually inspected or exported to a web log analyzer such as Webalizer: Monitoring with Ganglia
If they are the same, the _rev values are compared in ASCII sort order, and the highest wins. CouchDB The Definitive Guide
Lowercase filter: It makes all the tokens lowercased • ASCII folding filter: It removes non ASCII parts from tokens. Mastering ElasticSearch
Use assume so CLI formats the columns as ASCII text, and then fetch all the columns of the 'mynewcar' row: Cassandra High Performance Cookbook
If both documents have the same number of revisions, then CouchDB simply does an ASCII comparison of the revision number and the document with the highest sort order wins. Scaling CouchDB
The data is stored using a line-oriented ASCII format, in which each line is a record. Hadoop:The Definitive Guide 3rd Edition
Without this clause, Hive defaults to the newline character as the row delimiter and an ASCII value. Hadoop in Action
This sorts by directly comparing the bytes, validating that the input can be parsed as US-ASCII. Cassandra The Definitive Guide
Our file is tab separated, but Hive, by default, expects its input files to have fields separated by the ASCII code 00 (control-A) Hadoop Beginners Guide
Finally the frame is ended by the ASCII null character (^@) ActiveMQ in Action
You may be bound by other regulatory limitations on access to data such as Payment Card Industry (PCI—http://en.wikipedia. Apache Flume Distributed Log Collection for Hadoop
The Wikipedia article on B-trees is a good starting point for further investigations. CouchDB The Definitive Guide
For more information about SPNEGO, please refer to its wiki page at http://en.wikipedia.org/wiki/SPNEGO. Hadoop Operations and Cluster Management Cookbook
From the client point of view, document is a JSON object (see more about JSON format at http://en.wikipedia.org/wiki/JSON) Mastering ElasticSearch
Suppose we want to add StackOverflow links to each Wikipedia page that is referenced in a StackOverflow comment. MapReduce Design Patterns
A more in-depth explanation can be found in Wikipedia at http://en.wikipedia.org/wiki/Z-order_curve. Hadoop RealWorld Solutions Cookbook
You can find more information about Bayesian classifier from http://en.wikipedia.org/wiki/Naive_Bayes_classifier and Bayes theorem from http://betterexplained.com/articles/an-intuitiveand-short-explanation-of-bayes-theorem/ Hadoop MapReduce Cookbook
Wikipedia provides some good background articles on XML and JSON, if needed. Hadoop in Practice
The reader is configured by setting job configuration properties to tell it the patterns for the start and end tags (see the class documentation for details).5 To take an example, Wikipedia provides dumps of its content in XML form, which are appropriate for processing in parallel with MapReduce using this approach. Hadoop:The Definitive Guide 3rd Edition
The original Google paper and the Wikipedia entry use the CamelCase version MapReduce. Hadoop in Action
When you use the switch statement with enumerated constants, you need not supply the name of the enumeration in each label—it is deduced from the switch value. Core Java Volume I Fundamentals 9th Edition
When the argument is create (or an equivalent enumeration value), then the index operation will fail if a document with a given identifier already exists. Mastering ElasticSearch
The interface provides a public enumeration type, named ReturnCode, that is used by the filterKeyValue() method to indicate what the execution framework should do next. HBase:The Definitive Guide
Obviously, this is not what we want to happen when dealing with identification codes, or any kind of precise enumeration. Elasticsearch The Definitive Guide
Enumeration entries() returns an Enumeration object that enumerates the ZipEntry objects that describe the entries of the ZipFile. Core Java Volume II Advanced Features 9th Edition
The basic idea behind Java’s enum types is simple: they are classes that export one instance for each enumeration constant via a public static final field. Effective Java 2nd Edition
The getStatus operation is not an enumeration with some set list of possible statuses. Cassandra The Definitive Guide
What just happened? The first thing we need to do in order to add new counters is to create a standard Java enumeration that will hold them. Hadoop Beginners Guide
The getPropertyNames() method returns an Enumeration of all the properties on a given message to easily iterate through all of them. ActiveMQ in Action
Another possibility is to use the sleep() method of an element of the TimeUnit enumeration. Java 7 Concurrency Cookbook
Two of the three (Percona Server and MariaDB) are drop-in replacements, so the effort involved in trying them out is not large. High Performance MySQL 3rd Edition
In fact, for some systems, such as MariaDB or Percona, this may be your only option.) For example, to check the status of the cluster, open the MySQL client and check the status of the various parameters: high-availability-guide
Each calendar receiving the message will process the invitation in its own transaction which updates the invitee's state and deletes the message. Megastore - Providing Scalable, Highly Available Storage for Interactive Services
The merge cost is calculated in a way that merges reclaiming more deletes and having a smaller size is favored. Mastering ElasticSearch
FlumeJava automatically deletes any temporary intermediate files it creates when they are no longer needed by later operations in the execution plan. FlumeJava - Easy, Efficient Data-Parallel Pipelines
Deletes in Cassandra are implemented as special writes known as tombstones. Cassandra High Performance Cookbook
Description: Deletes documents marked as deleted and old revisions of documents. Beginning CouchDB
Whenever CouchDB deletes a document, it creates a new revision with _deleted field set to true. Scaling CouchDB
After some interval (a minute, say), the master deletes ephemeral files that have no open file handles. Chubby - The lock service for loosely-coupled distributed systems
Each segment is almost a self sufficient index by itself, immutable up to deletes. elasticsearch. The Road to a Distributed, (Near) Real Time, Search Engine
It is still a common design requirement to store IDs related to other entities in your tables, but operations such as cascading deletes are not available. Cassandra The Definitive Guide
Irrelevant details were elided to keep the example short.) The call to Apply performs an atomic mutation to the Webtable: it adds one anchor to www.cnn.com and deletes a different anchor. Bigtable - A Distributed Storage System for Structured Data
A Callable is similar to a Runnable, but it returns a value. Core Java Volume I Fundamentals 9th Edition
In practice, a command named runMyAwesomeTask would be callable with the abbreviation rMAT. Gradle in Action
We can also use a closure or an instance of the Callable interface to define a list of files: Gradle Effective Implementation Guide
Functions in Pig are callable units of work that can be referenced in the context of specific operators. Hadoop in Practice
Endpoints are dynamic extensions to the RPC protocol, adding callable remote procedures. HBase:The Definitive Guide
In our example here, we are submitting a Callable object to the ExecutorService instance. Getting Started with Google Guava
There are two kinds of tasks: Runnable and its close cousin, Callable (which is like Runnable, except that it returns a value) Effective Java 2nd Edition
Another important advantage of the Executor framework is the Callable interface. Java 7 Concurrency Cookbook
In the same way that interface Runnable describes any pure action, a Callable interface can be used to describe any service method that accepts an Object argument, returns an Object result, and may throw an Exception: Concurrent Programming in Java Design Principles and Pattern
The CallableStatement in the following code executes this stored procedure that is contained within the database, passing the necessary parameters. Java 7 Recipes A Problem-Solution Approach
File name Description solrconfig.xml This is the main configuration file of your Solr install. Scaling Big Data with Hadoop and Solr Chapter-02 Understanding Solr
Solr's solrconfig.xml file contains lots of parameters that can be tweaked. Solr 3 Enterprise Search Server
I'm talking about files like schema.xml, solrconfig.xml, and so on. Solr 3.1 Cookbook
We'll learn more about the main Solr configuration file for a core, named solrconfig.xml, in chapter 4 Solr in action
I'm talking about files such as schema.xml, solrconfig.xml, solr.xml, and so on. Apache Solr 4 Cookbook
In SBT, a resolver is mapped to a URL that hosts dependencies (like Maven repositories) Scala in Action
This resolver handles certain standard Spring MVC exceptions by setting a specific response status code. Pro Spring 3
NameResolver’s act method can be rewritten to use loop as shown in Listing 32.4 Programming in Scala 2nd Edition
If you must parse from an input stream, supply an entity resolver (see the following note) Core Java Volume II Advanced Features 9th Edition
The DefaultHandler class provides an empty implementation of all the methods of the ContentHandler, ErrorHandler, DTDHandler, and EntityResolver interfaces. Java in a Nutshell 5th Edition
Hive uses Java Reflection to select the correct evaluation method, and if you require a finer-grained selection, you can develop your own utility class that implements the UDFMethodResolver interface. Hadoop Beginners Guide
Hostname canonicalization is the process of finding the complete, official, hostname according to the resolution system, in this case, the host’s resolver library. Hadoop Operations
As an example, we will look at a new UDF, MetroResolver. Programming Pig
One way to improve the performance is to load the entities into memory using an entity resolver. Java Performance
You can catch these exceptions and handle the conflicts manually using a SyncResolver object. Java 7 Recipes A Problem-Solution Approach
He has worked with Perl/ PHP/Python, primarily for coding and MySQL/Oracle as the backend, for several years prior to moving into Hadoop fulltime. Apache Flume Distributed Log Collection for Hadoop
The prospect of firing up rsync or kludging together some Perl is altogether different when 20,000 hosts are involved. Monitoring with Ganglia
The multiple language code generation allows programs written in C++ or Perl to call the same methods as a Java client. Cassandra High Performance Cookbook
Most of these packages (Perl, gcc, GNU make, and so on) are preinstalled on most Linux distributions. Beginning CouchDB
The results, shown in Table 1, demonstrate Sawzall is significantly faster than Python, Ruby, or Perl, at least for these microbenchmarks. Interpreting the Data Parallel Analysis with Sawzall
Hadoop has a contrib project (contributed in JIRA ticket HADOOP-3754) that contains a Thrift server and bindings for various client languages including Python, Ruby, and Perl. Hadoop in Practice
We wrote a naive single-node Perl hash-based program as an example of a typical quick solution a sysadmin may create. Hadoop in Action
It’s increasingly popular and is used by projects such as Android, Fedora, Ruby on Rails, Perl, and many Cassandra clients (as we’ll see in Chapter 8) Cassandra The Definitive Guide
If you know Perl, Python, Ruby, JavaScript, or even shell script, you can learn Pig’s syntax in the morning and be running MapReduce jobs by lunchtime. Hadoop Operations
Although ActiveMQ is written in Java, APIs for many languages other than Java are provided, including C/C++, .NET, Perl, PHP, Python, Ruby, and many more. ActiveMQ in Action
The type of object returned will vary, depending on the MIME content type of the remote resource and the available content handlers (classes responsible for processing and retrieving objects from a URLConnection) Java Network Programming and Distributed Computing
ContentHandler objects are typically d indirectly using the getContent method of URL and URLConnection. Java EE 7 Specification
The HttpURLConnection class that uses this method is abstract, so the actual behavior of this class may be overridden in a particular implementation. Java Security 2nd Edition
The URL and URLConnection classes encapsulate much of the complexity of retrieving information from a remote site. Core Java Volume II Advanced Features 9th Edition
At a slightly more complex level, a URLConnection object can be obtained from a given URL object. Java in a Nutshell 5th Edition
Chapter 7, URLConnections, shows you how to use the URLConnection and HttpURLConnection classes not just to download data from web servers, but to upload documents and configure connections. Java Network Programming 4th Edition
Use the CLI to see that the data is stored in a single column: [default@parking] assume parking validator as ascii; Cassandra High Performance Cookbook
ValidatorFactory for ValidatorFactory, or by requesting the injection of an object of the appropriate type via the Resource annotation. Java EE 7 Specification
This command runs the SortValidator program, which performs a series of checks on the unsorted and sorted data to check whether the sort is accurate. Hadoop:The Definitive Guide 3rd Edition
This is achieved by way of a nested f:convert er and f:validator, just like with any h:inputText, and can be done as shown: Java EE 7 Essentials
Several HTML validity checkers are available for download and you can validate HTML online [W3C-validator] Effective Java 2nd Edition
Obtain a Validator from the Schema, and then use the Validator to validate your XML document. Java in a Nutshell 5th Edition
Once the rows have been added to the validator, if the column family needs to be validated, it will create the Merkle tree and broadcast it to the neighboring nodes. Cassandra The Definitive Guide
The example has two UserValidator objects that return a random boolean value. Java 7 Concurrency Cookbook
The following code snippet shows how to use a validator to validate a document using SAX. Java Performance
You can use the Schema instance to retrieve a Validator instance that understands this grammar. Java 7 Recipes A Problem-Solution Approach
The hypervisor acts as a virtual switch, connecting virtual machines to each other and to the physical network. Monitoring with Ganglia
This refers to the time during which something was runnable on the virtual machine, but the hypervisor chose to run something else instead. High Performance MySQL 3rd Edition
The nova-compute process is primarily a worker daemon that creates and terminates virtual machine instances via hypervisor's APIs (XenAPI for XenServer/XCP, libvirt for KVM or QEMU, VMwareAPI for VMware, etc.) bk-compute-adminguide-grizzly
A worker daemon that creates and terminates virtual machine instances through hypervisor APIs. bk-admin-guide-cloud-latest
The Vagrant file simply defines the hypervisor-level configuration and provisioning; the remaining provisioning is done through Puppet and is defined at two levels. Storm Real-Time Processing Cookbook
The piece of software that enables multiple guests, for example operating systems, to run on a single system is called a hypervisor. Oracle JRockit The Definitive Guide
Virtualization works by running a hypervisor either in a host OS or directly on bare metal, replacing the host OS entirely. Hadoop Operations
To view the list of valid compute hosts, use the nova hypervisor-list command, as follows: admin-user-guide-trunk
The actual device name might differ due to hypervisor settings. user-guide
Probably the most important factor in your choice of hypervisor is your current usage or experience. OpenStack Operations Guide
Imagine you want to have a codec that decodes bytes to some kind of message (a POJO) and encodes the message back to bytes. Netty in Action MEAP V08
You’ll create a new Java class called ToDoItem, a plain old Java object (POJO) acting as a model. Gradle in Action
This is a simple POJO (plain old Java object) with three properties—nothing much to it. Scala in Action
There is little else to configure: just add a dependency to the test framework (or frameworks) your tests are using, and Maven will automatically look for and execute the JUnit, TestNG, or even Plain Old Java Objects (POJO) tests contained in this directory structure. Jenkins The Definitive Guide
Enables a WebSocket client and server endpoint to be defined declaratively via annotations on a POJO, or programmatically via interface implementation. Java EE 7 Essentials
A configurable POJO is associated with the ActiveMQ connection that you can tune to set different policies. ActiveMQ in Action
Arcane interfaces for using persistent entities have been removed, creating a simplified Plain Old Java Object (POJO) based persistence model. Java Performance
I created a constraint object called MyCellConstraint which is a plain old Java object (POJO) used when calling the add(JComponent comp, Object constraint) method of a JPanel component. Java 7 Recipes A Problem-Solution Approach
Because the objects were regular Java objects, this persistence model came to be known as Plain Old Java Object (POJO) persistence. Pro JPA2
Formally, each of these kind of fields is known as a property in Gradle since we are saving state in POJO objects under the hood. Building and Testing with Gradle
When you add many components into a window, you need to give some thought to the traversal order. Core Java Volume I Fundamentals 9th Edition
There is also a version of parallelDo() that allows multiple output PCollections to be produced simultaneously from a single traversal of the input PCollection. FlumeJava - Easy, Efficient Data-Parallel Pipelines
For example, Cascalog adds a logic-based query capability inspired by Datalog, while Scalding adds math libraries that are useful for graph-traversal problems and many machine-learning algorithms. Professional Hadoop Solutions
For example, owner and mode are required for the security check during path traversal but rarely change, requiring very few states, while the file lock reflects a wider range of client access modes as it controls an MDS’s ability to issue client capabilities. Ceph A Scalable, High-Performance Distributed File System
Because of the limited set of methods available in the Iterator interface, it is hard to use the Iterator for anything more than a simple traversal through the Collection. Data Structures and Algorithm Analysis in Java 3rd Edition
Computer scientists also call this postorder traversal because the search process visits the children before visiting the parents. Core Java Volume II Advanced Features 9th Edition
The serialization logic has no knowledge of the topology of the object graph, so it must go through an expensive graph traversal. Effective Java 2nd Edition
It performs a depth first traversal, starting at a specific root directory. Java 7 New Features Cookbook
The most fundamental graph traversal algorithms are depth-first search (DFS) and breadth-first search (BFS).The difference between the algorithms is the ordering in which a node is processed in relationship to its neighbors. Hadoop Beginners Guide
However, this strategy doesn't work for another common usage of collections, traversal. Concurrent Programming in Java Design Principles and Pattern
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. Apache Flume Distributed Log Collection for Hadoop
Such a problem often occurs when Hadoop is used to store and retrieve large items, such as PDF files, large data samples, images, movies, or other multimedia data. Professional Hadoop Solutions
Aside from his work at Core, Joe has worked on several successful web development projects, including Learn French by Podcast, a language-learning web site that offers its members free highquality MP3 podcasts and allows them to purchase accompanying professional PDF guides for a small fee. Beginning CouchDB
This example uses PNG, but GNU plot supports many other terminals like SCREEN, PDF, EPS, and so on. Hadoop MapReduce Cookbook
Nutch provides parsers for documents in many popular formats, such as HTML, PDF, Open Office and Microsoft Office, RSS, and others. Hadoop:The Definitive Guide 3rd Edition
See Eric Brewer’s original paper on this topic and the follow-up post by Coda Hale, as well as this PDF by Gilbert and Lynch. HBase:The Definitive Guide
The MIME standard defines common data formats such as "image/jpeg" and "application/pdf" Core Java Volume II Advanced Features 9th Edition
Because the Times had stored its older articles as scanned TIFF images , they needed image processing to combine different pieces of each article together into a single file in the desired PDF format. Hadoop in Action
You can read the original Phi Accrual Failure Detection paper on which Cassandra’s design is based at http://ddg.jaist.ac.jp/pub/HDY+04.pdf. Cassandra The Definitive Guide
This book was written and typeset by the author using Adobe FrameMaker, Acrobat, and Distiller on Macintosh and PC platforms, and supplied to the publisher and printer as an Adobe Portable Document Format (PDF) file. Fundamental Networking in Java
When two values are combined with a binary operator (such as n + f where n is an integer and f is a floating-point value), both operands are converted to a common type before the operation is carried out. Core Java Volume I Fundamentals 9th Edition
Italic is used for Java Virtual Machine "assembly language", its opcodes and operands, as well as items in the Java Virtual Machine's run-time data areas. The Java Virtual Machine Specification Java SE 7 Edition
The most common use for operators that modify their operands is to generate the side effect, but you should keep in mind that the value produced is available for your use, just as in operators without side effects. Thinking in Java 4th Edition
The leaves of an expression tree are operands, such as constants or variable names, and the other nodes contain operators. Data Structures and Algorithm Analysis in Java 3rd Edition
Numeric promotions are used to convert the operands of a numeric operator to a common type where an operation can be performed. The Java Language Specification Java SE 7 Edition
All bytecode operations, with few exceptions, are computed on an evaluation stack by popping operands from the stack, executing the operation and pushing the result back to the stack. Oracle JRockit The Definitive Guide
In a bitwise AND operation, each bit of the operands are Anded with each other. Java 7 New Features Cookbook
An operator is a key part of Java syntax: it performs a computation on one or two operands to produce a new value. Java in a Nutshell 5th Edition
Other constraints that check the type and number of arguments for other instructions requires dynamic analysis of the code to determine which operands will be present on the expression stack during execution. Java Performance
If the operands are integer types, the quotient of the division is returned. Programming Hive
These metrics are then subsequently polled to Ganglia Meta Daemon (Gmetad) servers, where the metrics are computed and saved in round-robin, time-series databases using round-robin database tool (RRDtool) HBase Administration Cookbook
I should mention here that Federico Sacerdoti was heavily involved in the implementation of gmetad and wrote a nice academic paper2 highlighting the strength of its design. Monitoring with Ganglia
Ganglia is composed of two servers: the gmetad server, which provides historical data and collects current data, and the gmond server, which collects and serves current statistics. Pro Hadoop
Add all the hostnames in the cluster to the gmetad configuration /etc/ganglia/ gmetad.conf file, for example, this file should contain the following: Hadoop Operations and Cluster Management Cookbook
You will also need the Ganglia meta daemon (gmetad) running on at least one node, and another node running the Ganglia web frontend. Hadoop RealWorld Solutions Cookbook
Ganglia meta daemon (gmetad) The meta daemon is installed on a central node and acts as the federation node to the entire cluster. HBase:The Definitive Guide
Each gmond process relays data to a central gmetad process that records data in a series of RRD, or round-robin database files, which are fixed-size files that efficiently store time series data. Hadoop Operations
In this way, the hash table is just a mapping from hashes to backend database URIs. CouchDB The Definitive Guide
The Random Partitioner hashes the user supplied key and uses the result to place this data on nodes in the cluster. Cassandra High Performance Cookbook
When the high write levels are detected, Ceph hashes the shared directory and relaxes the directory’s mtime coherence to distribute the workload across all MDS nodes. Ceph A Scalable, High-Performance Distributed File System
Redis supports many other data structures, including hashes, lists, and sorted sets. Hadoop RealWorld Solutions Cookbook
Gradle relies heavily on hashes to optimize download performance; when resolving a dependency, it will first attempt to download the much-smaller SHA from the repo, skipping the download if it determines that it already has that content in the artifact cache. Gradle Beyond The Basics
The hashes are the same as th you generated with the write, Hadoop in Practice
The default partitioner is HashPartitioner, which hashes a record’s key to determine which partition the record belongs in. Hadoop:The Definitive Guide 3rd Edition
The last native JSON datatype that we need to discuss is the object—known in other languages as hashes, hashmaps, dictionaries or associative arrays. Elasticsearch The Definitive Guide
By default, Hadoop will use a strategy that hashes the output key to perform the partitioning. Hadoop Beginners Guide
Note that the field has many distinct values, leading to large internal hash-tables; computing the hashes themselves on possibly large strings is already computationally quite expensive. PowerDrill - Processing a Trillion Cells per Mouse Click
The libvirt project has created a common set of tools for managing virtualization resources on different virtualization platforms, currently including: Xen, QEMU, KVM, LXC, OpenVZ, User Mode Linux, VirtualBox, and VMware ESX and GSX. Monitoring with Ganglia
In contrast, a SAN can be great when you don’t demand a lot of high-performance random I/O, but you are interested in features such as snapshots, storage consolidation, data deduplication, and virtualization. High Performance MySQL 3rd Edition
Server virtualization is the partitioning of a physical server into smaller virtual servers to help maximize the server resources. Gradle in Action
Virtualization Virtualization is a great way to get cheap hardware and be able to expand fast. MongoDB The Definitive Guide 2nd Edition
It may seem strange, but think about virtualization and the modern world, where the operating system cannot determine on which physical machine it works. Mastering ElasticSearch
Some virtualization types are better than others, but none come without an impact on performance. HBase in Action
Billing for resources consumed, including the cost of hardware, virtualization, network provision, and even the costs of software development and operations (since these are a simply logical extension of the system we see here) Graph Databases
Since 2008, he works for Oracle on a fast virtualization platform. Oracle JRockit The Definitive Guide
Blades, SANs, and Virtualization The large-scale data storage and processing industry moves in cycles. Hadoop Operations
It is also a good practice in the setup and configuration of a benchmark to consider leveraging operating system capabilities such as processor sets, or partitioning off portions of a system so they are viewed as distinct systems, such as virtualization or Oracle Solaris zones. Java Performance
This includes putting data directly into an HBase table and writing to HFile format files on Hadoop Distributed File System (HDFS) HBase Administration Cookbook
A storage file format, HFile, is under development and due for Hbase version 0.20.0, and is expected to solve many of the performance and reliability issues. Pro Hadoop
The implementation of HFile in a region server is responsible for reading and writing HFiles to and from HDFS. Professional Hadoop Solutions
Its contents are flushed to disk to form an HFile when the MemStore fills up. HBase in Action
All columns in a column family are stored together in the same lowlevel storage file, called an HFile. HBase:The Definitive Guide
When the MemStore reaches a certain threshold it is written out as an HFile. Apache Hadoop Goes Realtime at Facebook
The HFile writer has a straightforward write path, with four elements: Hadoop in Action
While JMX is fine for causal browsing of metric values, the number of eyeballs looking at JConsole doesn't scale when you have hundreds or even thousands of servers sending data all over the place. Apache Flume Distributed Log Collection for Hadoop
It is also possible to change logging levels in a running program by using the jconsole program. Core Java Volume I Fundamentals 9th Edition
Use the following command to start the monitor user interface: jconsole. Hadoop Operations and Cluster Management Cookbook
Also described are the Java profiling tools such as JConsole. Cassandra High Performance Cookbook
The JDK comes with a tool called JConsole for viewing MBeans in a running JVM. Hadoop:The Definitive Guide 3rd Edition
JConsole Java ships with a helper application called JConsole, which can be used to connect to local and remote Java processes. HBase:The Definitive Guide
The jconsole application that is available as part of the JDK, provides an alternative technique for determining which MXBeans are available. Java 7 New Features Cookbook
You may start jconsole with no local or remote process specified and use its Connection menu to establish connections. Java in a Nutshell 5th Edition
To do this, open a console, start the jconsole application, and connect to Cassandra via JMX. Cassandra The Definitive Guide
It shows how to configure ActiveMQ for JMX monitoring and demonstrates this using JConsole. ActiveMQ in Action
You can also check the detailed statistics of your tasks per job by opening the Tasktracker component of Apache Hadoop. Web Crawling and Data Mining with Apache Nutch
Jobtracker and tasktracker tunable parameters The JobTracker is the server that handles the management of the queued and executing jobs. Pro Hadoop
Another two EC2 small instances are used with roles hadoop-datanode and hadoop-tasktracker in each instance. Hadoop MapReduce Cookbook
There are two types of client: external clients (run by users) and internal clients (run on the cluster as a part of the system, e.g., datanode and tasktracker daemons) Hadoop:The Definitive Guide 3rd Edition
Tasktracker The process responsible for running map or reduce tasks from the jobtracker. Cassandra The Definitive Guide
Daemons There are two major daemons in Hadoop MapReduce: the jobtracker and the tasktracker. Hadoop Operations
Core Instance Group The nodes in the core instance group have the same function as Hadoop slave nodes that run both the datanode and tasktracker daemons. Programming Hive
This option, in turn, defaults to a subdirectory named Ganglia in the directory where libganglia will be installed. Monitoring with Ganglia
Cluster Administration Tools The Hadoop Core installation provides a number of scripts in the bin subdirectory of the installation that are used to stop and start the entire cluster or various pieces of the cluster. Pro Hadoop
In addition, you can specify a directory and CouchApp will load all the files in this directory and any subdirectory. CouchDB The Definitive Guide
Because I don’t want to make this example too complex to grasp, you’ll just copy it to the subdirectory build/backup. Gradle in Action
I will now show how to change to the source subdirectory inside this new directory and change the permissions of a few files to make sure they are ready for the build process: Beginning CouchDB
Whirr generates a script that we can use to start this proxy, under a subdirectory named after your Hadoop cluster inside the ~/.whirr directory. Hadoop MapReduce Cookbook
The extra trailing /foo subdirectory is needed on the destination, because now the contents of the source directory are copied to the contents of the destination directory. Hadoop:The Definitive Guide 3rd Edition
A task can have more than one attempt if the first one fails.) These user log files are under the /logs/userlogs subdirectory. Hadoop in Action
Now we have a subdirectory named after the Git project so that we can build the project and start using it. Cassandra The Definitive Guide
If you include in a JAR file class the files from a subdirectory, the class may not be stored with the path you expect. Hadoop Beginners Guide
A user request comes in, and the web server takes care of the networking and parses the HTTP request. CouchDB The Definitive Guide
The compiler parses the Groovy source files and generates stubs for them. Gradle in Action
It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. MapReduce - Simplied Data Processing on Large Clusters
ElasticSearch parses the input as lines, where every pair of lines contains information about the target index along with additional parameters and a query itself. Mastering ElasticSearch
The program parses the input record and then does a relatively sophisticated emit statement. Interpreting the Data Parallel Analysis with Sawzall
It parses out all the links in every Web page and stores important information about them in an anchors tile. The anatomy of a large-scale hypertextual Web search engine
The query server parses the query, applies optimizations and sends the plan to the master for execution. Tenzing - A SQL Implementation On The MapReduce Framework
As we know there is only one file in the cache we feel safe in using the first index in this array, and pass that to a utility method that parses the file and uses the contents to populate the state abbreviation lookup map. Hadoop Beginners Guide
We can do this using the receive() method, which reads frames from the TCP socket and parses them. ActiveMQ in Action
In the background, the “engine” provides an abstraction layer for the UI based on SQL: the user constructs charts via drag’n’drop operations, they get translated to group-by SQL queries, which the engine parses and processes. PowerDrill - Processing a Trillion Cells per Mouse Click
To recursively list files in the home directory, we can use the command hadoop fs -lsr ... Hadoop Operations and Cluster Management Cookbook
The procedure is invoked recursively on nested records (Line 18) Dremel - Interactive Analysis of WebScale Datasets
There are convenience methods for converting paths into project-relative java.io.File objects, making collections of files, and recursively turning directory trees into file collections. Gradle Beyond The Basics
The –R option recursively sets the replication factor for files and directories within a directory. Hadoop MapReduce Cookbook
PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the Web. The anatomy of a large-scale hypertextual Web search engine
Both files and directories (which are scanned recursively for LZOP files) are supported: Hadoop in Practice
In practice we don’t need to check on all files recursively, and we may restrict ourselves to what’s in our own working directory. Hadoop in Action
This can be done by using the -setrep command, which takes a replication factor and an optional flag (-R) to indicate it should operate recursively (see Example 2-5) Hadoop Operations
For example, STOCKS.> means the entry applies to all destinations in the STOCKS path recursively. ActiveMQ in Action
This rewrite can be applied recursively, to support deeper trees. PowerDrill - Processing a Trillion Cells per Mouse Click
However, in the subflow, when selecting the customer, the information about the order should be available because it contains some filter criteria (for example, an international order can be placed only by customers living in certain countries) Pro Spring 3
The other states are decision state, action state, subflow state, and end state. Pro Spring MVC
The start state is a marker state that designates another state (view, action, decision, end, subflow) defined in the flow as being the flow’s starting point. Spring in Practice
The boxes represent various states (the View state, Action state, Decision state, Subflow state, and others), and the arrows represent transitions. Pro Spring Security
Subflow A subflow state starts a new flow in the context of a flow that is already underway. Spring in Action 4th Edition
The globals section configures the general characteristics of the daemon itself. Monitoring with Ganglia
Configures each subproject by project path; actual configuration happens in the closure. Gradle in Action
In addition, the master node is a critical failure point in a default cluster configuration, which configures only one master node. Hadoop Operations and Cluster Management Cookbook
The build script configures the Project object, and finally the set of tasks to be executed is determined and executed. Gradle Effective Implementation Guide
After validating the authenticity of the lease [R7], the NM configures the environment for the container, including initializing its monitoring subsystem with the resource constraints specified in the lease. Apache Hadoop YARN Yet Another Resource Negotiator
To recover from a failed namenode in this situation, an administrator starts a new primary namenode with one of the filesystem metadata replicas and configures datanodes and clients to use this new namenode. Hadoop:The Definitive Guide 3rd Edition
When Elasticsearch detects a new string field in your documents, it automatically configures it as a full text string field and analyzes it with the standard analyzer. Elasticsearch The Definitive Guide
The driver instantiates, configures, and passes a JobConf object named job to JobClient. Hadoop in Action
Since the tasktracker runs as an unprivileged user (user mapred, by default, in the case of CDH and whatever user the administrator configures in Apache Hadoop), it isn’t possible for it to launch task JVMs as a different user. Hadoop Operations
As you can see, listing 7.2 instantiates the BrokerService and configures the brokerName and dataDirectory properties. ActiveMQ in Action
Whenever you use a class that is not defined in the basic java.lang package, you need to use an import directive. Core Java Volume I Fundamentals 9th Edition
The assumptions are that the hadoopprobook and commons-lang JARs are in the current working directory. Pro Hadoop
Your Java source code has a dependency on a class from the external library Apache Commons Lang. Gradle in Action
In the preceding example, ElasticSearch will take the value of the lang field and will use it as the analyzer for that document. Mastering ElasticSearch
The following snippet shows how we could define and use a trim UDF that uses the Apache Commons Lang StringUtils class. Hadoop:The Definitive Guide 3rd Edition
Minimally, a Java client needs the following JAR files specified in its CLASSPATH, when connecting to HBase: hbase, hadoop-core, zookeeper, log4j, commons-logging, and commons-lang. HBase:The Definitive Guide
For more sophisticated caches, you may need to use java.lang.ref directly. Effective Java 2nd Edition
Using a string reference variable that is assigned a null value will result in a java.lang. Java 7 New Features Cookbook
Primitive wrapper classes These classes are conceptually similar to the primitive wrapper classes, such as Integer and Long found in java.lang. Hadoop Beginners Guide
The lang and country arguments are optional; if omitted, the default locale is used. Programming Hive
There were rumors or promises of many new features, such as online backups and server-level foreign keys for all storage engines, subquery improvements, and thread pooling. High Performance MySQL 3rd Edition
The CLUSTER BY clause in the sub-query specifies the output columns that are hashed on to distributed the data to the reducers and finally the REDUCE clause specifies the user program to invoke (python wc_reduce.py in this case) on the output columns of the subquery. Hive A Petabyte Scale Data Warehouse Using Hadoop
As mentioned earlier, if the job is to return all the twits for all the users you follow, you can’t rely on a join or subquery to do this, as you can in SQL: HBase in Action
The second subquery calculates the second leg of the route, which comprises the path from the end location to another delivery center elsewhere in the network. Graph Databases
Consider this IN subquery, which finds all the items in the things table that are in the sales table: Hadoop:The Definitive Guide 3rd Edition
A correlated subquery contains a reference to the outer query and so. SQL Antipatterns
A clause is a subquery that can be optional, required, or prohibited. Lucene in Action 2nd Edition
Each subquery of the union query must produce the same number of columns, and for each column, its type must match all the column types in the same position. Programming Hive
A subquery is a complete select query inside a pair of parentheses that is embedded within a conditional expression. Pro JPA2
The optimizer needs quick access to various parts of the WHERE clause, table, field, and key lists, ORDER BY and GROUP BY expressions, subquery structuring, and other data. Understanding MySQL Internals
In the job definition, only one step is defined, which contains one tasklet and chunk process. Pro Spring 3
The chunk element C configures the chunk-processing step, in a step element, which is itself in a tasklet element. Spring Batch in Action
The most common way to implement a step is delegating to a tasklet, which defines the activity that must be performed, leaving the proper step implementation to deal with the boilerplate aspects of the execution (maintaining state, sending events, and so on) Spring Integration in Action
Processing and Loading Data from a Database To process and load data from a relational database to HDFS, we need to configure a Spring Batch tasklet with a JdbcItemReader and a HdfsTextItemWriter. Spring Data Modern Data Access for Enterprise Java
A graph mining algorithm might simply output aggregated statistics mined from the graph. Pregel - A System for Large-Scale Graph Processing
The gweb Aggregated Graphs Tab Aggregate graphs (Figure 4-11) allow you to create composite graphs combining different metrics. Monitoring with Ganglia
That is, an integer; a string; or a small, fixed-size list or object that includes an aggregated value (or values) from the values argument. CouchDB The Definitive Guide
After running gradle build, you’ll find the aggregated HTML test report under the directory build/reports/test. Gradle in Action
Enterprise-level switches often support Link Aggregation Groups (LAG), which bundle multiple interfaces together in an active/active fashion to make a single logical interface that is as fast as the sum of the links aggregated. Cassandra High Performance Cookbook
In all the examples so far, you have aggregated by department. Beginning CouchDB
Solid arcs represent data flowing from the analysis machines to the aggregators; dashed arcs represent the aggregated data being merged, first into one file per aggregation machine and then to a single final, collated output file. Interpreting the Data Parallel Analysis with Sawzall
Good keyword search is often not enough: you need some aggregated data, so you can narrow the results down to what you're interested in. Elasticsearch in Action MEAP V07 cHAPTER 1
However, shared storage services like those described in Section 6.5 require aggregated information as soon as possible during a sudden spike in user activity. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
Read requests are issued on these multiple HFiles in parallel & aggregated for a final result. Apache Hadoop Goes Realtime at Facebook
Make sure that the HBasegora-hbase dependency is available in ivy.xml. Web Crawling and Data Mining with Apache Nutch
It will be helpful to have some knowledge of tools like Ant, Ivy, and Maven; practices like continuous integration and delivery; and concepts like dependency management. Gradle in Action
Maven and Ivy repositories are supported to publish or fetch dependencies. Gradle Effective Implementation Guide
Apache Ivy is a dependency manager with flexibility and configurability. Scala in Action
They also play a key role for non-Maven projects using tools like Ivy and Gradle, both of which rely on standard Maven repositories. Jenkins The Definitive Guide
There are three kinds of repositories currently supported by Gradle: Maven repositories, Ivy repositories, and static directories. Gradle Beyond The Basics
It is straightforward to use another build tool, such as Gradle or Ant with Ivy, as long as you use the same set of dependencies defined here.) For building MapReduce jobs you only need to have the hadoop-core dependency, which contains all the Hadoop classes. Hadoop:The Definitive Guide 3rd Edition
The better approach is to let Ivy load the appropriate version from the remote repositories, and then compile Hive normally. HBase:The Definitive Guide
For example, Apache Ivy is used to build Cassandra, and a popular client (Hector) is available via Git. Cassandra The Definitive Guide
Apache Ivy later added badly needed dependency management to Java builds, but still didn’t address the lack of conventions inherent in Ant. Building and Testing with Gradle
Without such an element, significant time could be spent nailing down the right pattern matcher, as it would involve rerunning your test plan several times with various inaccurate expressions, hoping it eventually matches. Performance Testing With JMeter 2.9
HBase solves the problem by using a QueryMatcher in combination with a ColumnTracker, which comes in a few variations: one for explicit matching, for when you specify a list of columns to retrieve, and another that includes all columns. HBase:The Definitive Guide
Then, get a Matcher object from the pattern and call its matches method: Click here to view code imageClick here to view code image. Core Java Volume II Advanced Features 9th Edition
Working with strings; specifically, how to perform common operations such as removing parts of a string, matching strings, and more using the CharMatcher and Strings classes. Getting Started with Google Guava
These are defined in the documentation for the Files class' getPathMatcher method. Java 7 New Features Cookbook
Regular expression support is provided by the Pattern and Matcher. Java in a Nutshell 5th Edition
By running the output through a pattern matcher to extract the GC times and amount freed and totalling those values, you can identify the cost of GC to your application. Java Performance Tuning 2nd Edition
Use the regular expression Pattern and Matcher classes for a better performing and more versatile matching solution than the String matches() method. Java 7 Recipes A Problem-Solution Approach
In Java 8, you can use the names in the start, end, and group methods of Matcher: Java SE 8 for the Really Impatient
The power of Groovy’s operators can also be leveraged inside AntBuilder calls, such as using the regular expression matcher inside an evaluation of all System properties in Example 3-19 Building and Testing with Gradle
The code that tries to process the information in a file therefore will need to notify the compiler that it can throw some sort of IOException. Core Java Volume I Fundamentals 9th Edition
If no path can be constructed, an IOException will be thrown. Pro Hadoop
You need to create one that matches the signature BufferedReader -> String and that may throw an IOException. Java 8 in Action Lambdas, Streams and Functional-style Programming
Calling seek() with a position that is greater than the length of the file will result in an IOException. Hadoop:The Definitive Guide 3rd Edition
First, the method that is used to retrieve specific values from an HBase table: Result get(Get get) throws IOException. HBase:The Definitive Guide
FileVisitResult visitFileFailed(T path, IOException exc) is called if an exception was thrown in an attempt to get information about the given file. Core Java Volume II Advanced Features 9th Edition
Thus, catching the IOException may be sufficient as illustrated in the following code: Java 7 New Features Cookbook
An IOException that signals that a specified file cannot be found. Java in a Nutshell 5th Edition
If you try to reset too far back, an IOException is thrown. Java Network Programming 4th Edition
If one thread performs s.close() on an IO object (for example, an InputStream) s, then any other thread attempting to use s (for example, s.read()) will receive an IOException. Concurrent Programming in Java Design Principles and Pattern
The core framework is large, and dozens of portfolio projects extend that core, covering things like security, web flow, SOAP web services (REST web services are part of the core), enterprise integration, batch processing, mobile, various flavors of social (Facebook, LinkedIn, Twitter, GitHub, and so on), various flavors of NoSQL (MongoDB, Neo4j, Riak, and so on), BlazeDS/Flex, AMQP/Rabbit, and many more. Spring in Practice
It’s generally applied to a number of recent nonrelational databases such as Cassandra, Mongo, Neo4J, and Riak. NoSQL Distilled
He has given talks at PyCon India on Storm and Neo4J and has published. Storm Blueprints Patterns for Distributed Real-time Computation
One of the biggest logistics companies in the world uses a graph database in real time to route physical parcels; a major airline is leveraging graphs for its media content metadata; and a toptier financial services firm has rewritten its entire entitlements infrastructure on Neo4j. Graph Databases
Unlike single node databases like Neo4j or PostgreSQL, Riak does not support ACID transactions. A Little Riak Book
Graph databases such as Neo4J and the set of databases derived at least in part from the design of Google’s Bigtable database (such as MongoDB, HBase, Hypertable, and Redis) all are focused slightly less on Availability and more on ensuring Consistency and Partition Tolerance. Cassandra The Definitive Guide
They were trying to integrate the Neo4j graph database with the Spring Framework and evaluated different approaches. Spring Data Modern Data Access for Enterprise Java
Whether you’re using a document database like MongoDB, a graph database such as Neo4j, or even a traditional relational database, Spring Data offers a simplified programming model for persistence. Spring in Action 4th Edition
Set up cron jobs to prepare static HTML reports at desired intervals. Monitoring with Ganglia
You could use this feature to synchronize database servers within a cluster or between data centers using a job scheduler such as cron, or you could use it to synchronize data with your laptop for offline work as you travel. CouchDB The Definitive Guide
Think of scheduled automation as a time-based job scheduler (in the context of a Unix-based operation system, also known as a cron job) Gradle in Action
It is commonly used to run commands on remote computers typically from non-interactive jobs such as rsync-based backups through cron. Cassandra High Performance Cookbook
One way of doing this would be to set up a cron job that runs at a frequent interval, calling a script that uses curl to replicate your primary database to a database on your backup server. Beginning CouchDB
This can be done with a cron job or through another automated process. Scaling CouchDB
Summary I showed you one example of the use of the Oozie coordinator, which offers cron-like capabilities to launch periodic Oozie workflows. Hadoop in Practice
For example, you may want all cron jobs to go to a single pool rather than have them run under each individual user’s pool. Hadoop in Action
Obvious tools such as cron may be good enough, but as system complexity increases, you may need to investigate more sophisticated scheduling systems. Hadoop Beginners Guide
AMQ_SCHEDULED_CRON String Use a cron entry to set the schedule. ActiveMQ in Action
A quorum of Chubby locks detects most network partitions and node unavailability. Megastore - Providing Scalable, Highly Available Storage for Interactive Services
However the GFS master is now made fault tolerant using the Chubby[3] abstraction. Cassandra A Decentralized Structured Storage System
Bigtable in turn uses both Chubby[7] (a distributed lock system) and GFS. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
ZooKeeper is the comparable system to Google’s use of Chubby for Bigtable. HBase:The Definitive Guide
Abstract We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby - The lock service for loosely-coupled distributed systems
In Bigtable there is a master server that controls operations using the Chubby persistent distributed locking mechanism; in Cassandra, all the nodes are on even par with no centralized control, and they communicate using a gossip model. Cassandra The Definitive Guide
Bigtable relies on a highly-available and persistent distributed lock service called Chubby [8] Bigtable - A Distributed Storage System for Structured Data
Running workers write a token into the Chubby lockservice [8] to indicate they belong to the system; other workers can use the existence of this token as a sign that the worker is alive (the token is automatically deleted when the process exits) Percolator - Large-scale Incremental Processing Using Distributed Transactions and Notifications
In plain English, if you flip the parameters of compareTo, the sign (but not necessarily the actual value) of the result must also flip. Core Java Volume I Fundamentals 9th Edition
Now if our user wants to narrow down its results to the English-language films category we need to send the following query: Mastering ElasticSearch
The description provides a useful English description of what the metric actually measures. Beginning CouchDB
You don’t want words that commonly appear in the English language (referred to in text mining circles as stopwords) to be used as features because they’ll appear in every single email. Hadoop in Practice
Then create a new index, specifying that the tweet field should use the english analyzer: Elasticsearch The Definitive Guide
We will close this chapter with a complete example—a retirement calculator with a user interface in English, German, and Chinese. Core Java Volume II Advanced Features 9th Edition
The English language has about one million words, a size that fits comfortably into an iPod, but our word counting program will deal with many unique words not found in any standard English dictionary. Hadoop in Action
It has been said, though it is hard to verify, that the 17th-century English poet John Milton had actually read every published book on the face of the earth. Cassandra The Definitive Guide
Apache Camel uses simple English prepositions, such as from and to, to denote a route. ActiveMQ in Action
The easyB specification file reads like plain English in its execution of web browser driving events and querying of the page responses. Building and Testing with Gradle
You need to extend the Formatter class and override the method. Core Java Volume I Fundamentals 9th Edition
A StringBuilder and Formatter are ThreadLocal instance variables, making this class thread-safe. Pro Hadoop
You can now represent multiple formatting behaviors by implementing the Apple-Formatter interface: Java 8 in Action Lambdas, Streams and Functional-style Programming
Therefore, we will write a Hadoop input formatter to process the e-mail archive. Hadoop MapReduce Cookbook
There are several formatter classes that take these differences into account. Core Java Volume II Advanced Features 9th Edition
As we’ll see below, a Formatter object can send its output to any Appendable object. Java in a Nutshell 5th Edition
The final component in log4j is the layout which acts as a formatter for log events. Hadoop Operations
First of all, you have implemented the MyFormatter class to give a format to the log messages. Java 7 Concurrency Cookbook
Some properties are good candidates for being transient, like cached calculations, or a date formatter that you always instantiate to the same value. Java 7 Recipes A Problem-Solution Approach
When this log level is selected, Gradle automatically provides a richer log formatter, including the timestamp, log level, and logger name of each message. Building and Testing with Gradle
Because Lounge allocates a portion of the hash (known as a keyspace) to each node, you can add as many nodes as you like. CouchDB The Definitive Guide
Many cryptographic algorithms suffer from the problem that a small number of keys in the keyspace are considered weak. Pro Spring 3
This randomly distributes the keyspace evenly over the reducers, but still ensures that keys with the same value in different mappers end up at the same reducer. MapReduce Design Patterns
A keyspace is similar to a database in RDBMS where all column families related to the application are stored. NoSQL Distilled
These need to be created so data can be stored in them: [default@unknown] create keyspace testkeyspace. Cassandra High Performance Cookbook
There are more keywords that may be used in the USING clause, including "skewed" (for large datasets with a skewed keyspace) and "merge" (to effect a merge join for inputs that are already sorted on the join key) Hadoop:The Definitive Guide 3rd Edition
Schema-Free Cassandra requires you to define an outer container, called a keyspace, that contains column families. Cassandra The Definitive Guide
The following is an example of creating an external Hive table that maps to an existing Cassandra keyspace and column family: Programming Hive
The channel names are not related to the Redis keyspace so you don’t have to worry about conflicts with existing keys. Redis Cookbook
The Hadoop DFS metrics provide information about data block operations (read, removed, replicated, verify, written), verification failures, bytes read and written, volume failures, and local/remote client reads and writes. Monitoring with Ganglia
The dfs.data.dir property specifies the location of the data directory on the host Linux filesystem. Hadoop Operations and Cluster Management Cookbook
The number of mappers launched is roughly equal to the input size divided by dfs.block.size (the default block size is 64 MB) Hadoop RealWorld Solutions Cookbook
Finally, shut down the HDFS cluster using the following command: >bin/stop-dfs.sh. Hadoop MapReduce Cookbook
Check the value of dfs.block.size to see what it’s set to in your cluster. Hadoop in Practice
We set the second property, dfs.replication, to 1 so that HDFS doesn’t replicate filesystem blocks by the default factor of three. Hadoop:The Definitive Guide 3rd Edition
The dfs stands for Distributed Frequency Search and it tells Elasticsearch to first retrieve the local IDF from each shard in order to calculate the global IDF. Elasticsearch The Definitive Guide
Some older documentation shows file utilities in the form of hadoop dfs -cmd Hadoop in Action
The dfs.replication variable specifies how many times each HDFS block should be replicated. Hadoop Beginners Guide
For this reason, the underlying disks need not be part of a RAID group, although some administrators choose to do so and forego specifying multiple directories in dfs.name.dir (although an NFS mount should still be used, no matter what) Hadoop Operations
In addition, federated NameNodes can isolate namespaces for different applications or situations. Hadoop Operations and Cluster Management Cookbook
With multiple NameNodes, different categories of applications and users can be isolated to different namespaces. Professional Hadoop Solutions
Compared to P2P storage systems that only support flat namespaces, distributed file systems typically support hierarchical namespaces. Cassandra A Decentralized Structured Storage System
Hadoop version 2.x supports HDFS Federation, where HDFS is partitioned into multiple distinct namespaces, with each independently managed by a separate NameNode. Hadoop in Practice
Namespace volumes are independent of each other, which means namenodes do not communicate with one another, and furthermore the failure of one namenode does not affect the availability of the namespaces managed by other namenodes. Hadoop:The Definitive Guide 3rd Edition
Our near-term solution to scalability is to allow multiple namespaces (and NameNodes) to share the physical storage within a cluster. The Hadoop Distributed File System
Data is organized into buckets, which are little more than flat namespaces for logically grouping key-value pairs. Cassandra The Definitive Guide
In Example 5-16, we generate a one gigabyte file and copy it into the two separate filesystem namespaces. Hadoop Operations
You’ll see how to configure ActiveMQ using both pure Spring XML and custom namespaces, as well as the standard ActiveMQ XML via Apache XBean (http://mng.bz/EAfX) ActiveMQ in Action
When you use this feature, Hive puts the key-value pair in the hivevar “namespace” to distinguish these definitions from three other built-in namespaces, hiveconf, system, and env. Programming Hive
The way MongoDB accomplishes this is by keeping a log of operations, or oplog, containing every write that a primary performs. MongoDB The Definitive Guide 2nd Edition
We’ll also look at the oplog and how it fits in with the communication between members of a replication topology. The Definitive Guide to MongoDB
Each member of a replica set logs all its writes to a special capped collection called oplog.rs. MongoDB in Action
You'll see two compressed tar archives, available along with checksum and gpg signature files used to verify the archives. Apache Flume Distributed Log Collection for Hadoop
When it finds a footer pair, it makes some checks: if the first 2k are corrupt (a footer includes a checksum), CouchDB replaces it with the second footer and all is well. CouchDB The Definitive Guide
Gradle detects if an artifact was changed in the repository by comparing its local and remote checksum. Gradle in Action
Current doc status, a pointer into the repository, a doc checksum, and various statistics. The anatomy of a large-scale hypertextual Web search engine short
The information stored in each entry includes the current document status, a pointer into the repository, a document checksum, and various statistics. The anatomy of a large-scale hypertextual Web search engine
Chubby also exposes a 64-bit file-content checksum so clients may tell whether files differ. Chubby - The lock service for loosely-coupled distributed systems
To avoid this problem we had to change the format of the edits log to have a transaction length, transaction id and the checksum per each transaction written to the file. Apache Hadoop Goes Realtime at Facebook
This performs a checksum against the database and compares checksums of peers; if the checksums differ, then the data is exchanged. Cassandra The Definitive Guide
By default, there is a checksum for each 512-byte chunk within the block. Hadoop Beginners Guide
Figure 2: Example usage of the Percolator API to perform basic checksum clustering and eliminate documents with the same content. Percolator - Large-scale Incremental Processing Using Distributed Transactions and Notifications
After all, what can go wrong? Unfortunately, with modern processors and compilers, there is plenty of room for error. Core Java Volume I Fundamentals 9th Edition
In these times of nearly continuous performance improvements of compilers and JVMs, performance measurements are only of transient value. A Java ForkJoin Framework
It’s typically useful when building compilers or engines for processing business rules. Java 8 in Action Lambdas, Streams and Functional-style Programming
Similar to compilers in traditional databases, the Hive compiler processes HiveQL statements in the following steps: Hive A Petabyte Scale Data Warehouse Using Hadoop
Balancing Symbols Compilers check your programs for syntax errors, but frequently a lack of one symbol (such as a missing brace or comment starter) will cause the compiler to spill out a hundred lines of diagnostics without identifying the real error. Data Structures and Algorithm Analysis in Java 3rd Edition
Avro compilers come in two flavors: they can either compile a protocol file or a schema file. Hadoop in Practice
As with many modern compilers, the Pig compiler can reorder the execution sequence to optimize performance, as long as the execution plan remains logically equivalent to the original program. Hadoop in Action
Reflection is also appropriate for use in remote procedure call (RPC) systems to eliminate the need for stub compilers. Effective Java 2nd Edition
To achieve this goal, it is essential to use an IDE that integrates under one common interface several tools (compilers and debuggers) that facilitate the development of applications. Java 7 Concurrency Cookbook
This eliminates the possibility of some common errors, and also enables compilers to automatically and safely perform some optimizations that would need to be performed manually in C. Concurrent Programming in Java Design Principles and Pattern
What started as a refactoring branch eventually became the main line of development as Flume 1.X. Apache Flume Distributed Log Collection for Hadoop
At Acme, developers do all their coding within the IDE, from navigating through the source code, implementing new features, and compiling and refactoring code, to running unit and integration tests. Gradle in Action
The HFile refactoring into a reader and writer hierarchy allows for significant improvements in code maintainability. Professional Hadoop Solutions
The API is constantly being enhanced with new features, bug fixes, and occasional refactoring. Hadoop RealWorld Solutions Cookbook
The Example Plug-In In this chapter, we will create a Gradle plug-in to automate the use of the open-source database refactoring tool, Liquibase. Gradle Beyond The Basics
Therefore, the WebMap applications would not require extensive refactoring to fit into MapReduce. Hadoop:The Definitive Guide 3rd Edition
This refactoring better aligns the configuration settings to the subsystem of Hadoop that they control. Hadoop in Action
If a JMS client uses such additional features, this client may not be portable to another JMS provider without a refactoring effort. ActiveMQ in Action
When the representations and behavior of one class can be partitioned into independent, noninteracting, or just non-conflicting subsets, it is almost always worth refactoring the class to use distinct finer-granularity helper objects whose actions are delegated by the host. Concurrent Programming in Java Design Principles and Pattern
Only the modest addition of the useTestNG() call to the test closure is needed in addition to the refactoring of the unit test class. Building and Testing with Gradle
The default configuration provides for only local access to the managed beans (MBeans) Pro Hadoop
From the window, we can check the memory usage, threads, classes, summary of JVM, and details of MBeans. Hadoop Operations and Cluster Management Cookbook
Cassandra provides access to this information through standard JMX MBeans (Java Management eXtension Managed Bean) Cassandra High Performance Cookbook
Hadoop includes several managed beans (MBeans), which expose Hadoop metrics to JMX-aware applications. Hadoop:The Definitive Guide 3rd Edition
There is a loose overlap between the metric context, as provided by the metrics framework, and the MBeans exposed over JMX. HBase:The Definitive Guide
MBeans and the platform MBean server will be discussed in the next chapter. Oracle JRockit The Definitive Guide
Overview of JMX and MBeans In this section, we explore how Cassandra makes use of Java Management Extensions (JMX) to enable remote management of your servers. Cassandra The Definitive Guide
Internal MBeans in Hadoop rely on a metric plug-in that has an update thread running to collect data from the system. Hadoop Operations
These interfaces consist of management beans, usually called MBeans, which expose resources of your application to external management applications. ActiveMQ in Action
Some applications also build-in monitoring and management capabilities using MBeans via Java SE’s monitoring and management APIs. Java Performance
The responsibility of Namenode is to transfer data to Datanodes. Web Crawling and Data Mining with Apache Nutch
Since MapReduce servers (task-trackers) share nodes with HDFS datanodes, map output inevitably competes with HDFS reads. HDFS scalability the limits to growth
Namenodes and Datanodes An HDFS cluster has two types of nodes operating in a master-worker pattern: a namenode (the master) and a number of datanodes (workers) Hadoop:The Definitive Guide 3rd Edition
It has locations of data blocks stored in several datanodes and often runs on the same server as the jobtracker in smaller clusters. Cassandra The Definitive Guide
Storage can be added to a cluster by adding more datanodes with additional disk capacity, or even adding disks to existing datanodes. Hadoop Operations
It is also one of the biggest contributors of the Hadoop codebase. Hadoop Operations and Cluster Management Cookbook
To tell the truth, it was much harder not only because of more advanced topics covered in this book, but also because of the constantly introduced changes in the ElasticSearch codebase. Mastering ElasticSearch
Because of high performance, a small codebase, and compact resulting data, there is a wide adoption of Avro not only in the Hadoop community, but also by many other NoSQL implementations (including Cassandra) Professional Hadoop Solutions
Familiarity with the Pregel design will help to explain many components found in the Giraph codebase. Hadoop RealWorld Solutions Cookbook
Due to the number of moving parts and the size and scope of the codebase and deployment, debugging services like universal search (described earlier in Section 1) is very challenging. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
That begs the question: what’s the best approach to integrate sampling support into an existing codebase? One approach would be to add a configurable option that would toggle the use of the sampling InputFormat, similar to the following code: Hadoop in Practice
The shuffle is an area of the codebase where refinements and improvements are continually being made, so the following description necessarily conceals many details (and may change over time; this is for version 0.20) Hadoop:The Definitive Guide 3rd Edition
These benchmarks are available as part of the Hadoop codebase. The Hadoop Distributed File System
Powerset was acquired in July 2008, and after a couple of months during which Powerset employees were disallowed from contributing while Microsoft’s legal department vetted the HBase codebase to see if it impinged on SQLServer patents, we were allowed to resume contributing (I was a Microsoft employee working near full time on an Apache open source project) HBase:The Definitive Guide
By unifying the code paths, the codebase was simplified and files were kept as small as possible. Apache Hadoop Goes Realtime at Facebook
This is not exactly how Craigslist works (because we don’t know how Craigslist works), but it is a close enough approximation to illustrate problems with benchmarking. CouchDB The Definitive Guide
Benchmarking of a Hadoop cluster is the first step to tune the performance of a Hadoop cluster. Hadoop Operations and Cluster Management Cookbook
We demonstrate through benchmarking that our system is effective at transforming logical computations into efficient programs. FlumeJava - Easy, Efficient Data-Parallel Pipelines
Stress is an easy-to-use command-line tool for stress testing and benchmarking Cassandra. Cassandra High Performance Cookbook
Generic benchmarking can be of some use, but a test specifically designed for your system will be more useful. Scaling CouchDB
Benchmarking HDFS Running benchmarks is a good way to verify whether your HDFS cluster is set up properly and performs as expected. Hadoop MapReduce Cookbook
Benchmarking a Hadoop Cluster Is the cluster set up correctly? The best way to answer this question is empirically: run some jobs and confirm that you get the expected results. Hadoop:The Definitive Guide 3rd Edition
Only after real-world usage in your environment and some performance benchmarking with the aid of heap analysis tools and observation of your specific application’s behavior should you dive into tuning the more advanced JVM. Cassandra The Definitive Guide
The results of benchmarking LLVM vs vector on some typical aggregation queries is shown in table 3 Tenzing - A SQL Implementation On The MapReduce Framework
After benchmarking a particular cluster size, we deleted the data directories on each node, reformatted and reloaded HDFS to ensure uniform data distribution across all nodes. HadoopDB An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
If the system obtains an IP address dynamically, it is possible that DHCP is not completed before the attempt to start gmond, and so gmond fails to run. Monitoring with Ganglia
In an intranet, a network administrator may assign a specific address to your machine, or you may have a dynamically assigned address allocated by a Dynamic Host Control Protocol (DHCP) server. Java Network Programming and Distributed Computing
Only first nic of vm will get default gateway from dhcp server. bk-compute-adminguide-grizzly
We will learn steps to configure this server, including the configuration of the following two services: DHCP and FTP. Hadoop Operations and Cluster Management Cookbook
FlatDHCP builds upon the Flat manager by providing DHCP services to handle instance addressing and creation of bridges. Deploying OpenStack
Tracks IP address leases and records them in the database by using the dnsmasq dhcp-script facility. bk-admin-guide-cloud-latest
When booting, the kernel will try to use the DHCP protocol, if available, to configure networking. Oracle JRockit The Definitive Guide
High Availability for the DHCP agent is achieved by adopting Pacemaker. high-availability-guide
Others, especially clients on local area networks and wireless connections, receive a different address every time they boot up, often provided by a DHCP server. Java Network Programming 4th Edition
You should be familiar with concepts such as DHCP, Linux bridges, VLANs, and iptables. OpenStack Operations Guide
We recommend you to limit this user's privileges to prevent it from creating or altering tables in the metastore database schema. HBase Administration Cookbook
Solution You can take advantage of the built-in Sqoop metastore that allows you to save all parameters for later reuse. Apache Sqoop Cookbook
To support this abstraction, Hive provides a metastore in the form of a relational database that enables you to capture dependencies between actual physical files (and HBase tables) and tables (virtual) used to access this data. Professional Hadoop Solutions
Following this, the script issues a CREATE command with the EXTERNAL keyword, which tells the Hive Metastore that the data is not managed by the Hive warehouse in HDFS. Hadoop RealWorld Solutions Cookbook
Hive also includes a system catalog - Metastore – that contains schemas and statistics, which are useful in data exploration, query optimization and query compilation. Hive A Petabyte Scale Data Warehouse Using Hadoop
This merely saves the notion of this command as a job in something called the Sqoop metastore. Hadoop in Practice
The Apache HCatalog project (http://incubator.apache.org/hcatalog/) solves this problem by providing a table metadata service, based on Hive’s metastore, so that Pig queries can reference schemas by name, rather than specifying them in full each time. Hadoop:The Definitive Guide 3rd Edition
To support these additional features, a new and important component of Hive is a metastore for storing schema information. Hadoop in Action
Another option is to use Sqoop’s saved job feature, which stores both connection credentials and incremental state information in a directory on the local filesystem of the client machine in what is called the metastore. Hadoop Operations
The Metastore is a separate relational database (usually a MySQL instance) where Hive persists table schemas and other system metadata. Programming Hive
To install gweb on a Debian-based Linux distribution, execute the following command as either root or user with high privilege: Monitoring with Ganglia
The Cloudera folks have Amazon machine images (AMIs), Debian and RPM installer files, and an online configuration tool to generate configuration files. Pro Hadoop
Debian GNU/Linux You can install the CouchDB package by running: CouchDB The Definitive Guide
If you are using a Linux distribution other than CentOS, such as Debian, please refer to corresponding documentation. Hadoop Operations and Cluster Management Cookbook
At the time of writing this book, the current release is Debian package. Gradle Effective Implementation Guide
Fuse development files can be found in fusedevel RPM for Redhat/Fedora and in libfuse-dev package for Debian/Ubuntu. Hadoop MapReduce Cookbook
The Apache Bigtop project (http://incubator.apache.org/bigtop/) runs interoperability tests on stacks of Hadoop components and provides Linux packages (RPMs and Debian packages) for easy installation. Hadoop:The Definitive Guide 3rd Edition
When installing Elasticsearch in production, you can use the method described above, or the Debian or RPM packages provided on the downloads page. Elasticsearch The Definitive Guide
You can simplify your deployment of Hadoop by using their free Hadoop distribution, in either RPM or Ubuntu /Debian Packages. Hadoop in Action
Starting around version 1.0, support for building RPM and Debian packages was added to the build system, and later releases provide these artifacts for download. Hadoop Operations
The next step is to create an instance of the RMI lightbulb service defined by the RMILightBulbImpl class. Java Network Programming and Distributed Computing
Putting it all together, the complete RelationCountImpl is shown next. HBase in Action
To do this globally in an application, it suffices to remap the implementation for file URIs by setting the property fs.file.impl to the value org.apache. Hadoop:The Definitive Guide 3rd Edition
This code returns PartitionPlanImpl, a convenient basic implementation of the PartitionPlan interface. Java EE 7 Essentials
The WarehouseImpl constructor is declared to throw a RemoteException because the superclass constructor can throw that exception. Core Java Volume II Advanced Features 9th Edition
SocketImplFactory objects may be registered to create SocketImpl objects for the Socket and ServerSocket classes. Java in a Nutshell 5th Edition
This depends heavily on the native UDP software in the OS and the native code that interfaces between this and Java’s DatagramSocketImpl class. Java Network Programming 4th Edition
You can almost do this generically using the SocketImplFactory, but if you install your own SocketImplFactory, there is no protocol to allow you to access the default socket implementation, so another way must be used. Java Performance Tuning 2nd Edition
For example, assuming that we have an implementation class, say TankImpl, we can write the following AdaptedTank class. Concurrent Programming in Java Design Principles and Pattern
First, a new JoinRowSet is created by instantiating a JoinRowSetImpl() object: Java 7 Recipes A Problem-Solution Approach
Alternatively, you can also use Redis as a cache, storing recently posted messages in Redis, and older (rarely read) messages in a primarily on-disk storage server (like PostgreSQL, MySQL, Riak, MongoDB, and so on) Redis in Action
Riak is a Dynamo-inspired key/value store that offers fault tolerance out of the box. RabbitMQ in Action
The core framework is large, and dozens of portfolio projects extend that core, covering things like security, web flow, SOAP web services (REST web services are part of the core), enterprise integration, batch processing, mobile, various flavors of social (Facebook, LinkedIn, Twitter, GitHub, and so on), various flavors of NoSQL (MongoDB, Neo4j, Riak, and so on), BlazeDS/Flex, AMQP/Rabbit, and many more. Spring in Practice
It’s generally applied to a number of recent nonrelational databases such as Cassandra, Mongo, Neo4J, and Riak. NoSQL Distilled
Links and Walking The Riak key-value store allows each of its stored values to be augmented with link metadata. Graph Databases
Riak is an open-source, distributed key/value database for high availability, fault-tolerance, and nearlinear scalability. A Little Riak Book
This design is observed by Cassandra as well as by straight key/value stores such as Project Voldemort and Riak. Cassandra The Definitive Guide
Although Dynamo is a proprietary system, the ideas used to build it have inspired many systems falling under the NoSQL umbrella, including Cassandra, Project Voldemort, and Riak. MongoDB in Action
What it lacks in complex querying functionality (like that found in Riak or MongoDB), it makes up for in speed and efficiency. Spring Data Modern Data Access for Enterprise Java
Fortunately, JSON doesn’t care about whitespace, and the compact nature of the output keeps the size of the responses down, but isn’t very human-readable.: Getting Started with CouchDB
Now, the actual result is not as nicely formatted and doesn’t include any superfluous whitespace or newlines, but this is better for you (and us!) to read and understand. CouchDB The Definitive Guide
For example, the following code will result in a request that checks how the analysis will be performed for the ElasticSearch Servers phrase in the library index using the whitespace tokenizer and the nGram filter: Mastering ElasticSearch
The call to strip() method removes any leading/trailing whitespace, and then we tokenize it into an array of named variables. Hadoop RealWorld Solutions Cookbook
In the previous function, I am taking the content field of the document, lowering it, and replacing any nonalphabetical characters with a space, before tokenizing the document into an array of words, split by a single whitespace token. Beginning CouchDB
We already covered this topic in section 13.2.4, but I want to reemphasize that a high percentage of problems you’ll encounter in your job are due to unexpected input, and can be as simple an issue as leading or trailing whitespace characters that cause parsing issues. Hadoop in Practice
The script must accept a variable number of arguments that are the hostnames or IP addresses to be mapped, and it must emit the corresponding network locations to standard output, separated by whitespace. Hadoop:The Definitive Guide 3rd Edition
A simple tokenizer might split the text up into terms whenever it encounters whitespace or punctuation. Elasticsearch The Definitive Guide
Tokenization is based purely on whitespace characters and not punctuation marks, making States, States., and States: separate words. Hadoop in Action
Unlike a property file, however, whitespace is not permitted between the equals sign and its key or value. Hadoop Operations
HBase servers, particularly the version using memcached, are memory intensive and generally require at least a gigabyte of real memory per server; any paging will drastically affect performance. Pro Hadoop
Slow Paging (Do Not Use) Don’t use this method! We just show it because it might seem natural to use, and you need to know why it is a bad idea. CouchDB The Definitive Guide
Other alerting methods such as SMS messages and paging, are also available. Hadoop Operations and Cluster Management Cookbook
Paging When the result list is large, it is necessary to use paging to limit the number of documents fetched in a single request and go through the subsequent ones. Mastering ElasticSearch
Create a scanner over all users whose username is less than I, paging one cell at a time. HBase in Action
Beware of paging too deep or requesting too many results at once. Elasticsearch The Definitive Guide
In extreme cases, such memory leaks can cause disk paging and even program failure with an OutOfMemoryError, but such failures are relatively rare. Effective Java 2nd Edition
Adding RAM will help you if you find the OS doing a lot of paging on reads (in general, it is better to enable the various caches Cassandra has) Cassandra The Definitive Guide
The default queue configuration uses a separate thread for paging messages from the message store into the queue to be dispatched to interested message consumers. ActiveMQ in Action
Improved Cache Heuristics It is a known problem in disk-cache / paging algorithms. PowerDrill - Processing a Trillion Cells per Mouse Click
The view definition below shows a new view that emits a row of data based on the ingredient text by iterating over the ingredient list from each recipe document, and emitting each ingredient name. Getting Started with CouchDB
If your module requires a simple string value, referencing the module_params string from the mmodule structure is much more convenient than iterating through an APR array of name/value pairs. Monitoring with Ganglia
Perhaps it is programmed to stop iterating when it sees a particular user’s document or is based on a tally it’s been keeping of some resource allocated in the configuration. CouchDB The Definitive Guide
This is because you are now iterating over all values that have that key, instead of just one at a time. MapReduce Design Patterns
Iterating all the columns of a large key In some designs, a particular row key may have a large number of associated columns. Cassandra High Performance Cookbook
Running your MapReduce application against the dataset may take hours, and constantly iterating with code refinements and rerunning against it isn’t an optimal workflow. Hadoop in Practice
And in this case, the output types of the reduce function are Text and IntWritable, for a year and its maximum temperature, which we find by iterating through the temperatures and comparing each with a record of the highest found so far. Hadoop:The Definitive Guide 3rd Edition
When iterating over super columns in your mapper, each IColumn would need to be cast to a SuperColumn, and it would contain nested column information. Cassandra The Definitive Guide
In other words, when “synchronoulsy” iterating over all columns, the original rows can be reconstructed. PowerDrill - Processing a Trillion Cells per Mouse Click
Iterating through all dependencies can take simultaneous advantage of the dependencies object and the Groovy each method that accepts a closure. Building and Testing with Gradle
Another set of properties you may find useful allow for rounding down event times at a hour, minute, or second granularity while still maintaining those elements in file paths. Apache Flume Distributed Log Collection for Hadoop
The granularity of the data blocks can affect the distribution and parallel execution of the tasks. Hadoop Operations and Cluster Management Cookbook
We can also set the maximum and minimum granularity of the log messages with the minGranularity and maxGranularity properties. Gradle Effective Implementation Guide
Ultimately, HoD had too little information to make intelligent decisions about its allocations, its resource granularity was too coarse, and its API forced users to provide misleading constraints to the resource layer. Apache Hadoop YARN Yet Another Resource Negotiator
Although, technically, updates can be implemented as overwrites, the granularity of such an approach (overwrite will work only on the file level) can be cost-prohibitive in most cases. Professional Hadoop Solutions
To achieve efficiency, Incoop relies on memoization, but goes beyond the straightforward task-level application of this technique by performing a stable partitioning of the input and by reducing the granularity of tasks to maximize result reuse. Incoop MapReduce for Incremental Computations
However, they tend to be at the wrong level of granularity for creating sophisticated, highly composable code that can be shared among different developers. Hadoop:The Definitive Guide 3rd Edition
We are currently implementing an object-granularity clone operation to create object copies with copy-on-write behavior for efficient storage utilization, and are extending the RADOS client interface to allow transparent versioning for logical point-in-time copies across sets of objects (i. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
Hadoop is unable to efficiently divide this task at a finer granularity. Hadoop in Action
Users and groups may be granted permission to perform any combination of the above operations at the table, column family, or column qualifier granularity. Hadoop Operations
End-to-end application which makes use of the new clientjar option in JAX-WS wsimport tool. Gradle Effective Implementation Guide
The jaxws-api.jar file is required for JAX-WS support in Spring remoting, and jta.jar is used for JTA transaction support. Pro Spring 3
For the APIs exposed via the public Internet, you can implement an HBase-based real-time application by using REST APIs and leveraging one of the JAX-RS frameworks (for example, RestEasy or Jersey) Professional Hadoop Solutions
The Java API for XML Web Services (JAX-WS) and the Java API for ased RPC (JAX-RPC) both provide support for web service calls using the TTP protocol. Java EE 7 Specification
Using JAX-RS annotations, we decorate each extension class to indicate to the server which HTTP requests it handles. Graph Databases
Without the Client API (introduced in JAX-RS 2), developers are required to use basic HttpUrlConnection APIs and write all the surrounding code. Java EE 7 Essentials
The second part of this chapter focuses on Web services performance, which starts with a brief look at the implementation of the Java API for XML Web Services (JAX-WS) Reference Implementation. Java Performance
In doing so, you’ll explore various remoting technologies, including RMI, Hessian/ Burlap, and SOAP web services with JAX-WS. Spring in Action 4th Edition
To prove the point: The Java date library is a bit muddled, and a major redesign is underway; see http://jcp.org/en/jsr/detail?id=310.) Core Java Volume I Fundamentals 9th Edition
If tryItOut returns without raising an exception, control is transferred to the finally block using a jsr instruction. The Java Virtual Machine Specification Java SE 7 Edition
This document is the public review version of the JSR-133 specification, the Java Memory Model (JMM) and Thread Specification. JSR-133 Java Memory Model and Thread
This scenario ely discussed in the Servlet expert group (JSR-53) but we were unable to consensus on the appropriate solution. Java EE 7 Specification
No one deserves more credit for this version of the Java programming language than he - but any blame for its deficiencies should be directed at myself and the members of the many JSR Expert Groups! The Java Language Specification Java SE 7 Edition
Thanks to my colleagues at Oracle and the different JSR specification leads for explaining the intended use cases of different technologies. Java EE 7 Essentials
For more information about using Java bytecode for dynamic languages, please refer to Java Specification Request (JSR) 292 on the Internet. Oracle JRockit The Definitive Guide
It's important to mention that Guava has only one dependency, JSR-305 Getting Started with Google Guava
Customize Liferay portal using JSR-286 portlets, hooks, themes, layout templates, webs plugins, and diverse portlet bridges. Java 7 New Features Cookbook
One of the issues addressed with JSR-133 is the use of the Java keyword volatile. Java Performance
Remember to include standard fields such as timestamps (created, updated), and status so that you can organize and identify the content. Getting Started with CouchDB
Because validation functions are run at replication time as well as during normal client access, we can’t require that timestamps be set close to the server’s system time. CouchDB The Definitive Guide
Data is stored in schematized semi-relational tables; data is versioned, and each version is automatically timestamped with its commit time; old versions of data are subject to configurable garbage-collection policies; and applications can read data at old timestamps. Spanner - Google’s Globally-Distributed Database
When two columns for a row key have the same name, the timestamps of the columns are compared and the value of the column with the highest timestamp is the final value. Cassandra High Performance Cookbook
A write operation in Dynamo also requires a read to be performed for managing the vector timestamps. Cassandra A Decentralized Structured Storage System
In this way, we have a lower and upper bound for the span timestamps on the server side of RPCs. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
This information is stored as metadata in each HFile and queries that ask for a specific timestamp or range of timestamps will check if the request intersects with the ranges of each file, skipping those which do not overlap. Apache Hadoop Goes Realtime at Facebook
Instead of relying on timestamps, causal consistency instead takes a more semantic approach, attempting to determine the cause of events to create some consistency in their order. Cassandra The Definitive Guide
In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: column under the timestamps when they were fetched, as illustrated in Figure 1 Bigtable - A Distributed Storage System for Structured Data
The timestamp oracle provides strictly increasing timestamps: a property required for correct operation of the snapshot isolation protocol. Percolator - Large-scale Incremental Processing Using Distributed Transactions and Notifications
Alternatively, you can overlay a timeline to indicate the duration of a particular event. Monitoring with Ganglia
For more information on its timeline, please refer to the roadmap’s dashboard (http://www.gradle.org/roadmap) Gradle in Action
You can now view trend reports and see a timeline of success and failure. Jenkins Continuous Integration Cookbook
From a machine point of view, the most natural format to model time is with a single large number representing a point on a continuous timeline. Java 8 in Action Lambdas, Streams and Functional-style Programming
Sorted by time, it can provide a timeline view on the data. MapReduce Design Patterns
It primarily retrieves events by looking up all logs containing a keyword from a range of time.To make this efficient, the event timeline is denormalized to produce one timeline per keyword. Cassandra High Performance Cookbook
Figure 6.13 shows the reduce task timeline with the units of work and potential areas impacting performance. Hadoop in Practice
Start by summarizing the problem and the end result, and then walk through the timeline of events, calling out anything interesting along the way. Hadoop Operations
If you observe a wide distribution of pause times, you should look at the GC Timeline tab to identify peaks in garbage collection activity. Java Performance
Quite similar to a cartoon flip book, each page represents a frame or picture that will be displayed on the timeline for a period of time. Java 7 Recipes A Problem-Solution Approach
This API and its implementation are critical to supporting external consistency and a variety of powerful features: nonblocking reads in the past, lock-free read-only transactions, and atomic schema changes, across all of Spanner. Spanner - Google’s Globally-Distributed Database
Use Job.is Complete(), a nonblocking job completion check, to constantly poll to see whether all of the jobs are complete. MapReduce Design Patterns
Twisted provides a Deferred object for building chains of nonblocking event handlers. HBase in Action
Thrift also includes an RPC layer with various types of servers, including a nonblocking implementation. Hadoop in Practice
Updates are a nonblocking operation, so a client that loses an update (because another process updated the znode in the meantime) can decide whether to try again or take some other action, and it can do so without blocking the progress of any other process. Hadoop:The Definitive Guide 3rd Edition
From the command-line help you can see that, for example, using the nonblocking server implies the framed transport. HBase:The Definitive Guide
If you do synchronize your class internally, you can use various techniques to achieve high concurrency, such as lock splitting, lock striping, and nonblocking concurrency control. Effective Java 2nd Edition
This flushing is a nonblocking operation; multiple memtables may exist for a single column family, one current and the rest waiting to be flushed. Cassandra The Definitive Guide
The most prominent features of NIO are selectors and nonblocking I/O programming, allowing developers to use the same resources to handle more network clients and generally heavier loads on their servers. ActiveMQ in Action
With event-based programming in mind, you will discover nonblocking or callback behaviors when invoking functions. Java 7 Recipes A Problem-Solution Approach
The keyword public is called an access modifier; these modifiers control the level of access other parts of a program have to this. Core Java Volume I Fundamentals 9th Edition
Other features such as class inheritance and visibility modifiers (the public, private, and protected keywords of Java) make object-oriented languages much easier and safer to work with than older procedural languages. Java Network Programming and Distributed Computing
Term modifiers In addition to the standard field query with a simple term or clause, Lucene allows us to modify the terms we pass in the query with modifiers. Mastering ElasticSearch
Like the access modifiers that started this discussion, bounds checking is generally thought of in terms other than security: the prime benefit to bounds checking is that it leads to fewer bugs and more robust programs. Java Security 2nd Edition
Documented annotations should be treated just like other modifiers such as protected or static for documentation purposes. Core Java Volume II Advanced Features 9th Edition
In order to make a class cloneable, it may be necessary to remove final modifiers from some fields. Effective Java 2nd Edition
In this release of Java, there are no defined modifiers. Java 7 New Features Cookbook
To return to line 5, the public and static keywords are modifiers. Java in a Nutshell 5th Edition
When using the -O option with javac prior to SDK 1.3, the compiler considered inlining methods defined with any of the following modifiers: private, static, or final. Java Performance Tuning 2nd Edition
In any case, constants are typically defined with the static and final modifiers: Java 7 Recipes A Problem-Solution Approach
It is similar in concept to MapReduce [14], but with a natural graph API and much more efficient support for iterative computations over the graph. Pregel - A System for Large-Scale Graph Processing
A friend has told him about agile development, a time-boxed iterative approach for implementing and releasing software. Gradle in Action
So it is ideal for iterative and responsive Big Data applications. Hadoop Operations and Cluster Management Cookbook
Examples of (ab)uses include forking web servers and gang-scheduled computation of iterative workloads. Apache Hadoop YARN Yet Another Resource Negotiator
Most of these problems require iterative steps and/or knowledge of complex algorithms, which can be cumbersome to implement in MapReduce. Hadoop RealWorld Solutions Cookbook
Finally, although not fully transparent, Haloop [4] provides task-level memoization techniques for memoization in the context of iterative data processing applications. Incoop MapReduce for Incremental Computations
How it works… Mahout CVB version of LDA implements the Collapse Variable Bayesian inference algorithm using an iterative MapReduce approach: Hadoop MapReduce Cookbook
A basic implementation uses a sequential iterative process to traverse the entire graph from the starting node, as seen in the algorithm presented in figure 7.7 Hadoop in Practice
Iterative application If we apply this algorithm once, we will get node 1 marked as Done, several more (its immediate neighbors) as Current, and a few others as Pending. Hadoop Beginners Guide
In order to reassemble the files into a single sorted file, we open some number of files (defined by io.sort.fac tor), and begin the iterative process of comparing the heads of the streams, picking the lowest key, and writing the result to a new file. Hadoop Operations
Notice that in case of conflicting ACLs, the first match wins. Monitoring with Ganglia
If no ACLs have been configured, we will get an output similar to the following: Hadoop Operations and Cluster Management Cookbook
What is not covered by this LDIF example is the adding of objectClasses and Access Control Lists (ACLs) Jenkins Continuous Integration Cookbook
Service-level authorization is a capability that provides access control lists (ACLs) of which users have permission to access particular services. Professional Hadoop Solutions
By default, all ACLs are set to *, which means that all users have permission to access each service, but on a real cluster you should lock the ACLs down to only those users and groups that should have access. Hadoop:The Definitive Guide 3rd Edition
Each node has various meta-data, including three names of access control lists (ACLs) used to control reading, writing and changing the ACL names for the node. Chubby - The lock service for loosely-coupled distributed systems
We displayed the list of current ACLs only to see what they were, and to eventually verify that the attributes for the file have been changed. Java 7 New Features Cookbook
The metadata server is also responsible for storing ACLs (Access Control Lists) and other security related information about the tables. Tenzing - A SQL Implementation On The MapReduce Framework
When enabling security, this feature should be enabled as well and meaningful ACLs configured. Hadoop Operations
For example, does user X have the necessary permissions to execute program Y on system Z? Such privileges are often referred to as access control lists (ACLs) and determine who or what can access a given resource to perform a given operation. ActiveMQ in Action
In the latest Ganglia versions, some of these metrics now have modules that report their values including NFS. Monitoring with Ganglia
There are other installation methods other than FTP, for example, we can also use NFS and HTTP. Hadoop Operations and Cluster Management Cookbook
Traditional solutions, exemplified by NFS [20], provide a straightforward model in which a server exports a file system hierarchy that clients can map into their local name space. Ceph A Scalable, High-Performance Distributed File System
This time is mostly dominated by disk IO over NFS (since our disks are spread over a number of machines) The anatomy of a large-scale hypertextual Web search engine
Scribe supports multiple data sinks, including HDFS, regular filesystems, and NFS. Hadoop in Practice
Previously, data generated by Scribe was stored in expensive and hard to manage NFS servers. Apache Hadoop Goes Realtime at Facebook
NFS: May not be a better fit for many deployments as NFS is an external device, costly, less control on timeouts etc. Namenode High Availability
As the dfs.name.dir directory on the SNN is unused, we can expose it to the NameNode via the Network File System (NFS) Hadoop in Action
Be it block storage via a Storage Area Network (SAN) or file-based via Network File System (NFS) or similar protocols, these approaches constrain Hadoop by introducing unnecessary bottlenecks and additional shared devices that would have a critical impact on failure. Hadoop Beginners Guide
Currently, an NFS filer is required as the shared storage, although there are plans to remove this dependency.2 As the active namenode writes to the edits log, the standby namenode is constantly replaying transactions to ensure it is up to date and ready to take over in the case of failure. Hadoop Operations
You will pull in the Jenkins RSS feeds using a Firefox add-on. Jenkins Continuous Integration Cookbook
You can also subscribe to RSS feeds for various levels of log messages. Jenkins The Definitive Guide
RabbitMQ development RSS feed—The RabbitMQ Mercurial repository offers an RSS feed with the latest code changes to the server. RabbitMQ in Action
Tip Other ideas for Puppet report processors include RSS feeds for new reports, IRC, XMPP or instant messaging, or SMS notifications of new reports. Pro Puppet
It’s employed by many standards today such as SOAP and RSS, and used as an open data format for products such as Microsoft Office. Hadoop in Practice
Streamy crawls thousands of RSS feeds and aggregates hundreds of millions of items from them. Hadoop:The Definitive Guide 3rd Edition
To emulate this use case, the book makes use of a freely available data set on the Internet: the Delicious RSS feed. HBase:The Definitive Guide
You can create Atom or RSS feeds in just a few lines that transform database records into the appropriate XML. Beginning Scala
Then, in the early part of this century, RSS feeds, blogs, and wikis started cropping up, adding automation and linking of reference data in new ways that stretched the relational model. Cassandra The Definitive Guide
How It Works If you’ve ever seen a simple RSS reader or the Javadocs, you’ll notice that the screen is divided into sections with dividers that allow the user to adjust. Java 7 Recipes A Problem-Solution Approach
Through the membership key, you can block/unblock datagrams from different addresses, drop membership, get the channel and/or multicast group for which this membership key was created, and more. Pro Java 7 NIO.2
Troubleshooting Most problems with sFlow deployments occur because the sFlow datagrams are dropped somewhere between the sFlow agent and gmond. Monitoring with Ganglia
Information is exchanged between two hosts in the form of IP packets, also known as IP datagrams. Java Network Programming and Distributed Computing
In this book, we cover only the Transmission Control Protocol (TCP).The Java platform also supports the User Datagram Protocol (UDP), which can be used to send packets (also called datagrams) with much less overhead than that of TCP. Core Java Volume II Advanced Features 9th Edition
This class implements a communication channel based on network datagrams. Java in a Nutshell 5th Edition
The internet layer fragments the segments into IP datagrams of the necessary size for the local network and passes them to the host-to-network layer for transmission onto the wire. Java Network Programming 4th Edition
The purpose of these two protocols is identical—to send and receive data packets (datagrams) over the network. ActiveMQ in Action
Its sequencing with respect to other datagrams between the same two endpoints is not guaranteed. Fundamental Networking in Java
Problem You would like to broadcast datagrams to zero or more hosts identified by a single address. Java 7 Recipes A Problem-Solution Approach
In this chapter, we will use VisualVM (http://visualvm.java.net/index.html) as the application monitoring tool. Pro Spring 3
You start it in this manner so you can connect a local JMX console (like JConsole or Java VisualVM) to it. Spring in Practice
You can try to reproduce the problem locally by downloading the file that the task is failing on and running the job locally, possibly using a debugger such as Java’s VisualVM. Hadoop:The Definitive Guide 3rd Edition
This is what we’ll look at in the next section on how to monitor the Course Manager application using tools like VisualVM. Spring Roo in Action
The following graphical tools can be used to monitor the HotSpot VM: JConsole, VisualGC, and VisualVM. Java Performance
Solution 2 Analyze your program’s memory consumption by using VisualVM (an external GNU Open Source Java profiler and visualization tool).To use and install VisualVM, go to http://visualvm.java.net/, download and follow the steps for installation (usually uncompressing in the folder of your choice and then clicking the provided executable) Java 7 Recipes A Problem-Solution Approach
As illustrated in figure 20.1, exporting Spring beans as JMX MBeans makes it possible for a JMX-based management tool such as JConsole or VisualVM to peer inside a running application to view the beans’ properties and invoke their methods. Spring in Action 4th Edition
Mesos uses resource hoarding to achieve gang scheduling, and can potentially deadlock as a result. Omega - flexible, scalable schedulers for large compute clusters
The system runs over the Mesos cluster manager [17], allowing it to share resources with Hadoop, MPI and other applications. Resilient Distributed Datasets - A Fault-Tolerant Abstraction for In-Memory Cluster Computing
HoD proved itself as a versatile platform, anticipating some qualities of Mesos[17], which would extend. Apache Hadoop YARN Yet Another Resource Negotiator
Paco is an evangelist for the Mesos and Cascading open source projects. Storm Real-Time Processing Cookbook
Abstract We present Mesos, a platform for sharing commodity clusters between multiple diverse cluster computing frameworks, such as Hadoop and MPI. Mesos A Platform for Fine-Grained Resource Sharing in the Data Center
The screenshot in Figure 10-10 shows the word cloud generated from a MapReduce view of CouchDB’s archived tweets. CouchDB The Definitive Guide
You need to be able to provide credentials so that your plugin can send tweets. Jenkins Continuous Integration Cookbook
In order to get the data we needed, we decided to use the Twitter river plugin to index some public tweets from Twitter. Mastering ElasticSearch
Let’s say you have a list of strings representing tweets people are sending to you. Java 8 in Action Lambdas, Streams and Functional-style Programming
By monitoring, collecting, and aggregating data from various Internet sources such as blogs, boards, news feeds, tweets, and social media, companies are using Hadoop to extract and aggregate information about their products, services, and competitors, discovering patterns and revealing upcoming trends important for understanding their business. Professional Hadoop Solutions
HBase stores structured and semistructured data naturally so you can load it with tweets and parsed log files and a catalog of all your products right along with their customer reviews. HBase in Action
Imagine you wanted to join that set of users with all their tweets, which is a much larger dataset than the set of users. Hadoop in Practice
For example, emitting a stream of trend analysis by processing a stream of tweets. Apache Kafka
Something seems to be broken: we have 12 tweets in our indices, and only one. Elasticsearch The Definitive Guide
You can use what’s likely a familiar data model (Twitter’s) and see how users, time lines, and tweets all fit into a simple Cassandra data model. Cassandra The Definitive Guide
Each document has to be converted from BSON to a JavaScript object and then run through the "$where" expression. MongoDB The Definitive Guide 2nd Edition
The format of the document in which the information is stored in MongoDB is BSON (binary JSON) Scala in Action
Finally, you don’t need to worry about mapping the Java Article object to MongoDB’s native BSON format,8 because Spring Data MongoDB also handles that for you. Spring in Practice
In addition to standard relational data, Drill supports (using ANSI SQL extensions) hierarchical data, including XML, JavaScript Object Notation (JSON), Binary JSON (BSON), Avro, protocol buffers, and so on. Professional Hadoop Solutions
The database stores and retrieves documents, which can be XML, JSON, BSON, and so on. NoSQL Distilled
MongoDB uses these BSON objects to efficiently serialize, transfer, and store data. Hadoop RealWorld Solutions Cookbook
MongoDB doesn’t actually use JSON to store the data; rather, it uses an open data format developed by the MongoDB team called BSON (pronounced Bee-Son), which is short for Binary-JSON. The Definitive Guide to MongoDB
Chapter 3 introduces the MongoDB drivers and MongoDB’s data format, BSON. MongoDB in Action
MongoDB stores these documents in BSON, a binary derivative of JSON. Spring Data Modern Data Access for Enterprise Java
The support for Hibernate, MyBatis, JDO, and JPA makes already simple APIs even simpler, thus easing the burden on developers. Pro Spring 3
For example, it provides a consistent and transparent way to work with a variety of different data access frameworks (e.g., JDBC, JPA, Hibernate, iBATIS, and JDO) Pro Spring MVC
A quick note about iBATIS Apache iBATIS was retired in 2010 and has been superseded by MyBatis (mybatis.org) Spring in Practice
There are many other tools for such upgrades, including Liquibase [Liquibase], MyBatis. NoSQL Distilled
As an example, consider the ORM module in the Spring Framework: this module includes support for popular ORM tools such as Hibernate, iBATIS, and OpenJPA. Spring Dynamic Modules in Action
Just configure your Spring application with the appropriate JDBC driver and data source, and then start using the Spring JdbcTemplate API, or MyBatis (a common SQL-mapping API), or even a non-SQL database platform. Spring Roo in Action
It significantly simplified the implementation of data access layers, regardless of whether JDBC, Hibernate, TopLink, JDO, or iBatis was used as persistence technology. Spring Data Modern Data Access for Enterprise Java
Spring doesn’t attempt to implement its own ORM solution but does provide hooks into several popular ORM frameworks, including Hibernate, Java Persistence API, Java Data Objects, and iBATIS SQL Maps. Spring in Action 4th Edition
Iff the predicate classIsTypeSafe is not true, the type checker must throw the exception VerifyError to indicate that the class file is malformed. The Java Virtual Machine Specification Java SE 7 Edition
For every other kind of expression, the expression can throw an exception class E iff one of its immediate subexpressions can throw E. The Java Language Specification Java SE 7 Edition
Apply a predicate p to all elements of this iterable object and // return true, iff there is at least one element for which p yields true. Programming Scala
If it does not find that file, it automatically searches for Employee.java and compiles it. Core Java Volume I Fundamentals 9th Edition
This code compiles and executes on Android without any conspicuous error. Monitoring with Ganglia
The build task compiles your code, runs your tests, and assembles the JAR file, all in the correct order. Gradle in Action
The apiClasses task is the task that compiles the Java source files, and processes the resources into the build directory, so Gradle will first invoke the apiClasses task before the apiJar task. Gradle Effective Implementation Guide
Query Compiler – The component that compiles HiveQL into a directed acyclic graph of map/reduce tasks. Hive A Petabyte Scale Data Warehouse Using Hadoop
For example, the build task in a Java project compiles and tests all of the code. Gradle Beyond The Basics
When we run the Pig script, Pig internally compiles Pig commands to MapReduce jobs in an optimized form and runs it in a MapReduce cluster. Hadoop MapReduce Cookbook
Pig compiles all the statements together and can locate the dependency and redundancy. Hadoop in Action
The command compiles the Java examples and starts up a simple JMS consumer. ActiveMQ in Action
All commands and queries go to the Driver, which compiles the input, optimizes the computation required, and executes the required steps, usually with MapReduce jobs. Programming Hive
In the past, you had to come up with a solution for this problem, but no more! Flume started as a project out of Cloudera when their integration engineers had to keep writing tools over and over again for their customers to import data automatically. Apache Flume Distributed Log Collection for Hadoop
Hadoop was created by Doug Cutting, who now works as an architect at Cloudera and serves as chair of the Apache Software Foundation. Monitoring with Ganglia
Prerequisites For those of you who are new to Hadoop, I strongly urge you to try Cloudera’s open source Distribution for Hadoop (http://www.cloudera.com/hadoop) Pro Hadoop
The success of MR led to a wide range of third-party implementations (notably open-source Hadoop [15]), and a number of hybrid systems that combine parallel DBMSs with MR, offered by vendors like Aster, Cloudera, Greenplum, and Vertica. Dremel - Interactive Analysis of WebScale Datasets
Prior to the release of MRUnit by Cloudera, even the simplest tests running in local mode would have to read from the disk and take several seconds each to set up and run. Hadoop RealWorld Solutions Cookbook
After tackling Hadoop’s architecture and its weaknesses you’re probably ready to roll up your sleeves and get hands-on with Hadoop, so let’s take a look at how to get the Cloudera Distribution for Hadoop (CDH)12 up and running on your system, which you can use for all the examples in this book. Hadoop in Practice
There are Hadoop distributions from the large, established enterprise vendors, including EMC, IBM, Microsoft, and Oracle, as well as from specialist Hadoop companies such as Cloudera, Hortonworks, and MapR. Hadoop:The Definitive Guide 3rd Edition
Cloudera is now bringing Hadoop to a broader enterprise audience. Hadoop in Action
Though Doug has now moved on to Cloudera (another prominent startup supporting the Hadoop community) and much of the Yahoo's Hadoop team has been spun off into a startup called Hortonworks, Yahoo remains a major Hadoop contributor. Hadoop Beginners Guide
Special thanks to Christophe Bisciglia for giving me an opportunity at Cloudera and for the advice along the way. Hadoop Operations
The libvirt project has created a common set of tools for managing virtualization resources on different virtualization platforms, currently including: Xen, QEMU, KVM, LXC, OpenVZ, User Mode Linux, VirtualBox, and VMware ESX and GSX. Monitoring with Ganglia
The nova-compute process is primarily a worker daemon that creates and terminates virtual machine instances via hypervisor's APIs (XenAPI for XenServer/XCP, libvirt for KVM or QEMU, VMwareAPI for VMware, etc.) bk-compute-adminguide-grizzly
If our system is running within Xen virtualization, those numbers can be closer to 200–300ms per gigabyte of memory used by Redis, depending on the Xen configuration. Redis in Action
While libvirt may be the most common, novacompute also uses the Xen API, vSphere API, Windows Management Interface, and others to support other virtualization technologies. Deploying OpenStack
However, the majority of OpenStack development is done on systems using KVM and Xen-based hypervisors. bk-admin-guide-cloud-latest
Hardware device drivers can be provided by the hypervisor, either in a special virtual machine for isolation (for example in Oracle VM- and Xen-based solutions) or as part of the hypervisor itself (for example in VMware ESX) Oracle JRockit The Definitive Guide
To boot a Xen image from a volume, the image you launch in must be the same type, fully virtualized or paravirtualized, as the one on the volume. user-guide
However, C++ has a special initializer list syntax, such as. Core Java Volume I Fundamentals 9th Edition
A compiler writer can work around this bug by limiting the maximum size of the generated Java Virtual Machine code for any method, instance initialization method, or static initializer (the size of any code array) to 65534 bytes. The Java Virtual Machine Specification Java SE 7 Edition
The value of the variable Point.origin can never change, so it always refers to the same Point object, the one created by its initializer. The Java Language Specification Java SE 7 Edition
The delegate injection point follows the normal rules for injection and therefore must be an injected field, initializer method parameter, or bean constructor method parameter: Java EE 7 Essentials
This statement causes the driver class to be loaded, thereby executing a static initializer that registers the driver. Core Java Volume II Advanced Features 9th Edition
The version that follows avoids this inefficiency with a static initializer: Effective Java 2nd Edition
A variable declaration can also include an initializer: an expression that specifies an initial value for the variable. Java in a Nutshell 5th Edition
Rather than initialize explicitly in the constructor (or class static initializer), it is left until access time for the variable to be initialized, using a test for null to determine if it has been initialized. Java Performance Tuning 2nd Edition
As of this writing, the JLS does not yet clearly state that the visible initial value read for an initialized final field is the value assigned in its initializer or constructor. Concurrent Programming in Java Design Principles and Pattern
The initGrades2() method combines array creation and initialization in one line using the special array initializer syntax: Java 7 Recipes A Problem-Solution Approach
Java has an extensive library of routines for coping with TCP/IP protocols like HTTP and FTP. Core Java Volume I Fundamentals 9th Edition
The backup server could either be accessible over a mounted shared drive or you could transfer the file over FTP. Gradle in Action
We will learn steps to configure this server, including the configuration of the following two services: DHCP and FTP. Hadoop Operations and Cluster Management Cookbook
Once a jar file is created, you can FTP it to the cluster’s edge node, and executed using the command shown in Listing 3-4 Professional Hadoop Solutions
This recipe requires an SSH Server and SSH Client, but any method of transferring binary data such as FTP is sufficient. Cassandra High Performance Cookbook
Using curl curl is a lightweight command-line tool for transferring data to or from a server using various protocols, such as HTTP, HTTPS, and FTP. Beginning CouchDB
These techniques are particularly helpful in situations where the data you’re working with is in a binary or semistructured form, or is only accessible via interfaces such as HTTP or FTP. Hadoop in Practice
It functions much like a GUI-based FTP client in managing remote storage. Hadoop in Action
The act of copying files to or from HDFS is more like SCP or FTP than working with an NFS mounted filesystem, for example. Hadoop Operations
The blob itself is transferred out of bounds, by either FTP or HTTP. ActiveMQ in Action
Part of the transition from a single executable also included splitting out other components such as the Apache Portable Runtime (APR) library, which was previously being statically linked with gmond as well. Monitoring with Ganglia
You may prefer Java for its statically typed nature and library support to implement your business logic. Gradle in Action
Structural conditions — These are statically defined in a Workflow DAG (transitions and fork-join construct) Professional Hadoop Solutions
The mapper class defines a statically initialized position to read the country from each line, as well as the regex pattern to split each line. Hadoop RealWorld Solutions Cookbook
Many US hotels and conference centers are still woefully unprepared for a few hundred software developers, each with two or three devices on the wireless network and a seemingly insatiable appetite for bandwidth.) While I strongly prefer dependencies to be managed by my build tool, it might make sense for me to prepare lab materials with all of the dependencies statically located in the project in the style of old Ant builds.4 For some Java frameworks and APIs, chasing all of these JARs down by hand can be a burden. Gradle Beyond The Basics
In contexts where the Writable is defined by type, such as in SequenceFile keys or values or as input to MapReduce in general, you need to subclass ArrayWritable (or TwoDAr rayWritable, as appropriate) to set the type statically. Hadoop:The Definitive Guide 3rd Edition
In more basic designs, the ID is hashed statically to one of a fixed number of database machines. Hadoop in Action
Keyspaces Keyspaces used to be defined statically in an XML configuration file, but as of 0.7, you can use the API to create keyspaces and column families. Cassandra The Definitive Guide
When you want to configure a network of brokers, the first obvious question is, do you know the exact network address of each broker in the network? If the answer is yes, then you can proceed configuring your network statically and also connect your clients to predefined broker URIs. ActiveMQ in Action
The concrete nature (and ideally, number) of all objects and threads comprising the subsystem are statically known. Concurrent Programming in Java Design Principles and Pattern
It consisted of a federation of worker daemons (agents) configured from a centralized master (or masters) via Zookeeper (a federated configuration and coordination system) Apache Flume Distributed Log Collection for Hadoop
The examples at http://hadoop.apache.org/zookeeper/docs/current/recipes.html provide ZooKeeper recipes for two-phase commit, leader election, barriers, queues, and locks. Pro Hadoop
HBase uses Zookeeper for its management to ensure that all of its components are up and running. Professional Hadoop Solutions
In Chapter 10, Third-party Libraries and Applications, the recipes Setting up Zookeeper to support Cages for transactional locking and Using Cages to implement an atomic read and set. Cassandra High Performance Cookbook
The shell script in this recipe assumes that Zookeeper is running on the host localhost on port 2181; you can change this to suit your environment needs. Hadoop RealWorld Solutions Cookbook
Cassandra system elects a leader amongst its nodes using a system called Zookeeper[13] Cassandra A Decentralized Structured Storage System
Create two directories to store the HDFS data and Zookeeper data. Hadoop MapReduce Cookbook
Windows is supported only as a development platform, not as a production platform.) Download a stable release of ZooKeeper from the Apache ZooKeeper releases page at http://zookeeper.apache.org/releases.html, and unpack the tarball in a suitable location: Hadoop:The Definitive Guide 3rd Edition
Our plan is to use Zookeeper, Yahoo’s distributed consensus technology to build an automated failover solution. The Hadoop Distributed File System
Zookeeper, also part of the Hadoop project, is a centralized service for maintaining configuration information and distributed synchronization across nodes in a cluster. Cassandra The Definitive Guide
Use the ulimit command to increase the permitted number of open files or file descriptors for the rrdcached process. Monitoring with Ganglia
File Descriptors Hadoop Core uses large numbers of file descriptors for MapReduce, and the DFSClient uses a large number of file descriptors for communicating with the HDFS NameNode and DataNode server processes. Pro Hadoop
All this takes processing time and enough sockets (or file descriptors) must be available. CouchDB The Definitive Guide
Writing a plugin is surprisingly easy and doesn’t require a lot of additional descriptors. Gradle in Action
But those descriptor files can be misconfigured, and so we may want to override the descriptors ourselves to ensure the dependencies are correct. Gradle Effective Implementation Guide
Because of this, HBase tends to see issues not normally encountered by MapReduce clients: Running out of file descriptors. Hadoop:The Definitive Guide 3rd Edition
Clients open nodes to obtain handles that are analogous to UNIX file descriptors. Chubby - The lock service for loosely-coupled distributed systems
The new intermediate file, under output/test, is missing the quoted field descriptors. Hadoop in Action
Linux supports, via Pluggable Authentication Modules (PAM) system, the ability to control resources such as file descriptors and virtual memory at the user level. Hadoop Operations
This is common when working with a Java EE application server, since each one requires its own custom configuration file and possibly more deployment descriptors. ActiveMQ in Action
For each named sink group, you need to specify the sinks it contains using the sinks property consisting of a space-delimited list of sink names: Apache Flume Distributed Log Collection for Hadoop
Because the output is just a text string, that means you can reformat the content as XML, a tab delimited string, even as a UUencoded string if that is what you need. Getting Started with CouchDB
Then add a list of checks that are delimited with a colon. Monitoring with Ganglia
This recipe will use delimited text as an alternative to storing multiple columns under the same key. Cassandra High Performance Cookbook
Note that the expression passed to the expand() method is a Groovy map literal—it is enclosed by square brackets, and a series of key/value pairs are delimited by commas, with the key and the value themselves separated by colons. Gradle Beyond The Basics
It reads records that are delimited by a specific XML begin and end tag. Hadoop in Practice
Popular formats, such as tab-delimited text files, are natively supported. Hadoop in Action
For example, for delimited text format, the user can specify the delimiter, encoding, quoting, escaping, headers, etc. Tenzing - A SQL Implementation On The MapReduce Framework
For simplicity, we keep to our previous validation strategy of looking for the number of fields and discarding lines that don't break into exactly six tab-delimited fields. Hadoop Beginners Guide
The users.properties file defines each user in a line-delimited manner along with its password, as shown: ActiveMQ in Action
Using specific rowkey filtering, scanning will be faster over a subset, rather than scanning over the entire table. Web Crawling and Data Mining with Apache Nutch
Within the enterprise, all applications can write their events to Cassandra with their own columns and the rowkey of the form appname:timestamp. NoSQL Distilled
It’s often described as a sparse, distributed, persistent, multidimensional sorted map, which is indexed by rowkey, column. HBase in Action
Iterate through all values for the rowkey and scan criteria (defined in the main method) Hadoop in Practice
Originally targeted at embedded network hardware, sFlow collectors now exist for general-purpose operating systems as well as popular applications such as Tomcat, memcached, and the Apache Web Server. Monitoring with Ganglia
More recent versions of HBase also provide a memcached-based intermediate layer between the user and the MapFiles (http://www.danga.com/memcached/) Pro Hadoop
BigTable, Hadoop, SimpleDB, and memcached restrict object lookups by key for exactly these reasons. CouchDB The Definitive Guide
Cache documents and query results using memcached or another caching system. Scaling CouchDB
Service becomes more popular; too many reads hitting the database Add memcached to cache common queries. Hadoop:The Definitive Guide 3rd Edition
Even projects like memcached are lumped into the NoSQL category, as if anything that is not an RDBMS is automatically NoSQL. HBase:The Definitive Guide
Instead, we have used a combination of large clusters of MySQL databases and caching tiers built using memcached[9] Apache Hadoop Goes Realtime at Facebook
Caching in Guava gives us more power and flexibility than using plain HashMap but is not as robust as EHCache or Memcached. Getting Started with Google Guava
This was true of many tools and techniques for scaling data processing, including caching (for example, memcached), replication, sharding, and, of course, MapReduce/ Hadoop. Hadoop in Action
For larger systems, this might include distributed caches such as memcached, EHCache, Oracle Coherence, or other related products. Cassandra The Definitive Guide
He’s then taught the modern concurrency tools provided directly within the Java SDK before embarking upon an adventure through the exciting realms of STM and actors. Programming Concurrency on the JVM
Instead of working with tables and rows, STM controls the access to shared memory. Scala in Action
This method gets the path to the file, an iterable object over the char sequence, a charset to use for encoding, and options specifying how the file is opened. Pro Java 7 NIO.2
Passing this to Charset.forName( ) produces the Charset object that can be used to decode the string. Thinking in Java 4th Edition
You can use the CHARSET(), COLLATION(), and COERCIBILITY() functions to help debug errors related to character sets and collations. High Performance MySQL 3rd Edition
Each supports the same attributes (other than a charset attribute that's only present on the string transformer), so the primary difference is the type of the payload of the outgoing message. Spring Integration in Action
To obtain a Charset, call the static forName method with either the official name or one of its aliases: Core Java Volume II Advanced Features 9th Edition
The third parameter of this method defines the Charset to use. Java 7 New Features Cookbook
Charset and its associated classes, CharsetEncoder and CharsetDecoder, define methods for encoding strings of characters into sequences of bytes and decoding sequences of bytes into strings of characters. Java in a Nutshell 5th Edition
When you create a String instance from a byte array, you must provide a Charset argument to the String constructor so that the platform knows how to perform the mapping from the legacy encoding to Unicode. Java 7 Recipes A Problem-Solution Approach
In the (hopefully unlikely) case that you need a different encoding, you can supply a Charset argument. Java SE 8 for the Really Impatient
THR_LOCK_charset Protects the variables and data structures relevant to character set operations. Understanding MySQL Internals
In addition to extremely large-scale pipelines for Yahoo! Search, scientists optimizing advertising analytics, spam filtering, and content optimization drove many of its early requirements. Apache Hadoop YARN Yet Another Resource Negotiator
The next recipe, Running the Yahoo! Cloud Serving Benchmark for a more sophisticated load testing system. Cassandra High Performance Cookbook
Also, pages that have perhaps only one citation from something like the Yahoo! h homepage are also generally worth looking at. The anatomy of a large-scale hypertextual Web search engine
You’ve most likely come across Hadoop as it’s been adopted by technology giants like Yahoo!, Facebook, and Twitter to address their big data needs, and it’s making inroads across all industrial sectors. Hadoop in Practice
Around that time, Yahoo! got interested, and quickly put together a team that I joined. Hadoop:The Definitive Guide 3rd Edition
BookKeeper is in production use at Yahoo! for guaranteed delivery of log messages to Hedwig Servers. Namenode High Availability
Doug Cutting got it all started, and Yahoo had the foresight to support it early on. Hadoop in Action
Pig Pig (http://hadoop.apache.org/pig) is a platform for data analytics developed at Yahoo! Cassandra The Definitive Guide
Thanks, Yahoo Yahoo hired Doug Cutting in 2006 and quickly became one of the most prominent supporters of the Hadoop project. Hadoop Beginners Guide
The team at Yahoo! ran into a number of scalability limitations that were difficult to overcome given Hadoop’s existing architecture and design. Hadoop Operations
CouchDB’s JSON documents are great for programmatic access in most environments. CouchDB The Definitive Guide
The Project instance gives you programmatic access to all Gradle features in your build, like task creation and dependency management. Gradle in Action
Introduction Programmatic access to a cluster of Cassandra servers is done though the Application Programmer Interface. Cassandra High Performance Cookbook
Dapper is used widely at Google, both directly through the Dapper user interface and indirectly through the programmatic APIs or applications built on top of those APIs. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
You have programmatic access to counters, and the job history command will also include the counters: Hadoop in Practice
The listStatus() and globStatus() methods of FileSystem take an optional PathFilter, which allows programmatic control over matching: Hadoop:The Definitive Guide 3rd Edition
Programmatic Authentication If you have set up authentication on your keyspace, your client application code will need to log in. Cassandra The Definitive Guide
Using web, commandline, or programmatic interfaces to create objects, which can be everything from text files to images to MP3s, you can store and retrieve your data based on a hierarchical model. Hadoop Beginners Guide
Oozie presents a REST service for programmatic management of workflows and status retrieval. Hadoop Operations
This kind of programmatic modification of that otherwise inaccessible build logic can be very powerful. Building and Testing with Gradle
HBase uses CMS GC as its default garbage collector for the old generation. HBase Administration Cookbook
For instance, if you are using the blog.posts collection in the cms database, the namespace of that collection would be cms.blog.posts. MongoDB The Definitive Guide 2nd Edition
In the XML world he is known as the developer of DB Generator for the Apache Cocoon project, the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integration by using Oracle JVM Directory implementation and in the Restlet.org project the Oracle XDB Restlet Adapter, an alternative to write native REST web services inside the database-resident JVM. Mastering ElasticSearch
A content management system (CMS) allows for storing and serving content, as well as managing everything from a central location. HBase in Action
This is mitigated by using the Concurrent Mark-Sweep Collector (CMS), enabled with the latter option shown earlier. HBase:The Definitive Guide
Use the name or id attributes to name these variables as follows: public WebElement heightCMS; public WebElement weightKg; public WebElement Calculate; public WebElement bmi; public WebElement bmi_category; Selenium Testing Tools Cookbook
Use Liferay tools (CMS, WCM, collaborative API and social API) to create your own Web sites and WAP sites with hands-on examples. Java 7 New Features Cookbook
The CMS API (described later in this chapter) defines an API similar to JMS, and since the pyactivemq is just a wrapper around a CMS API implementation, we can expect a JMS-like API for Python. ActiveMQ in Action
Figure 3-8a illustrates how a garbage collection cycle works in CMS. Java Performance
In the XML world, he is known as the developer of the DB Generator for the Apache Cocoon project, the open source projects DBPrism and DBPrism CMS, the Lucene-Oracle integration by using Oracle JVM Directory implementation, and the Restlet.org project – the Oracle XDB Restlet Adapter, an alternative to writing native REST web services inside the database resident JVM. Apache Solr 4 Cookbook
Flume was created to meet this need and create a standard, simple, robust, flexible, and extensible tool for data ingestion into Hadoop. Apache Flume Distributed Log Collection for Hadoop
Ganglia 1.0 was not extensible, whereas Ganglia today can publish custom metrics via Python and C modules or a simple command-line tool. Monitoring with Ganglia
The nested data model backs a platform-neutral, extensible mechanism for serializing structured data at Google. Dremel - Interactive Analysis of WebScale Datasets
In fact, the AM can run arbitrary user code, and can be written in any programming language since all communication with the RM and NM is encoded using extensible communication protocols6—as an example consider. Apache Hadoop YARN Yet Another Resource Negotiator
Attain a more flexible and extensible framework with the world class Axis2 architecture. Mastering ElasticSearch
It has extensible rules that can permit or deny traffic based on many attributes, including, but not limited to, source IP, destination IP, source port, and destination port. Cassandra High Performance Cookbook
The DDL forms a clear, compact, extensible notation describing the layout of the binary records and naming the fields. Interpreting the Data Parallel Analysis with Sawzall
One of the main design goals driving toward Hadoop’s major 1.0 release is a stable and extensible MapReduce API. Hadoop in Action
First we’ve started the XMPP transport connector on port 61222 to expose the broker to clients via XMPP (the Extensible Messaging and Presence Protocol) ActiveMQ in Action
Ant provided a way to describe a Java build that was extensible and platform independent, which was a large part of its success. Building and Testing with Gradle
If your view of the world is different, Gradle gives you the option of customizing the conventions. Gradle in Action
Next, we will go to a web page for customizing services, for example, configuring the location for the NameNode directory. Hadoop Operations and Cluster Management Cookbook
Customizing the Gradle wrapper We can change the names of the script files that are generated with the scriptFile property of the Wrapper task. Gradle Effective Implementation Guide
Our books and publications share the experiences of your fellow IT professionals in adapting and customizing today's systems, applications, and frameworks. Cassandra High Performance Cookbook
CouchDB’s configuration file includes many options for customizing your CouchDB installation. Beginning CouchDB
While this is the fundamental role of configurations, they also provide methods for querying the declared dependencies and customizing the resolution process. Gradle Beyond The Basics
The site XML files (those with site in their filenames) will grow as you start customizing your Hadoop cluster, and it can quickly become challenging to keep track of what changes you’ve made, and how they relate to the default configuration values. Hadoop in Practice
Inverted index Analysis and analyzers Built-in analyzers When analyzers are used Testing analyzers Specifying analyzers Mapping Core simple field types Viewing the mapping Customizing field mappings Updating a mapping Testing the mapping Complex core field types Multi-value fields Empty fields Multi-level objects Mapping for inner objects How inner objects are indexed Arrays of inner objects Empty search Query DSL Structure of a query clause Combining multiple clauses Queries and Filters Performance differences When to use which Most important queries and filters Combining queries with filters Filtering a query Just a filter A query as a filter Validating queries Understanding errors Understanding queries Conclusion Sorting Sorting by field values Multi-level sorting Sorting on multi-value fields String sorting and multi-fields What is relevance? Understanding the score. Elasticsearch The Definitive Guide
In using FileInputFormat you focus on customizing RecordReader, which is responsible for parsing an input split into records and then parsing each record into a key/value pair. Hadoop in Action
To begin customizing the ActiveMQ resource adapter to start up an embedded ActiveMQ instance, you need to make some changes to META-INF/ra.xml as shown in the following steps: ActiveMQ in Action
Hive supports a SQL-like query language called HiveQL (HQL) to access data in its table. HBase Administration Cookbook
Similar to Apache Pig, the Hive runtime engine translates HiveQL statements into a sequence of MapReduce jobs for execution. Hadoop Operations and Cluster Management Cookbook
As will be seen throughout this book in the resemblances sections, Pig and SQL (or HiveQL) can be significantly more terse than the raw Hadoop implementations in Java. MapReduce Design Patterns
HiveQL supports ad-hoc queries, joins, summarization, and so on, by transparently converting them into MapReduce execution. Professional Hadoop Solutions
Hive lets you build analytics and manage data using a declarative, SQL-like dialect known as HiveQL. Hadoop RealWorld Solutions Cookbook
Discussion The following HiveQL creates a table which you’ll partition by date as well as by country. Hadoop in Practice
Its query language, HiveQL, is based on SQL, and anyone who is familiar with SQL would have little trouble writing queries in HiveQL. Hadoop:The Definitive Guide 3rd Edition
Its target users remain data analysts who are comfortable with SQL and who need to do ad hoc queries , summarization , and data analysis on Hadoop-scale data.3 You interact with Hive by issuing queries in a SQL-like language called HiveQL. Hadoop in Action
In particular, it provides a query language called HiveQL that closely resembles the common Structured Query Language (SQL) standard. Hadoop Beginners Guide
Developers, analysts, and existing third-party packages already know and speak SQL (Hive’s dialect of SQL is called HiveQL and implements only a subset of any of the common standards) Hadoop Operations
Intuitively speaking, wildcards with supertype bounds let you write to a generic object, while wildcards with subtype bounds let you read from a generic object. Core Java Volume I Fundamentals 9th Edition
The most common modifiers, which you are surely familiar with, are wildcards. Mastering ElasticSearch
Java supports wildcards as a flexible mechanism to allow subtyping for generics (more generally referred to as use-site variance) Java 8 in Action Lambdas, Streams and Functional-style Programming
We can specify all the part prefixed subdirectories as input to the dependent job using wildcards. Hadoop MapReduce Cookbook
Bash by default performs globbing, meaning that it’ll expand wildcards like “*” Hadoop in Practice
Fieldnames can be specified with wildcards—any field which matches the wildcard pattern will be included in the search. Elasticsearch The Definitive Guide
If an input parameter is both a producer and a consumer, then wildcard types will do you no good: you need an exact type match, which is what you get without any wildcards. Effective Java 2nd Edition
Lower-bounded wildcards are much less commonly used than upperbounded wildcards, and we discuss them later in the chapter. Java in a Nutshell 5th Edition
Chapter 11 presents advanced features provided by ActiveMQ such as wildcards and composite destinations, advisory messages, virtual topics, some info about ActiveMQ plug-ins, and an introduction to message routing with Apache Camel. ActiveMQ in Action
Wildcards are used to declare parameterized types that are either bounded or unbounded. Java 7 Recipes A Problem-Solution Approach
And finally, Appendix F is an introduction to Sphinx, a high-performance, full-text indexing system that can complement MySQL’s own abilities. High Performance MySQL 3rd Edition
Indexing and searching is fast in Sphinx Search, and it supports distributed queries as well. SQL Antipatterns
It’s common practice that a version control system (VCS) manages the versioning of the build definition and source code files. Gradle in Action
ElasticSearch tries to reduce these issues and provide additional mechanisms as versioning. Mastering ElasticSearch
Things to think about: Major impact on storage design and performance Are these storage- or application-level features? Versioning of collaborative documents. Storage Architecture and Challenges
If you are developing an application that requires revision history, it is highly recommended that you build an additional versioning layer rather than use CouchDB's internal layer. Beginning CouchDB
This is a very efficient way to create a prototype, but of course a properly managed plug-in will need a build of its own, so it can be decorated with all of the appropriate testing, versioning, and release processes that attend mature software development. Gradle Beyond The Basics
Versioning—The ability for the file format to support backward or forward compatibility. Hadoop in Practice
We are currently implementing an object-granularity clone operation to create object copies with copy-on-write behavior for efficient storage utilization, and are extending the RADOS client interface to allow transparent versioning for logical point-in-time copies across sets of objects (i. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
Versioning support The data types are capable of being versioned to support updates to the client API. Cassandra The Definitive Guide
A note on versioning Flume has gone through some major changes in recent times. Hadoop Beginners Guide
Versioning As newer versions of Hadoop are related, the process of upgrading the software on disk can be deferred to the package manager. Hadoop Operations
In order to get your authentication details you need to log in to https://dev. Mastering ElasticSearch
This specification defines the g of servlets and JSP pages to the HTTP and HTTPS protocols, and the g of EJB components to IIOP and SOAP protocols. Java EE 7 Specification
Using curl curl is a lightweight command-line tool for transferring data to or from a server using various protocols, such as HTTP, HTTPS, and FTP. Beginning CouchDB
You can find more details about Hive commands from https://cwiki. Hadoop MapReduce Cookbook
The NameNode has an embedded Jetty HTTP/HTTPS web server, which is used for the SecondaryNameNode to read images and merge them back. Hadoop in Practice
For that reason, the URL class only works with schemes that the Java library knows how to handle, such as http:, https:, ftp:, the local file system (file:), and JAR files (jar:) Core Java Volume II Advanced Features 9th Edition
As of Java 1.4, secure HTTP is also supported with the https:// protocol. Java in a Nutshell 5th Edition
It is valid to specify the wild card IP 0.0.0.0 to indicate the HTTPS server should listen on all interfaces. Hadoop Operations
Standalone Java applications can indicate the proxy server to use by setting the socksProxyHost and socksProxyPort properties (if you’re using a SOCKS proxy server), or http.proxySet, http.proxyHost, http.proxyPort, https.proxySet, https.proxy Host, https.proxyPort, ftpProxySet, ftpProxyHost, ftpProxyPort, gopherProxy Set, gopherProxyHost, and gopherProxyPort system properties (if you’re using protocol-specific proxies) Java Network Programming 4th Edition
Secure HTTP (HTTP over SSL or HTTPS) is also supported by this transport: ActiveMQ in Action
Databases in CouchDB have a sequence number that gets incremented every time the database is changed. CouchDB The Definitive Guide
They’ll change less often (for example, to indicate a new feature) and have to be incremented manually. Gradle in Action
For each record, numWords is incremented by the number of terms in input.txtField returned by the CountWords function. Dremel - Interactive Analysis of WebScale Datasets
The cluster map also includes a list of down or inactive devices and an epoch number, which is incremented each time the map changes. Ceph A Scalable, High-Performance Distributed File System
Generate a new vector with all the values incremented by one. Hadoop in Practice
In the response, we can see that Elasticsearch has incremented. Elasticsearch The Definitive Guide
For each word, its corresponding entry in a multiset called wordCount is incremented by one. Hadoop in Action
Each time a process experiences an event, its clock is incremented by 1 Cassandra The Definitive Guide
Each reduce task writes to a single file with the filename part-r-nnnnn, where nnnnn starts at 00000 and is incremented. Hadoop Beginners Guide
When a data log is full, a new one will be created, and the log number incremented. ActiveMQ in Action
The problem with HDFS and streaming data/logs HDFS isn't a real filesystem, at least not in the traditional sense, and many of the things we take for granted with normal filesystems don't apply here, for example being able to mount it. Apache Flume Distributed Log Collection for Hadoop
The scan and DFSIO jobs are canonical benchmarks used to evaluate HDFS and other distributed filesystems run under Hadoop MapReduce; the results in table 1 are a coarse measure of the effect attributable to HDFS in our experiments. Apache Hadoop YARN Yet Another Resource Negotiator
Its implementation addresses a number of problems that are present in a number of distributed filesystems such as Network File System (NFS) Professional Hadoop Solutions
Scribe supports multiple data sinks, including HDFS, regular filesystems, and NFS. Hadoop in Practice
Chapter 3 looks at Hadoop filesystems, and in particular HDFS, in depth. Hadoop:The Definitive Guide 3rd Edition
Other developments include HBase running on filesystems other than Apache HDFS, such as MapR. HBase:The Definitive Guide
You can store a big data set of (say) 100 TB as a single file in HDFS , something that would overwhelm most other filesystems. Hadoop in Action
Like most filesystems, Hadoop has the concept of a home directory for each user. Hadoop Beginners Guide
This is similar to common Linux filesystems such as ext3 Hadoop Operations
Programming Hive introduces Hive, an essential tool in the Hadoop ecosystem that provides an SQL (Structured Query Language) dialect for querying data stored in the Hadoop Distributed Filesystem (HDFS), other filesystems that integrate with Hadoop, such as MapR-FS and Amazon’s S3 and databases like HBase (the Hadoop database) and Cassandra. Programming Hive
Chapter 8 is a detailed discussion of the event model of the AWT, the abstract window toolkit. Core Java Volume I Fundamentals 9th Edition
The AWT event-handling model requires application developers to implement a listener interface and have that listener register with each component that needs to be monitored. Java Network Programming and Distributed Computing
A number of classes in the java.awt.geom package, such as Point2D.Double, are not serializable. Core Java Volume II Advanced Features 9th Edition
Prominent examples include the Point and Dimension classes in the java.awt package. Effective Java 2nd Edition
However, they were implemented as part of the private com.sun.awt. Java 7 New Features Cookbook
For the second edition, John Zukowski reviewed my Java 1.1 AWT quick reference material, and George Reese reviewed most of the remaining new material. Java in a Nutshell 5th Edition
In comparison to the AWT or I/O, there have been almost no changes and only a few additions. Java Network Programming 4th Edition
This technique is used extensively in the java.awt and javax.swing packages, which define interfaces and abstract classes associated with different kinds of event-handling methods. Concurrent Programming in Java Design Principles and Pattern
For example, you can’t serialize any Swing components (like JButton, JSpinner), and you can’t serialize File objects, but you can serialize the Color class (awt.color, to be more precise) Java 7 Recipes A Problem-Solution Approach
NOTE: This method uses the Color and Image classes from JavaFX, not from java.awt. Java SE 8 for the Really Impatient
Packaged distributions (particularly on Linux and Debian) typically distribute those headers in a dev package and install them into the directory /usr/include. Monitoring with Ganglia
To push your application to a development database, you could use couchapp push dev. CouchDB The Definitive Guide
Special thanks to Spencer Allain, Jonathan Keam, and Robert Wenner for thoroughly reading every chapter of the book and providing me with line-by-line edits and comments at different stages of development; Michael McGarr and Samuel Brown for bouncing around ideas that involved content on continuous delivery and DevOps; and Baruch Sadogursky from JFrog for the technical review of chapter 14 and for promoting the book even before it was released. Gradle in Action
In the sample build file, we only want to sign the artifacts if the uploadArchives task is part of the Gradle task graph to be executed and if the version of the project doesn't end with the String value DEV: Gradle Effective Implementation Guide
In order to get your authentication details you need to log in to https://dev. Mastering ElasticSearch
For those interested in developing or fixing features in Cassandra, it is common to join the dev mailing list and chat room. Cassandra High Performance Cookbook
Fuse development files can be found in fusedevel RPM for Redhat/Fedora and in libfuse-dev package for Debian/Ubuntu. Hadoop MapReduce Cookbook
Replace /dev/md1 with the name of your disk device (consult /etc/fstab for device names): Hadoop in Practice
Download the JDBC driver for your database; for MySQL, we find it at http://dev. Hadoop Beginners Guide
The continued growth in software engineering maturity today can be seen in the DevOps movement, climaxing with the publication last year of Jez Humble and David Farley’s book Continuous Delivery (AddisonWesley) Building and Testing with Gradle
You can easily combine these functions with the orElse and andThen combinators you learned at the beginning of this section: Scala in Action
In the next chapter, we look more closely at functional programming techniques such as higher-order functions, currying, persistent data structures, lazy lists, pattern matching, and combinators. Java 8 in Action Lambdas, Streams and Functional-style Programming
The MapReduce model originates from the map and reduce combinators concept in functional programming languages such as Lisp. Professional Hadoop Solutions
Examples can be found in standard libraries such as parser combinators and actors. Programming in Scala 2nd Edition
We want to start writing combinators that let us avoid explicitly passing along the state. Functional Programming in Scala
External DSLs with Parser Combinators When you write a parser for an external DSL, you can use a parser generator tool like Antlr (see [Antlr]) Programming Scala
Let's explore advanced CSS selectors such as adjacent sibling combinators and pseudoclasses as described in the following sections. Selenium Testing Tools Cookbook
In this chapter, we’re going to explore combinators and Scala’s parser combinatory library. Beginning Scala
Object-oriented programming (OOP) is now in the mainstream of programming practice, and Java is an object-oriented programming language. Core Java Volume I Fundamentals 9th Edition
As Java programmers, we’re well versed in object-oriented programming (OOP) Programming Concurrency on the JVM
However, I will be introducing the concepts of object-oriented programming (OOP) and Java’s basic control mechanisms. Thinking in Java 4th Edition
This approach borrowed from an OOP world solves the problem of maintaining jobs sharing a number of identical properties. Jenkins The Definitive Guide
Because in the client you have to create a JSON object, you’ll import the JSONStringer object that lets you create JSON strings in an OOP way. RabbitMQ in Action
An OOP is normally the same size as the machine pointer. Cassandra High Performance Cookbook
Other languages that have integrated some elements of functional programming into OOP include Ruby, Smalltalk, and Python. Programming in Scala 2nd Edition
Smalltalk captured the essence of object-oriented programming (OOP) as that model of programming entered the mainstream. Programming Scala
But in fact, most of the Java classes we define have getters and setters, so there is data exposed in OOP. Beginning Scala
The HotSpot VM internally accesses the instanceKlass using an internal data structure called a klassOop. Java Performance
It also assumes that filenames never change; otherwise, the source would loose its place on restarts as to which files have been sent and which have not. Apache Flume Distributed Log Collection for Hadoop
As of Ganglia 3.3, hostnames are converted to lowercase, RRD files are created with lowercase filenames, and lowercase hostnames should be used in URLs to access the metrics. Monitoring with Ganglia
This section will explain how to configure your project to use custom build filenames. Gradle in Action
The matching pattern style is described as an ANT-style matching pattern because the ANT build tool uses this style to define a syntax for matching filenames in file trees. Gradle Effective Implementation Guide
Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "What we would like to do is, use the BM25 similarity model for the name field and the contents field." Mastering ElasticSearch
A real-world build routinely copies files from place to place, recursing directory trees, pattern-maching filenames, and performing string operations on file content. Gradle Beyond The Basics
By Hadoop convention filenames for these files end in .deflate. Hadoop in Action
Indicates new terms, URLs, email addresses, filenames, and file extensions. Cassandra The Definitive Guide
On a computer filesystem, filenames (keys) allow access to any sort of data, such as text, images, and sound (values) Hadoop Beginners Guide
Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Building and Testing with Gradle
However, for those applications that were built on EJB or need to deploy the Spring-based applications in a JEE container and utilize the application server’s enterprise services (e.g., JTA Transaction Manager, data source connection pooling, JMS connection factories, etc.), Spring also provides simplified support for those technologies. Pro Spring 3
There is a transaction manager that handles the transactions, and it even has support for JTA (Java Transaction API), which enables distributed or global transactions. Pro Spring MVC
Additional threads that are created for any purpose uld not attempt to start JTA transactions. Java EE 7 Specification
You can use the Java Transaction API (JTA) to make transactions span both databases: the batch tables and the business tables will always be synchronized, but you’ll add overhead because managing multiple transactional resources is more expensive than managing just one. Spring Batch in Action
Spring deals with common resource management “plumbing” code and allows for declarative transaction management in various situations ranging from managed environments with JTA (Java Transaction API, the transaction standard in Java) to native database transaction with plain JDBC. Spring Dynamic Modules in Action
TransactionScoped is a new CDI scope that defines bean instances whose life cycle is scoped to the currently active JTA transaction. Java EE 7 Essentials
It’s transactional, supporting JTA/JTS, two-phase commit, deadlock detection, and MVCC. Cassandra The Definitive Guide
Note that the JTA transaction manager has been added and a reference to it has been added to the listener-container. ActiveMQ in Action
The typical situation when running inside the Java EE container environment is that the standard Java Transaction API (JTA) is used. Pro JPA2
It is written predominantly in Java and leverages a custom storage format and the facilities of the Java Transaction Architecture (JTA) to provide XA transactions. Spring Data Modern Data Access for Enterprise Java
Gradle’s answer to such a scenario is finalizer tasks, which are regular Gradle tasks scheduled to run even if the finalized task fails. Gradle in Action
If you don’t want to return this key, you can use a finalizer to change the final accumulator document into anything, even a nondocument (e.g., a number or string) MongoDB The Definitive Guide 2nd Edition
The particular definition of finalize that can be invoked for an object is called the finalizer of that object. JSR-133 Java Memory Model and Thread
If an object declares a finalizer, the finalizer is executed before the object. The Java Language Specification Java SE 7 Edition
However, since a finalize method can contain arbitrary code, there are several potentially bad scenarios to consider as well, for example, the case that a finalizer resurrects the object, or clones a new object from the dying one, in effect preventing an object from being garbage collected. Oracle JRockit The Definitive Guide
It can take arbitrarily long between the time that an object becomes unreachable and the time that its finalizer is executed. Effective Java 2nd Edition
A finalizer in Java is the opposite of a constructor. Java in a Nutshell 5th Edition
Finalizers postpone garbage collection until the finalizer is run, adding yet more overhead to the GC algorithm. Java Performance Tuning 2nd Edition
These arise from a combination of internal HotSpot VM threads and HotSpot VM library related threads such as the reference handler and finalizer threads. Java Performance
In this case, you use the finalizer to calculate the average number of votes per review. MongoDB in Action
In fact, in cooperation with the sFlow agent, Ganglia may be used to monitor network gear such as routers and switches (see Chapter 8 for more information) Monitoring with Ganglia
The Maven Reactor lays out the build order: the parent pom, then the subprojects. Netty in Action MEAP V08
You’ll start by deconstructing the existing project structure and break out individual, functional subprojects. Gradle in Action
In this case you have only one project, but multiproject builds will declare all the subprojects here. Scala in Action
Example 3-2 has three builds, the parent project and two subprojects. Gradle Beyond The Basics
Over the course of three years, the Hadoop project has blossomed and spun off half a dozen subprojects. Hadoop:The Definitive Guide 3rd Edition
The most prominent and well-supported ones have officially become subprojects under the umbrella of the Apache Hadoop project.1 These subprojects include. Hadoop in Action
Parts of Hadoop The top-level Hadoop project has many component subprojects, several of which we'll discuss in this book, but the two main ones are Hadoop Distributed File System (HDFS) and MapReduce. Hadoop Beginners Guide
Instead, you’ll find three subprojects called Common, HDFS, and MapReduce shown as tabs along the top of the page. Hadoop Operations
We’ve also seen how NMS and CMS subprojects help create more complex, JMS-like APIs for environments that deserve this kind of support. ActiveMQ in Action
However, for a given build, not all tasks must be present in the top-level build file, since Gradle supports dependent subprojects, or nested builds. Building and Testing with Gradle
The rest of the code block defines the parameters that will be passed to the metric init function and iterates through each metric definition, calls the specified callback function with the metric name as a parameter, and finally prints out the metric value that was gathered. Monitoring with Ganglia
Iterates through all subprojects accessible through root projectSets custom build filename for a subproject by using subprojects’ names, Gradle in Action
The while-loop of the algorithm (Line 5) iterates over all atomic and record-valued fields contained in a given record. Dremel - Interactive Analysis of WebScale Datasets
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. MapReduce - Simplied Data Processing on Large Clusters
When an active OSD receives an updated cluster map, it iterates over all locally stored placement groups and calculates the CRUSH mapping to determine which ones it is responsible for, either as a primary or replica. Ceph A Scalable, High-Performance Distributed File System
The process iterates until the skip range is within the acceptable size. Hadoop in Action
Our engine iterates on input rows and uses generated procedures that do both. Tenzing - A SQL Implementation On The MapReduce Framework
The example created a Ruby script that opens an Avro datafile, iterates through each datum, and displays it based on explicitly named fields. Hadoop Beginners Guide
Next it iterates through all available queues and prints their current size and number of consumers subscribed to them. ActiveMQ in Action
This Groovy code declares a list of four strings (each of which is the name of a poet), then iterates over that list. Building and Testing with Gradle
Two Jetty connectors are defined for HTTP and for the secure TLS traffic. Jenkins Continuous Integration Cookbook
This specification also describes standard ways to extend Java EE with connectors to other non-Java EE application systems, such as me systems and ERP systems. Java EE 7 Specification
Sqoop has the notion of Connectors, which contain the specialized logic to read and write to external systems. Hadoop in Practice
Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Hadoop:The Definitive Guide 3rd Edition
To add a user-defined connection, simply click on the Connectors folder and click on the Create Connection button. Oracle JRockit The Definitive Guide
Many of these connectors are open source, while others are free or available from commercial vendors at a cost. Hadoop Operations
It discusses ActiveMQ URIs and all the transport connectors for both client-to-broker and broker-to-broker communications including TCP, NIO, STOMP, failover, SSL, HTTP/S, and much more. ActiveMQ in Action
The server consists of one or more connectors that share a single engine component. Java Performance
Such tools are typically designed to make it easy to load content stored in various systems, and sometimes provide pre-built connectors to common content stores, such as web sites, databases, popular content management systems, filesystems, etc. Lucene in Action 2nd Edition
Many NoSQL databases have implemented Hive connectors using custom adapters. Programming Hive
If it is successful create a ByteBuf to hold the data. Netty in Action MEAP V08
The default implementation of this method simply calls the engineDigest( ) method and then copies the result into buf. Java Security 2nd Edition
The “buf.type” that appears in the interpreter responses in this and several other examples in this section is a singleton type. Programming in Scala 2nd Edition
Abbreviations, as in buf holding a pointer to a buffer of some kind. The Java Language Specification Java SE 7 Edition
The buf argument is the address of the String buffer that stores the resulting message. Understanding MySQL Internals
Gradle figures out for you which of the subproject dependencies need to be rebuilt, without having to store a subproject’s artifact in the local cache. Gradle in Action
We apply the Java plugin to each subproject and add a testCompile dependency on the JUnit libraries. Gradle Effective Implementation Guide
This causes Gradle to look explicitly in the individual subproject’s directory for the grammars directory, rather than only at the root project level. Gradle Beyond The Basics
Chukwa is an Apache subproject of Hadoop that also offers a large-scale mechanism to collect and store data in HDFS. Hadoop in Practice
If you need to know which package a class is in, you can easily look it up in Hadoop’s Java API documentation for the relevant subproject, linked to from the Apache Hadoop home page at http://hadoop.apache.org/ Hadoop:The Definitive Guide 3rd Edition
Hadoop started out as a subproject of Nutch , which in turn was a subproject of Apache Lucene. Hadoop in Action
HBase HBase is a clone of Google’s Bigtable, originally created for use with Hadoop (it’s actually a subproject of the Apache Hadoop project) Cassandra The Definitive Guide
Doug started work on the implementations of these Google systems, and Hadoop was soon born, firstly as a subproject of Lucene and soon was its own top-level project within the Apache open source foundation. Hadoop Beginners Guide
The Common subproject was to contain code shared by both HDFS and MapReduce, while HDFS and MapReduce would be home to their respective components. Hadoop Operations
Ant, though the older of the JVM build tools, has been supplemented by the Ivy subproject to accomplish exactly this. Building and Testing with Gradle
We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach – all driven by real-life Google production workloads. Omega - flexible, scalable schedulers for large compute clusters
Other administrative tasks include the management of Hadoop jobs, for example configuring job scheduling policy with schedulers. Hadoop Operations and Cluster Management Cookbook
This makes it possible to tightly enforce global scheduling properties (different schedulers in YARN focus on different global properties, such as capacity or fairness), but it requires the scheduler to obtain an accurate understanding of applications’ resource requirements. Apache Hadoop YARN Yet Another Resource Negotiator
When talking about merge policy and merge schedulers it would be nice to visualize it. Mastering ElasticSearch
The good news is that Hadoop scheduler is pluggable, and it comes with two other schedulers. Hadoop MapReduce Cookbook
Summary The amount of parallelism available to you is a function of your cluster capacity, the other jobs running at the same time as your job, and any schedulers configured for your environment. Hadoop in Practice
The default in MapReduce 1 is the original FIFO queue-based scheduler, and there are also multiuser schedulers called the Fair Scheduler and the Capacity Scheduler. Hadoop:The Definitive Guide 3rd Edition
Fortunately, starting with version 0.19, Hadoop has a pluggable architecture for the scheduler, and two new schedulers have become available for resolving job contention. Hadoop in Action
Unless we use one of the alternative schedulers that we will discuss in later sections, the FIFO scheduler dedicates the full cluster to the sole currently running job. Hadoop Beginners Guide
In Chapter 7, we’ll cover how each of the schedulers may be configured to suit different environments and use cases. Hadoop Operations
The WAL is used to track all input and output from the channel in an atomically safe way. Apache Flume Distributed Log Collection for Hadoop
Storing all metadata in a single Bigtable row allows us to update it atomically through a single Bigtable transaction. Megastore - Providing Scalable, Highly Available Storage for Interactive Services
The entry data is accessed atomically, and changes are ordered. Pro Hadoop
When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. MapReduce - Simplied Data Processing on Large Clusters
When it has moved all but a nominal amount of the data, it uses a transaction to atomically move that nominal amount and update the metadata for the two Paxos groups. Spanner - Google’s Globally-Distributed Database
It turns out there’s no direct analogue in the relational world! Column families exist in HBase as a way for a single row to contain disjoint sets of columns in a way that’s physically efficient but can be processed atomically. HBase in Action
Files must be atomically moved into the source directory (file moves in both the Linux3 and HDFS filesystems are atomic) Hadoop in Practice
Cell values can be interpreted as counters and updated atomically. HBase:The Definitive Guide
The contents of a file are read atomically and in their entirety. Chubby - The lock service for loosely-coupled distributed systems
In particular, composite read/write operations such as the "++'' operation on volatile variables are not performed atomically. Concurrent Programming in Java Design Principles and Pattern
Launch the console and issue the x command, which clears the classloader cache. Core Java Volume I Fundamentals 9th Edition
If a user-defined classloader prefetches binary representations of classes and interfaces, or loads a group of related classes together, then it must reflect loading errors only at points in the program where they could have arisen without prefetching or group loading. The Java Virtual Machine Specification Java SE 7 Edition
This is generally a good idea, as it can avoid classloader issues related to conflicts with Ant’s own libraries. Jenkins The Definitive Guide
Classes in Java are loaded by a classloader object, which consults the directories and. Java Threads 3rd Edition
It indicates that the dependencies you’re providing will be available to the classloader during the rest of the build script execution, which is exactly what we need. Gradle Beyond The Basics
The JNDI naming context, classloader, and security context is propagated to the proxied object. Java EE 7 Essentials
The option of splitting the classes per classloader is also available in the Type Graph. Oracle JRockit The Definitive Guide
A further extension to this idea is to include your own classloader that opens the ZIP/JAR file itself and reads in all files sequentially, loading them into memory immediately. Java Performance Tuning 2nd Edition
This will cause classloader issues and will prevent ActiveMQ from being deployed correctly. ActiveMQ in Action
Java 7 simply adds a close method to close the classloader. Java SE 8 for the Really Impatient
You’ll configure the Maven deployer to upload both files to a local directory. Gradle in Action
We must configure the repository for our uploadArchives task via a closure argument of the mavenDeployer() method. Gradle Effective Implementation Guide
Because there’s no gold copy of the configuration for a given set of server instances, the deployer has to go onto each server and update the configuration manually. Spring in Practice
The Application Assembler is responsible for providing assembly ions describing external dependencies of the application that the Deployer solve in the deployment process. Java EE 7 Specification
At the heart of the Spring DM web support is its web extender, which contains a WAR deployer, which we’ll take a look at to see how it differs from a standard web deployer. Spring Dynamic Modules in Action
You decide what components to include via the Geronimo deployer. ActiveMQ in Action
For GlassFish, it is recommended that the deployer use commit option C for entity beans that are used only in transactions or used only once or twice (hence no benefit in caching the instances) Java Performance
The deployment process is optimized for these kinds of deployment scenarios to minimize the amount of metadata that a deployer has to specify. Pro JPA2
The file chooser calls the isTraversable method to decide whether to open a directory when a user clicks on it. Core Java Volume I Fundamentals 9th Edition
Then we instantiate a variable x as an anonymous subclass of A, such that type B is stabilized at Traversable[Int] Scala in Depth
Traversable is the parent trait for all the collection types in Scala, and the ++ method is only defined in this trait. Scala in Action
Nodes know about INCOMING and OUTGOING relationships that are traversable both ways. NoSQL Distilled
If the current vertex value is still equal to Integer.MAX, we know that the graph never received any incoming messages intended for that vertex, which implies it is not traversable by the source vertex. Hadoop RealWorld Solutions Cookbook
This method selects those elements of a collection that pass a test the user sup3A foreach method is defined in trait Traversable, a common supertrait of List, Set, Array, and Map. Programming in Scala 2nd Edition
But traversable data types are too numerous for us to write specialized and methods for each of them. Functional Programming in Scala
The xs parameter can be any collection with the TraversableOnce trait, the most general trait in the Scala collections hierarchy. Scala for the Impatient
As described earlier in Section 2.1, Dapper traces are modelled as trees of trace spans, so the Trace data structure is consequently a simple traversable tree of individual Span structures. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
In addition to receiving confined references along call chains, the method invocations executing within a single thread can access the Thread object representing the thread they are running in, and any further information traversable from there. Concurrent Programming in Java Design Principles and Pattern
In the Java source, to connect to HBase, we first create a Configuration object and then use it with the table name to create an HTable instance. HBase Administration Cookbook
All programmatic data-manipulation access to HBase is done through either the HTableInterface or the HTable class that implements HTableInterface. Professional Hadoop Solutions
The HTable constructor reads the default configuration information to locate HBase, similar to the way the shell did. HBase in Action
Finally, we call close() on our HTable instance to flush out any write buffers not yet cleared. Hadoop:The Definitive Guide 3rd Edition
Here is a summary of the points we just discussed: • Create HTable instances only once, usually when your application. HBase:The Definitive Guide
The HTable class is the main way in Java to interact with HBase. Spring Data Modern Data Access for Enterprise Java
You can now start CouchDB using the new user: sudo -i -u couchdb couchdb -b. Getting Started with CouchDB
Finally, to install gmond, type: user@host:# sudo yum install ganglia-gmond. Monitoring with Ganglia
Build and install the CouchDB ebuild by running: sudo emerge couchdb. CouchDB The Definitive Guide
This user should have sudo privileges to install software packages, configure system services, and so on. Hadoop Operations and Cluster Management Cookbook
Please remember to run the jstat command from the same account ElasticSearch is running or if that is not possible, run the jstat command with administrator privileges (for example, using sudo command on Linux systems) Mastering ElasticSearch
If you get a message informing you that curl is not available, simply use the command sudo apt-get install curl to download and install it automatically. Beginning CouchDB
You’ll need sudo access for your user to run this command (it launches the Hadoop services via init.d scripts): Hadoop in Practice
We also need to change the owner of the Hadoop files to be the hadoop user and group: % sudo chown -R hadoop:hadoop hadoop-x.y.z. Hadoop:The Definitive Guide 3rd Edition
In order to get autoscan, which the Thrift bootstrap needs, you’ll have to run this command: $ sudo apt-get install automake. Cassandra The Definitive Guide
For instance, assuming the HDFS super user is hdfs, it would normally be possible to perform administrative activities using sudo like in Example 6-3 Hadoop Operations
Of course, interpreting bytecodes is necessarily slower than running machine instructions at full speed, so it isn’t clear that this is even a good idea. Core Java Volume I Fundamentals 9th Edition
A class file contains Java Virtual Machine instructions (or bytecodes) and a symbol table, as well as other ancillary information. The Java Virtual Machine Specification Java SE 7 Edition
When a class must be loaded (typically, the first time you want to create an object of that class), the .class file is located, and the bytecodes for that class are brought into memory. Thinking in Java 4th Edition
But if the program came from an unknown source, how do we know that the bytecodes we've received are actually legal? Java Security 2nd Edition
Thus, the JVM is required to turn the bytecodes into native code for the CPU on which the Java application executes. Oracle JRockit The Definitive Guide
That lets you carry out custom checks before you pass the bytecodes to the virtual machine. Core Java Volume II Advanced Features 9th Edition
A long time ago, when Java was Oak and it was being designed as a way to distribute untrusted code into set-top boxes (and later browsers), the rules defining how a program executed and what were the means of the instruction set (bytecodes) was super-important. Beginning Scala
Generating the bytecodes without the extra constructor is perfectly legal. Java Performance Tuning 2nd Edition
People loved the simplicity of the language, the portability of bytecodes, and the safety of garbage collection (versus traditional malloc/free memory management of other systems) Java Performance
This mapping strategy also applies at the level of the CPU when operations are compiled into native code rather than interpreted as bytecodes. Concurrent Programming in Java Design Principles and Pattern
How all of these components can be used to build complicated data workflows will become clear as the book proceeds. Apache Flume Distributed Log Collection for Hadoop
With its plugin system and rapidly growing community, you’ll need to take into consideration existing JavaScript automation workflows. Gradle in Action
Still worse, instead of losing a single workflow, a JobTracker failure caused an outage that would lose all the running jobs in a cluster and require users to manually recover their workflows. Apache Hadoop YARN Yet Another Resource Negotiator
Hive and Pig provide a clean abstraction layer between your data flow and meaningful queries, and the complex MapReduce workflows they compile to. Hadoop RealWorld Solutions Cookbook
We will also be discussing other advanced Hadoop features such as using DistributedCache for distribute data, using Hadoop Streaming for quick prototyping of Hadoop computations, and using Hadoop counters to report custom metrics for your computation as well as adding job dependencies to manage simple DAG-based workflows of Hadoop MapReduce computations. Hadoop MapReduce Cookbook
For our scenario Oozie has a coordinator engine that can start workflows based on data and time triggers. Hadoop in Practice
Oozie A service for running and scheduling workflows of Hadoop jobs (including MapReduce, Pig, Hive, and Sqoop jobs) Hadoop:The Definitive Guide 3rd Edition
Each page in the ES2 collection is pushed through multiple logical workflows, each consisting of a local analysis phase, a global analysis phase, and an appropriate variant generation strategy. Hadoop in Action
Smart software, dumb hardware If we wish to see a cluster of hardware used in as flexible a way as possible, providing hosting to multiple parallel workflows, the answer is to push the smarts into the software and away from the hardware. Hadoop Beginners Guide
Workflows can be triggered by time or events such as data arriving in a directory, and job failure handling logic can be implemented so that policies are adhered to. Hadoop Operations
Identity (codenamed "Keystone") provides authentication and authorization for all the OpenStack services. bk-compute-adminguide-grizzly
Keystone Provides an authentication and authorization service for other OpenStack services. bk-admin-guide-cloud-latest
As a result of the influence elephants have over many plants and animals, they are often referred to as a keystone species because they are vital to the long-term survival of the ecosystems in which they live. Hadoop:The Definitive Guide 3rd Edition
OpenStack services that are stateless include nova-api, nova-conductor, glance-api, keystone-api, neutron-api and nova-scheduler. high-availability-guide
To see the version of the keystone client, run the following command: admin-user-guide-trunk
The OpenStack Identity Service (Keystone) is the point that provides the authentication decisions and user attribute information, which is then used by the other OpenStack services to perform authorization. OpenStack Operations Guide
In order to inform HBase where Java is installed, we will set JAVA_HOME in hbase-env. HBase Administration Cookbook
Per-process runtime environment: The file hadoop-env.sh is responsible for tailoring the per-process environment. Pro Hadoop
Based on the provided environment value with the key env, Maven will pick the appropriate profile. Gradle in Action
File Description hadoop-env.sh Configures the environment variable used by Hadoop core-site.xml Configures parameters for the whole Hadoop cluster hdfs-site.xml Configures parameters for HDFS and its clients mapred-site. Hadoop Operations and Cluster Management Cookbook
The #!/usr/bin/env python statement is a hint to tell the shell how to execute the script. Hadoop RealWorld Solutions Cookbook
You can check if it's there by using the env command on UNIX-like systems and the set command on Windows. Elasticsearch in Action MEAP V07 cHAPTER 1
If you’ve set up your cluster with LZO compression, you’ll need to create a flume-env.sh file and set the directory that contains the native compression codecs: Hadoop in Practice
Environment Settings In this section, we consider how to set the variables in hadoop-env.sh. Hadoop:The Definitive Guide 3rd Edition
In hadoop-env.sh define the JAVA_HOME environment variable to point to the Java installation directory. Hadoop in Action
Later, we’ll use configuration variables in hadoop-env.sh to specify alternate locations for Hadoop’s log and pid file directories. Hadoop Operations
Finally, while files are being written to the HDFS, a .tmp extension is added. Apache Flume Distributed Log Collection for Hadoop
We also need to create a /tmp directory on HDFS, so that Hive can save its temporary files under it. HBase Administration Cookbook
For instance, here’s a sample command for piping the output of running gmond in debug mode to the file /tmp/gmond.log: Monitoring with Ganglia
In a standard Cygwin installation, the /tmp directory is also the C:\cygwin\tmp directory. Pro Hadoop
Set the sticky bit on HDFS directory to prevent the directories or files from being deleted by unauthorized users with the following command: sudo -u hdfs hadoop fs -chmod 1777 /tmp. Hadoop Operations and Cluster Management Cookbook
You would therefore (incorrectly) assume that the command hadoop fs -ls /tmp/* would work. Hadoop in Practice
Test Drive To start a temporary instance of HBase that uses the /tmp directory on the local filesystem for persistence, type: Hadoop:The Definitive Guide 3rd Edition
The default values for dfs.name.dir and dfs.data.dir point to directories under /tmp, which is intended only for temporary storage in almost all Unix systems. Hadoop in Action
Regardless of the mode, Hadoop by default uses the hadoop.tmp.dir property as the base directory under which all files and data are written. Hadoop Beginners Guide
Many new to Hadoop have quickly setup Hadoop, missed setting dfs.name.dir, and found themselves with an unusable filesystem after rebooting the namenode because / tmp was cleared during boot. Hadoop Operations
You’ve most likely created these connections between nodes by declaring a dependsOn relationship for a task or by leveraging the implicit task dependency interference mechanism. Gradle in Action
It redefines the way that data is managed and processed by leveraging the power of computing resources composed of commodity hardware. Hadoop Operations and Cluster Management Cookbook
We showcase this by leveraging the notion of container reuse and push-based communications provided by REEF. Apache Hadoop YARN Yet Another Resource Negotiator
This chapter will introduce readers to these two frameworks, which are capable of leveraging the distributed power of MapReduce. Hadoop RealWorld Solutions Cookbook
By declaring and modelling dependencies, tooling such as Gradle can automate working with dependencies by leveraging this information. Gradle Beyond The Basics
Finally, by opening Dapper’s trace repositories to internal developers we have enabled the creation of many more analysis tools than the Dapper team alone could have been able to produce in isolation, greatly leveraging the design and implementation effort. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
Usage was important to us because we think some of the most interesting research will involve leveraging the vast amount of usage data that is available from modern Web systems. The anatomy of a large-scale hypertextual Web search engine
Among many notable “Aha!” moments I had while working with Cynthia, the biggest one was when she steered me into leveraging visual aids to help explain some of the complex concepts in this book. Hadoop in Practice
We present the design and implementation of RADOS, a reliable object storage service that can scales to many thousands of devices by leveraging the intelligence present in individual storage nodes. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
When considering leveraging an existing investment in Apache Ant, or perhaps using the broad set of tasks that the Ant community has created, Gradle has a great story to tell. Building and Testing with Gradle
If you’re interested in pushing the edge of CouchDB’s incremental reduce functionality, have a look at Google’s paper on Sawzall, which gives examples of some of the more exotic reductions that can be accomplished in a system with similar constraints. CouchDB The Definitive Guide
The handoptimized Sawzall version likewise intermixes logical computation with lower-level implementation details, in an effort to get better performance. FlumeJava - Easy, Efficient Data-Parallel Pipelines
The set of aggregations is limited but the query phase can involve more general computations, which we express in a new interpreted, procedural programming language called Sawzall.1 (An interpreted language is fast enough: most of the programs are small and on large data sets the calculation tends to be I/O bound, as is discussed in the section on performance.) Interpreting the Data Parallel Analysis with Sawzall
HBase has coprocessors that are different from what Sawzall, the scripting language used in Bigtable to filter or aggregate data, or the Bigtable Coprocessor framework,† provides. HBase:The Definitive Guide
Queries are also sometimes written in a scripting language developed at Google called Sawzall. Cassandra The Definitive Guide
An increasing number of analysts were being forced to write custom code for more complex analysis, often directly against the source (such as Sawzall against logs) Tenzing - A SQL Implementation On The MapReduce Framework
The scripts are written in a language developed at Google for processing data called Sawzall [28] Bigtable - A Distributed Storage System for Structured Data
Careful with those delimiters! With so many Nagios plug-ins, written by so many different authors, it’s important to carefully choose your delimiter and avoid using the same one returned by a plug-in. Monitoring with Ganglia
While these values can be crammed into a Text object with some delimiter, it is typically a better practice to create a custom Writable. MapReduce Design Patterns
A standard delimiter “/” is used to split a key name.) Professional Hadoop Solutions
Each line of the input data is broken into a key (text) and value (text) pair using a delimiter character. Hadoop MapReduce Cookbook
The default record delimiter is newline, but it can be overridden with. Hadoop in Practice
Each line is broken into fields using a configurable field delimiter (defaults to a tab character) to be stored in the tuple’s fields. Hadoop:The Definitive Guide 3rd Edition
To ignore standard punctuation marks, we add them to the StringTokenizer’s list of delimiter characters: Hadoop in Action
Instead of representing columns within a super column, the composite key approach means that you use a regular column family with regular columns, and then employ a custom delimiter in your key name and parse it on client retrieval. Cassandra The Definitive Guide
For example, for delimited text format, the user can specify the delimiter, encoding, quoting, escaping, headers, etc. Tenzing - A SQL Implementation On The MapReduce Framework
In order to access the different attributes at run time, the Map and Reduce functions split the record by the delimiter into an array of strings. HadoopDB An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
In the real world, Java uses the "sandbox" approach, wherein untrusted code, which includes classes downloaded over a network within a Web browser, is placed within the sandbox and required to meet certain expectations. Java Network Programming and Distributed Computing
If you’re deploying CouchDB on the Web, or any place where untrusted parties can access your sever, it behooves you to research the recommended security measures for your operating system and take any additional steps needed. CouchDB The Definitive Guide
When we discuss the methods below, we speak of them in terms of trusted and untrusted classes. Java Security 2nd Edition
By contrast, SETI@home runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality. Hadoop:The Definitive Guide 3rd Edition
If you write a class whose security depends on the immutability of a BigInteger or BigDecimal argument from an untrusted client, you must check to see that the argument is a “real” BigInteger or BigDecimal, rather than an instance of an untrusted subclass. Effective Java 2nd Edition
Chapter 6: Java Security This chapter explains the Java security architecture that allows untrusted code to run in a secure environment from which it cannot do any malicious damage to the host system. Java in a Nutshell 5th Edition
If, however, you are running Hadoop in an untrusted, multitenant environment or where accidental data exposure would be catastrophic, secure mode is the appropriate option. Hadoop Operations
An untrusted applet under the control of the default security manager will only be allowed to get the IP address of the host it came from (its codebase) and possibly the local host. Java Network Programming 4th Edition
This is a generic error all clients will receive when trying to connect to the untrusted broker (without the proper keystore and truststore) ActiveMQ in Action
These systems address concerns that do not arise for Bigtable, such as highly variable bandwidth, untrusted participants, or frequent reconfiguration; decentralized control and Byzantine fault tolerance are not Bigtable goals. Bigtable - A Distributed Storage System for Structured Data
In order to be truly object-oriented, a programming language must also support inheritance and polymorphism. Core Java Volume I Fundamentals 9th Edition
This code calculates the address of the method body, using information stored in the object (this process is covered in great detail in the Polymorphism chapter) Thinking in Java 4th Edition
For such data types, you might also use polymorphism and method overriding as an alternative to if-then-else, but there’s still language-design discussion as to which is more appropriate.[9] We’d say that both are useful tools and you should have both in your armory. Java 8 in Action Lambdas, Streams and Functional-style Programming
Serialization or polymorphism can be used if this is truly a problem, but adds a bit of complexity. MapReduce Design Patterns
The producer methods provide runtime polymorphism where the concrete type of the bean to be injected may vary at runtime, the injected object may not even be a bean, and objects may require custom initialization. Java EE 7 Essentials
In object-oriented polymorphism, two subtypes can be referenced similarly because they implicitly share a common supertype. SQL Antipatterns
Creating Buffers The buffer class hierarchy is based on inheritance but not really on polymorphism, at least not at the top level. Java Network Programming 4th Edition
On the other hand, good objectoriented design actually encourages many small methods and significant polymorphism in the method hierarchy. Java Performance Tuning 2nd Edition
Because of polymorphism, dynamic loading, etc., it is generally impossible to identify them all beforehand; instead, their identities become known only as the action unfolds. Concurrent Programming in Java Design Principles and Pattern
Inheritance is the natural way to express these relationships and allows for polymorphism in the application. Pro JPA2
For instance, a sum aggregator applied to the out-degree of each vertex yields the. Pregel - A System for Large-Scale Graph Processing
Starting Up the Processes Starting the processes in a specific order is not necessary; however, if the daemons are started in the order recommended here, there won’t be a delay waiting for metadata to be retransmitted to the UDP aggregator and users won’t get error pages or incomplete data from the web server: Monitoring with Ganglia
The framework provides a set of aggregator services (see Table 8-14 for descriptions of the predefined aggregator services) Pro Hadoop
Of course, a node can be configured to neither be a master or a data node and in such case, the node will be only used as aggregator node that will have user queries. Mastering ElasticSearch
Pipes consist of types Each (applying a function or filter), GroupBy (which groups streams on tuple fields), CoGroup (which joins a common set of values), Every (which applies an aggregator or sliding window operation), and SubAssembly (which combines other pipe assemblies) Professional Hadoop Solutions
The keyword table introduces an aggregator type; aggregators are called tables in Sawzall even though they may be singletons. Interpreting the Data Parallel Analysis with Sawzall
Use Case: HBase at Streamy.com Streamy.com is a real-time news aggregator and social sharing platform. Hadoop:The Definitive Guide 3rd Edition
The output string starts with the name of a value aggregator function (from the set of predefined functions available in the Aggregate package) Hadoop in Action
Flume is a relatively new log aggregator developed by Cloudera. Kafka a Distributed Messaging System for Log Processing
At first the value of aggregator is the initial document. MongoDB in Action
Many have already used the filesystem as a handoff point between the application creating the data (for instance, log4j) and the mechanism responsible for moving those files someplace else (for instance, syslog) Apache Flume Distributed Log Collection for Hadoop
Logs gmond and gmetad do not log many things to syslog; however, these logs should always be checked when there is a problem. Monitoring with Ganglia
The script is invoked with the additional arguments of the paths to the stdout, stderr, and syslog output files for the task. Pro Hadoop
As shown in Figure 5-4, log pages contain stdout, stderr, and syslog logs. Professional Hadoop Solutions
This recipe shows how to configure Cassandra’s logging mechanism, Log4J, to send events to its local logfile as well as a remote syslog server. Cassandra High Performance Cookbook
Sending data and running queries by sending JSON objects over HTTP makes it easy for someone to extend anything, from a syslog daemon like rsyslog to a connecting framework like ManifoldCF, to interact with Elasticsearch. Elasticsearch in Action MEAP V07 cHAPTER 1
You’ll want to preserve the original format of your syslog file, so you’ll need to create and edit flume-site.xml and indicate the raw output format. Hadoop in Practice
Flume supports a large variety of sources; some of the more commonly used ones include tail (which pipes data from a local file being written to into Flume, just like Unix tail), syslog, and Apache log4j (allowing Java applications to write events to files in HDFS via Flume) Hadoop:The Definitive Guide 3rd Edition
It supports native connectivity and support for writing directly to HDFS, and simplifies reliable, streaming data delivery from a variety of sources including RPC services, log4j appenders, syslog, and even the output from OS commands. Hadoop Operations
The Unix syslog utility provides a good starting point for designing such a framework. Java Performance Tuning 2nd Edition
This book provides in-depth coverage of several of the more intricate areas of Scala, including advanced aspects of its type system, implicits, composition techniques with traits, collections, actors, functional categories. Scala in Depth
There are many features such as pattern matching, for comprehensions and implicits supported in Scala but not in Java, that we won’t discuss. Java 8 in Action Lambdas, Streams and Functional-style Programming
This chapter shows you how implicits work, and it presents some of the most common ways they are used. Programming in Scala 2nd Edition
Trim the stream to at most a single element; see chapter code We can give eval infix syntax using implicits; see chapter code for details. Functional Programming in Scala
Instead, this syntax is defined with normal definitions of types and methods, combined with a few general-purpose parsing conventions, such as support for implicits. Programming Scala
They are very powerful and very dangerous, and I only invite them into my program’s scope when there is a very good reason.5 Using implicits to convert to a class that has a particular method is a good reason. Beginning Scala
Beware that this power comes with a price, as having too many implicits in scope can render the code not only less understandable but also incorrect. Steps in Scala An Introduction to Object-Functional Programming
They consist of a separate VLAN, volumes, instances, images, keys, and users. bk-compute-adminguide-grizzly
In this mode, nova-network creates a VLAN, a subnet, and a separate bridge for each project. Deploying OpenStack
They consist of an individual VLAN, and volumes, instances, images, keys, and users. bk-admin-guide-cloud-latest
With VLAN networking, this allows access between instances within the same project. admin-user-guide-trunk
VLANs VLAN configuration can be as simple or as complicated as desired. OpenStack Operations Guide
Here, we are using the System.out object and calling its println method. Core Java Volume I Fundamentals 9th Edition
If the specified boolean flag is set to "true," whenever a byte array, println method, or newline character is sent, the underlying buffer will be automatically flushed. Java Network Programming and Distributed Computing
So, we are defining here that we want to add the closure (with the statement println 'Hello world.') to our task with the name helloWorld. Gradle Effective Implementation Guide
This rule definition can be further expanded with actual network reachability code in place of the println() call, with that functionality comfortably factored into methods as we see fit. Gradle Beyond The Basics
To write to a print writer, use the same print, println, and printf methods that you used with System.out. Core Java Volume II Advanced Features 9th Edition
The toString method is automatically invoked when an object is passed to println, printf, the string concatenation operator, or assert, or printed by a debugger. Effective Java 2nd Edition
The object System.out has an instance method named println( ) Java in a Nutshell 5th Edition
For example, it’s relatively rare to use the write() method of PrintStream instead of one of its print() and println() methods. Java Network Programming 4th Edition
Add a breakpoint in the first call to the println() method of the run() method of the Task1 class. Java 7 Concurrency Cookbook
In a real-world build (still relying on println statements in place of actual testing actions), we would be more likely to structure the task as done in Example 2-16 Building and Testing with Gradle
Consequently, if there is a truly huge number of RRD files and if the disk storage system does not cope well with a random access workload (e.g., if it is not an SSD or SAN), the startup phase may take many hours or may not even complete at all. Monitoring with Ganglia
There are too many to name, but all of you in Dublin, Portland, Lisbon, London, Zurich, San Francisco, Mountain View, Dortmund, Stockholm, Hamburg, Frankfurt, Salt Lake City, Blacksburg, San Diego, and Amsterdam: you know who you are—thanks! To my family, friends, and coworkers: thanks you for your support and your patience with me over the last year. CouchDB The Definitive Guide
In contrast to systems like Lustre [4], which assume one can construct sufficiently reliable OSDs using mechanisms like RAID or fail-over on a SAN, we assume that in a petabyte or exabyte system failure will be the norm rather than the exception, and at any point in time several OSDs are likely to be inoperable. Ceph A Scalable, High-Performance Distributed File System
Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared filesystem, hosted by a Storage Area Network (SAN) Hadoop:The Definitive Guide 3rd Edition
Block pools are analogous to LUNs in a SAN storage system and a namespace with its pool of blocks is analogous as a file system volume. The Hadoop Distributed File System
As in conventional storage systems based on local or network-attached (SAN) disk drives or those embracing the proposed T10 OSD standard, devices passively respond to read and write commands, despite their potential to encapsulate significant intelligence. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
Thanks to my editor, Mike Loukides, for being a charming conversationalist at dinner in San Francisco. Cassandra The Definitive Guide
Be it block storage via a Storage Area Network (SAN) or file-based via Network File System (NFS) or similar protocols, these approaches constrain Hadoop by introducing unnecessary bottlenecks and additional shared devices that would have a critical impact on failure. Hadoop Beginners Guide
In the classic HPC model, data is usually stored on a large shared centralized storage system such as a SAN or NAS. Hadoop Operations
CPUs have become so much faster than disks that a few CPUs in a shared-memory machine can drive enough disk heads to service required loads without the complexity of distributed transactions: the top TPC-E benchmark results today are achieved on large shared-memory machines connected to a SAN. Percolator - Large-scale Incremental Processing Using Distributed Transactions and Notifications
He has more than 7 years of experience in software and middleware (Apache, Tomcat, PostgreSQL, Hadoop eco system) engineering. HBase Administration Cookbook
The use of technologies like CORBA as middleware is very important in organizations with many systems written in many languages. Java Network Programming and Distributed Computing
Because the test suite is run from the browser, not only does it test that CouchDB is functioning properly, it also verifies that your browser’s connection to the database is properly configured, which can be very handy for diagnosing misbehaving proxies or other HTTP middleware. CouchDB The Definitive Guide
Her interests and experiences lie across multiple tiers, from web frontend GUI to middleware, from middleware to backend SQL RDBMS and NoSQL data storage. Cassandra High Performance Cookbook
Before starting the graduate studies, Thilina worked as a Senior Software Engineer at WSO2 Inc., focusing on open source middleware development. Hadoop MapReduce Cookbook
We find that there are many middleware systems which return a result to their caller before all. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
Message-oriented middleware (MOM) allows sending and receiving messages between distributed systems. Java EE 7 Essentials
Instead, it calls on a middleware layer on the server that in turn makes the database queries. Core Java Volume II Advanced Features 9th Edition
Type 3 drivers provide a connection to a middleware product, which in turn connects to a database. Java Performance Tuning 2nd Edition
Commercial message-oriented middleware (MOM) did not help to encourage adoption of the process; most of the MOMs on the market were closed source, were costly to buy and support, and required trained system administrators for proper installation. ActiveMQ in Action
The import command is a basic JRuby feature, and it is not HBase-specific. HBase Administration Cookbook
I’ve written this book for experienced Java programmers who are interested in learning how to manage and make use of concurrency on the JVM, using languages such as Java, Clojure, Groovy, JRuby, and Scala. Programming Concurrency on the JVM
With R, simply handling the training data volumes and getting good integration with the existing Ruby/JRuby infrastructure at Shop It To Me proved very difficult. Mahout in Action
JRuby and JVM languages Those of you unfamiliar with Java may be confused by this JRuby concept. HBase in Action
This will bring up a JRuby IRB interpreter that has had some HBase-specific commands added to it. Hadoop:The Definitive Guide 3rd Edition
Since the shell is JRuby-based, you can mix Ruby with HBase commands, which enables you to do things like this: HBase:The Definitive Guide
He envisages himself using functional languages in projects and the decision to learn Clojure (a little bit of JRuby, Scala, F#, and Dart lately) influences his current self-learning activities. Java 7 New Features Cookbook
By using JRuby, you write ruby code that has access to any Java code, but will generally lose access to Ruby extensions that are implemented in C. Lucene in Action 2nd Edition
All are JVM (Java Virtual Machine) libraries that can be used from programming languages like Java, Clojure, Scala, JRuby, Groovy, and Jython, as opposed to tools with their own languages, like Hive and Pig. Programming Hive
Not only is Java 7 the most modern, statically typed, object-oriented language available for development, it also enhances the ability to integrate different languages that run on the JVM, such as Jython, Groovy, JRuby, and Scala to name a handful. Java 7 Recipes A Problem-Solution Approach
The classic Hashtable class serves the same purpose as the HashMap and has essentially the same interface. Core Java Volume I Fundamentals 9th Edition
Thus, read and write operations have improved performance compared to the synchronized Hashtable alternative. Java 8 in Action Lambdas, Streams and Functional-style Programming
You’ll use the distributed cache4 to copy the small dataset to the nodes running the map tasks, and use the initialization method of each map task to load the small dataset into a hashtable. Hadoop in Practice
The program reads this source file into a Java Hashtable called joinData that’s available throughout the mapper’s lifespan. Hadoop in Action
Failure to do so will result in a violation of the general contract for Object.hashCode, which will prevent your class from functioning properly in conjunction with all hash-based collections, including HashMap, HashSet, and Hashtable. Effective Java 2nd Edition
The Collections Framework was added in Java 1.2, but prior to that release you can use Vector and Hashtable, which are approximately the same as ArrayList and HashMap. Java in a Nutshell 5th Edition
When using regular columns, as we saw earlier, Cassandra looks like a four-dimensional hashtable. Cassandra The Definitive Guide
The java.util.Map interface is the parent of commonly used classes such as HashMap and (through some library backward reengineering) even the original Hashtable. Hadoop Beginners Guide
You should also prefer the URI class if you need to store URLs in a hashtable or other data structure, since its equals() method is not blocking. Java Network Programming 4th Edition
Even though the fully synchronized Hashtable class allows an Inventory class to be defined without worrying about some low-level synchronization details, we still want to place some semantic constraints on the store and retrieve operations. Concurrent Programming in Java Design Principles and Pattern
So this loop need only execute for each nonzero element of the user vector. Mahout in Action
IOException— reads a byte from the input stream, and returns "true" if the byte is nonzero. Java Network Programming and Distributed Computing
Each of the four value sets includes not only the finite nonzero values that are ascribed to it above, but also the five values positive zero, negative zero, positive infinity, negative infinity, and NaN. The Java Virtual Machine Specification Java SE 7 Edition
In C and C++ the result of this assignment will always be true if y is nonzero, and you’ll probably get an infinite loop. Thinking in Java 4th Edition
If any of the executed commands return a nonzero value, the build will fail. Jenkins The Definitive Guide
A compile-time error does not occur if a non-zero floating-point literal has a small value that, on rounded conversion to its internal representation, becomes a nonzero denormalized number. The Java Language Specification Java SE 7 Edition
The test is very fast because it is inmemory, but it has a nonzero probability of false positives, in which case the regular get() method is called. Hadoop:The Definitive Guide 3rd Edition
Most products have nonzero values for only a few of these optional fields. Effective Java 2nd Edition
It sets its exit code to a nonzero value if an extension with the same name is already installed and has a specification version number equal to or greater than the version of the specified file. Java in a Nutshell 5th Edition
However, if the scan rate is nonzero and the trend of reported free memory is decreasing, then swapping is likely occurring. Java Performance
With ElasticSearch 0.90.3 we have the possibility of using the prefix-based suggester which is very handy for implementing the autocomplete functionality and which we will discuss in the Completion suggester section. Mastering ElasticSearch
Query term completion via the Suggester New in Solr 3 is a high-speed approach to implement term completion, called the Suggester. Solr 3 Enterprise Search Server
Using the Suggester component Nowadays, it's common for web pages to give a search suggestion (or autocomplete as I tend to call it), just like many "big" search engines do—just like Google, Microsoft, and others. Solr 3.1 Cookbook
The completion suggester in Elasticsearch takes a completely different approach. Elasticsearch The Definitive Guide
The source, an exec source, will tail the /var/log/app.log file and follow any rotations outside applications may perform on that logfile. Apache Flume Distributed Log Collection for Hadoop
The check_log plugin greps a particular query word incrementally in a specified logfile. HBase Administration Cookbook
The HTTP operation records contain a superset of the attributes in the widely supported Combined Logfile Format (CLF) commonly used in web server logging. Monitoring with Ganglia
To troubleshoot the HDFS startup problem, we can open a new terminal and monitor the NameNode logfile on the master node with the following command: Hadoop Operations and Cluster Management Cookbook
This recipe shows how to configure Cassandra’s logging mechanism, Log4J, to send events to its local logfile as well as a remote syslog server. Cassandra High Performance Cookbook
Benchmark writes the results to the console and appends the results to a logfile similarly to the write benchmark. Hadoop MapReduce Cookbook
SequenceFile Imagine a logfile where each log record is a new line of text. Hadoop:The Definitive Guide 3rd Edition
Whenever you are not sure what a byte array contains you should use this method to print its content, for example, to the console, or into a logfile. HBase:The Definitive Guide
Watching the logfile is a good way to determine that the node is finished bootstrapping, but to watch for progress while it’s happening, use nodetool streams. Cassandra The Definitive Guide
Also, when starting a session, we’ll omit the line about the logfile. Programming Hive
The time for initializing the cluster, generating the test graphs in-memory, and verifying results is not included in the measurements. Pregel - A System for Large-Scale Graph Processing
The framework calls the configure() method upon initializing a task, and it calls the close() method when the task has finished processing its input split: Pro Hadoop
You’ll also add two more targets for initializing and cleaning up the required output directories. Gradle in Action
After validating the authenticity of the lease [R7], the NM configures the environment for the container, including initializing its monitoring subsystem with the resource constraints specified in the lease. Apache Hadoop YARN Yet Another Resource Negotiator
Initializing our river After we are done with installation, we can initialize our river. Mastering ElasticSearch
We start by initializing the output result for each input group. MapReduce Design Patterns
There's more Initializing connections in this manner does not account for server fail-over or retries. Cassandra High Performance Cookbook
Before a task starts, it shows its status as “initializing,” and then once it starts reading records, it shows the split information for the split it is reading as a filename with a byte offset and length. Hadoop:The Definitive Guide 3rd Edition
Backups provide both disaster recovery and a means for initializing the database of a newly replaced replica without placing load on replicas that are in service. Chubby - The lock service for loosely-coupled distributed systems
The JMS API can be tedious, as it requires you to write a lot of code for initializing all the necessary JMS objects such as connections, sessions, producers, consumers, and so forth. ActiveMQ in Action
More precisely, this method will establish an association between the socket and a local address, which is usually explicitly specified as an InetSocketAddress instance (this class represents a socket address with IP (or hostname) and port, and extends the abstract SocketAddress class) Pro Java 7 NIO.2
The InetAddress class is used to represent IP addresses within a Java networking application. Java Network Programming and Distributed Computing
The list of attributes the record must have is defined by the objectClass inetOrgPerson. Jenkins Continuous Integration Cookbook
This variable has a single use: it must be set by the security manager before the security manager calls most methods of the InetAddress class. Java Security 2nd Edition
This is used to force the built-in SNMP agent to bind to the given inetAddress. Oracle JRockit The Definitive Guide
However, you can use the InetAddress class if you need to convert between host names and Internet addresses. Core Java Volume II Advanced Features 9th Edition
In Java 5.0 the InetAddress class has an isReachable( ) method that attempts to determine whether the host is reachable. Java in a Nutshell 5th Edition
It shows how Java programs interact with the Domain Name System through the InetAddress class, the one class that’s needed by essentially all network programs. Java Network Programming 4th Edition
A session can cache the IP address explicitly after the first lookup, but this needs to be done at the application level by holding on to the InetAddress object. Java Performance Tuning 2nd Edition
The client is then bound to the port using a new InetSocketAddress instance. Java 7 Recipes A Problem-Solution Approach
If you want bidirectional replication, you need to trigger two replications with source and target swapped. CouchDB The Definitive Guide
The key to making this work is called joint compilation, which allows you to freely mix Java and Groovy source code with bidirectional dependencies on each other. Gradle in Action
Replication CouchDB is designed to allow bidirectional replication in an efficient and reliable manner. Beginning CouchDB
Finally, having assigned the bidirectional mapping, the UID is returned. HBase in Action
For example, you would use a directed graph to model relationships between users in a social network because relationships are not always bidirectional. Hadoop in Practice
We can see that one of the edges is bidirectional. Cassandra The Definitive Guide
Depending on the type of graph, the edges may be bidirectional or unidirectional and may have weights associated with them. Hadoop Beginners Guide
Sqoop is a tool that facilitates bidirectional exchange of data between HDFS and relational databases. Hadoop Operations
In some situations, you may want to create a bidirectional communication channel between brokers—a channel that communicates not only outward to the brokers on the other side of the connection, but also receives messages from other brokers on that same channel. ActiveMQ in Action
A Java channel represents an open data path, possibly bidirectional, to an external data source or sink such as a file or a socket. Fundamental Networking in Java
GridFS, a protocol for storing large files, uses subcollections to store file metadata separately from content chunks (see Chapter 6 for more information about GridFS) MongoDB The Definitive Guide 2nd Edition
As we move through the book, we cover essential topics such as how to store large files using the GridFS feature and how to administer and optimize your MongoDB installation. The Definitive Guide to MongoDB
Appendix C shows how to work with binary data in MongoDB and how to use GridFS, a spec implemented by all the drivers, to store especially large files in the database. MongoDB in Action
Caution The paths and URIs for DistributedCache items are stored as comma-separated lists of strings in the configuration. Pro Hadoop
Everything in CouchDB has an address, a URI, and you use the different HTTP methods to operate on these URIs. CouchDB The Definitive Guide
This EL function is commonly used to pass the URIs of data set instances that will be consumed by a Workflow job triggered by a Coordinator action. Professional Hadoop Solutions
The setup() method in the Mapper is told to pull the complete list of cache file URIs currently in the distributed cache. Hadoop RealWorld Solutions Cookbook
The subsequent rewrite rule then proxies all requests to URIs starting with /api to the equivalent URI on http://couch-master.example.com:5984 (again, only if the previous rewrite condition has been met): Scaling CouchDB
You’ll notice that all of the directory names are HDFS URIs. Hadoop in Practice
This is managed in configuration using ViewFileSystem and the viewfs:// URIs. Hadoop:The Definitive Guide 3rd Edition
Similar to the URIs used in a web browser, the file: scheme addresses local files. HBase:The Definitive Guide
The JobTracker will take this list of URIs and create a local copy of the files in all the TaskTrackers when it starts the job. Hadoop in Action
It discusses ActiveMQ URIs and all the transport connectors for both client-to-broker and broker-to-broker communications including TCP, NIO, STOMP, failover, SSL, HTTP/S, and much more. ActiveMQ in Action
As shown in table 1.1, Redis allows us to store keys that map to any one of five different data structure types; STRINGs, LISTs, SETs, HASHes, and ZSETs. Redis in Action
Gets the intersection of a given number of ZSETS and store the result in a new ZSET. Redis Cookbook
Cryptographic techniques ensure that such a certificate cannot be forged, and that any effort to tamper with the signed file will be detected. Core Java Volume I Fundamentals 9th Edition
With the addition of new cryptographic codecs, a new token-based authentication mechanism that supports more authentication protocols, a unified authorization system that supports Attribute Based Access Control (ABAC) and supports policy enforcement using open standards and XACML, and changes to HBase to allow cell-level authorization, Hadoop will be able to move from isolated cluster environments with perimeter-level security to very secure deployments that can meet the requirements of highly secure environments. Professional Hadoop Solutions
They’re both faster than cryptographic hashers such as SHA-1 and produce good distributions. Hadoop in Practice
More significantly, objects in PAST are immutable, facilitating cryptographic protection and simplifying consistency and caching, but limiting the systems usefulness as a general storage service. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
Code signing, whereby code authors can use standard cryptographic algorithms to authenticate Java code. Core Java Volume II Advanced Features 9th Edition
Cryptographic hash functions While a full description of a cryptographic hash function is beyond the scope of this book, we can say that cryptographic hash functions are used for information security. Getting Started with Google Guava
These include support to protect against cipher block chaining attacks and new cryptographic algorithms. Java 7 New Features Cookbook
This package and its subpackages support cryptographic message digests and digital signatures. Java in a Nutshell 5th Edition
When a block is first stored, there is also a hidden file written to the same HDFS directory containing cryptographic checksums for the block. Hadoop Beginners Guide
The Handshake Protocol allows the server and client to authenticate each other and to negotiate an encryption algorithm and secret cryptographic keys before the application protocol transmits or receives its first byte of data. Fundamental Networking in Java
As such, this chapter is a great opportunity to showcase an example of natively serving up a dynamic web application using nothing more than CouchDB’s integrated web server, something you may wish to do with your own applications. CouchDB The Definitive Guide
The use of SequenceFile provides a storage mechanism natively supporting key/value structure, thus making implementations using this data layout much simpler. Professional Hadoop Solutions
For browsers that do not support JSON natively, the eval() function can be used to translate a string containing JSON code into JavaScript objects. Beginning CouchDB
There are few dependency management scenarios it can’t handle natively or be adapted to through a few easy customizations. Gradle Beyond The Basics
But when you’re working with big data volumes you’ll need serialization formats with compact serialized forms that natively support partitioning and have schema evolution features. Hadoop in Practice
It is, however, possible to disable checksums, typically when the underlying filesystem supports checksums natively. Hadoop:The Definitive Guide 3rd Edition
Keys and values can take on types beyond the basic ones which Hadoop natively supports. Hadoop in Action
You can do the same thing if you’re using Windows, but Windows doesn’t include a tail program natively. Cassandra The Definitive Guide
With its support for nested complex types, Avro can natively support a representation of a node that is much closer to the runtime object. Hadoop Beginners Guide
Nevertheless, if you know it, or already have a monitoring system that is natively JMX-aware, it’s a perfectly valid option for integration. Hadoop Operations
Timeouts and workers Finally, there are two miscellaneous properties related to timeouts and two for worker pools that you can change: Apache Flume Distributed Log Collection for Hadoop
Prior to Hadoop 0.18, blocking operations and fixed timeouts were used for the RPC calls. Pro Hadoop
It turns out that most of these overwrites came from client retries due to errors or timeouts. GFS - The Google File System
By default, both types of cached artifacts expire after 24 hours, but both timeouts can be set to arbitrary periods (including zero) using the resolution Strategy block. Gradle Beyond The Basics
But setting the value too high would mean that the client’s interaction with HBase would have longer pauses, and this could result in timeouts on HBase’s side. HBase in Action
I felt like we’d need to precisely control how things like connections, timeouts and buffers are handled, and RMI gives you little control over those. Hadoop:The Definitive Guide 3rd Edition
TCP’s back off policies pay no attention to higher-level timeouts such as Chubby leases, so TCP-based KeepAlives led to many lost sessions at times of high network congestion. Chubby - The lock service for loosely-coupled distributed systems
NFS: May not be a better fit for many deployments as NFS is an external device, costly, less control on timeouts etc. Namenode High Availability
Time for action – causing task failure Let's cause a task to fail; before we do, we will need to modify the default timeouts: Hadoop Beginners Guide
In general, timeouts should be set slightly too long rather than slightly too short.19 Fundamental Networking in Java
You can export the preferences of a subtree (or, less commonly, a single node) by calling the methods. Core Java Volume I Fundamentals 9th Edition
The interactive facility allows simple subtree and summation views of the grid state XML tree. Monitoring with Ganglia
Ceph [9] has a cluster of namespace servers (MDS) and uses a dynamic subtree partitioning algorithm in order to map the namespace tree to MDSes evenly. HDFS scalability the limits to growth
The bin/hadoop script and all of the Cygwin utilities see a file system that is a subtree of the Windows file system, with the Windows drives mapped to the /cygdrive directory. Pro Hadoop
Ceph utilizes a novel metadata cluster architecture based on Dynamic Subtree Partitioning [30] that adaptively and intelligently distributes responsibility for managing the file system directory hierarchy among tens or even hundreds of MDSs. Ceph A Scalable, High-Performance Distributed File System
The problem only gets more complicated as we allow more complex changes, where for example the order of records may be permuted; such changes can be common, for instance, if a crawler uses a depth-first strategy to crawl the web, and a single link changing can move an entire subtree’s position in the input file. Incoop MapReduce for Incremental Computations
If we changed a file in the /foo subtree on the first cluster from the previous example, we could synchronize the change with the second cluster by running: Hadoop:The Definitive Guide 3rd Edition
Ceph [17] has a cluster of namespace servers (MDS) and uses a dynamic subtree partitioning algorithm in order to map the namespace tree to MDSs evenly. The Hadoop Distributed File System
A special cell, named global, contains a subtree /ls/global/master that is mirrored to the. Chubby - The lock service for loosely-coupled distributed systems
The line sticking out from the handle icon points to the right when the subtree is collapsed and down when the subtree is expanded (see Figure 6.20) Core Java Volume II Advanced Features 9th Edition
Alice in Wonderland has 5,909 unique words, including the copyright notice at the beginning.) The words appear in random order. Core Java Volume I Fundamentals 9th Edition
To continue with our previous examples, we may wish to assign the "alice" user the "compute-user" role in the "acme" tenant: bk-compute-adminguide-grizzly
Suppose that Alice, an engineer at Google, comes up with a novel idea for extracting new kinds of signals from web pages. Dremel - Interactive Analysis of WebScale Datasets
You might assign the computeuser role to the alice user in the acme tenant: bk-admin-guide-cloud-latest
Bob uses the algorithm specification to generate his own public and private keys; he sends the public key to Alice. Java Security 2nd Edition
Based on our sample data, the answer is Alice and Zach. Graph Databases
Suppose Alice wants to send Bob a message, and Bob wants to know this message came from Alice and not an impostor. Core Java Volume II Advanced Features 9th Edition
The sender of a message, Alice, can compute a digest for a message and then encrypt that digest with her private key. Java in a Nutshell 5th Edition
So any commands executed by a user called alice will create files with an owner named alice and will only be able to read or write files to which this user has the correct access. Hadoop Beginners Guide
Consider the example of an over-supervised girl (Alice) trying to write to her boyfriend (Bob) when her outgoing mail is scrutinized by her parents. Fundamental Networking in Java
Indeed, it isn’t efficient to generate a new string that holds the concatenation of "Hel" and "p!" Core Java Volume I Fundamentals 9th Edition
The combiner can be used to do some concatenation prior to the reduce phase. MapReduce Design Patterns
When obtaining execution context, the class overwrites the output directory with a new value — concatenation of the base output with the output name. Professional Hadoop Solutions
This includes string concatenation, external table mapping, simple table joins, custom functions, and dependency distribution across the cluster. Hadoop RealWorld Solutions Cookbook
It consists of the concatenation of the compressed HTML of every page, preceded by a small header. The anatomy of a large-scale hypertextual Web search engine
One of the oldest rules in Java has always been that string concatenation using the plus operator should be avoided at all cost, which the following code example shows: Hadoop in Practice
However, there is no easy way to combine the files (by concatenation, for example, in the case of plain-text files) to produce a globally sorted file. Hadoop:The Definitive Guide 3rd Edition
The row keys in the catalog tables are the region names, which are a concatenation of the region’s table name, its start row, and an ID (usually the current time in milliseconds) HBase:The Definitive Guide
In addition, the number of fields in the output relation is the sum of the number of fields in the input relations, and the output relation’s schema is a concatenation of the input relations’ schemas. Hadoop in Action
The key is essentially the concatenation of four dimension keys (row, column family, column qualifier, and timestamp) Cassandra The Definitive Guide
Client connections dump Lists all the sessions and ephemeral znodes for the ensemble. Hadoop:The Definitive Guide 3rd Edition
It offers filesystem-like access with directories and files (called znodes) that. HBase:The Definitive Guide
ZooKeeper (http://zookeeper.apache.org) allows distributed processes coordinating with each other through a shared hierarchical name space of data registers (znodes), much like a file system. Apache Kafka
Like files in a traditional filesystem, ZNodes can store arbitrary bytes, have access control lists associated with them, and can be organized hierarchically. Hadoop Operations
Recipes include ZooKeeper operations that manipulate small data nodes, called znodes, that are organized hierarchically as a tree, just like in a file system. ZooKeeper Distributed Process Coordination
Currently, there can only be one network, and this set up would use the max IPs available in a /24 bk-compute-adminguide-grizzly
Nova defines two different types of IP addresses for an instance: Fixed IPs and Floating IPs. Deploying OpenStack
HAProxy uses a single configuration file to define everything from the frontend IPs being advertised to the servers behind them. RabbitMQ in Action
The templates enable you to create most OpenStack resource types, such as instances, floating IPs, volumes, security groups, users, and so on. bk-admin-guide-cloud-latest
To calculate page views, we need to calculate distinct IPs by page. Hadoop RealWorld Solutions Cookbook
You’ll then use sorting to order them by the number of IPs that viewed that movie category, thus giving you the most popular categories first. Hadoop in Practice
This snitch helps Cassandra know for certain if two IPs are in the same data center or on the same rack—because you tell it that they are. Cassandra The Definitive Guide
Global Network Usage tab to view duration and creation requests for networks, subnets, routers, ports, and floating IPs, per tenant (project) admin-user-guide-trunk
The templates allow creation of most OpenStack resource types such as instances, floating ips, volumes, security groups, users, etc. user-guide
Fixed IPs are assigned to instances on boot, whereas Floating IP address55 OpenStack Operations Guide
In the Java space, you can choose from a wide range of open source and commercial solutions, such as Checkstyle, PMD, Cobertura, FindBugs, and Sonar. Gradle in Action
This recipe shows you how to do this for checkstyle. Jenkins Continuous Integration Cookbook
There are tools already available for Java and Groovy projects to analyze and check source code, such as Checkstyle, JDepend, PMD, FindBugs, CodeNarc, and Sonar. Gradle Effective Implementation Guide
Modern IDEs have great support for many code quality tools—Checkstyle, PMD, and FindBugs all have plugins for Eclipse, NetBeans, and IntelliJ, which provide rapid feedback for developers on code quality issues. Jenkins The Definitive Guide
Earlier we discussed the jscheckstyle npm package, which coincidentally can output its report in Checkstyle format—and wouldn’t you know it, there is a Jenkins plug-in for that format! Even though you may be checking complexity numbers pre-commit, it is still a great idea to also generate those numbers as part of your build, for two reasons. Testable JavaScript
Checkstyle Checkstyle ensures that all source code adheres to coding standards. Sonar Code Quality Testing Essentials
These tools, such as Checkstyle, PMD, or FindBugs have a set of predefined rules of good practices and parse the source code looking for violations of those rules. Java 7 Concurrency Cookbook
For the first example, the Checkstyle Ant task will be loaded from a combination of a properties file and a set of local directory JARs. Building and Testing with Gradle
Figure 2.2 Comparing build script size and readability between Maven and Gradle. Gradle in Action
Some of the major benefits of using these higher-level of abstractions include readability, maintainability, development time, and automatic optimization. MapReduce Design Patterns
Adding a custom action node with a well-defined and meaningful business name can significantly improve the overall readability of the process. Professional Hadoop Solutions
One basic example might be the min() operation of the Reducer class that specifically formats output in a certain way for readability. Hadoop RealWorld Solutions Cookbook
The following code shows the implementations of the Serialized interface in the RegexSerDe class, with some unimportant code removed to help with readability: Hadoop in Practice
Statements that have to be terminated with a semicolon can be split across multiple lines for readability: Hadoop:The Definitive Guide 3rd Edition
For easier readability, the related details were broken up into groups using blank lines. HBase:The Definitive Guide
In the interest of readability, we’re ignoring the fact that this doesn’t yield identically formatted output; the data is the same. Hadoop Operations
Although the output has been truncated for readability, the command starts up a simple JMS producer and you can see from the output that it. ActiveMQ in Action
This technique increases readability by breaking up a long complex WHERE clause into two or more UNION queries. Programming Hive
If your input is comprised of tailed logfiles, the array is most likely a UTF-8 encoded String containing a line of text. Apache Flume Distributed Log Collection for Hadoop
The UTF-16 encoding represents all Unicode code points in a variable-length code. Core Java Volume I Fundamentals 9th Edition
In this case, the output is assumed by CouchDB to be HTML (i.e., with a content type of text/html) and UTF-8 formatting. Getting Started with CouchDB
The input bytes are considered to be in the UTF-8 character set. Pro Hadoop
This recipe is only applicable when your data is ASCII or UTF-8 Cassandra High Performance Cookbook
Your implementation stored them as strings, and used the DataOutput’s writeUTF method to write them out: Hadoop in Practice
Because of its emphasis on using standard UTF-8, there are some differences between Text and the Java String class. Hadoop:The Definitive Guide 3rd Edition
The writeUTF method writes string data using a modified version of 8-bit Unicode Transformation Format. Core Java Volume II Advanced Features 9th Edition
BytesType is the default for a reason: it provides the correct sorting for most types of data (UTF-8 and ASCII included) Cassandra The Definitive Guide
Regardless of operating system, you should save the file in the UTF-8 encoding if your editor allows. Java 7 Recipes A Problem-Solution Approach
Interactive queries are performed via a text protocol (similar to SMTP or HTTP) Monitoring with Ganglia
Finally, we describe interfacing directly with CouchDB from Erlang, which can be a useful technique if you want to integrate CouchDB storage with a server for non-HTTP protocols, like SMTP (email) or XMPP (chat) CouchDB The Definitive Guide
The way you want to send an email (for example, via SMTP or a third-party email service provider) and the list of recipients are very specific to your build. Gradle in Action
The output tells us that the local machine has the following services running: ftp, ssh, smtp, rpcbind (service for remote procedure calls), and jpp (service for Java packaging) Hadoop Operations and Cluster Management Cookbook
The JavaMail API specification does not require any protocol providers, but the JavaMail reference implementation includes an essage store provider, a POP3 message store provider, and an SMTP. Java EE 7 Specification
However, our core data model is not restricted to our particular RPC framework; we also trace activities such as SMTP sessions in Gmail, HTTP requests from the outside world, and outbound queries to SQL servers. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
The Exchange environment is independent by necessity, but one class of Postfix machines acts as an added layer of protection and uses SMTP to transfer messages between mailboxes hosted in each environment. Hadoop:The Definitive Guide 3rd Edition
In the past, it was simple to write a program that sends e-mail by making a socket connection to port 25, the SMTP port. Core Java Volume II Advanced Features 9th Edition
You’ll also get a behind-the-scenes look at HTTP, SMTP, TCP/IP, and the other protocols that support the Internet and the Web. Java Network Programming 4th Edition
In the example that follows, application-wide parameters are set up to store an organization’s e-mail address and SMTP host for sending mail. Java 7 Recipes A Problem-Solution Approach
Methods that change instance fields are called mutator methods, and those that only access instance fields without modifying them are called accessor methods. Core Java Volume I Fundamentals 9th Edition
While you can write accessor functions to obtain the value of a variable and/or modify it, an easier alternative is to define an attribute. Java Network Programming and Distributed Computing
But it isn’t always possible to use a method reference as a classification function, because you may wish to classify using something more complex than a simple property accessor. Java 8 in Action Lambdas, Streams and Functional-style Programming
A large number of accessor methods give you this information. Core Java Volume II Advanced Features 9th Edition
Note that a nonzero-length array is always mutable, so it is wrong for a class to have a public static final array field, or an accessor that returns such a field. Effective Java 2nd Edition
Certain important or frequently used fields can also be public, but it is common practice to make fields non-public and encapsulate them with public accessor methods. Java in a Nutshell 5th Edition
It is easy to use a method in a loop that has side effects such as making copies, or an accessor that returns a copy of some object you need only once. Java Performance Tuning 2nd Edition
The HotSpot VM’s TemplateTable defines all the templates and provides accessor functions to get the template for a given bytecode. Java Performance
We can again apply our default rules, which in this case lead to the creation of little synchronized get and set methods (also known as accessor and assignment methods) for particles, otherwise avoiding direct access of the particles variable itself. Concurrent Programming in Java Design Principles and Pattern
The following code demonstrates the declaration of a private field, followed by accessor and mutator methods that can be used to obtain or set the value of that field from an outside class: Java 7 Recipes A Problem-Solution Approach
We also define a logout method that takes a UserSession object and invalidates it and performs any cleanup that may be needed. Scala in Depth
To log out at the end of the transaction, use the method logout, passing to it the GUID. Jenkins Continuous Integration Cookbook
The logout() method sets any value that might be stored in the ThreadLocal to null. Pro Spring 3
Finally, we will change the logout so it is handled by Spring Security as well. Pro Spring MVC
To implement it, you’ll need to configure Spring Security, configure web.xml, and add login and logout links to the app. Spring in Practice
LogoutFilter no-op interaction when the logout URL is not requested. Pro Spring Security
From an authentication standpoint, you’ll use the form-login D and logout E elements to enable a login page, with a login failure URL and a security logout page, respectively, for your web application. Spring Roo in Action
Invokes the logout method on the managers in the JAAS configuration file. Core Java Volume II Advanced Features 9th Edition
This is one of the most important classes in the JAAS API for application programmers: it defines the login( ) method (and the corresponding logout( ) method) that allows an application to authenticate a user. Java in a Nutshell 5th Edition
As it turns out, logout capability is already enabled by your configuration without you having to do anything else. Spring in Action 4th Edition
Views are functions that take your document data and produce searchable lists of information based on the document’s contents. Getting Started with CouchDB
The package must live in the directory indicated by its name, which must be a directory that is searchable starting from the CLASSPATH. Thinking in Java 4th Edition
Why Subscribe? f Fully searchable across every book published by Packt f Copy and paste, print and bookmark content f On demand and accessible via web browser. Hadoop Operations and Cluster Management Cookbook
Lucene operates on searchable documents, which are collections of fields, each having a value. Professional Hadoop Solutions
Why Subscribe? f Fully searchable across every book published by Packt. Cassandra High Performance Cookbook
Subscribers have access to thousands of books, training videos, and prepublication manuscripts in one fully searchable database from publishers like O’Reilly Media, Prentice Hall Professional, Addison-Wesley. Java EE 7 Essentials
Not only does it store them, it also indexes the contents of each document in order to make them searchable. Elasticsearch The Definitive Guide
If this is not the case, you may need to break objects and searchable data into fields that are individually compressed. Java Performance Tuning 2nd Edition
Organizations would like to build this living knowledge repository of Hive assets that is easily searchable by users. Programming Hive
Safari Books Online is a digital library providing searchable, on-demand access to thousands of technology, digital media, and professional development books and videos from leading publishers. Java SE 8 for the Really Impatient
Over time, the hdfs.path directory will get very full so you will want to add some kind of time element into the path to partition the files into subdirectories. Apache Flume Distributed Log Collection for Hadoop
The attributes group, name, and version directly map to subdirectories in the file system. Gradle in Action
The views directory is structured into subdirectories, each of which is the name of the view. Beginning CouchDB
The simplest use of fileTree() simply points it at a parent directory, allowing it to recurse through all subdirectories and add all of the files it finds into the resulting file collection. Gradle Beyond The Basics
We can specify all the part prefixed subdirectories as input to the dependent job using wildcards. Hadoop MapReduce Cookbook
I also sometimes have data that’s specific to a chapter, which exists within chapter-specific subdirectories under the same GitHub location. Hadoop in Practice
If you want to see all the subdirectories, in a way similar to Unix’s ls with the -r option, you can use Hadoop’s lsr command. Hadoop in Action
Double-click on the ufodata directory to open it and within it create two subdirectories called ufo and states. Hadoop Beginners Guide
The AMQ message store directory contains subdirectories for all the brokers that are running on the machine. ActiveMQ in Action
The master project goes in a top-level directory, with subprojects arranged in subdirectories underneath. Building and Testing with Gradle
Tools (such as Eclipse PDE, Bnd, and Bundlor) can help you easily and efficiently manage the manifest file, and they can easily be integrated with building tools like Ant or Maven. Spring Dynamic Modules in Action
On the other hand, using snappy or lzo results in faster compression performance, but results in a lower compression ratio. Apache Flume Distributed Log Collection for Hadoop
This step requires the HBase cluster to support LZO compression. HBase Administration Cookbook
The framework supports the gzip, zip, sometimes LZO, and bzip2 compression codecs. Pro Hadoop
Also covered is how to set the block size and replication, and enable LZO compression. Hadoop RealWorld Solutions Cookbook
Various compression codecs are available to be used with HBase, including LZO, Snappy, and GZIP. HBase in Action
Twitter LZO-compresses all of its data, and uses Protocol Buffers for serialization purposes, all of which are geared to optimizing the use of its storage and computing resources. Hadoop in Practice
The LzopCodec is compatible with the lzop tool, which is essentially the LZO format with extra headers, and is the one you normally want. Hadoop:The Definitive Guide 3rd Edition
Before Snappy was made available in 2011, the recommended algorithm was LZO, even if it did not have the best compression ratio. HBase:The Definitive Guide
As result of these experiments we chose a variant of LZO for production, since it gave an about 10% better compression ratio and was up to twice as fast when decompressing compared to Zippy. PowerDrill - Processing a Trillion Cells per Mouse Click
However, BZip2 and LZO provide block-level compression, where each block has complete records, so Hadoop can split these files on block boundaries. Programming Hive
For example, if you specify the codec as follows all files written will have a .gzip extension, so you don't need to specify a hdfs.fileSuffix property in this case: Apache Flume Distributed Log Collection for Hadoop
The most notable things to do for performance are keeping data small (GZIP compression, high JPG compression) and avoiding requests all together (HTTP-level caching in the browser) CouchDB The Definitive Guide
We suggest downloading a .gzip archived file with the filename ending with tar.gz. Hadoop Operations and Cluster Management Cookbook
To create a TAR archive with the optional gzip or bzip2 compression, we must use the tarFiles task. Gradle Effective Implementation Guide
Since files compressed using the bzip2, gzip, and DEFLATE codecs cannot be split, the whole file must be given as a single input split to the mapper. Hadoop RealWorld Solutions Cookbook
Deflate Deflate is similar to zlib, which is the same compression algorithm that gzip uses without the gzip headers. Hadoop in Practice
Note that the gzip file format is DEFLATE with extra headers and a footer.) The .deflate filename extension is a Hadoop convention. Hadoop:The Definitive Guide 3rd Edition
This problem was solved by GZIP compressing the intermediate map output. Apache Hadoop Goes Realtime at Facebook
For example, to use GZIP compression, you can set the configuration object: Hadoop in Action
On Linux, GZip extraction utilities should be preinstalled; on Windows, you’ll need to get a program such as WinZip, which is commercial, or something like 7-Zip, which is freeware. Cassandra The Definitive Guide
As compared to Ganglia, which stores its data in RRDtool, OpenTSDB leverages HBase's scalability to monitor it at a larger scale. HBase Administration Cookbook
Ganglia 1.0 had no time-series support, whereas Ganglia today leverages the power. Monitoring with Ganglia
The ElasticSearch point of view On top of all this is ElasticSearch which leverages Apache Lucene and thankfully allows us to change the default scoring algorithm (more about this can be found in the Altering Apache Lucene scoring section, Chapter 3, Low-level Index Control) Mastering ElasticSearch
Sqoop leverages database to describe the schema for the imported/ exported data and MapReduce for parallelization operation and fault tolerance. Professional Hadoop Solutions
To control granularity we propose a new Contraction phase that leverages Combiner functions, normally used to reduce network traffic by anticipating a small part of the processing done by Reduce tasks. Incoop MapReduce for Incremental Computations
To support this throughput HDFS leverages unusually large (for a filesystem) block sizes and data locality optimizations to reduce network input/output (I/O) Hadoop in Practice
Instead of implementing a separate RPC server, it leverages the Accept header of HTTP to send and receive the data encoded in Protocol Buffers. HBase:The Definitive Guide
Although both consider only independent failures, RADOS leverages CRUSH to mitigate correlated failure risk with failure domains. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
HBase in turn leverages HDFS for scalable and fault tolerant storage and ZooKeeper for distributed consensus. Apache Hadoop Goes Realtime at Facebook
One of these methods leverages the Bloom filter, a data structure that’s useful in many data processing tasks. Hadoop in Action
Actually, the OpenOption is an interface from the java.nio.file package and it has two implementations: the LinkOption class (remember the well-known NOFOLLOW_LINKS enum constant) and the StandardOpenOption class, which defines the following enums: Pro Java 7 NIO.2
For instance, enums are compared by the order in which the symbol is defined and not by the value of the symbol string. Hadoop:The Definitive Guide 3rd Edition
The property values are defined on CacheRetrieveMode and CacheStoreMode enums and explained in Table 13-5 Java EE 7 Essentials
Typically, most of your string fields will be exact-value not_analyzed fields such as tags or enums, plus a handful of full-text fields which will use some default analyzer like standard orenglish or some other language. Elasticsearch The Definitive Guide
The language supports four kinds of types: interfaces (including annotations), classes (including enums), arrays, and primitives. Effective Java 2nd Edition
Arrays, enumerated types (or “enums”) and annotation types are the other three. Java in a Nutshell 5th Edition
Avro supports six kinds of complex types: records, enums, arrays, maps, unions, and fixed. Cassandra The Definitive Guide
Avro also supports primitive types such as integers, floats, and booleans as well as other structured types such as arrays and enums. Hadoop Beginners Guide
I could have used strings or int constants to specify the kind of record to search for and the database to search in, but because there are only a small number of valid values, lookUpNames() defines enums with a fixed number of members instead. Java Network Programming 4th Edition
By calling conversion methods against these enums, long values representing the duration between two dates can be converted. Java 7 Recipes A Problem-Solution Approach
Using onlyIf predicates Every task has a method onlyIf that accepts a closure as an argument. Gradle Effective Implementation Guide
In addition, coprocessors can be used for implementation of new features — for example, secondary indexing, complex filtering (push down predicates), and access control. Professional Hadoop Solutions
Predicate pushdown – Predicates are pushed down to the scan if possible so that rows can be filter early in the processing. Hive A Petabyte Scale Data Warehouse Using Hadoop
Filters are sometimes called push-down predicates, allowing you to push data-filtering criteria down to the server (see figure 4.16) HBase in Action
In addition, it provides push-down predicates, that is, filters, reducing data transferred over the network. HBase:The Definitive Guide
A second important and growing use case for graph databases is the Semantic Web, where predicates are given equal status to subjects and objects in a triple. Cassandra The Definitive Guide
The execution engine will ignore the file after processing the header if it can determine that the file does not contain any records of interest, based on the predicates defined for that table in the query. Tenzing - A SQL Implementation On The MapReduce Framework
As with traditional relational databases, Hive allows tables to be partitioned based on the values of virtual columns and for these values to then be used in query predicates later. Hadoop Beginners Guide
When we add predicates to WHERE clauses that filter on partition values, these predicates are called partition filters. Programming Hive
Many formalisms are available for precisely stating predicates describing requirements (as discussed in most of the texts on concurrent design methods listed in the Further Readings) Concurrent Programming in Java Design Principles and Pattern
The synchronicity of this model makes it easier to reason about program semantics when implementing algorithms, and ensures that Pregel programs are inherently free of deadlocks and data races common in asynchronous systems. Pregel - A System for Large-Scale Graph Processing
If no other thread calls signal again, then the system deadlocks. Core Java Volume I Fundamentals 9th Edition
To the scheduler they appeared as map-only jobs with radically different resource curves, thwarting the assumptions built into to the platform and causing poor utilization, potential deadlocks, and instability. Apache Hadoop YARN Yet Another Resource Negotiator
Reads within read-write transactions use woundwait [33] to avoid deadlocks. Spanner - Google’s Globally-Distributed Database
Locking, waits, and deadlocks It is a known fact that complex transactional processing, like two-phase commits, can increase the possibility of multiple clients waiting for a resource to become available. HBase:The Definitive Guide
Complete non blocking, event IO based communication (no blocking threads on sockets, no deadlocks, scalable with large number of shards/replicas) elasticsearch. The Road to a Distributed, (Near) Real Time, Search Engine
Depending on what an alien method does, calling it from a synchronized region can cause exceptions, deadlocks, or data corruption. Effective Java 2nd Edition
Java 7 has improved the use of class loaders by modifying the locking mechanism to avoid deadlocks. Java 7 New Features Cookbook
More Info You have to be very careful with the use of Locks to avoid deadlocks. Java 7 Concurrency Cookbook
In addition, you will find detailed coverage of all aspects of concurrent programming, including such topics as confinement and synchronization, deadlocks and conflicts, state-dependent action control, asynchronous message passing and control flow, coordinated interaction, and structuring web-based and computational services. Concurrent Programming in Java Design Principles and Pattern
In this new criteria API, the criteria being passed into the query is based on the mapped entity classes’ metamodel. Pro Spring 3
Most of the time, online store applications don’t have a static structure for the products in their catalog: they use a metamodel configured with the structure of the products and a generic engine that uses this metamodel to display products dynamically. Spring Batch in Action
Criteria API The Criteria API is an object-based, type-safe API and operates on a metamodel of the entities. Java EE 7 Essentials
This reflects additions to support many modeling constructs natural to Java developers, expansion of the standardized mapping options, an object-based criteria query API, a metamodel API, support for automatic validation, support for pessimistic locking, and much more. Pro JPA2
Introduction to Querydsl When working with Querydsl, you will usually start by deriving a metamodel from your domain classes. Spring Data Modern Data Access for Enterprise Java
Note: Using well-supported third-party frameworks, such as JAAS and JDBC implementations, you can connect to a much wider set of services than mentioned in the table. Jenkins Continuous Integration Cookbook
Tomcat, for example, allows you to authenticate users against a relational database (using direct JDBC or a DataSource), JNDI, JAAS, or an XML configuration file. Jenkins The Definitive Guide
Standards such as Java Authentication and Authorization Service (JAAS) or Java EE Security do offer some ways of performing some of the same authentication and authorization functions, but Spring Security is a winner because it packages up everything you need to implement a top-to-bottom application security solution in a concise and sensible way. Spring Security 3.1
On the other hand, if you expected SASL to work, please fix your JAAS configuration. Professional Hadoop Solutions
So it overlaps mainly with the JAAS Java API, although they can be used together, as you will see later in the book. Pro Spring Security
Chapter 15 This chapter discusses how the Java Authentication and Authorization Service (JAAS) enables applications to authenticate users and grant them particular permissions based on their login ID or other credentials. Java Security 2nd Edition
Specifies the name of a Java Authentication and Authorization Service (JAAS) login configuration entry to use when the JMX agent authenticates users. Oracle JRockit The Definitive Guide
The string parameter "Login1" in the LoginContext constructor refers to an entry with the same name in the JAAS configuration file. Core Java Volume II Advanced Features 9th Edition
With the addition of the JAAS API to the core Java platform in Java 1.4, policytool allows the specification of a Principal to whom a set of permissions is granted. Java in a Nutshell 5th Edition
ActiveMQ also supports its own simple style of authentication and authorization using properties files as well as standard JAAS login modules. ActiveMQ in Action
This durability is provided by a combination of a Write Ahead Log (WAL) and one or more file storage directories. Apache Flume Distributed Log Collection for Hadoop
So, HBase can guarantee that when a region server dies, data can be recovered and replayed on other region servers using its Write-Ahead Log (WAL) HBase Administration Cookbook
It also means that if writing the record to the WAL fails, the whole operation must be considered a failure. Professional Hadoop Solutions
When a write is made, by default, it goes into two places: the write-ahead log (WAL), also referred to as the HLog, and the MemStore (figure 2.1) HBase in Action
When data is updated it is first written to a commit log, called a write-ahead log (WAL) in HBase, and then stored in the in-memory memstore. HBase:The Definitive Guide
A common practice for highthroughput data stores, use of a write ahead log (WAL) such as the edits file reduces I/ O operations to sequential, append-only operations (in the context of the namenode, since it serves directly from RAM), which avoids costly seek operations and yields better overall performance. Hadoop Operations
But in case you have to know, it initializes to null several other internal elements of the mmodule structure and fills in a little bit of static information. Monitoring with Ganglia
At the start of a simulation, the lightweight simulator initializes cluster state using task-size data extracted from the. Omega - flexible, scalable schedulers for large compute clusters
More subtly, because the JobTracker needs to allocate tracking structures for every job it initializes, its admission control logic includes safeguards to protect its own availability; it may delay allocating fallow cluster resources to jobs because the overhead of tracking them could overwhelm the JobTracker process. Apache Hadoop YARN Yet Another Resource Negotiator
For each input record, Sawzall initializes the pre-defined variable input to the uninterpreted byte string of the input record. Interpreting the Data Parallel Analysis with Sawzall
The only feature that sets them apart from regular POJOs4 (or POGOs) is that they must have a property called name and a constructor that accepts a String and initializes the name property with it. Gradle Beyond The Basics
When we run the code, it initializes the K-means MapReduce job and executes it using the MapReduce framework. Hadoop MapReduce Cookbook
The first time we run this request, the upsert value is indexed as a new document, which initializes the views field to 1 Elasticsearch The Definitive Guide
When a task initializes, it calls configure(), which you override to retrieve and store your properties. Hadoop in Action
This tends to be a side effect of how the jobtracker initializes jobs submitted to the cluster and how it handles memory. Hadoop Operations
This simple class starts a minimal ActiveMQ broker configuration and initializes the Spring application context to start the JMS clients. ActiveMQ in Action
In order to extract one or more fields, you start by specifying the regex property with group matching parentheses. Apache Flume Distributed Log Collection for Hadoop
If you’re interested in something specific, you can specify a system name, or a regex or type-glob to combine various metrics from various hosts to create a custom graph of exactly what you want to see. Monitoring with Ganglia
ElasticSearch allows us to use two types of field data loading filtering: by term frequency, by using regex, or by combining both of them. Mastering ElasticSearch
We use the setup function to retrieve the map regex from the job configuration. MapReduce Design Patterns
The mapper class defines a statically initialized position to read the country from each line, as well as the regex pattern to split each line. Hadoop RealWorld Solutions Cookbook
To rename files using regular expressions, we can simply provide a source regex and a destination filename. Gradle Beyond The Basics
Keep in mind that this is a regex over the un-interpreted bytes in the rowkey. HBase in Action
Solution Use the RegexSerDe bundled with Hive and define a regular expression that can be used to parse the contents of Apache log files. Hadoop in Practice
RegexMapper A mapper that finds matches of a regular expression in the input value and emits the matches along with a count of one. Hadoop:The Definitive Guide 3rd Edition
Returns whether the string s matches the Java regular expression regex. Hadoop in Action
However, Spring can also integrate with other popular web frameworks such as Struts, JSF, Google Web Toolkit (GWT), and so on. Pro Spring 3
Google’s GWT compiler can also output source maps, and work is being done for the CoffeeScript compiler as well, so as far as you can tell you are debugging your original native code while the browser is actually running the compiled version. Testable JavaScript
Google Web Toolkit (GWT) integration There are a number of ways to set up a GWT-based application to communicate with the server. Spring Security 3.1
We’ll show you how to use servlet and JSP technologies with Spring DM, and how to use web frameworks like Spring MVC, JSF, and GWT to implement real web applications. Spring Dynamic Modules in Action
Using a similar approach, we can also build support for other widgets in jQuery UI or other UI frameworks such as Yahoo UI, Doojo, and GWT. Selenium Testing Tools Cookbook
You can use this technique when doing rapid prototyping; perhaps somebody has created a set of domain object models, and you want to try using them from various front ends such as Spring MVC, JSF, or GWT. Spring Roo in Action
If you are working with GWT and would like to take advantage of Guava in your code, there is also a GWT compatible version that can be downloaded by clicking on the guava-gwt-14.jar link on the same page. Getting Started with Google Guava
IoC is a technique that externalizes the creation and management of component dependencies. Pro Spring 3
To be able to do the things mentioned here, we will explore the Inversion of Control (IoC) container; the application context. Pro Spring MVC
Spring’s Core Container provides the inversion of control (IoC) and DI capabilities on which all the other modules are built. Spring in Practice
Dependency Injection The basic idea of DI, a type of Inversion of Control (IoC), is simply that instead of having an object instantiate its needed dependencies, the dependencies are somehow given to the object. Pro Spring Security
Dependency Injection in Scala: The Cake Pattern Dependency injection (DI), a form of inversion of control (IoC), is a powerful technique for resolving dependencies between “components” in larger applications. Programming Scala
Spring application context management in OSGi environments is an example of the inversion of control (IoC) pattern: Spring DM–powered bundles wait for the extender to inspect them and create their application context; no application code is needed for this task. Spring Dynamic Modules in Action
The theme of inversion of control (IoC) is central to this investigation because it’s a significant part of the Spring philosophy.3 For the purpose of this discussion, we consider IoC in broad terms. Spring Integration in Action
Using the Spring IoC container, these may configured as normal Spring beans and injected into GemFire components. Spring Data Modern Data Access for Enterprise Java
Part 3 completes the content with methods to help you scale Redis using memory-saving techniques, horizontal sharding, and Lua scripting. Redis in Action
Redis 2.6 includes a built-in Lua interpreter which developers can leverage to write more advanced queries to be. The Little Redis Book
Integration of Lua and R has been proposed, and work is in progress. Tenzing - A SQL Implementation On The MapReduce Framework
This has been achieved by integrating the Redis server with the Lua C API. Redis Cookbook
The framework for coprocessors is very flexible, and allows implementing two basic coprocessor types: Professional Hadoop Solutions
Writing an observer coprocessor is tricky to get right, and such a coprocessor can be extremely difficult to debug when running at scale. HBase in Action
With the coprocessor feature in HBase, you can even move part of the computation to where the data lives. HBase:The Definitive Guide
HBase security is implemented as a coprocessor—a plug-in that, in many ways, works like a database trigger, firing authorization checking code before each operation is performed. Hadoop Operations
How can we fulfill the need for computational speed? Moore’s law takes us some of the way, but multicore is the real future. Programming Concurrency on the JVM
Stepping back from extremes of multicore CPU deployments, Ganglia has also proven itself to be truly adaptable and versatile in the face of dramatic change. Monitoring with Ganglia
Now, with Moore’s Law running out of steam (at least for conventional chips), speed improvements are appearing in the form of multicore processors rather than faster chips. Thinking in Java 4th Edition
It is effectively a single point of contention for the whole server, and it can cause severe stalls on multicore servers. High Performance MySQL 3rd Edition
Given that today’s computers have blazingly fast multicore processors, you should use their computing powers to their fullest. Gradle in Action
There’s also a hardware influence: commodity CPUs have become multicore—the processor in your laptop or desktop machine probably has four or more CPU cores within it. Java 8 in Action Lambdas, Streams and Functional-style Programming
For production use, it is typical that you use multicore processors.* Quad-core are state of the art and affordable, while hexa-core processors are also becoming more popular. HBase:The Definitive Guide
It is inherent in much of what we do, and a requirement if you are to obtain good performance from multicore processors, which are now commonplace. Effective Java 2nd Edition
Being able to make efficient use of CPU cycles can be challenging for multithreaded applications running on multiprocessor and multicore. Java Performance
By making concurrent programs, you take advantage of today’s multicore CPUs. Java 7 Recipes A Problem-Solution Approach
During that incubation year, work had already begun to refactor Flume under the Star Trek Themed tag, Flume-NG (Flume the Next Generation) Apache Flume Distributed Log Collection for Hadoop
We refactor the TimerTest example and extract a TalkingClock class. Core Java Volume I Fundamentals 9th Edition
It’s easy to refactor the existing project structure into the identified modules. Gradle in Action
At Stage 4 of the cycle (where MapReduce job should be invoked from an Oozie action), it is possible to either use an existing driver as is by using a java action, or refactor it for use as an Oozie map-reduce action. Professional Hadoop Solutions
As the plug-in takes shape, we will slowly refactor it into a distributable plug-in project with a lifecycle of its own. Gradle Beyond The Basics
We can refactor the MapReduce program to track the count explicitly. Hadoop in Action
Once you do this, you can refactor the method to give it a single return: Effective Java 2nd Edition
Lastly, we revisited the portfolio example from chapter 3 in order to refactor it to use Spring JMS. ActiveMQ in Action
When locking presents liveness or performance problems for a given class or program, usually the best solution is to refactor the design to use one or more of the other approaches presented in this chapter. Concurrent Programming in Java Design Principles and Pattern
While you are always free to solve problems in a locally optimized way by writing Groovy code inside your build, the most maintainable builds will eventually refactor this code into organized plug-ins that expose new build language idioms to the build files that use them. Building and Testing with Gradle
There is also a Java Runtime Environment (JRE) that contains the virtual machine but not the compiler. Core Java Volume I Fundamentals 9th Edition
OSNaming JNDI service provider is a part of the Java SE 7 SDK and JRE. Java EE 7 Specification
HBase needs the Java Runtime Environment (JRE) to be installed and available on the system. HBase in Action
JAVA_HOME must be set to point to a JDK, not to a JRE. Hadoop MapReduce Cookbook
All of the following assumes you have the Java Runtime Environment (JRE) installed. HBase:The Definitive Guide
You can run into grief if you drop a JAR file into the jre/lib/ext directory and one of its classes needs to load a class that is not a system or extension class. Core Java Volume II Advanced Features 9th Edition
The deployJava.js is part of the Java Deployment Toolkit and is used to detect the presence of a JRE, install one if necessary, and then run an applet. Java 7 New Features Cookbook
Ivy requires Ant, and building from source requires the complete JDK, version 1.6.0_20 or better, not just the JRE. Cassandra The Definitive Guide
Although the JRE (Java Runtime Environment) is all you need to run Hive, you will need the full JDK (Java Development Kit) to build examples in this book that demonstrate how to extend Hive with Java code. Programming Hive
The Java platform consists of a hierarchy of components, starting with the Java Development Kit (JDK), which is composed of the Java Runtime Environment (JRE), the Java programming language, and platform tools that are necessary to develop and run Java applications. Java 7 Recipes A Problem-Solution Approach
You do this by instantiating implementations of AppleFormatter and giving them as arguments to prettyPrintApple: Java 8 in Action Lambdas, Streams and Functional-style Programming
To do that, you must first connect to the Zookeeper instance keeping track of Accumulo by instantiating a ZookeeperInstance class, which will return a connector. Professional Hadoop Solutions
Using an HTablePool is more common in practice than instantiating HTables directly: HBase in Action
Instantiating a Configuration object and obtaining a FileSystem handle within a Hadoop environment will point it to the HDFS NameNode of that environment. Hadoop MapReduce Cookbook
There is a cost to instantiating an HTable, so if you do this for each insert, you may have a negative impact on performance, hence our setup of HTable in the configure() step. Hadoop:The Definitive Guide 3rd Edition
You must provide a row key when instantiating an Increment, which sets the row containing all the counters that the subsequent call to increment() should modify. HBase:The Definitive Guide
Another difference is that Pig programs are written in Pig Latin, whereas Cascading works more like a Java framework in which you create a data processing flow through instantiating various Java classes (Each, Every, etc.) Hadoop in Action
For example, the following uses a raw type when instantiating the identifier, list: Java 7 New Features Cookbook
We introduce them here, however, because they are the next higher level of syntax after methods, and because the rest of this chapter requires a basic familiarity with the concept of class and the basic syntax for defining a class, instantiating it, and using the resulting object. Java in a Nutshell 5th Edition
The method with this signature can be invoked without instantiating its containing class because it is marked as static. Java 7 Recipes A Problem-Solution Approach
For example, in a 32-bit Intel-compatible Linux host, the default is usually at /usr/ lib/ganglia. Monitoring with Ganglia
The argument -reducer "/usr/bin/perl -w wordCount.pl" causes the Perl program wordCount.pl to be used to perform the reduce. Pro Hadoop
Be sure to read through the Debian-specific system documentation that can be found under /usr/share/couchdb. CouchDB The Definitive Guide
Suppose we have both OpenJDK and Oracle Java installed in /usr/openjdk and / usr/jdk respectively. Hadoop Operations and Cluster Management Cookbook
One variant uses a shared /lib for moderate sharing, while the other shares /usr/include, which is very heavily read. Ceph A Scalable, High-Performance Distributed File System
In this example, we connected to the MySQL server using hdp_usr. Hadoop RealWorld Solutions Cookbook
This will create a new directory called otp_src_R13B01 under /usr/local/erlang. Beginning CouchDB
Next you’ll need to untar the tarball that the build created under /usr/local: Hadoop in Practice
Python script to filter out poor-quality weather records #!/usr/bin/env python. Hadoop:The Definitive Guide 3rd Edition
This directory can be made to be read only when configured correctly and usually lives in /usr/local, /opt, or /usr when Hadoop is installed via packages. Hadoop Operations
He has more than 7 years of experience in software and middleware (Apache, Tomcat, PostgreSQL, Hadoop eco system) engineering. HBase Administration Cookbook
The command will automatically install the PostgreSQL database, which is required by Ambari. Hadoop Operations and Cluster Management Cookbook
Using HDFS in a Greenplum external table Greenplum is a parallel database that distributes data and queries to one or more PostgreSQL instances. Hadoop RealWorld Solutions Cookbook
You could either use one of the free databases, such as MySQL or PostgreSQL, or a pure key/value store like Berkeley DB. HBase in Action
He has more than seven years' experience in software and middleware (Apache, Tomcat, PostgreSQL, and Hadoop eco system) engineering. Hadoop MapReduce Cookbook
What are the disadvantages of fast connectors? First, only MySQL and PostgreSQL are currently supported. Hadoop in Practice
For a majority of small- to medium-volume applications, there is no substitute for the ease of use, flexibility, maturity, and powerful feature set of available open source RDBMS solutions such as MySQL and PostgreSQL. Hadoop:The Definitive Guide 3rd Edition
Today, Sqoop includes native connectors (called direct support) for MySQL and PostgreSQL. Hadoop Operations
The first iteration of the data warehouse was built using an open source system with a Java SQL optimizer and a set of underlying PostGreSQL databases. Programming Hive
Hence, we use PostgreSQL as the database layer and Hadoop as the communication layer, Hive as the translation layer, and all code we add we release as open source [2] HadoopDB An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
Programming with mutability in pure Java is like working with the mother-in-law who’s just waiting for you to fail. Programming Concurrency on the JVM
Another aspect of functional programming is that it doesn’t have side effects or mutability. Scala in Action
The old java.util.Date class and all other classes used to model date and time in Java before Java 8 have many inconsistencies and design flaws, including their mutability and some poorly chosen offsets, defaults, and naming. Java 8 in Action Lambdas, Streams and Functional-style Programming
For sets and maps, Scala models mutability in the class hierarchy. Programming in Scala 2nd Edition
These patterns focus on writing code that avoids mutability and favors a declarative style, which helps us write simpler, more maintainable code. Functional Programming Patterns in Scala and Clojure
If you remove mutability, then the problems essentially go away. Programming Scala
Given that mutability leads to unexpected defects, minimizing mutability in code minimizes mutability-related defects. Beginning Scala
If a class cannot be made immutable, limit its mutability as much as possible. Effective Java 2nd Edition
Its mutability makes it suitable for processing text in place, which is not possible with the immutable String class. Java in a Nutshell 5th Edition
Criteria Objects and Mutability Typical usage of the criteria API will result in many different objects being created. Pro JPA2
Thymeleaf: It is a modern, tempting framework that provides an excellent alternative to JSPs. Spring Security 3.1
Chapter 6 will explore various view technologies that can be used with Spring, including JSP, Apache Tiles, and Thymeleaf. Spring in Action 4th Edition
The Google File System (GFS [14]) is one such distributed storage layer widely used in the company. Dremel - Interactive Analysis of WebScale Datasets
A GFS cluster consists of a single master and multiple. GFS - The Google File System
We conserve network bandwidth by taking advantage of the fact that the input data (managed by GFS [8]) is stored on the local disks of the machines that make up our cluster. MapReduce - Simplied Data Processing on Large Clusters
The Google File System (GFS)[9] is another distributed file system built for hosting the state of Google’s internal applications. Cassandra A Decentralized Structured Storage System
The data sets are often stored in GFS, the Google File System [9] Interpreting the Data Parallel Analysis with Sawzall
Bigtable in turn uses both Chubby[7] (a distributed lock system) and GFS. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
Another storage data structure that is often required at scale is a FIFO queue, like that provided by GFS [7] RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
In addition, both GFS and Bigtable use Chubby as a well-known and available location to store a small amount of meta-data; in effect they use Chubby as the root of their distributed data structures. Chubby - The lock service for loosely-coupled distributed systems
Bigtable runs on top of the Google File System (GFS) Cassandra The Definitive Guide
Bigtable uses the distributed Google File System (GFS) [17] to store log and data files. Bigtable - A Distributed Storage System for Structured Data
You can update or remove tag cloud, autocomplete, country code, and calendar widgets from reuters.js. Web Crawling and Data Mining with Apache Nutch
Additionally, Suggestor with autocomplete feature starts providing users options right at the time when user is typing search query enhancing the overall experience. Scaling Big Data with Hadoop and Solr Chapter-02 Understanding Solr
Figure 4-9 shows how gweb search autocomplete allows you to find metrics across your entire deployment. Monitoring with Ganglia
We’ll begin by building autocomplete functions to quickly find users in short and long lists of items. Redis in Action
But the book doesn't only pay attention to low-level aspects of ElasticSearch; it includes user search experience improvements tips, such as dealing with spelling mistakes, highly effective autocomplete feature, and a tutorial on how you can deal with query related improvements. Mastering ElasticSearch
The name-to-UID rows enable support for autocomplete of tag names. HBase in Action
You can customize the completion tokens by creating a file named autocomplete and placing it on Pig’s classpath (such as in the conf directory in Pig’s install directory) or in the directory you invoked Grunt from. Hadoop:The Definitive Guide 3rd Edition
We can do this by using the autocomplete analyzer at index time and the standard analyzer at search time. Elasticsearch The Definitive Guide
If you start typing and hit the Tab key, the CLI will autocomplete possible keywords and function names. Programming Hive
Appendix, Real Life Situations, describes how to handle real-life situations such as implementing different autocomplete functionalities, using near real-time search, or improving query relevance. Apache Solr 4 Cookbook
A novel extension to Paxos, detailed in Section 4.4.1, allows local reads at any up-to-date replica. Megastore - Providing Scalable, Highly Available Storage for Interactive Services
CouchDB differs from others by accepting eventual consistency, as opposed to putting absolute consistency ahead of raw availability, like RDBMS or Paxos. CouchDB The Definitive Guide
At the highest level of abstraction, it is a database that shards data across many sets of Paxos [21] state machines in datacenters spread all over the world. Spanner - Google’s Globally-Distributed Database
I suggest that you should look at Paxos and Cassandra's use of that algorithm to affect conditional updates at the following URLs: Storm Blueprints Patterns for Distributed Real-time Computation
Improved primitives: Distributed cross-group transactions Synchronous replication groups (Paxos) Automatic failover of client requests. Storage Architecture and Challenges
The cluster, which is based in part on the Paxos part-time parliament algorithm [14], is designed to favor consistency and durability over availability and update latency. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
One might argue that we should have built a library embodying Paxos, rather than a library that accesses a centralized lock service, even a highly reliable one. Chubby - The lock service for loosely-coupled distributed systems
Traditional protocols to implement replicated state machines, like Paxos [2], do not enable such a feature directly, however. Zab High-performance broadcast for primary-backup systems
Because this clause produces mutual exclusion, a common name for such a mechanism is mutex. Thinking in Java 4th Edition
A mutex is a data structure that guarantees mutually exclusive access to some resource, preventing concurrent access from causing problems.) High Performance MySQL 3rd Edition
The volatile keyword accomplishes this: it forces reads and writes to the variable to go all the way out to main memory (where they’re visible to all threads), but there’s no mutex. Spring in Practice
A mutex lock is also known as a mutually exclusive lock. Java Threads 3rd Edition
Notice in the preceding code that we used a mutex to sequence the updates to scores, thereby eliminating the race condition mentioned earlier. Storm Blueprints Patterns for Distributed Real-time Computation
The Monitor class that functions as a Mutex, ensuring serial access to the defined areas in our code, much like the synchronized keyword but with much easier semantics and some useful additional features. Getting Started with Google Guava
This allowed for operating system tools such as Solaris mpstat to easily monitor a Java application for lock contention by observing the “smtx” (spin on mutex) column along with observing system or kernel CPU utilization. Java Performance
A Mutex (short for mutual exclusion lock) class can be defined as (omitting implementation code): Concurrent Programming in Java Design Principles and Pattern
Most global variables have an associated mutex that other threads will lock before accessing it. Understanding MySQL Internals
The methods in turn use JNI to actually interact with an embedded Java Virtual Machine (JVM) which actually interacts with HDFS. Pro Hadoop
The only way to write a true ping application in Java would be to use the Java Native Interface (JNI) to access native code; such a discussion is beyond the scope of this book. Java Network Programming and Distributed Computing
For example, a Java version of MapReduce is implemented as a JNI veneer on top of the C++ version of MapReduce. FlumeJava - Easy, Efficient Data-Parallel Pipelines
Using JNI JNI is a programming framework that allows Java code running in a JVM to call and to be called by native applications. Professional Hadoop Solutions
You also have to include the JNI header files of your Java installation. Hadoop MapReduce Cookbook
It works using the Java Native Interface (JNI) to call a Java filesystem client. Hadoop:The Definitive Guide 3rd Edition
Refer to the source code of the JNI library to find out what is available to you. HBase:The Definitive Guide
The usual Java mechanism for accessing non-native libraries is JNI [15], but it is regarded as slow and cumbersome. Chubby - The lock service for loosely-coupled distributed systems
The Java platform has an API for interoperating with native C code called the Java Native Interface (JNI) Core Java Volume II Advanced Features 9th Edition
In Java 1.1, the -jni option specified that javah should generate new-style files. Java in a Nutshell 5th Edition
Further, it is possible to mix and match the deaf/mute, and default topologies to create a system architecture that better suits your environment. Monitoring with Ganglia
Preventing infinite loops is important for some of the more useful replication topologies, such as master-master replication.5 High Performance MySQL 3rd Edition
As shown in Figure 9-19, Storm applications (or topologies) are defined in the form of input streams (called spouts) and a connected graph of processing nodes (called bolts) Professional Hadoop Solutions
Next in the Storm cluster, jobs are typically referred to as topologies; the only difference is that these topologies run forever. Apache Kafka
A table lookup may make more sense for more complex cluster topologies. Hadoop in Action
Network Usage in Hadoop: A Review Hadoop was developed to exist and thrive in real-world network topologies. Hadoop Operations
This is known as a network of brokers and can support many different topologies. ActiveMQ in Action
In this chapter, we will cover the following replication topologies that MongoDB supports: The Definitive Guide to MongoDB
Approaches that assume particular architectures, topologies, processor capabilities, or other fixed environmental constraints are ill suited to commonly available JVM implementations. Concurrent Programming in Java Design Principles and Pattern
Replica sets provide other enhancements too, such as easier recovery and more sophistical deployment topologies. MongoDB in Action
With replicas spread over a wide area, communication latencies limit overall throughput. Megastore - Providing Scalable, Highly Available Storage for Interactive Services
Cluster allocation latency was so high, users would often share long-awaited clusters with colleagues, holding on to nodes for longer than anticipated, raising latencies still further. Apache Hadoop YARN Yet Another Resource Negotiator
Our implementation of Paxos is pipelined, so as to improve Spanner’s throughput in the presence of WAN latencies; but writes are applied by Paxos in order (a fact on which we will depend in Section 4) Spanner - Google’s Globally-Distributed Database
Using nodetool cfhistograms to see the distribution of query latencies. Cassandra High Performance Cookbook
Larger clusters have imperfect load distributions, resulting in lower average per-MDS throughput (but, of course, much higher total throughput) and slightly higher latencies. Ceph A Scalable, High-Performance Distributed File System
Since users are served from data centers that are geographically distributed, being able to replicate data across data centers was key to keep search latencies down. Cassandra A Decentralized Structured Storage System
So, in this example, we can see that there’s a roughly log normal distribution of latencies for the selected execution pattern. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
After launching to employees, we noticed that our put and sync latencies were very high. Apache Hadoop Goes Realtime at Facebook
This allows us to track down all clusters, discover how big they are, see which versions of our software they are running, how much traffic they are receiving, and whether or not there are any problems such as unexpectedly large latencies. Bigtable - A Distributed Storage System for Structured Data
Here and in all further experiments we do not show the corresponding latencies, since they do not change significantly (the main goal is to reduce the memory footprint) PowerDrill - Processing a Trillion Cells per Mouse Click
This registration is performed by calling an addXXXListener( ) method in the event-firing component, in which "XXX" represents the type of event listened for. Thinking in Java 4th Edition
Note that lack of _source on the list of returned fields causes the sourceXXX() methods not to be working (as discussed in the next point) Mastering ElasticSearch
The super column version of these commands uses an extra ['xxx'] to represent the extra level of the map called the sub-column. Cassandra High Performance Cookbook
A signature file (XXX.SF, where XXX is the name of the entity that signed the archive) that contains signature information. Java Security 2nd Edition
The resource method parameters, fields, or bean properties are bound by way of xxxParam annotations added during object creation time. Java EE 7 Essentials
Xxx getXxx(int columnNumber) • Xxx getXxx(String columnLabel) (Xxx is a type such as int, double, String, Date, etc.) Core Java Volume II Advanced Features 9th Edition
To single out one record, look it up with "xxx", where xxx is one of the of the records displayed above. Java Network Programming 4th Edition
If you printed out the class name of the Connection object or other objects that you are using, you would probably see something like XXX Connection , XXX Statement , XXX ConnectionImpl , or XXX StatementImpl , where XXX is the name of the database you are using (Oracle, for example) Java Performance Tuning 2nd Edition
Lucene has a sandbox component, spellchecker, covered in section 8.XXX, that you can use. Lucene in Action 2nd Edition
In addition, you can call one of the ResultSet object’s updateXXX() methods to update any value within the ResultSet. Java 7 Recipes A Problem-Solution Approach
The reason for this is the log4j configuration file would be included on the classpath. Apache Flume Distributed Log Collection for Hadoop
A dependency is generally an external, reusable library in the form of a JAR file (for example, Log4J for logging support) Gradle in Action
The following table shows all the logging levels provided by Log4j: Hadoop Operations and Cluster Management Cookbook
Forwarding Log4j logs to a central sever The faster a problem can be diagnosed and corrected, the better. Cassandra High Performance Cookbook
Within each of the directories in the previous code there are three files, corresponding to standard out, standard error, and the system log (output from both the infrastructure task code, as well as any of your own log4j logging): Hadoop in Practice
Flume supports a large variety of sources; some of the more commonly used ones include tail (which pipes data from a local file being written to into Flume, just like Unix tail), syslog, and Apache log4j (allowing Java applications to write events to files in HDFS via Flume) Hadoop:The Definitive Guide 3rd Edition
There are three basic functions: the storage-conf.xml file allows you to create your data store by configuring your keyspace and column families; there are files related to setting up authentication; and finally, the log4j properties let you change the logging levels to suit your needs. Cassandra The Definitive Guide
More precisely, this is the default behavior as specified within the log4j property file within our conf directory. Hadoop Beginners Guide
It supports native connectivity and support for writing directly to HDFS, and simplifies reliable, streaming data delivery from a variety of sources including RPC services, log4j appenders, syslog, and even the output from OS commands. Hadoop Operations
By adding the custom Log4J RepositorySelector to the application, you’ll avoid this error. ActiveMQ in Action
The following example shows how to use the --hiveoverwrite in conjunction with enabling LZOP compression. Hadoop in Practice
The LzopCodec is compatible with the lzop tool, which is essentially the LZO format with extra headers, and is the one you normally want. Hadoop:The Definitive Guide 3rd Edition
Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs) Ceph A Scalable, High-Performance Distributed File System
Emerging clustered storage architectures constructed from storage bricks or object storage devices (OSDs) seek to distribute low-level block allocation. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
In the Java space, you can choose from a wide range of open source and commercial solutions, such as Checkstyle, PMD, Cobertura, FindBugs, and Sonar. Gradle in Action
You will also find recipes on static code review through PMD and FindBugs. Jenkins Continuous Integration Cookbook
There are tools already available for Java and Groovy projects to analyze and check source code, such as Checkstyle, JDepend, PMD, FindBugs, CodeNarc, and Sonar. Gradle Effective Implementation Guide
Modern IDEs have great support for many code quality tools—Checkstyle, PMD, and FindBugs all have plugins for Eclipse, NetBeans, and IntelliJ, which provide rapid feedback for developers on code quality issues. Jenkins The Definitive Guide
Although JSLint does a better/more thorough job, it is relatively easy to add new rule sets to PMD for more custom checking in your environment. Testable JavaScript
Head on to PMD's home page for a more comprehensive list. Sonar Code Quality Testing Essentials
These tools, such as Checkstyle, PMD, or FindBugs have a set of predefined rules of good practices and parse the source code looking for violations of those rules. Java 7 Concurrency Cookbook
In the dependencies closure, we once address the myPmd grouping and indicate that it should download the PMD JAR as needed. Building and Testing with Gradle
JavaMail supports MIME attachments, and makes it easy to compose and transmit a multipart message with a file attachment. Java Network Programming and Distributed Computing
In addition to checking the option to make a post request multipart, you will need to specify the absolute path of the file, in cases where the file you are uploading is not within JMeter's bin directory, or the relative path in cases where the file resides within JMeter's bin directory. Performance Testing With JMeter 2.9
Instead of integration with Apache Commons File Upload, we will discuss how we can use Spring MVC with the Servlet 3.0 container’s built-in multipart support for file upload. Pro Spring 3
Next, the incoming request is checked for whether it is a multipart HTTP request (this is used when doing file uploads) Pro Spring MVC
There are two flavors of a put operation — a single upload and a multipart upload. Professional Hadoop Solutions
The getParts method provides a Collection of parts for this multipart request. Java EE 7 Essentials
Link walking always returns a multipart/mixed, since a single key can contain any number of links, meaning any number of objects returned. A Little Riak Book
An XOP package is created by placing a serialization of the XML infoset inside of an extensible packaging format (such as MIME Multipart/Related) Java Performance
Solution Creating a message that contains different parts (a multipart message) is what allows you to send attachments such as files and images. Java 7 Recipes A Problem-Solution Approach
In this chapter, you’ll learn how to customize Spring MVC configuration, handle multipart file uploads, deal with exceptions that may occur in a controller, and pass data between requests with flash attributes. Spring in Action 4th Edition
Here, we print a stack trace by using the printStackTrace method of the Throwable class. Core Java Volume I Fundamentals 9th Edition
Throwable In case the servers return an exception for the operation it is returned to the client as-is. HBase:The Definitive Guide
In this code, the onComplete method is invoked when the request processing is finished, after a response is processed and is sent back to the client, or when an unmapped throwable has been propagated to the hosting I/O container. Java EE 7 Essentials
This exception chain is in addition to the “cause” chain of Throwable objects that every exception has. Core Java Volume II Advanced Features 9th Edition
In the event of a failure, we are getting the error message from the Throwable object. Getting Started with Google Guava
There is some confusion among programmers as to when it is appropriate to use each kind of throwable. Effective Java 2nd Edition
The first is the message to be associated with the AssertionError and the second is a Throwable clause. Java 7 New Features Cookbook
The Throwable class and all its subclasses include a String field that stores a human-readable error message that describes the exceptional condition. Java in a Nutshell 5th Edition
All exceptions inherit from Throwable, as shown in Figure 6-1 Java 7 Recipes A Problem-Solution Approach
NOTE: The classes Throwable, Exception, RuntimeException, and Error have constructors with an option for disabling suppressed exceptions and for disabling stack traces. Java SE 8 for the Really Impatient
If you define a method inside a class, then it is, automatically, an inline method. Core Java Volume I Fundamentals 9th Edition
In this case, we’re using them to import a template and some metadata about the blog (as JSON data), as well as to include link and template rendering functions as inline code. CouchDB The Definitive Guide
The special symbol @ specifies that the configuration is pointing to a file rather than inline. Hadoop Operations and Cluster Management Cookbook
Script location must be set to File instead of Inline. Gradle Effective Implementation Guide
This recipe will outline a shorthand technique for inline table creation when the query is executed. Hadoop RealWorld Solutions Cookbook
The shell provides you with tab-completion of your commands and inline access to command documentation: HBase in Action
For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required: Hadoop:The Definitive Guide 3rd Edition
The C++ version of the JNIEnv class has inline member functions that take care of the function pointer lookup for you. Core Java Volume II Advanced Features 9th Edition
Views are simply named SELECT statements which are expanded inline during compilation. Tenzing - A SQL Implementation On The MapReduce Framework
The result file is large and printing the complete results inline will kill too many trees. Programming Hive
However, Java’s syntax isn’t a natural fit for producing DSLs. Gradle in Action
How can you simplify MapReduce development using domain-specific languages (DSLs)? Professional Hadoop Solutions
And second, they support writing domainspecific languages (DSLs) within the language. Programming in Scala 2nd Edition
The Gradle APIs are rich, the possibilities for DSLs matching your domain are abundant, and the path towards finally having a build system that conforms to your product is clear. Gradle Beyond The Basics
On the other hand, we’ve got internal DSLs, also known as embedded languages. Functional Programming Patterns in Scala and Clojure
Domain-Specific Languages Domain-Specific Languages, or DSLs, provide a convenient syntactical means for expressing goals in a given problem domain. Programming Scala
First off, developers can create domain-specific languages (DSLs) using their favorite language, such as Groovy, JRuby, Jython, Scala, and others (see the project site for examples) Hadoop:The Definitive Guide 3rd Edition
Cascading to Hive and Pig offers a Java API, as opposed to the domain-specific languages (DSLs) provided by the others. HBase:The Definitive Guide
Although we refer to the Query DSL, in reality there are two DSLs: the Query. Elasticsearch The Definitive Guide
Choosing implicit conversions and designing domain-specific languages (DSLs) takes time, thought, and deliberation. Beginning Scala
The description starts with zero or more modifiers shift control ctrl meta alt altGraph and ends with either the string typed, followed by a one-character string (for example, "typed a") or an optional event specifier (pressed or released, with pressed being the default), followed by a key code. Core Java Volume I Fundamentals 9th Edition
If you started in a console, then hitting Ctrl-C and rerunning the same command you used to start it is the simplest way. CouchDB The Definitive Guide
Windows: Open the task manager with the keyboard shortcut Ctrl+Shift+Esc and click the Processes tab. Gradle in Action
To stop the Jetty web container, we press Ctrl + C at the command line, to return to our prompt. Gradle Effective Implementation Guide
Not to worry, just open another Terminal tab by hitting Shift+Ctrl+T. Beginning CouchDB
This lets you see what it's doing, and you can you can easily stop it with CTRL+C. Elasticsearch in Action MEAP V07 cHAPTER 1
To run the program, we specify the name of the file to write output to (pairs.avro) and send input pairs over standard in, marking the end of file by typing Ctrl-D: Hadoop:The Definitive Guide 3rd Edition
Stopping the server requires a Ctrl-C to abort the start script. HBase:The Definitive Guide
We type a word on each line and hit Enter then kill the session with a Ctrl + C command. Hadoop Beginners Guide
You can press CTRL-C in the second terminal to shut down the Consumer at this point. ActiveMQ in Action
Abstract We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Resilient Distributed Datasets - A Fault-Tolerant Abstraction for In-Memory Cluster Computing
To prevent conflicts from causing starvation, Omega schedulers typically choose to use incremental transactions, which accept all but the conflicting changes (i.e., the transaction provides atomicity but not independence) Omega - flexible, scalable schedulers for large compute clusters
This 5-tuple key provides atomicity, locality, uniqueness, access control, and versioning. Professional Hadoop Solutions
Future works involves adding compression, ability to support atomicity across keys and secondary index support. Cassandra A Decentralized Structured Storage System
The CouchDB layout and committing system adheres to the ACID properties (atomicity, consistency, isolation, and durability), which guarantee the reliable execution of database transactions. Beginning CouchDB
For those who don’t know (or don’t remember), ACID is an acronym standing for atomicity, consistency, isolation, and durability. HBase in Action
Designing the schema in a way to completely avoid explicit locking, combined with row-level atomicity, gives you the ability to scale your system without any notable effect on read or write performance. HBase:The Definitive Guide
It also provides row-level atomicity guarantees, but no native cross-row transactional support. Apache Hadoop Goes Realtime at Facebook
Record-Level Atomicity on Writes Cassandra automatically gives you record-level atomicity on every write operation. Cassandra The Definitive Guide
The performance gains and atomicity of the service co-located with its broker can be an attractive architectural change, too. ActiveMQ in Action
In other words, synchronized is not equivalent to atomic, but synchronization can be used to achieve atomicity. Concurrent Programming in Java Design Principles and Pattern
Use instanceof to check before casting from a superclass to a subclass. Core Java Volume I Fundamentals 9th Edition
Even if an HttpURLConnection object is expected, it is good programming practice to test the class type using the instanceof keyword. Java Network Programming and Distributed Computing
Adding Scala-style pattern matching in full generality to Java seems quite a big job, but following the recent generalization to switch to allow Strings, you can imagine a more-modest syntax extension, which allows switch to operate on objects, using the instanceof syntax. Java 8 in Action Lambdas, Streams and Functional-style Programming
As a result, the typecast will generate a warning, and a corresponding instanceof test is illegal. Data Structures and Algorithm Analysis in Java 3rd Edition
You have two options available: you can either determine the type (using instanceof) in your execmethod, or you can specify in your UDF type-specific alternative implementations, which is neater and removes ugly type-switching code. Hadoop in Practice
You can use the instanceof operator to find out whether a particular remote object implements an interface. Core Java Volume II Advanced Features 9th Edition
If o is a normal Point, do a color-blind comparison if (!(o instanceof ColorPoint)) Effective Java 2nd Edition
The instanceof operator requires an object or array value as its left operand and the name of a reference type as its right operand. Java in a Nutshell 5th Edition
In the application layer where Java programs reside, you simply don’t need to know this (and even if you do need to know, it’s quicker to check the size of the byte array returned by getAddress() than to use instanceof to test which subclass you have) Java Network Programming 4th Edition
Or you could focus on particular classes within a more general monitoring class by filtering interesting hierarchies using instanceof. Java Performance Tuning 2nd Edition
The nodetool program communicates with the JMX port to confirm that the server is alive. Cassandra High Performance Cookbook
It also has scripts to run the nodetool, which is a utility for inspecting a cluster to determine whether it is properly configured, and to perform a variety of maintenance operations. Cassandra The Definitive Guide
Nowadays, the covariant return type feature lets you specify the correct return. Core Java Volume I Fundamentals 9th Edition
Covariant return types Java SE5 adds covariant return types, which means that an overridden method in a derived class can return a type derived from the type returned by the base-class method: Thinking in Java 4th Edition
The Maybe class is declared as a covariant on type A. Scala in Action
Or, since it just has one type parameter, you could say simply that Queues are covariant. Programming in Scala 2nd Edition
In Java, as we mentioned in Section 1.4.4, arrays are covariant. Data Structures and Algorithm Analysis in Java 3rd Edition
This rule allows for covariant return types - refining the return type of a method when overriding it. The Java Language Specification Java SE 7 Edition
This is especially true for library writers, who will want to understand when to use parameterized types versus abstract types, which type parameters should be covariant, contravariant, or invariant under subtyping, and so forth. Programming Scala
We’ll discuss why this happened and the very complex topic of invariant, covariant, and contravariant types later in this chapter. Beginning Scala
Covariant returns to allow a subclass to override a superclass method and narrow the return type of the method. Java in a Nutshell 5th Edition
If List was immutable, as it is in a functional programming language, then the problem would disappear, and one could have a covariant list. Java SE 8 for the Really Impatient
We will start two EC2 instances; one is a DNS/NTP server, and the other one is the client: HBase Administration Cookbook
Enable the NTP server with the following command: sudo service ntpd start. Hadoop Operations and Cluster Management Cookbook
Network Time Protocol (NTP) is a distributed hierarchical system used to keep system clocks in sync. Cassandra High Performance Cookbook
As we described earlier, this can be achieved using NTP. HBase in Action
We synchronized the two masters’ clocks using NTP and measured the mean time. Mesos A Platform for Fine-Grained Resource Sharing in the Data Center
Run NTP on your cluster, or an equivalent application, to synchronize the time on all servers. HBase:The Definitive Guide
One good strategy here is the use of a Network Time Protocol (NTP) server. Cassandra The Definitive Guide
This contrasts with the various NTP and Time Server protocols, which do not return data that can be easily read by humans. Java Network Programming 4th Edition
Clock synchronization can be achieved through the use of Network Time Protocol (NTP) or using the rdate utility available on Oracle Solaris and Linux systems. Java Performance
You need to ensure that each of your servers uses NTP (Network Time Protocol) to keep their clocks synchronized. MongoDB in Action
The recommended solution is the current beta project for a CouchDB installer. Getting Started with CouchDB
The Cloudera folks have Amazon machine images (AMIs), Debian and RPM installer files, and an online configuration tool to generate configuration files. Pro Hadoop
CouchDB intends to provide an official Windows installer at some point in the future, so this may change. CouchDB The Definitive Guide
Download the distribution for your OS (at the time of writing, this is version 7.3) and run the installer. Gradle in Action
We simply run the installer software to install Jenkins on our computer. Gradle Effective Implementation Guide
This recipe also requires Python and the Python installer pip. Cassandra High Performance Cookbook
To build the file, download the Jython java installer, run the installer, and select Standalone from the installation menu. Hadoop RealWorld Solutions Cookbook
The Xcode Tools Installer will now load, and you can click the Continue button to begin the install. Beginning CouchDB
RedHat uses packages called RPMs for installation, and Yum as a package installer that can fetch RPMs from remote Yum repositories. Hadoop in Practice
Crucially, they also provide hooks to run scripts at the end of the process, which are invaluable for doing the final system tweaks and customization that are not covered by the standard installer. Hadoop:The Definitive Guide 3rd Edition
Any existing errata can be viewed by selecting your title from http://www.packtpub. Apache Flume Distributed Log Collection for Hadoop
If you purchased this book elsewhere, you can visit http://www.packtpub. HBase Administration Cookbook
If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Hadoop Operations and Cluster Management Cookbook
If you find any errata, please report them by visiting http://www.packtpub. Gradle Effective Implementation Guide
Downloading the example code You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. Mastering ElasticSearch
If you find any errata, please report them by visiting http://www.packtpub.com/support, selecting your book, clicking on the errata submission form link, and entering the details of your errata. Cassandra High Performance Cookbook
Downloading the example code for this book You can download the example code files for all the Packt books you have purchased from your account at http://www.packtpub.com. Hadoop RealWorld Solutions Cookbook
Any existing errata can be viewed by selecting your title from http://www.packtpub.com/support. Hadoop MapReduce Cookbook
If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. Hadoop Beginners Guide
Get in touch with us at service@ packtpub.com for more details. Java 7 Concurrency Cookbook
I’ll cover the basics of using Puppet manifests for configuration management and techniques for executing and managing those configurations with MCollective and Facter. Managing Infrastructure with Puppet
It shows you how to create Puppet recipes, extend Puppet, and use Facter to gather configuration data from your servers. Pro Puppet
For each GPU discovered on the system, the NVML modules expose the maximum and current clock speeds utilization information for the GPU memory and SM, temperature, fan speeds, power draw, ECC mode, used and total GPU memory, performance state, and identifiers such as the PCI bus ID, the GPU UUID, and the brand. Monitoring with Ganglia
Values generated by UUID() have different characteristics from those generated by a cryptographic hash function such as SHA1(): the UUID values are unevenly distributed and are somewhat sequential. High Performance MySQL 3rd Edition
Make sure to leave the document ID blank, and CouchDB will generate a UUID for you. CouchDB The Definitive Guide
Similarly, if you’re sharding on something coarsely-grained key like log level, add a second field to your shard key with a lot of granularity, such as an MD5 hash or UUID. MongoDB The Definitive Guide 2nd Edition
As shown in Listing 9-1, this table will have a key that contains a concatenation of the user’s Universally Unique Identification (UUID) and a timestamp for a picture that shows when the picture was taken. Professional Hadoop Solutions
This unique ID can be assigned by the user or application, or it can use a universally unique identifier (UUID)—a random number generated by CouchDB that greatly reduces the chance of duplicate IDs being used. Beginning CouchDB
This is a UUID generated when HBase starts with an empty storage directory. HBase:The Definitive Guide
This class is an immutable representation of 128-bit Universal Unique Identifier, or UUID, which serves as an identifier that is (with very high probability) globally unique. Java in a Nutshell 5th Edition
The scheme Cassandra uses is a version one UUID, which means that it is generated based on conflating the computer’s MAC address and the number of 100-nanosecond intervals since the beginning of the Gregorian calendar. Cassandra The Definitive Guide
Although the correlation ID in this case uses a random UUID, just about any ID generator can be used. ActiveMQ in Action
This file is organized into different contexts: jvm, rpc, hdfs, mapred, and hbase. Monitoring with Ganglia
File Description hadoop-env.sh Configures the environment variable used by Hadoop core-site.xml Configures parameters for the whole Hadoop cluster hdfs-site.xml Configures parameters for HDFS and its clients mapred-site. Hadoop Operations and Cluster Management Cookbook
The MapReduce community has made sure that applications written against 1.x can run on top of YARN in a fully binary compatible manner (mapred APIs) or just by recompiling (source compatibility for mapreduce APIs) Apache Hadoop YARN Yet Another Resource Negotiator
Each Hadoop component uses information in core-site.xml and either mapred-site. Cassandra High Performance Cookbook
For example, if your cluster has 10 nodes running a task tracker and the mapred. Hadoop RealWorld Solutions Cookbook
Inside each node's HADOOP_HOME/conf directory, add the following to the core-site.xml, hdfs-site.xml and mapred-site.xml. Hadoop MapReduce Cookbook
Solution Working with compressed files in MapReduce involves updating the MapReduce configuration file mapred-site.xml and registering the compression codec you are using. Hadoop in Practice
Clicking on the link brings one to a Hadoop Map/Reduce Tutorial (http://hadoop.apache.org/core/docs/current/mapred_ tutorial.html) explaining the Map/Reduce framework. Hadoop in Action
There are many configuration files, but the ones we need to modify are core-site.xml, hdfs-site.xml and mapred-site.xml. Hadoop Beginners Guide
Child tasks mapred a a In secure mode, the user that submitted the job. Hadoop Operations
Tools (such as Eclipse PDE, Bnd, and Bundlor) can help you easily and efficiently manage the manifest file, and they can easily be integrated with building tools like Ant or Maven. Spring Dynamic Modules in Action
This becomes important in more complex add-ons, because the Roo shell will take BND’s defaults and export all packages found in the source code. Spring Roo in Action
You can then use this to define an operation, which captures internal iteration as a combinator. Java 8 in Action Lambdas, Streams and Functional-style Programming
To figure out what to do, you need to know first what the individual parsers in the combinator frameworks return as a result (provided they succeed in parsing the input) Programming in Scala 2nd Edition
What we need is a new combinator intDouble doubleInt , that can combine two RNG actions into one using a binary rather than unarymap2 Functional Programming in Scala
However, the Scala library includes a powerful parser combinator library that can be used for parsing most external DSLs that have a context-free grammar. Programming Scala
Chapter 8 rolls the pieces together with an exploration of Scala’s parser combinator library. Beginning Scala
If we want to specify repetition, we need to use the rep parser combinator. Steps in Scala An Introduction to Object-Functional Programming
Jetty (http://www.mortbay.org/jetty/) is an embeddable servlet container, which serves a function similar to that of Tomcat or Resin. Mahout in Action
G2: Low memory footprint Tika should be embeddable within Java applications at low memory cost so that it’s as easy to use Tika in a desktop-class environment with capacious network and memory as it is within a mobile PDA with limited resources on which to operate. Tika in Action
Underneath, Storm makes use of zeromq (0mq, zeromq), an advanced, embeddable networking library that provides wonderful features that make Storm possible. Getting Started with Storm
Embeddable API The Embeddable EJB API allows client code and its corresponding enterprise beans to run within the same JVM and class loader. Java EE 7 Essentials
But even complex systems, such as an embeddable spreadsheet application, can function as individual beans. Java in a Nutshell 5th Edition
Berkeley XML DB is embeddable, so it can run as a JAR within your application. Cassandra The Definitive Guide
Not only is it written in 100% Java, but it’s also designed to be embeddable. ActiveMQ in Action
An embedded type is marked as such by adding the @Embeddable annotation to the class definition. Pro JPA2
In terms of persistence mapping, the EmailAddress class is an @Embeddable, which will cause the persistence provider to flatten out all properties of it into the table of the surrounding class. Spring Data Modern Data Access for Enterprise Java
The async mode allows the client to batch small random messages into larger data chunks before sending it over the network. Building LinkedIn’s Real-time Activity Data1
The first takes the String argument which can be one of the following values: sync, async, or default. Mastering ElasticSearch
To achieve this you first have to turn the getPrice method into a getPriceAsync method and change its return value: Java 8 in Action Lambdas, Streams and Functional-style Programming
Async allows you to build parallel data-processing pipelines by chaining successive actions onto asynchronous computations. HBase in Action
The createTable() method, for example, wraps the createTableAsync() method, while adding a loop that waits for the table to be created on the remote servers before yielding control back to the caller. HBase:The Definitive Guide
There is also AsyncListener, which is used to manage async events such as completed, timed out, or an error. Java EE 7 Essentials
If you set replication to async, then it will return success to the client as soon as the request has been executed on the primary shard. Elasticsearch The Definitive Guide
The SettableFuture, AsyncFunction, and FutureFallback classes that are useful utility classes we can use when working with Future instances and doing asynchronous transformation of objects. Getting Started with Google Guava
But you can also get the same performance gain for persistent messages by setting the useAsyncSend property on the message producer’s connection factory, as shown next. ActiveMQ in Action
However, the doInOrderAsync returns right away and does the work in a separate thread, so it is not possible to have the method rethrow the exception. Java SE 8 for the Really Impatient
We bring that method into our Account class using incude—JRuby’s mixin facility. Programming Concurrency on the JVM
Chapter 4 covers new issues arising in object-oriented design due to Scala’s mixin inheritance. Scala in Depth
Mixins The term mixin seems to have acquired numerous meanings over time, but the fundamental concept is that of mixing in capabilities from multiple classes in order to produce a resulting class that represents all the types of the mixins. Thinking in Java 4th Edition
Modular mixin composition—This feature of Scala has traits in common with both Java interfaces and abstract classes. Scala in Action
Objects are constructed by mixin composition, which takes the members of a class and adds the members of a number of traits to them. Programming in Scala 2nd Edition
We’ll then use Scala’s self-type annotation and mixin inheritance to specify wiring in a typesafe manner. Functional Programming Patterns in Scala and Clojure
Scala improves upon Java’s support for OOP with the addition of traits, a clean way of implementing classes using mixin composition. Programming Scala
Ruby has a singleton mixin that provides the singleton pattern in Ruby programs. Beginning Scala
For example, Comparable is a mixin interface that allows a class to declare that its instances are ordered with respect to other mutually comparable objects. Effective Java 2nd Edition
Remember that all GUI related classes are actually wrappers around Java’s JFC/Swing classes, thus, trait SuperMixin is used to redirect certain calls from the peer to the wrapper and back. Steps in Scala An Introduction to Object-Functional Programming
These and many other tricks are part of a popular weighting method called TF-IDF (term frequency–inverse document frequency) Mahout in Action
In this section, we'll look at the default Apache Lucene scoring mechanism: the TF/IDF (term frequency/ inverse document frequency) algorithm and how it affects the returned document. Mastering ElasticSearch
He has been a software engineer in Intelligence Corps, Israel Defense Force (IDF), 2005-2008, working on a variety of military IT systems, and later a software engineer and a team leader in web-based startup named AnyClip, making a dream of "Find any moment from any film, instantly" to come true. Cassandra High Performance Cookbook
You will get scores that are of lesser quality due to sub-optimal document frequency values, a component of the IDF part of the score. Solr 3 Enterprise Search Server
Introduction This chapter will present the implementation of a very well-known data processing algorithm, Term Frequency–Inverse Document Frequency (TF-IDF), using Storm's Trident API. Storm Real-Time Processing Cookbook
The Term frequency-inverse document frequency (TF-IDF) model solves this issue by utilizing the inverted document frequencies (IDF) to scale the term frequencies (TF) Hadoop MapReduce Cookbook
Calculate the sum of TF/IDF values for each word, for each category, Hadoop in Practice
The standard similarity algorithm used in Elasticsearch is known as TF/IDF, or Term Frequency/Inverse Document Frequency, which takes the following. Elasticsearch The Definitive Guide
At a high level, the important concepts are demonstrated by the high-level formula – namely, Term Frequency (tf), Inverse Document Frequency (idf), Term Boosts (t.getBoost), the Field Normalization (norm), the Coordination Factor (coord), and the Query Normalization (queryNorm) Solr in action
Very common terms have a low idf; very rare terms have a high idf. Lucene in Action 2nd Edition
Instead of the default UDP multicast communication method we use between Gmonds, we have configured it to use unicast messages. HBase Administration Cookbook
Clusters may use a unicast address instead, which is more compatible with various types of network hardware, and has performance benefits, at the cost of additional per-host configuration. Monitoring with Ganglia
Hadoop supports network communication through both unicast (with normal IP addresses, which is the one we use here) and multicast, which uses multicast addresses, such as 239.2.11.71 Hadoop Operations and Cluster Management Cookbook
The boostrap process When the ElasticSearch node starts, it uses multicast (or unicast, if configured) to find the other nodes in the same cluster (the key here is the cluster name defined in the configuration) and connect to them. Mastering ElasticSearch
Multicast Versus Unicast While the default communication method between monitoring daemons (gmond) is UDP multicast messages, you may encounter environments where multicast is either not possible or a limiting factor. HBase:The Definitive Guide
The term “unicast” refers to the fact that the remote object is located by making a call to a single IP address and port. Core Java Volume II Advanced Features 9th Edition
A unicast event is a rare kind of event for which there can be only a single registered listener object. Java in a Nutshell 5th Edition
The sockets in the previous chapters are unicast: they provide point-to-point communication. Java Network Programming 4th Edition
For this latter scenario, a unicast-based model of message queuing, organized into a hub-and-spoke model, is more efficient than multicast. Java Performance Tuning 2nd Edition
Instead of sending out unicast requests by cycling through a range of addresses where the service mught be, the client can send out a single broadcast or multicast request. Fundamental Networking in Java
Commit: Use Paxos to achieve consensus for appending that entry to the log. Megastore - Providing Scalable, Highly Available Storage for Interactive Services
This allows different processes to access old versions while the new version is being inserted, and also means that updating the information is really just a case of appending the new data, not reading, updating, and writing back a new version. Getting Started with CouchDB
Log file rotation is a function that periodically rotates a log file to a different name, for example, by appending the date to the filename, so that the original logfile name can be used as an empty file. Hadoop Operations and Cluster Management Cookbook
Otherwise, we merely adjust the record structure without appending any value—which needs to be done if empty records are present. Dremel - Interactive Analysis of WebScale Datasets
The function getReverseTime() achieves this by appending a reverse timestamp for the given item to the already calculated Z-order curve, separated by an underscore token. Hadoop RealWorld Solutions Cookbook
Within the group, appending the postdate orders the twits chronologically. HBase in Action
Pig supports a repartition map-side join by appending USING 'replicated' to the end of the JOIN statement. Hadoop in Practice
Configuration properties apply to a context name and are specified by appending the property name to the context name (separated by a dot) Hadoop:The Definitive Guide 3rd Edition
Commit Logs You can set the value for how large the commit log is allowed to grow before it stops appending new writes to a file and creates a new one. Cassandra The Definitive Guide
Appending a task’s actions one at a time task hello. Building and Testing with Gradle
The args parameters can be used to supply formatting arguments, as described in the next section. Core Java Volume I Fundamentals 9th Edition
The command-line arguments accepted in args are listed in Table 8-15 Pro Hadoop
While you’re at it, you’ll also fine-tune the JVM memory settings by calling the method jvmArgs to avoid potential OutOfMemoryErrors. Gradle in Action
The paths for the posts and comments data are provided via the command line, and the program retrieves them from the args array. MapReduce Design Patterns
For security reasons, be sure to change this magic cookie value to a new value (using the same new value on each node) in each node’s vm.args file. Scaling CouchDB
Args: peerid Id of the peer used for verification, must match the one given for replication tablename Name of the table to verify. HBase in Action
Some older documentation shows file utilities in the form of hadoop dfs -cmd <args> Hadoop in Action
You will see the following output: usage: sqoop COMMAND [ARGS] Hadoop Beginners Guide
The Consumer is executed using the maven-exec-plugin (http://mng.bz/bf7g) by passing it some system properties as arguments using the exec.args property. ActiveMQ in Action
The args parameter expects a List, so we use Groovy list literal syntax to wrap the poet variable. Building and Testing with Gradle
In step 4, we added the HBase jar to our classpath and then run the javac command to compile the Java code. HBase Administration Cookbook
You should get a display such as this one: javac 1.7.0_02 Core Java Volume I Fundamentals 9th Edition
Look for the string '/bin/javac$', using the egrep program, which searches for simple regular expressions in its input stream: Pro Hadoop
The following code snippet shows the usage of the javac Ant task for compiling Java source code within your build script: Gradle in Action
Internally, the core Java plug-in does this with compile-time dependencies when executing the javac compiler (Example 1-21) Gradle Beyond The Basics
You can specify the character encoding of your source files with the -encoding flag, for example, javac -encoding Big5 Myfile.java. Core Java Volume II Advanced Features 9th Edition
First, check what's already available by opening up a terminal and typing the following: $ javac. Hadoop Beginners Guide
This code can be compiled and run using javac dns.java followed by java dns. Hadoop Operations
Alternately, the javac tool provides an option for specifying resources that need to be loaded for an application. Java 7 Recipes A Problem-Solution Approach
When I started writing Java code in 1996, there weren’t many options for compiling and building a project outside of javac and a custom build script. Building and Testing with Gradle
All @param tags for one method must be kept together. Core Java Volume I Fundamentals 9th Edition
Multiple parameters can be passed to the module’s initialization function by including one or more param sections. Monitoring with Ganglia
Parameters can be specified when launching Pig, using the -param option, one for each parameter: Hadoop:The Definitive Guide 3rd Edition
As explained earlier, @WebInitParam is used to specify init params for a servlet: Java EE 7 Essentials
The third parameter can contain an ImageWriteParam object to set image writing details such as tiling and compression; use null for default values. Core Java Volume II Advanced Features 9th Edition
If you run this script using the pig command, you specify the parameters using the -param name=value argument. Hadoop in Action
Also, preconditions can be specified along with the affected parameters in their @param tags. Effective Java 2nd Edition
For example, the @param tag lets you specify the name and meaning of a single parameter for a method. Java in a Nutshell 5th Edition
The param> clause can be used to pass one or more name/value pairs as parameters to the included resource. Java Performance
The @WebInitParam annotation accepts a name/value pair as an initialization parameter. Java 7 Recipes A Problem-Solution Approach
In fact, the opposite is true: Subclasses have more functionality than their superclasses. Core Java Volume I Fundamentals 9th Edition
This callback interceptor method may be defined only on interceptor classes and/or superclasses of interceptor classes and cannot be defined on the target class. Java EE 7 Essentials
This method saves the class of the object, the signature of the class, and the values of any nonstatic, nontransient fields of the class and its superclasses. Core Java Volume II Advanced Features 9th Edition
This is a very powerful concept because interfaces, superclasses, and interfaces implemented by superclasses are included, meaning we can easily make our event handlers as course- or fine-grained as we want, simply by changing the type accepted by the event-handling method. Getting Started with Google Guava
If implementing the Cloneable interface is to have any effect on a class, the class and all of its superclasses must obey a fairly complex, unenforceable, and thinly documented protocol. Effective Java 2nd Edition
In discussing nonstatic member classes, we saw that a member class can access any members inherited from superclasses and any members defined by its containing classes. Java in a Nutshell 5th Edition
In this mirror image hierarchy, two abstract superclasses define the basic API for reading and writing characters. Java Network Programming 4th Edition
The simplest way to handle all this is to have a flag on which objects are conditionally passed to the monitor from the Object constructor and to have this flag in a simple class with no superclasses so that classloading does not impose extra calls to superclasses. Java Performance Tuning 2nd Edition
Loading a class requires loading all Java superclasses and all Java superinterfaces. Java Performance
For example, self-calls in "superclasses" are not automatically bound to the versions of methods that have been "overridden" in delegation-based "subclasses" Concurrent Programming in Java Design Principles and Pattern
The standard name for a Kerberos realm is your domain name in uppercase. HBase Administration Cookbook
If you get a complaint about welcome (with a lowercase w), then you should reissue the java Welcome command with an uppercase W. Core Java Volume I Fundamentals 9th Edition
Often the host appears twice, with the name in uppercase and lowercase, or some other differing variations of the hostname. Monitoring with Ganglia
Bamboo also expects a short identifier in uppercase characters, as the project key, in the Project Key field. Gradle Effective Implementation Guide
This recipe shows how to use a map-only job to locate any key with a specific column and convert the value of the column to uppercase. Cassandra High Performance Cookbook
Prior experience has told you that the combination of words, and the fact that they’re uppercase, means that this email is most likely spam. Hadoop in Practice
The Pig Latin functions are case-insensitive, though commonly they are written in uppercase. HBase:The Definitive Guide
For example, there’s an UPPER function in PiggyBank that transforms a string to uppercase: Hadoop in Action
By convention, SQL statements use uppercase for SQL language keywords and we will generally follow this when using HiveQL within files, as shown later. Hadoop Beginners Guide
Throughout the book, we will follow the SQL convention of showing Hive keywords in uppercase (e.g., CREATE, TABLE, SELECT and FROM), even though case is ignored by Hive, following SQL conventions. Programming Hive
If your input is comprised of tailed logfiles, the array is most likely a UTF-8 encoded String containing a line of text. Apache Flume Distributed Log Collection for Hadoop
Hadoop logging file naming conventions Hadoop logfiles are kept under the directory $HADOOP_HOME/logs. Hadoop Operations and Cluster Management Cookbook
In environments with only a few systems, connection to the server over SSH and using command-line tools to examine logfiles is usually sufficient. Cassandra High Performance Cookbook
We can use the debug scripts to copy all the logfiles to a single location, e-mail them to a single e-mail account, or perform some analysis. Hadoop MapReduce Cookbook
A web server log is a good example of a set of records that is not normalized (for example, the client hostnames are specified in full each time, even though the same client may appear many times), and this is one reason that logfiles of all kinds are particularly well-suited to analysis with MapReduce. Hadoop:The Definitive Guide 3rd Edition
The logging framework used by HBase is creating the directory and logfiles dynamically. HBase:The Definitive Guide
Like regular logfiles, commit logs have a configurable rollover threshold, and once this file size threshold is reached, the log will roll over, carrying with it any extant dirty bit flags. Cassandra The Definitive Guide
The reason is that the local standalone mode prints information about each individual task execution to the screen, whereas in the other modes this information is written only to logfiles on the running hosts. Hadoop Beginners Guide
Most of the real work is done on the data devices, while the OS device usually only has to contend with logfiles in /var/log. Hadoop Operations
Suppose our Extract, Transform, and Load (ETL) process ingests and aggregates logfiles in our environment, converting each log message to a tab-delimited record and also decomposing the timestamp into separate year, month, and day fields, and a combined hms field for the remaining hour, minute, and second parts of the timestamp, for reasons that will become clear in a moment. Programming Hive
Carla Schroer and her teams of compatibility testers in Cupertino, California, and Novosibirsk, Russia (with special thanks to Leonid Arbouzov and Alexei Kaigorodov) painstakingly wrote compatibility tests for each testable assertion in the First Edition. The Java Virtual Machine Specification Java SE 7 Edition
Because you’re dealing with an actual class, any method is fully testable through unit tests. Gradle in Action
Chapter 10 focuses on testing Scala applications and how you can use patterns to make your code more testable. Scala in Action
Testable JavaScript, the image of a Doctor fish, and related trade dress are trademarks of O’Reilly Media, Inc. Testable JavaScript
The usage of the Tool interface also makes driver implementation more testable by allowing you to inject arbitrary configurations using Configured’s setConf() method. Professional Hadoop Solutions
It assumes that first is a valid word, which is an easily testable condition prior to the call. Data Structures and Algorithm Analysis in Java 3rd Edition
In particular, the schema-free nature of the graph data model, coupled with the testable nature of a graph database’s application programming interface (API) and query language, empower us to evolve an application in a controlled manner. Graph Databases
This makes the application more testable and conforms to accessibility standards. Selenium Testing Tools Cookbook
This change in thinking has lead to lower defect rates, more modular code, and more testable code. Beginning Scala
The first technique we will demonstrate to make this class testable is to introduce a subclass that eliminates entity manager calls. Pro JPA2
Moreover, there are many file formats that support text, such as PDF, Microsoft Word, Excel, and PowerPoint, simple text files, XML, HTML, XHTML, and so forth. Pro Java 7 NIO.2
The first TestSuite.xhtml defines the suite having HTML links to the tests. Jenkins Continuous Integration Cookbook
Similarly, NotFound is a case class representing HTTP status code 404; when calling the xhtml method, it implicitly gets converted to a Scalaz Response object with an HTTP header value for a content-type of "application/ xhtml+xml." This is a good example of how you can use higher-order functions and combine functions to create nice APIs like the preceding example. Scala in Action
I am very grateful to Dmitry Kirsanov and Alina Kirsanova who turned my manuscript from XHTML into a beautiful book, allowing me to concentrate on the content instead of fussing with the format. Scala for the Impatient
By convention, web pages built with XHTML have a .xhtml extension. Java EE 7 Essentials
However, most pages authored by the W3C are written in XHTML (an HTML dialect that is proper XML) Core Java Volume II Advanced Features 9th Edition
Currently, no browser natively supports XForms, though plug-ins are available for Firefox, and libraries exist to allow serverside translation of XForms into XHTML. Cassandra The Definitive Guide
Also, we show how to produce XHTML content with Scala. Steps in Scala An Introduction to Object-Functional Programming
Tika uses the XHTML (Extensible Hypertext Markup Language) standard to model all documents, regardless of their original format. Lucene in Action 2nd Edition
Dmitry Kirsanov and Alina Kirsanova once again turned an XHTML manuscript into an attractive book with amazing speed and attention to detail. Java SE 8 for the Really Impatient
Through using a large set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project, you will have a testing ground for Solr, and will learn how to import this data in various ways. Solr 3 Enterprise Search Server
Pressing the space bar when the focus is in the checkbox also toggles the checkmark. Core Java Volume I Fundamentals 9th Edition
Handling checkboxes Capturing checkbox submission is similar to that of capturing textbox submissions, which we encountered earlier in Chapter 2, Recording Your First Test. Performance Testing With JMeter 2.9
Make sure the checkbox next to the plugin name is checked and press the Install button to initiate the installation process. Gradle in Action
When configuring a Job, click on Discard Old Builds, and then the Advanced checkbox, define the Max # of builds to keep with the artifacts. Jenkins Continuous Integration Cookbook
If we want to use a locally installed instance of Gradle, we must uncheck the Install automatically checkbox. Gradle Effective Implementation Guide
To choose this option, just tick the Install automatically checkbox. Jenkins The Definitive Guide
If you want to present a set of choices to a user, and a radio button or checkbox set consumes too much space, you can use a combo box or a list. Core Java Volume II Advanced Features 9th Edition
Starting, changes to Complete; then click its checkbox to see details of the job flow, as shown in the following screenshot: Hadoop Beginners Guide
This interface has a text field to enter the name to be searched for and a checkbox to determine whether the match should be exact or partial. Java Network Programming 4th Edition
For example, perhaps your documents have Country as a field, and your search interface presents a checkbox allowing the user to pick and choose which countries to include in the search. Lucene in Action 2nd Edition
This will ensure that the mvn command is always available. Mahout in Action
Starting it up via mvn is as easy as shown in Listing 12.8 Netty in Action MEAP V08
The install command will run all the tests by default; we can ignore the tests to speed up the installation process with command sudo mvn -DskipTests install. Hadoop Operations and Cluster Management Cookbook
Goals bundle phases, for example, for the site goal there are four phases: pre-site, site, post-site, and site-deploy, all of which will be called in order by mvn site or directly by using the syntax mvn site:phase. Jenkins Continuous Integration Cookbook
For example, in the following pom.xml file, the project URL will point to the Jenkins build job that ran the mvn site build: Jenkins The Definitive Guide
For example, if you want to compile a project, you run Maven's mvn command with the compile goal as a parameter. Sonar Code Quality Testing Essentials
Change the folder into a trunk and run the build: $ mvn compile. Hadoop RealWorld Solutions Cookbook
To compile all the source, run the following command: $ mvn clean install This command will compile and package the source so that it’s ready to be run. ActiveMQ in Action
Then run "mvn install" from within the Tika source directory you unpacked above. Lucene in Action 2nd Edition
This install process, equivalent to mvn install, is guaranteed to be compatible with Maven because it actually uses the Maven Ant Tasks produced by the Maven team and hosted at Apache. Building and Testing with Gradle
This approach allows Ceph to more effectively leverage the intelligence (CPU and memory) present on each OSD to achieve reliable, highly available object storage with linear scaling. Ceph A Scalable, High-Performance Distributed File System
As in conventional storage systems based on local or network-attached (SAN) disk drives or those embracing the proposed T10 OSD standard, devices passively respond to read and write commands, despite their potential to encapsulate significant intelligence. RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
By the end of this chapter, you’ll write tests with JUnit, TestNG, and Spock and execute them as part of the build lifecycle. Gradle in Action
JUnit (http://www.junit.org/) is a popular unit test framework for Java; TestNG (http://testng.org/doc/index.html) is another. Jenkins Continuous Integration Cookbook
Using TestNG for testing We have written a test with the JUnit test framework. Gradle Effective Implementation Guide
This format is also used by many other Java testing tools, such as TestNG, Spock and Easyb. Jenkins The Definitive Guide
Some basic familiarity with the TestNG testing API will help make sense of the unit tests. Hadoop RealWorld Solutions Cookbook
You have many options for unit testing in Scala, from established Java tools, such as JUnit and TestNG, to new tools written in Scala, such as ScalaTest, specs, and ScalaCheck. Programming in Scala 2nd Edition
JUnit and TestNG are both good starting points for unit testing Scala. Programming Scala Tackle Multi-Core Complexity on the Java Virtual Machine
You can drive your tests using the built-in Runner or use the provided integration with JUnit or TestNG. Programming Scala
There are test goals for JUnit, TestNG, and so on. Beginning Scala
This facilitates compiled tests being able to reference JUnit or TestNG annotations or base classes. Building and Testing with Gradle
If you do want to write a method to change numeric parameters, you can use one of the holder types defined in the org.omg.CORBA package: IntHolder, BooleanHolder, and so on. Core Java Volume I Fundamentals 9th Edition
With its choice of two distributed systems technologies (RMI and CORBA), developers can create systems that span many computers. Java Network Programming and Distributed Computing
Java EE applications can use RMI-IIOP, P protocol support, to access CORBA services that are compatible with the. Java EE 7 Specification
This mechanism can only be used to replace certain standard Java libraries (such as those for XML and CORBA support) with newer versions. Core Java Volume II Advanced Features 9th Edition
As you can see, CORBA has more overhead than RMI, which in turn has more overhead than the proprietary system. Java Performance Tuning 2nd Edition
Technologies such as those discussed in chapter 2 (COM, CORBA, DCE, and EJB) using RPC are considered to be tightly coupled. ActiveMQ in Action
Request objects are used in distributed object support systems such as those in java.rmi and org.omg.corba. Concurrent Programming in Java Design Principles and Pattern
Finally, entity beans were modeled as remote objects that used RMI and CORBA, introducing network overhead and restrictions that should never have been added to a persistent object to begin with. Pro JPA2
Before RMI, the only remoting options available to Java programmers were CORBA (which at the time required the purchase of a third-party object request broker [ORB]) and handwritten socket programming. Spring in Action 4th Edition
However, great books such as Joshua Bloch’s Effective Java [Blo08] have advised us to promote immutability. Programming Concurrency on the JVM
We first examine the concepts behind functional programming such as side effects, immutability, declarative programming, and referential transparency and relate these to Java 8 Java 8 in Action Lambdas, Streams and Functional-style Programming
Final fields must be used correctly to provide a guarantee of immutability. JSR-133 Java Memory Model and Thread
However, changing modules explicitly disclaim immutability, so cached copies of those modules must be expired after some time. Gradle Beyond The Basics
Immutability One of the most surprising problems in the original Java Memory Model was that objects that were declared final were sometimes not in fact final (immutable) at all. Oracle JRockit The Definitive Guide
The functional approach may appear unnatural if you’re not familiar with it, but it enables immutability, which has many advantages. Effective Java 2nd Edition
Due to the immutability of files in HDFS, it’s uncommon to place a quota on them. Hadoop Operations
You can take advantage of immutability in your own classes. Java Network Programming 4th Edition
Finally, the immutability of SSTables enables us to split tablets quickly. Bigtable - A Distributed Storage System for Structured Data
However, selective use of immutability is a basic tool in concurrent OO programming. Concurrent Programming in Java Design Principles and Pattern
Further information about sFlow interoperability is provided in Chapter 8 Monitoring with Ganglia
However, the CouchDB team recommends sticking with JavaScript as it provides the highest level of support and interoperability, though other options are available. CouchDB The Definitive Guide
Cross-language interoperability is achieved using a standard binary on-the-wire representation of records, in which field values are laid out sequentially as they occur in the record. Dremel - Interactive Analysis of WebScale Datasets
BigTop — This is a formal process and framework for packaging and interoperability testing of Hadoop’s sub-projects and related components. Professional Hadoop Solutions
The Protocol Buffers format has been used heavily by Google for interoperability. Hadoop in Practice
The Apache Bigtop project (http://incubator.apache.org/bigtop/) runs interoperability tests on stacks of Hadoop components and provides Linux packages (RPMs and Debian packages) for easy installation. Hadoop:The Definitive Guide 3rd Edition
The long-term goal of HCatalog is to enable interoperability between tools such as Apache Hive and Pig so that they can share dataset metadata information. Hadoop Operations
You create data files with a wide choice of formats for convenience, compactness, or interoperability with different Apache Hadoop components. Cloudera Impala
But the listed constraints serve to eliminate large classes of safety and liveness problems while also satisfying common reusability and performance goals: unidirectional flow avoids deadlock, connectivity management avoids unwanted interleavings across different flows, transfer protocols avoid safety problems due to inadvertent sharing without the need for extensive dynamic synchronization, and interface conformance assures type safety while still permitting interoperability among components. Concurrent Programming in Java Design Principles and Pattern
Importing an Entire Ant Build File Up to this point, we’ve dealt with interoperability of Ant and Gradle on a fine-grained level, either calling existing Ant tasks, creating new Ant tasks on the fly, or manipulating the results of Ant task calls with Groovy. Building and Testing with Gradle
Java even has built-in support for advanced programming tasks, such as network programming, database connectivity, and multithreading. Core Java Volume I Fundamentals 9th Edition
This book's coverage of advanced topics such as input/output streaming and multithreading allows even the most experienced Java developers to sharpen their skills. Java Network Programming and Distributed Computing
This allows the application developer to focus on the business logic and not worry about lowlevel transaction and state management details, remoting, concurrency, multithreading, connection pooling, or other complex low-level APIs. Java EE 7 Essentials
On the server side, columns are immutable in order to prevent multithreading issues. Cassandra The Definitive Guide
Chapter 3, Threads, explores multithreading and synchronization, with a special emphasis on how they can be used for asynchronous I/O and network servers. Java Network Programming 4th Edition
In Java, the key to making an application responsive is multithreading. Java Performance Tuning 2nd Edition
A note on multithreading in JMS applications The JMS spec specifically defines concurrency for various objects in the JMS API and requires that only a few objects support concurrent access. ActiveMQ in Action
The introduction of the SPARC T-series processor brought chip multiprocessing and chip multithreading to Oracle’s offering of processors. Java Performance
Both simultaneous multithreading (SMT) and chip multiprocessor (CMP) architectures target thread-level parallelism and should improve the performance of many of our servers. Web search for a planet the google cluster architecture
With his experience in multithreading and Java Swing he has managed to create Java clients that can consume tens of thousands of market events per second. Java 7 Recipes A Problem-Solution Approach
Chapter 5, Monitoring and Diagnosis: In this chapter, we will describe how to monitor and diagnose HBase cluster with Ganglia, OpenTSDB, Nagios, and other tools. HBase Administration Cookbook
StumbleUpon uses OpenTSDB to monitor all of its infrastructure and software, including its HBase clusters. HBase in Action
Use Case: OpenTSDB The OpenTSDB‡ project provides a time series database used to store metrics about servers and services, gathered by external collection agents. HBase:The Definitive Guide
Since HBase 0.94, the Import utility can generate the HBase internal files (HFiles) for bulk load. HBase Administration Cookbook
This is the number of blocks of StoreFiles (HFiles) in the cache. Monitoring with Ganglia
The implementation of HFile in a region server is responsible for reading and writing HFiles to and from HDFS. Professional Hadoop Solutions
Data manipulation and a column family can have multiple HFiles. HBase in Action
The importtsv tool supports loading data to HBase directly using the Put operations as well as by generating the HBase internal HFiles. Hadoop MapReduce Cookbook
The data is stored in store files, called HFiles, which are persistent and ordered immutable maps from keys to values. HBase:The Definitive Guide
HFiles are immutable HDFS files containing key/value pairs in sorted order. Apache Hadoop Goes Realtime at Facebook
As most HFiles are hosted on HDFS, it would be impossible anyway because HDFS files are immutable as well. Hadoop in Action
The Simple Build Tool (http://mng.bz/2f7Q) provides the cc target for continuous compilation. Scala in Depth
If you’re having difficulty understanding what’s going on with function definition or argument passing, the Python language tutorial offers a good introduction to what’s going on, and you can jump right to the particular section by visiting this shortened URL: http://mng.bz/KM5x. Redis in Action
Make sure you explore it in the Akka documentation( http://mng.bz/wc7D) Scala in Action
In the great book Java Concurrency in Practice (http://mng.bz/979c), Brian Goetz and coauthors give some advice to find the optimal size for a thread pool. Java 8 in Action Lambdas, Streams and Functional-style Programming
You can read more about the built-in PropertyEditors at http://mng.bz/7CO9 Spring in Practice
This versioning scheme is roughly outlined in the Apache Release Management guide: http://mng.bz/6uvM. HBase in Action
I created a sample log file in Apache Common Log Format (see http://mng.bz/ L4S3) with some fake Class E IP addresses and some dummy resources and response codes. Hadoop in Practice
That means the channel adapter not only handles the communication via the particular transport and protocol being used but also must provide a Messaging Mapper (http://mng.bz/Fl0P) so that whatever data representation is used by the external system is converted to and from simple Spring Integration messages. Spring Integration in Action
The Consumer is executed using the maven-exec-plugin (http://mng.bz/bf7g) by passing it some system properties as arguments using the exec.args property. ActiveMQ in Action
For an up-to-date list of MongoDB production deployments, see http://mng.bz/z2CH. MongoDB in Action
This is because the score depends on multiple factors, not only the boosts and query structure, but also on how many terms were matched, in which fields, and the type of matching that was used on query normalization, and so on. Mastering ElasticSearch
By default, a field in Lucene is indexed with its norm, a product of the document’s boost, the field’s boost, and the field’s length normalization factor. Professional Hadoop Solutions
As of today, HBase doesn’t provide the features that make normalization easy to do. HBase in Action
Normalization poses problems for MapReduce because it makes reading a record a nonlocal operation, and one of the central assumptions that MapReduce makes is that it is possible to perform (high-speed) streaming reads and writes. Hadoop:The Definitive Guide 3rd Edition
The usual database normalization* rules do not apply directly to HBase, and therefore the number of tables is usually very low. HBase:The Definitive Guide
This process of tokenization and normalization is called analysis, which we discuss in the next section. Elasticsearch The Definitive Guide
Denormalization In relational database design, we are often taught the importance of normalization. Cassandra The Definitive Guide
These are useful for tasks such as normalization of data and doing complex computation involving groups of rows. Tenzing - A SQL Implementation On The MapReduce Framework
In such cases, it's common to have a precursor MapReduce job that applies a normalization or clean-up strategy to the data set. Hadoop Beginners Guide
Unique Keys and Normalization Relational databases typically use unique keys, indexes, and normalization to store data sets that fit into memory or mostly into memory. Programming Hive
It is easy to take the effects of the modern websites for granted, but the functionality of pop-up lists during searches, customization, and the in-page experience (traditionally referred to as AJAX) of a dynamic website. Getting Started with CouchDB
This is a lot of configuration that the Gradle plugins provide out of the box without any additional customization from your side. Gradle in Action
A rule set file contains several rules and allows customization of the rules. Gradle Effective Implementation Guide
The most common use case for this is when a build needs to access a custom version of some other module, especially when that customization is itself under development and is not ready to publish to the outside world. Gradle Beyond The Basics
We focused on understanding how Hive manages data in internal and external tables, and also looked at advanced topics such as join customization and user-defined functions. Hadoop in Practice
Crucially, they also provide hooks to run scripts at the end of the process, which are invaluable for doing the final system tweaks and customization that are not covered by the standard installer. Hadoop:The Definitive Guide 3rd Edition
But when performing this customization, quite often the administrators don’t edit the corresponding HTML meta headers that. Hadoop in Action
Do not expect to understand all of these now, but do spend a little time browsing to get a flavor for the type of customization allowed by Hadoop. Hadoop Beginners Guide
This allows for easy configuration and customization via the element of the ActiveMQ XML configuration file. ActiveMQ in Action
Let’s take a look at the different ways of defining a task, the two key aspects of task definitions, and the task API we can use to perform our own customization. Building and Testing with Gradle
Currently, it provides various row/columnar store file formats, such as CSVFile, RowFile, RCFile, and Trevni (still unstable) Introduction to Tajo
In this paper, we present a big data placement structure called RCFile (Record Columnar File) and its implementation in the Hadoop system. RCFile A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
Apart from the two formats mentioned above, Hive also provides an RCFileInputFormat which stores the data in a column oriented manner. Hive A Petabyte Scale Data Warehouse Using Hadoop
Hive provides another binary storage format called RCFile, short for Record Columnar File. Hadoop:The Definitive Guide 3rd Edition
Some types of storage formats store their data by fields instead of by records (for example, Hive’s RCFile) Programming Pig
Use any of the Hive data loading techniques, especially for tables using the Avro, SequenceFile, or RCFile formats. Cloudera Impala
You can replace TEXTFILE with one of the other built-in file formats supported by Hive, including SEQUENCEFILE and RCFILE, both of which optimize disk space usage and I/O bandwidth performance using binary encoding and optional compression. Programming Hive
Note A principal named “apress" is used in the examples in this section, but this principal will not be available on your machine. Pro Java 7 NIO.2
Apress and friends of ED books may be purchased in bulk for academic, corporate, or promotional use. Pro Hadoop
If you need to study the basics of database design, we suggest Clare Churcher’s book Beginning Database Design (Apress) High Performance MySQL 3rd Edition
For a detail description of RESTful web services, we recommend the book Ajax and REST Recipes, A Problem-Solution Approach (Apress, 2006) Pro Spring 3
For information on translations, please e-mail info@apress.com, or visit http://www.apress.com. Beginning CouchDB
Tip If you’re interested in Git, we strongly recommend Scott Chacon’s excellent book Pro Git – also published by Apress. Pro Puppet
Although every precaution has been taken in the preparation of this work, neither the author(s) nor Apress shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in this work. Beginning Scala
I would also like to express my gratitude towards all the people at Apress for all the work they have done to get this book out. The Definitive Guide to MongoDB
For your convenience Apress has placed some of the front matter material after the index. Java 7 Recipes A Problem-Solution Approach
For more information, reference our Special Bulk Sales–eBook Licensing web page at http://www.apress.com/info/bulksales. Pro JPA2
In Flume 0.9, the input thread also did the writing to the output (except for failover retries) Apache Flume Distributed Log Collection for Hadoop
This property configures the maximum number of reducer retries to fetch map outputs in case of fetch failure. Hadoop Operations and Cluster Management Cookbook
It turns out that most of these overwrites came from client retries due to errors or timeouts. GFS - The Google File System
The number of this retries can be defined by this method. Mastering ElasticSearch
There's more Initializing connections in this manner does not account for server fail-over or retries. Cassandra High Performance Cookbook
When the task fails, the tasktracker retries the task, skipping the records that caused the failure. Hadoop:The Definitive Guide 3rd Edition
For example, when dealing with errors, it favors retries or wait over fast failures. Apache Hadoop Goes Realtime at Facebook
Should a task fail, the JobTracker will automatically relaunch the task, possibly on a different node, up to a predefined limit of retries. Hadoop in Action
When a task fails, Hadoop retries it on the same block but attempts to work around the invalid records. Hadoop Beginners Guide
MapReduce is dead simple for developers: no socket programming, no threading or fancy synchronization logic, no management of retries, no special techniques to deal with enormous amounts of data. Hadoop Operations
In the Java space, you can choose from a wide range of open source and commercial solutions, such as Checkstyle, PMD, Cobertura, FindBugs, and Sonar. Gradle in Action
This recipe uses Cobertura (http://cobertura.sourceforge.net/) to find the code that is not covered by unit tests. Jenkins Continuous Integration Cookbook
We can define the location of the Clover or Cobertura coverage XML result file and much more. Gradle Effective Implementation Guide
To see how the plugins work, we are going to integrate code coverage metrics using the Cobertura plugin. Jenkins The Definitive Guide
Now that we have the aggregated lcov.info file, we need to convert it to the Cobertura XML format. Testable JavaScript
Cobertura and Clover Cobertura, based on the jcoverage Java library, is used to calculate the percentage of code accessed by tests and identify which parts of your source code lack test coverage. Sonar Code Quality Testing Essentials
Setting up a code-coverage tool like Clover, EMMA, or Cobertura can be a big help with managing this issue. Spring in Practice
Install the Maven Reporting plug-in, configuring the Cobertura code coverage report as well as the Surefire Testing report. Spring Roo in Action
As of Ganglia 3.4.0, the web interface is a separate distribution tarball maintained in a separate source code repository. Monitoring with Ganglia
If you prefer, a TAR file is available as well: http://github.com/couchapp/couchapp/tarball/master. CouchDB The Definitive Guide
The locations of installed files will be different from the tarball method, and we can check the file layout with the following command: Hadoop Operations and Cluster Management Cookbook
The following commands will download and unpack the tarball for CouchDB 0.10.0 Beginning CouchDB
Appendix A includes the installation instructions for both the CDH tarball and the Apache Hadoop tarball. Hadoop in Practice
Download and extract the tarball into a directory that we call HIVE_HOME. Hadoop in Action
Just click the link on the home page to download the latest release version as a gzipped tarball. Cassandra The Definitive Guide
What just happened? Setting up Avro is a little involved; it is a much newer project than the other Apache tools we'll be using, so it requires more than a single download of a tarball. Hadoop Beginners Guide
For tarball-based installs, this leaves quite a bit of flexibility but also an equal amount of ambiguity. Hadoop Operations
Click on the link to the 5.4.1 release and you’ll find both tarball and zip formats available (the tarball is for Linux and Unix; the zip is for Windows) ActiveMQ in Action
Using interfaces for the callbacks (see the next chapter as well) leads to code that runs faster and is a lot more maintainable. Core Java Volume I Fundamentals 9th Edition
The build tool of choice shouldn’t stand in the way of this effort; rather, it should provide you with a flexible and maintainable way to model your automation needs. Gradle in Action
Build logic is described with a powerful DSL and empowers developers to create reusable and maintainable build logic. Gradle Effective Implementation Guide
With MapReduce we’re not only battling with clean and maintainable code, but also with the performance of a job that will be distributed across hundreds of nodes to compute over terabytes and even petabytes of data. MapReduce Design Patterns
Dealing with Imperative Rule Code Our current rule definition relies on 18 lines of doLast code, which is probably too much imperative code for a well-factored, maintainable build. Gradle Beyond The Basics
I was at that point in late 2007 when I was facing the task of storing millions of documents in a system that needed to be fault-tolerant and scalable while still being maintainable by just me. HBase:The Definitive Guide
With careful design6 and a relatively moderate amount of data, such a system is flexible and maintainable. Hadoop in Action
It is about writing programs that are clear, correct, usable, robust, flexible, and maintainable. Effective Java 2nd Edition
How we build this complex network in a maintainable fashion is the focus of this chapter. Hadoop Beginners Guide
While you are always free to solve problems in a locally optimized way by writing Groovy code inside your build, the most maintainable builds will eventually refactor this code into organized plug-ins that expose new build language idioms to the build files that use them. Building and Testing with Gradle
Temporary tables created to hold the results of subqueries don’t have indexes, so this seems plausible. High Performance MySQL 3rd Edition
The language supports nested subqueries, inter and intra-record aggregation, top-k, joins, user-defined functions, etc; some of these features are exemplified in the experimental section. Dremel - Interactive Analysis of WebScale Datasets
If necessary, any node can send subqueries parallel to other nodes and merge responses to return the full response to the user. Mastering ElasticSearch
Traditional SQL features like from clause subqueries, various types of joins – inner, left outer, right outer and outer joins, cartesian products, group bys and aggregations, union all, create table as select and many useful functions on primitive and complex types make the language very SQL like. Hive A Petabyte Scale Data Warehouse Using Hadoop
Hive doesn’t support IN subqueries (at the time of this writing), but you can use a LEFT SEMI JOIN to do the same thing. Hadoop:The Definitive Guide 3rd Edition
Use the UNION operation only when the columns in both subqueries. SQL Antipatterns
Even if you have to use several complex subqueries or join statements, this is usually true. Cassandra The Definitive Guide
We use parentheses to form subqueries, enabling advanced construction of BooleanQuerys: Lucene in Action 2nd Edition
A broad selection of SQL features are directly supported, including subqueries, aggregate queries, update and delete statements, numerous SQL functions, and more. Pro JPA2
However, it still lacked support for transactions, subqueries, foreign keys, stored procedures, and views. Understanding MySQL Internals
Specifically, a per-cluster ResourceManager (RM) tracks resource usage and node liveness, enforces allocation invariants, and arbitrates contention among tenants. Apache Hadoop YARN Yet Another Resource Negotiator
When creating a representative dataset, we try to reproduce any domain invariants we have identified: the minimum, maximum, and average number of relationships per node, the spread of different relationship types, property value ranges, and so on. Graph Databases
It is now possible to account for cases where the masters are accessed directly and guarantee important system invariants. Dapper a Large-Scale Distributed Systems Tracing Infrastructure
This data uses the sequence file format, first because it’s processed sequentially, and second because we couldn’t satisfy the map file invariants of sorted keys. Hadoop:The Definitive Guide 3rd Edition
Several important prerequisites for code optimization, such as identifying loop invariants and loop bodies require expensive analysis in a control flow graph. Oracle JRockit The Definitive Guide
Like a constructor, a builder can impose invariants on its parameters. Effective Java 2nd Edition
Both scenarios break some of the invariants assumed by the security model and can inadvertently expose data to malicious applications. Hadoop Operations
You should use this pattern for debug and trace statements and assertion preconditions, postconditions, and invariants. Java Performance Tuning 2nd Edition
One path is first to establish conceptual-level invariants, for example the rule that water tank volumes must always be between zero and their capacities. Concurrent Programming in Java Design Principles and Pattern
The indexing system starts by crawling every page on the web and processing them while maintaining a set of invariants on the index. Percolator - Large-scale Incremental Processing Using Distributed Transactions and Notifications
Pointcuts: A pointcut is a collection of joinpoints that you use to define when advice should be executed. Pro Spring 3
Consult the Spring Security reference documentation for information on using security pointcuts and AOP to secure Java methods. Spring in Practice
Note that the new pointcut we added uses some more advanced AspectJ syntax, illustrating Boolean logic and the other supported pointcuts, args, that can be used to specify the type declaration of arguments. Spring Security 3.1
The AspectJ expression language for pointcuts offers a lot of flexibility to determine which methods to match against. Pro Spring Security
An aspect encapsulates pointcuts and advices, much the way a class encapsulates member fields and methods. Programming Scala
Aspects are often described in terms of advice, pointcuts, and join points. Spring in Action 4th Edition
Upon instantiation, the class expects you to provide the account credentials as well as the API’s URL, format, and version. Gradle in Action
Using an in-mapper combiner is more efficient than using actual combiners — there is no additional overhead of reading and instantiation of key/value pairs. Professional Hadoop Solutions
It is reset and reused every time the vertex needs to send a message, and helps to avoid unnecessary instantiation. Hadoop RealWorld Solutions Cookbook
It is instantiated in parallel on many machines, with each instantiation processing one file or perhaps GFS chunk. Interpreting the Data Parallel Analysis with Sawzall
You can set the block size during table instantiation like this: HBase in Action
Instantiation of Generic Types It is illegal to create an instance of a generic type. Data Structures and Algorithm Analysis in Java 3rd Edition
You can create an instance of FilterList while providing various parameters at instantiation time, using one of these constructors: HBase:The Definitive Guide
A Supplier interface also gives you the flexibility to use lazy instantiation by not constructing an instance until the get method is called. Getting Started with Google Guava
Its default instantiation is based on the resource configuration for your HDFS system. Hadoop in Action
This approach is functionally equivalent to the public field approach, except that it is more concise, provides the serialization machinery for free, and provides an ironclad guarantee against multiple instantiation, even in the face of sophisticated serialization or reflection attacks. Effective Java 2nd Edition
You will need to include the flume-ng-sdk JAR in the classpath of your Java application at runtime to use Flume's Log4J Appender. Apache Flume Distributed Log Collection for Hadoop
With the master plan in place, you’ll get started by writing some tasks to interact with the CloudBees client SDK. Gradle in Action
We can also use the options property to define a lot of properties we know from the Java SDK javadoc tool. Gradle Effective Implementation Guide
The examples presented here use the Java SDK to show how to implement S3 data manipulation. Professional Hadoop Solutions
OSNaming JNDI service provider is a part of the Java SE 7 SDK and JRE. Java EE 7 Specification
Flume Highly reliable, configurable streaming data collection ZooKeeper Coordination service for distributed applications Hue User interface framework and SDK for visual Hadoop applications Whirr Library for running Hadoop, and HBase, in the cloud In regard to HBase, CDH solves the issue of running a truly reliable cluster setup, as it has all the required HDFS patches to enable durability. HBase:The Definitive Guide
In the Java SDK documentation, you may see it referred to as Java 1.7 Java 7 New Features Cookbook
Details on downloading and using the SDK are available in the Amazon EMR documentation. Programming Hive
As of this writing, a similar class is scheduled to be supported in an upcoming SDK release. Concurrent Programming in Java Design Principles and Pattern
Once you have launched the JavaFX SDK setup executable you will see the start of the wizard in Figure 15-2 Java 7 Recipes A Problem-Solution Approach
The libvirt project has created a common set of tools for managing virtualization resources on different virtualization platforms, currently including: Xen, QEMU, KVM, LXC, OpenVZ, User Mode Linux, VirtualBox, and VMware ESX and GSX. Monitoring with Ganglia
The nova-compute process is primarily a worker daemon that creates and terminates virtual machine instances via hypervisor's APIs (XenAPI for XenServer/XCP, libvirt for KVM or QEMU, VMwareAPI for VMware, etc.) bk-compute-adminguide-grizzly
Real hardware, VMWare virtualization, or KVM virtualization will generally allow us to create a fork of a Redis process at roughly 10–20ms per gigabyte of. Redis in Action
An example of this would be nova-compute accepting a message from the queue to create a new instance and then using the libvirt library to start a new KVM instance. Deploying OpenStack
The process by which it does so is fairly complex but the basics are simple: Accept actions from the queue and perform a series of system commands, like launching a KVM instance, to carry them out while updating state in the database. bk-admin-guide-cloud-latest
For performance or scalability testing, you would use real hardware in a cluster configuration.) You run the VM in VMWare, KVM, or VirtualBox, start the Impala service through the Cloudera Manager web interface, and then interact with Impala through the impalashell interpreter or the ODBC and JDBC interfaces. Cloudera Impala
This deployment ran central services on a set of servers running KVM. OpenStack Operations Guide
Something to look at is the amount of dnsmasq processes that are running. bk-compute-adminguide-grizzly
Tracks IP address leases and records them in the database by using the dnsmasq dhcp-script facility. bk-admin-guide-cloud-latest
Debugging DHCP Issues One common networking problem is that an instance boots successfully but is not reachable because it failed to obtain an IP address from dnsmasq, which is the DHCP server that is launched by the nova-network service. OpenStack Operations Guide
This script comes with OpenTSDB; it creates two tables—tsdb and tsdb-uid. HBase Administration Cookbook
The serialVersionUID field is used to determine whether a compatible class definition was found during deserialization that matches that of a serialized object. Java Network Programming and Distributed Computing
A Solaris subject that contains a principal with a UID of 0, for example, implies a Solaris principal with any other UID. Java Security 2nd Edition
The tsdb-uid table maintains an index of globally unique values for use as metric tags. HBase in Action
By default, the _uid field is stored (can be retrieved) and indexed (searchable) Elasticsearch The Definitive Guide
The graphical version of the serialver program All later versions of the class must define the serialVersionUID constant to the same fingerprint as the original. Core Java Volume II Advanced Features 9th Edition
If you do not specify this number explicitly by declaring a static final long field named serialVersionUID, the system automatically generates it at runtime by applying a complex procedure to the class. Effective Java 2nd Edition
If the specified class declares a long serialVersionUID constant, the value of that field is displayed. Java in a Nutshell 5th Edition
The easiest way to accomplish this is to ensure that the uid of the user the namenode process run as is the same on all namenodes. Hadoop Operations
The “UID” column reports the user id responsible for performing the disk access. Java Performance
Note that the revision ID (rev) is prefixed with an underscore. Getting Started with CouchDB
As we’ve seen, design documents are normal JSON documents, denoted by the fact that their DocID is prefixed with _design/ CouchDB The Definitive Guide
The mentioned properties we just saw, should be prefixed with the index.merge. Mastering ElasticSearch
Note Up until this point, all UNIX commands have been prefixed with the prompt symbol ($) Beginning CouchDB
Quantifier variables are declared like regular variables, but the base type (usually int) is prefixed by a keyword specifying the form of quantifier. Interpreting the Data Parallel Analysis with Sawzall
We can specify all the part prefixed subdirectories as input to the dependent job using wildcards. Hadoop MapReduce Cookbook
Properties in the file are prefixed with whirr., but if they are passed as arguments on the command line, the prefix is dropped. Hadoop:The Definitive Guide 3rd Edition
A job ID is a string prefixed with job_, followed by the cluster ID (which is a timestamp of when the cluster was started), followed by an auto-incremented job number. Hadoop in Action
Instead, use the Thrift API calls prefixed with system_ to configure your keyspaces and column families, as shown previously. Cassandra The Definitive Guide
Type the following commands; the output should look as shown next, where the commands are included on the lines prefixed by $: Hadoop Beginners Guide
Some languages give you portability, garbage collection, and the like, but they don’t have much of a library, forcing you to roll your own if you want fancy graphics or networking or database access. Core Java Volume I Fundamentals 9th Edition
Ganglia will not allow IPv6=>IPv4 mapping (for portability and security reasons) Monitoring with Ganglia
In addition to less portability and JVM commitment, using the REST API you can connect to different major ElasticSearch versions if you are aware of the differences in REST endpoints and response returned by ElasticSearch. Mastering ElasticSearch
Portability — Your application may not require all the extra features that an RDBMS provides (for example, typed columns, secondary indexes, transactions, advanced query languages, and so on) Professional Hadoop Solutions
You learned that the advantages that a 100 percent client-side application can offer in terms of flexibility and portability. Beginning CouchDB
The project was created by Doug Cutting (the creator of Hadoop) to address the major downside of Hadoop Writables: lack of language portability. Hadoop:The Definitive Guide 3rd Edition
Java encourages portability of entire applications at the expense of incremental adoption by making it somewhat irksome to link against other languages. Chubby - The lock service for loosely-coupled distributed systems
The JMS API minimizes the amount of enterprise messaging knowledge a Java programmer is required to possess in order to develop complex messaging applications, while still maintaining a certain amount of portability across JMS provider implementations. ActiveMQ in Action
For script portability, it’s typical to omit the authority, only specifying it when referring to another distributed filesystem instance (including S3 buckets) Programming Hive
This enhances portability, at the expense of needing to accommodate differences in the quality of implementation of these mappings. Concurrent Programming in Java Design Principles and Pattern
Usually, each mapper is assigned a single block of a file as input (unless you have used certain compression codecs) Apache Flume Distributed Log Collection for Hadoop
The framework supports the gzip, zip, sometimes LZO, and bzip2 compression codecs. Pro Hadoop
Using codecs One of the most significant changes introduced by Apache Lucene 4.0 was the ability to alter how index files are written. Mastering ElasticSearch
With the addition of new cryptographic codecs, a new token-based authentication mechanism that supports more authentication protocols, a unified authorization system that supports Attribute Based Access Control (ABAC) and supports policy enforcement using open standards and XACML, and changes to HBase to allow cell-level authorization, Hadoop will be able to move from isolated cluster environments with perimeter-level security to very secure deployments that can meet the requirements of highly secure environments. Professional Hadoop Solutions
Meaning, once a file is compressed using any of the codecs that Hadoop provides, the file cannot be decompressed without the whole file being read. Hadoop RealWorld Solutions Cookbook
If you’ve set up your cluster with LZO compression, you’ll need to create a flume-env.sh file and set the directory that contains the native compression codecs: Hadoop in Practice
Table 4-2 lists the codecs that are available for Hadoop. Hadoop:The Definitive Guide 3rd Edition
Hadoop supports a number of compression codecs (see table 6.4) Hadoop in Action
This leaves administrators and developers to select more or less aggressive compression codecs based on the resource consumption and state of the cluster. Hadoop Operations
The code itself declares a single class called Encoder, which exposes a method that calls both the Metaphone and Base64 codecs on a string argument. Building and Testing with Gradle
If you are certain the spikes are not normal, you can remove them from the RRDtool database using the contributed script removespikes.pl, which is usually shipped in the Ganglia release tarball under contrib/ Monitoring with Ganglia
I once had a cluster that was using the Filesystem in Userspace (FUSE) contrib package to export HDFS as a read-only file system on a machine, which re-exported that file system via the Common Internet File System (CIFS) to a Windows server machine. Pro Hadoop
The standard Lucene distribution contains several Directory implementations, including filesystembased and memory-based, Berkeley DB-based (in the Lucene contrib module), and several others. Professional Hadoop Solutions
A tool to handle this is inside a contrib sub project. Cassandra High Performance Cookbook
They’re so common that Hadoop ships with a contrib JAR called hadoop-datajoin to make things easier. HBase in Action
The contrib package does all of the heavy lifting and only requires a handful of methods to be implemented. Hadoop in Practice
Hadoop’s Fuse-DFS contrib module allows any Hadoop filesystem (but typically HDFS) to be mounted as a standard filesystem. Hadoop:The Definitive Guide 3rd Edition
Hadoop has a contrib package called datajoin that works as a generic framework for data joining in Hadoop. Hadoop in Action
If you tune your JVM options and see some success using a load-testing tool or something like the Python stress test in contrib, don’t get too excited. Cassandra The Definitive Guide
Enabling alternative schedulers Each of the alternative schedulers is provided as a JAR file in capacityScheduler and fairScheduler directories within the contrib directory in the Hadoop installation. Hadoop Beginners Guide
The multiple Modifier and Reader tasks provide contention on the Accumulator object. Thinking in Java 4th Edition
This same accumulator will be used for each member of a given group, so any changes made to it can be persisted. MongoDB The Definitive Guide 2nd Edition
As you can see, you don’t need to implement how to filter using the attributes of a dish (for example, using an iterator and an accumulator) Java 8 in Action Lambdas, Streams and Functional-style Programming
We first created a UDF that extended the EvalFunc abstract class and implemented the Accumulator interface. Hadoop RealWorld Solutions Cookbook
If a function can’t be algebraic but can operate on subsets of the data, it’s recommended to implement the Accumulator interface, which can be seen in figure 11.6 Hadoop in Practice
Pig’s Algebraic or Accumulator interfaces for more efficient processing of the bag in chunks. Hadoop:The Definitive Guide 3rd Edition
Invokers do not use the Accumulator or Algebraic interfaces, and are thus likely to be much slower and to use much more memory than UDFs written specifically for Pig. Programming Pig
To accomplish aggregation in an external application, declare an accumulator before the loop that reads from the input stream and output the sum after the completion of the input: Programming Hive
The LongAccumulator generalizes this idea to an arbitrary accumulation operation. Java SE 8 for the Really Impatient
An in-memory table can be used to store temporary results when executing a complex set of queries; as a fast data accumulator that gets periodically flushed to disk; as a fast cache for a portion of the data from some large disk-based table; and in many other ways. Understanding MySQL Internals
Under the covers, Akka uses JBoss Netty and Google Protocol Buffers libraries to make the remoting happen seamlessly. Programming Concurrency on the JVM
He has provided consultations and training to corporations around the world on various open source technologies such as Mule ESB, Activiti BPM, JBoss jBPM and Drools, Liferay Portal, Alfresco ECM, JBoss SOA, and cloud computing. Web Crawling and Data Mining with Apache Nutch
If you can take an extra dependency, JBoss Marshalling25 is the way to go. Netty in Action MEAP V08
This plugin can deploy across a range of server types and version ranges including Tomcat, GlassFish, and JBoss. Jenkins Continuous Integration Cookbook
You can run it either as a stand-alone application, or deployed on a conventional Java application server such as Tomcat or JBoss. Jenkins The Definitive Guide
He also knew the next company he started would be open source and build on the model just proved successful by JBoss and MySQL. RabbitMQ in Action
MBeans can be managed remotely by a wide variety of management consoles such as the JConsole GUI that comes with Java and the webbased JMX Console that comes with the JBoss application server. Solr 3 Enterprise Search Server
In these applications, JMX is simply one way of managing the container; JBoss Application Server, on the other hand, uses JMX as the primary way of interacting with the container. Cassandra The Definitive Guide
Jeff Bleiel, our development editor, for his work liaising with Manning—without Jeff’s input, the book would not be what it is today; Gary Tully for his tireless technical reviews of the entire book; the Manning staff for their arduous work on the book; Filip Hanik for his assistance with Tomcat; Jan Bartel and Greg Wilkins for their assistance with Jetty; David Jencks and Kevan Miller for their assistance with Geronimo; and Jaikiran Pai for his assistance with JBoss. ActiveMQ in Action
NetBeans Profiler supports profiling of many popular application servers and Web servers right out of the box such as GlassFish, Tomcat, Weblogic, and JBoss. Java Performance
The Spittr application has two essential domain concepts: spitters (the users of the application) and spittles (the brief status updates that users publish) Spring in Action 4th Edition
Cloud-config format Cloud-init supports a YAML-based config format that allows the user to configure a large number of options on a system. bk-compute-adminguide-grizzly
Over time, some developers became grumpy about this: it requires a lot of explicit bean wiring, and XML compares unfavorably to terser formats like JSON, YAML, and perhaps even Java. Spring in Practice
The listing below provides a minimal storm.yaml file with entries that you must override: Storm Blueprints Patterns for Distributed Real-time Computation
By changing options in the cassandra.yaml configuration file, Cassandra uses specific directories created. Cassandra High Performance Cookbook
YAML, which stands for Yet Another Markup Language, is also supported for the same purpose, and you can activate it by adding the format=yaml parameter to the HTTP request. Elasticsearch in Action MEAP V07 cHAPTER 1
It is often used as a configuration file format; for example, the database configuration file used in Ruby on Rails applications, database.yml, is a YAML file. Pro Puppet
Modify the configuration file to add your Storm cluster configuration (you can see all default configurations on defaults.yaml in the Storm repository) Getting Started with Storm
The output from explain can be difficult to read in JSON, but it is easier when it is formatted as YAML. Elasticsearch The Definitive Guide
For this example, we’ll define the schema in YAML and then load it, although you could also use client code to define it. Cassandra The Definitive Guide
What just happened? We have just got the Selenium Grid to load our original Selenium Grid item by passing in the original YAML file. Selenium 2 Testing Tools Beginner's Guide
Managing compactions An HBase table has the following physical storage structure: HBase Administration Cookbook
These tombstones, along with the dead values, are cleaned up on major compactions. Professional Hadoop Solutions
Lowering this setting to three days allows compactions to reclaim the disk space sooner. Cassandra High Performance Cookbook
Future versions of CouchDB may include more advanced compaction features such as queuing compactions and processing multiple compactions with one API call. Beginning CouchDB
Minor compactions are designed to be minimally detrimental to HBase performance, so there is an upper limit on the number of HFiles involved. HBase in Action
There are two types of compaction: minor compactions and major compactions. HBase:The Definitive Guide
To address this, we made compactions interruptible to favor responsiveness over completion. Apache Hadoop Goes Realtime at Facebook
Cassandra can use a lot of memory, and garbage collection and major compactions can have a big impact on performance. Cassandra The Definitive Guide
Incoming read and write operations can continue while compactions occur. Bigtable - A Distributed Storage System for Structured Data
In this experiment, we configured the benchmark with increasing numbers of customers and measured both the achieved performance and the number of cores used by all parts of the system including cores used for background maintenance such as Bigtable compactions. Percolator - Large-scale Incremental Processing Using Distributed Transactions and Notifications
涉及文献数目:167
1、A Java ForkJoin Framework
2、A Little Riak Book
3、ActiveMQ in Action
4、Apache Flume Distributed Log Collection for Hadoop
5、Apache Hadoop Goes Realtime at Facebook
6、Apache Hadoop YARN Yet Another Resource Negotiator
7、Apache Kafka
8、Apache Solr 4 Cookbook
9、Apache Sqoop Cookbook
10、Beginning CouchDB
11、Beginning Scala
12、Bigtable - A Distributed Storage System for Structured Data
13、Building LinkedIn’s Real-time Activity Data1
14、Building and Testing with Gradle
15、Cassandra A Decentralized Structured Storage System
16、Cassandra High Performance Cookbook
17、Cassandra The Definitive Guide
18、Ceph A Scalable, High-Performance Distributed File System
19、Chubby - The lock service for loosely-coupled distributed systems
20、Cloudera Impala
21、Concurrent Programming in Java Design Principles and Pattern
22、Core Java Volume I Fundamentals 9th Edition
23、Core Java Volume II Advanced Features 9th Edition
24、CouchDB The Definitive Guide
25、Dapper a Large-Scale Distributed Systems Tracing Infrastructure
26、Data Structures and Algorithm Analysis in Java 3rd Edition
27、Deploying OpenStack
28、Dremel - Interactive Analysis of WebScale Datasets
29、Effective Java 2nd Edition
30、Elasticsearch The Definitive Guide
31、Elasticsearch in Action MEAP V07 cHAPTER 1
32、FlumeJava - Easy, Efficient Data-Parallel Pipelines
33、Functional Programming Patterns in Scala and Clojure
34、Functional Programming in Scala
35、Fundamental Networking in Java
36、GFS - The Google File System
37、Getting Started with CouchDB
38、Getting Started with Google Guava
39、Getting Started with Storm
40、Gradle Beyond The Basics
41、Gradle Effective Implementation Guide
42、Gradle in Action
43、Graph Databases
44、HBase Administration Cookbook
45、HBase in Action
46、HBase:The Definitive Guide
47、HDFS scalability the limits to growth
48、Hadoop Beginners Guide
49、Hadoop MapReduce Cookbook
50、Hadoop Operations
51、Hadoop Operations and Cluster Management Cookbook
52、Hadoop RealWorld Solutions Cookbook
53、Hadoop in Action
54、Hadoop in Practice
55、HadoopDB An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads
56、Hadoop:The Definitive Guide 3rd Edition
57、High Performance MySQL 3rd Edition
58、Hive A Petabyte Scale Data Warehouse Using Hadoop
59、Incoop MapReduce for Incremental Computations
60、Interpreting the Data Parallel Analysis with Sawzall
61、Introduction to Tajo
62、JSR-133 Java Memory Model and Thread
63、Java 7 Concurrency Cookbook
64、Java 7 New Features Cookbook
65、Java 7 Recipes A Problem-Solution Approach
66、Java 8 in Action Lambdas, Streams and Functional-style Programming
67、Java EE 7 Essentials
68、Java EE 7 Specification
69、Java Network Programming 4th Edition
70、Java Network Programming and Distributed Computing
71、Java Performance
72、Java Performance Tuning 2nd Edition
73、Java SE 8 for the Really Impatient
74、Java Security 2nd Edition
75、Java Threads 3rd Edition
76、Java in a Nutshell 5th Edition
77、Jenkins Continuous Integration Cookbook
78、Jenkins The Definitive Guide
79、Kafka a Distributed Messaging System for Log Processing
80、Lucene in Action 2nd Edition
81、Mahout in Action
82、Managing Infrastructure with Puppet
83、MapReduce - Simplied Data Processing on Large Clusters
84、MapReduce Design Patterns
85、Mastering ElasticSearch
86、Megastore - Providing Scalable, Highly Available Storage for Interactive Services
87、Mesos A Platform for Fine-Grained Resource Sharing in the Data Center
88、MongoDB The Definitive Guide 2nd Edition
89、MongoDB in Action
90、Monitoring with Ganglia
91、Namenode High Availability
92、Netty in Action MEAP V08
93、NoSQL Distilled
94、Omega - flexible, scalable schedulers for large compute clusters
95、OpenStack Operations Guide
96、Oracle JRockit The Definitive Guide
97、Percolator - Large-scale Incremental Processing Using Distributed Transactions and Notifications
98、Performance Testing With JMeter 2.9
99、PowerDrill - Processing a Trillion Cells per Mouse Click
100、Pregel - A System for Large-Scale Graph Processing
101、Pro Hadoop
102、Pro JPA2
103、Pro Java 7 NIO.2
104、Pro Puppet
105、Pro Spring 3
106、Pro Spring MVC
107、Pro Spring Security
108、Professional Hadoop Solutions
109、Programming Concurrency on the JVM
110、Programming Hive
111、Programming Pig
112、Programming Scala
113、Programming Scala Tackle Multi-Core Complexity on the Java Virtual Machine
114、Programming in Scala 2nd Edition
115、RADOS A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters.
116、RCFile A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems
117、RabbitMQ in Action
118、Redis Cookbook
119、Redis in Action
120、Resilient Distributed Datasets - A Fault-Tolerant Abstraction for In-Memory Cluster Computing
121、SQL Antipatterns
122、Scala for the Impatient
123、Scala in Action
124、Scala in Depth
125、Scaling Big Data with Hadoop and Solr Chapter-02 Understanding Solr
126、Scaling CouchDB
127、Selenium 2 Testing Tools Beginner's Guide
128、Selenium Testing Tools Cookbook
129、Solr 3 Enterprise Search Server
130、Solr 3.1 Cookbook
131、Solr in action
132、Sonar Code Quality Testing Essentials
133、Spanner - Google’s Globally-Distributed Database
134、Spring Batch in Action
135、Spring Data Modern Data Access for Enterprise Java
136、Spring Dynamic Modules in Action
137、Spring Integration in Action
138、Spring Roo in Action
139、Spring Security 3.1
140、Spring in Action 4th Edition
141、Spring in Practice
142、Steps in Scala An Introduction to Object-Functional Programming
143、Storage Architecture and Challenges
144、Storm Blueprints Patterns for Distributed Real-time Computation
145、Storm Real-Time Processing Cookbook
146、Tenzing - A SQL Implementation On The MapReduce Framework
147、Testable JavaScript
148、The Definitive Guide to MongoDB
149、The Hadoop Distributed File System
150、The Java Language Specification Java SE 7 Edition
151、The Java Virtual Machine Specification Java SE 7 Edition
152、The Little Redis Book
153、The anatomy of a large-scale hypertextual Web search engine
154、The anatomy of a large-scale hypertextual Web search engine short
155、Thinking in Java 4th Edition
156、Tika in Action
157、Understanding MySQL Internals
158、Web Crawling and Data Mining with Apache Nutch
159、Web search for a planet the google cluster architecture
160、Zab High-performance broadcast for primary-backup systems
161、ZooKeeper Distributed Process Coordination
162、admin-user-guide-trunk
163、bk-admin-guide-cloud-latest
164、bk-compute-adminguide-grizzly
165、elasticsearch. The Road to a Distributed, (Near) Real Time, Search Engine
166、high-availability-guide
167、user-guide