全文摘选自 Alex Holmes 《Hadoop in Practice》，关于 Flume, Sqoop, Oozie, Hive, HBase, Avro, Thrift, Pig, R, Mahout 等流行组件的概括描述。
Flume is a log collection and distribution system that can transport log files across a large number of hosts into HDFS. It’s an Apache project in incubator status, originally developed and currently maintained and supported by Cloudera.
Oozie is an Apache project which started life inside Yahoo. It’s a Hadoop workflow engine that manages data processing activities.
Sqoop is a tool for importing data from relational databases into Hadoop, and vice versa. It can support any JDBC-compliant database, and also has native
connectorsfor efficient data transport to and from mySQL and PostgreSQL.
HBase is a real-time key/value distributed column-based database modeled after Google’s BigTable.
Avro is a data serialization system that provides features such as compression, schema evolution, and code generation. It can be viewed as a more sophisticated version of a SequenceFile, with additional features such as schema evolution.
A.7. Protocol Buffers
Protocol Buffers is Google’s data serialization and Remote Procedure Call (RPC) library, which is used extensively at Google. In this book we’ll use it in conjunction with Elephant Bird and Rhipe. Elephant Bird requires version 2.3.0 of Protocol Buffers (and won’t work with any other version), and Rhipe only works with Protocol Buffers version 2.4.0 and newer.
A.8. Apache Thrift
Apache Thrift is essentially Facebook’s version of Protocol Buffers. It offers very similar data serialization and RPC capabilities. We’ll use it with Elephant Bird to support Thrift in MapReduce. Elephant Bird only works with Thrift version 0.5.
Snappy is a native compression codec developed by Google, which offers fast compression and decompression times. It can’t be split (as opposed to LZOP compression). In the book code examples where we don’t need splittable compression, we’ll use Snappy because of its time efficiency. In this section we’ll cover how to build and set up your cluster to work with Snappy.
LZOP is a compression codec that can be used to support splittable compression in MapReduce. Chapter 5 has a section dedicated to working with LZOP. In this section we’ll cover how to build and set up your cluster to work with LZOP.
A.11. Elephant Bird
Elephant Bird is a project that provides utilities for working with LZOP-compressed data. It also provides a container format that supports working with Protocol Buffers and Thrift in MapReduce.
Hoop is an HTTP/S server which provides access to all the HDFS operations.
Apache Hive is a data warehouse project that provides a simplified and SQL-like abstraction on top of Hadoop.
Apache Pig is a MapReduce pipeline project that provides a simplified abstraction on top of Hadoop.
Crunch is a pure Java library that lets you write code that’s executed in MapReduce without having to actually use MapReduce specific constructs.
R is an open source tool for statistical programming and graphics.
RHIPE is a library that improves the integration between R and Hadoop.