Introduction
Hadoop is a
cutting edge tool and everybody in the software industry wants to know about it.
Initially we learn that large amounts of data coming from web 2.0 applications
and social media, are rich of valuable raw data. And then, a quest for the best
processing tools begins.
NoSQL movement is highly
connected with the big data technologies, and their evolution appeared to be
remarkable. Hundreds of new persistence solutions and frameworks have been
released. Some of them offering high quality and some just being very well
advertised. All of them are offering, in short, the advantages of: being easily
scalable, giving great speed of random access, storing more intuitive structures
that need less time for mapping programmatically.
World
leading technology companies participated in promoting development of these technologies
and one of the most popular algorithms developed is MapReduce, and Hadoop managed
to be one of its mainstream implementations.
Ok, we
learned the basics, and now we have some production applications to implement
and maintain. Of course we will use different data sources, including text
files, relational database servers and NoSQL clusters. And of course there is a
large variety of useful tools out there, to choose from. To begin with, we need
to decide which tools to learn first, which are the most appropriate for our
case, and how exactly to solve possible problems.
Hadoop
Real-world solutions cookbook by Jonathan R. Owens, Jon Lentz and Brian Femiano,
is a book that does what it promises; it offers recipes of real world working
solutions using Hadoop alone, or in well collaborating systems with
supplementary open source tools. The recipes are organized in 10 chapters, and
every chapter is separated into several sections. Each section has the format
of a “how to” article with preparation steps, execution steps and explanation text.
Code examples are extensive, and they are available for download along with
sample datasets used throughout the book (after registration in Packt
publications support site).
Chapters 1-2
We learn
that there are command line tools helping to import our data files in the
magical Hadoop file system (HDFS). And if we have some relational data in a
relational database server, we can export and import them using an open source
tool called sqoop, in collaboration with
JDBC. The most advanced recipes include real-time access of HDFS files from Greenplum, as external tables, and
importing data to HDFS from special data sources using Flume. Next, we learn how to compress our
data in HDFS, and using different serialization techniques (Avro, Thrift
and Protocol Buffers).
Chapters 3-4
Some great
traffic statistics and analytics can be created and processed using MapReduce
in Hadoop processing environment. The recipes in these 2 chapters explain how
apache web server log files can be processed, mainly using Pig and Hive,
in order to extract very useful information as session logging, page view calculations
and geographical event data. Moreover, there are recipes that explain how log
files can be mapped as external tables, and proposed recipes for using
effectively other external data sources, as news archives.
Chapter 5
A whole
chapter is dedicated to the concept of joining datasets. There are recipes
giving example of replicated join, merge join and skewed join, mainly using Pig. Additionally, more advanced techniques
are presented, for full outer joins, increasing performance, using Hive and Redis key-value store.
Chapter 6-7
In these
two chapters, big data analysis is the main concern of the recipes. Initially,
simple MapReduce recipes using Pig and Hive are presented, to process large amounts
of data and derive complex pieces of information. Timely sorted aggregated
data, distinct values of a variable from very large sets, similarity of data
records and outliers discovery in a time series. For better facing this kind of
problems, the author suggests graph processing technologies and machine learning
algorithms, so, chapter 7 presents recipes using Apache Giraph and Mahout in collaboration with Hadoop.
Chapter 8
Chapter 8
is dedicated to debugging. Of course a lot of testing should be performed on
every aspect of any distributed MapReduce solution. For this reason, Hadoop
offers Counters mechanism that exposes the internals of every map and reduce
phases of jobs, in a very practical and user friendly format. Furthermore, a unit
testing framework called MRUnit is
presented, with the basic features of a testing framework, but for map and reduce
phases. Going one step further, the author presents a recipe for generating
test data with a very powerful Pig tool called illustrate. And finally, a
recipe is addressed to running MapReduce in local mode, for development
reasons, enabling local debuggers from within the IDE.
Chapter 9
Chapter 9
is dedicated to administrative tasks. These recipes explain how distributed
mode is configured in Hadoop systems, how to add or remove nodes on a cluster,
how to monitor the health of the cluster, and finally some tuning tips are
provided.
Chapter 10
In the
final chapter, the author suggests Apache
Accumulo for persistence layer. Inspired from Google’s BigTable, Apache
Accumulo has many unique features as iterators, combiners, scan authorizations
and constraints. In combination with MapReduce, example recipes present loading
and reading data from Accumulo tables.
Conclusion
Overall, this
is a recipes based cookbook, and as such, it contains task driven sections and chapters.
This is not a book that may be read from the beginning to the end, but it would
be better to be used as a reference. In other words, not all of these recipes
are appropriate for every reader. The reader being experienced enough, can execute
the recipes -that sometimes include downloading tools source code from github-
and use this cookbook to select certain tools and solutions. Finally, I would
like to note that the recipes are addressed to all the range of IT professionals:
developers, dev ops and architects, and I think that the better way to use it
is as an asset of a development team, or a guide for experienced developers
planning “one man show” startups.