Appendix
Useful Reading
No matter how much the authors have tried, it is virtually impossible to cover the Hadoop ecosystem in a single book. This appendix provides additional reading recommendations that you might find useful. They are organized by the main topics covered in the book.
“Apache HBase Book.” http://hbase.apache.org/book.html.
“Bloom Filter.” http://en.wikipedia.org/wiki/Bloom_filter.
“BloomMapFile — Fail-Fast Version of MapFile for Sparsely Populated Key Space.” https://issues.apache.org/jira/browse/HADOOP-3063.
Borthakur, Dhruba. “Hadoop AvatarNode High Availability.” http://hadoopblog.blogspot.com/2010/02/hadoop-namenode-high-availability.html.
Chang, Fay; Dean, Jeffrey; Ghemawat, Sanjay; Hsieh, Wilson C.; Wallach, Deborah A.; Burrows, Mike; Chandra, Tushar; Fikes, Andrew; and Gruber, Robert E. “BigTable: A Distributed Storage System for Structured Data.” http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf.
Chen, Yanpei; Ganapathi, Archana Sulochana; and Katz, Randy H. “To Compress or not to Compress — Compute vs. I/O Tradeoffs for MapReduce Energy Efficiency.” http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-36.pdf.
Dikant, Peter. “Storing Log Messages in Hadoop.” http://blog.mgm-tp.com/2010/04/hadoop-log-management-part2/.
Dimiduk, Nick, and Khurana, Amandeep. HBase in Action (Shelter Island, NY: Manning Publications, 2012). http://www.amazon.com/HBase-Action-Nick-Dimiduk/dp/1617290521/.
George, Lars. HBase: The Definitive Guide (Sebastopol, CA:O’Reilly Media, 2011). http://www.amazon.com/HBase-Definitive-Guide-Lars-George/dp/1449396100.
Ghemawat, Sanjay; Gobioff, Howard; and Leung, Shun-Tak. “The Google File System.” http://www.cs.brown.edu/courses/cs295-11/2006/gfs.pdf.
“HDFS Architecture Guide.” http://hadoop.apache.org/docs/stable/hdfs_design.html.
“HDFS High Availability with NFS.” http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html.
“HDFS High Availability Using the Quorum Journal Manager.” http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithQJM.html.
Radia, Sanjay. “HA Namenode for HDFS with Hadoop 1.0.” http://hortonworks.com/blog/ha-namenode-for-hdfs-with-hadoop-1-0-part-1/.
“Simple Example to Read and Write Files from Hadoop DFS.” http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample.
Srinivas, Suresh. “An Introduction to HDFS Federation.” http://hortonworks.com/blog/an-introduction-to-hdfs-federation/.
“The Hadoop Distributed File System.” http://developer.yahoo.com/hadoop/tutorial/module2.html.
White, Tom. Hadoop: The Definitive Guide (Sebastopol, CA:O’Reilly Media, 2012). http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520/.
White, Tom. “HDFS Reliability.” http://www.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf.
Zuanich, Jon. “Hadoop I/O: Sequence, Map, Set, Array, BloomMap Files.” http://www.cloudera.com/blog/2011/01/hadoop-io-sequence-map-set-array-bloommap-files/.
Adjiman, Philippe. “Hadoop Tutorial Series, Issue #4: To Use or not to Use a Combiner.” http://philippeadjiman.com/blog/2010/01/14/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner/.
“Apache Hadoop NextGen MapReduce (YARN).” http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html.
Blomo, Jim. “Exploring Hadoop OutputFormat.” http://www.infoq.com/articles/HadoopOutputFormat.
Brumitt, Barry. “MapReduce Design Patterns.” http://www.cs.washington.edu/education/courses/cse490h/11wi/CSE490H_files/mapr-design.pdf.
“C++ World Count.” http://wiki.apache.org/hadoop/C%2B%2BWordCount.
Cohen, Jonathan. “Graph Twiddling in a MapReduce World.” http://www.adjoint-functors.net/su/web/354/references/graph-processing-w-mapreduce.pdf.
“Configuring Eclipse for Hadoop Development (a Screencast).” http://www.cloudera.com/blog/2009/04/configuring-eclipse-for-hadoop-development-a-screencast/.
Dean, Jeffrey, and Ghemawat, Sanjay. “MapReduce: Simplified Data Processing on Large Clusters.” http://www.usenix.org/event/osdi04/tech/full_papers/dean/dean.pdf.
Ghosh, Pranab. “Map Reduce Secondary Sort Does it All.” http://pkghosh.wordpress.com/2011/04/13/map-reduce-secondary-sort-does-it-all/.
Grigorik, Ilya. “Easy Map-Reduce with Hadoop Streaming.” http://www.igvita.com/2009/06/01/easy-map-reduce-with-hadoop-streaming/.
“Hadoop MapReduce Next Generation — Writing YARN Applications.” http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html.
“Hadoop Tutorial.” http://archive.cloudera.com/cdh/3/hadoop/mapred_tutorial.html#Partitioner.
“How to Include Third-Party Libraries in Your Map-Reduce Job.” http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/.
Katsov, Ilya. “MapReduce Patterns, Algorithms, and Use Cases.” http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/.
Lin, Jimmy, and Dyer, Chris. Data-Intensive Text Processing with MapReduce (San Francisco: Morgan & Claypool, 2010). http://www.amazon.com/Data-Intensive-Processing-MapReduce-Synthesis-Technologies/dp/1608453421.
Mamtani, Vinod. “Design Patterns in Map-Reduce.” http://nimbledais.com/?p=66.
MapReduce website. http://www.mapreduce.org/.
Mathew, Ashwin J. “Design Patterns in the Wild.” http://courses.ischool.berkeley.edu/i290-1/s08/presentations/Day6.pdf.
Murthy, Arun C. “Apache Hadoop: Best Practices and Anti-Patterns.” http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/.
Murthy, Arun C.; Douglas, Chris; Konar, Mahadev; O’Malley, Owen; Radia, Sanjay; Agarwal, Sharad; Vinod; K V. “Architecture of Next Generation Apache Hadoop MapReduce Framework.” https://issues.apache.org/jira/secure/attachment/12486023/MapReduce_NextGen_Architecture.pdf.
Noll, Michael G. “Writing an Hadoop MapReduce Program in Python.” http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/.
Owen, Sean; Anil, Robin; Dunning, Ted; and Friedman, Ellen. Mahout in Action (Shelter Island, NY: Manning Publications, 2011). http://www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684/ref=sr_1_1?s=books&ie=UTF8&qid=1327246973&sr=1-1.
Rehman, Shuja. “XML Processing in Hadoop.” http://xmlandhadoop.blogspot.com/.
Riccomini, Chris. “Tutorial: Sort Reducer Input Values in Hadoop.” http://riccomini.name/posts/hadoop/2009-11-13-sort-reducer-input-value-hadoop/.
Shewchuk, Richard. “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain.” http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf.
“Splunk App for HadoopOps.” http://www.splunk.com/web_assets/pdfs/secure/Splunk_for_HadoopOps.pdf.
Thiebaut, Dominique. “Hadoop Tutorial 2.2 — Running C++ Programs on Hadoop.” http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_2.2_--_Running_C%2B%2B_Programs_on_Hadoop.
“When to Use a Combiner.” http://lucene.472066.n3.nabble.com/When-to-use-a-combiner-td3685452.html.
Winkels, Maarten. “Thinking MapReduce with Hadoop.” http://blog.xebia.com/2009/07/02/thinking-mapreduce-with-hadoop/.
“Working with Hadoop under Eclipse.” http://wiki.apache.org/hadoop/EclipseEnvironment.
“Hadoop Streaming with Ruby and Wukong.” http://labs.paradigmatecnologico.com/2011/04/29/howto-hadoop-streaming-with-ruby-and-wukong/.
“Yahoo! Hadoop Tutorial.” http://developer.yahoo.com/hadoop/tutorial/.
Zaharia, Matei; Borthakur, Dhruba; Sarma, Joydeep Sen; Elmeleegy, Khaled; Shenker, Scott; and Stoica, Ion. “Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling.” http://www.cs.berkeley.edu/~matei/papers/2010/eurosys_delay_scheduling.pdf.
“Oozie Bundle Specification.” http://oozie.apache.org/docs/3.1.3-incubating/BundleFunctionalSpec.html.
“Oozie Client javadocs.” http://archive.cloudera.com/cdh/3/oozie/client/apidocs/index.html.
“Oozie Command Line Utility.” http://rvs.github.io/oozie/releases/1.6.0/DG_CommandLineTool.html.
“Oozie Coordinator Specification.” http://archive.cloudera.com/cdh/3/oozie/CoordinatorFunctionalSpec.html.
“Oozie Custom Action Nodes.” http://oozie.apache.org/docs/3.3.0/DG_CustomActionExecutor.html.
“Oozie Source Code.” https://github.com/apache/oozie.
“Oozie Specification, a Hadoop Workflow System.” http://oozie.apache.org/.
“Oozie Web Services APIs.” http://archive.cloudera.com/cdh4/cdh/4/oozie/WebServicesAPI.html.
“xjc Binding Compiler.” http://docs.oracle.com/javase/6/docs/technotes/tools/share/xjc.html.
“Actors Model.” http://c2.com/cgi/wiki?ActorsModel.
“Add Search to HBASE.” https://issues.apache.org/jira/browse/HBASE-3529.
“Apache Solr.” http://lucene.apache.org/solr/.
Bienvenido, David, III. “Twitter Storm: Open Source Real-Time Hadoop.” http://www.infoq.com/news/2011/09/twitter-storm-real-time-hadoop.
Borthakur, Dhruba; Muthukkaruppan, Kannan; Ranganathan, Karthik; Rash, Samuel; Sarma; Joydeep Sen, Spiegelberg, Nicolas; Molkov, Dmytro; Schmidt, Rodrigo; Gray, Jonathan; Kuang, Hairong; Menon, Aravind; and Aiyer, Amitanand. “Apache Hadoop Goes Realtime at Facebook.” http://borthakur.com/ftp/RealtimeHadoopSigmod2011.pdf.
“Cassandra.” http://cassandra.apache.org/.
Haller, Mike. “Spatial Search with Lucene.” http://www.mhaller.de/archives/156-Spatial-search-with-Lucene.html.
“HBase Avro Server.” http://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/avro/AvroServer.HBaseImpl.html.
“HBasene.” https://github.com/akkumar/hbasene.
“HBasePS.” https://github.com/sentric/HBasePS.
“HStreaming.” http://www.hstreaming.com/.
Ingersoll, Grant. “Location-Aware Search with Apache Lucene and Solr.” http://www.ibm.com/developerworks/opensource/library/j-spatial/.
Kumar, Animesh. “Apache Lucene and Cassandra.” http://anismiles.wordpress.com/2010/05/19/apache-lucene-and-cassandra/.
Kumar, Animesh. “Lucandra — An Inside Story!” http://anismiles.wordpress.com/2010/05/27/lucandra-an-inside-story/.
Lawson, Loraine. “Exploring Hadoop’s Real-Time Potential.” http://www.itbusinessedge.com/cm/blogs/lawson/exploring-hadoops-real-time-potential/?cs=49692.
“Local Lucene Geographical Search.” http://www.nsshutdown.com/projects/lucene/whitepaper/locallucene_v2.html.
“Lucandra.” https://github.com/tjake/Lucandra.
Marz, Nathan. “A Storm Is Coming: More Details and Plans for Release.” http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html.
Marz, Nathan. “Preview of Storm: The Hadoop of Realtime Processing.” https://www.memonic.com/user/pneff/folder/queue/id/1qSgf.
McCandless, Michael; Hatcher, Erik; and Gospodnetic, Otis. Lucene in Action, Second Edition (Shelter Island, NY: Manning Publications, 2010). http://www.amazon.com/Lucene-Action-Second-Covers-Apache/dp/1933988177/ref=sr_1_1?ie=UTF8&qid=1292717735&sr=8-1.
“OpenTSDB.” http://opentsdb.net/.
“Powered by Lucene.” http://wiki.apache.org/lucene-java/PoweredBy.
“Stargate.” http://wiki.apache.org/hadoop/Hbase/Stargate.
“Thrift APIs.” http://wiki.apache.org/hadoop/Hbase/ThriftApi.
“Amazon CloudWatch.” http://aws.amazon.com/cloudwatch/.
“Amazon Elastic MapReduce.” http://aws.amazon.com/elasticmapreduce/.
“Amazon Simple Storage Service.” http://aws.amazon.com/s3/.
“Amazon Simple Workflow Service.” http://aws.amazon.com/swf/.
“Apache Whirr.” http://whirr.apache.org/.
“AWS Data Pipeline.” http://aws.amazon.com/datapipeline/.
“How-to: Set Up an Apache Hadoop/Apache HBase Cluster on EC2.” http://blog.cloudera.com/blog/2012/10/set-up-a-hadoophbase-cluster-on-ec2-in-about-an-hour/.
Linton, Rob. Amazon Web Services: Migrating Your .NET Enterprise Application (Olton, Birmingham, United Kingdom: Packt Publishing, 2011). http://www.amazon.com/Amazon-Web-Services-Enterprise-Application/dp/1849681945.
“What Are the Advantages of Amazon EMR, Vs. Your Own EC2 Instances, Vs. Running Hadoop Locally?” http://www.quora.com/What-are-the-advantages-of-Amazon-EMR-vs-your-own-EC2-instances-vs-running-Hadoop-locally. (quora account required).
“Apache Hama.” http://hama.apache.org/.
Capriolo, Edward; Wampler, Dean; and Jason Rutherglen. Programming Hive (Sebastopol, CA: O’Reilly Media, 2012). http://www.amazon.com/Programming-Hive-Edward-Capriolo/dp/1449319335/ref=sr_1_1?s=books&ie=UTF8&qid=1368408335&sr=1-1&keywords=hive.
“Cascading/CoPA.” https://github.com/Cascading/CoPA.
“Cascading Lingual.” http://www.cascading.org/lingual/.
“Cascading Pattern.” http://www.cascading.org/pattern/.
Cascading website. http://www.cascading.org/.
Cascalog website. https://github.com/nathanmarz/cascalog.
Crunch website. https://github.com/cloudera/crunch/tree/master/scrunch.
Czajkowski, Grzegorz. “Large-Scale Graph Computing at Google.” http://googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html.
“Domain Specific Language.” http://c2.com/cgi/wiki?DomainSpecificLanguage.
Gates, Alan. Programming Pig (Sebastopol, CA: O’Reilly Media, 2011). http://www.amazon.com/Programming-Pig-Alan-Gates/dp/1449302645/ref=sr_1_1?ie=UTF8&qid=1375109835&sr=8-1&keywords=Gates%2C+Alan.+Programming+Pig.
“Introduction to Apache Crunch.” http://crunch.apache.org/intro.html.
Fowler, Martin. Domain-Specific Languages (Boston: Addison-Wesley, 2010). http://www.amazon.com/Domain-Specific-Languages-Addison-Wesley-Signature-Fowler/dp/0321712943.
Scalding website. https://github.com/twitter/scalding.
“Welcome to Apache Giraph!” http://giraph.apache.org/.
“What Are the Differences between Crunch and Cascading?” http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading.
Wills, Josh. “Apache Crunch: A Java Library for Easier MapReduce Programming.” http://www.infoq.com/articles/ApacheCrunch.
“Accumulo User Manual — Security.” http://accumulo.apache.org/1.4/user_manual/Security.html.
“Apache Accumulo.” http://accumulo.apache.org/.
“Authentication for Hadoop Web-Based Consoles.” http://hadoop.apache.org/docs/stable/HttpAuthentication.html.
Becherer, Andrew. “Hadoop Security Design – Just Add Kerberos? Really?” https://media.blackhat.com/bh-us-10/whitepapers/Becherer/BlackHat-USA-2010-Becherer-Andrew-Hadoop-Security-wp.pdf.
Dwork, Cynthia. “Differential Privacy”, from 33rd International Colloquium on Automata, Languages, and Programming, Part II (ICALP 2006) (Springer Verlag, 2007), available at http://research.microsoft.com/apps/pubs/default.aspx?id=64346.
“Hadoop Service Level Authorization Guide.” http://hadoop.apache.org/docs/stable/service_level_auth.html.
“HDFS Permissions Guide.” http://hadoop.apache.org/docs/stable/hdfs_permissions_guide.html.
IETF. “Simple Authentication and Security Layer (SASL).” http://www.ietf.org/rfc/rfc2222.txt.
IETF. “The Kerberos Version 5 Generic Service Application Program Interface (GSS-API) Mechanism: Version 2.” http://tools.ietf.org/html/rfc4121.
IETF. “The Simple and Protected GSS-API Negotiation (SPNEGO) Mechanism.” http://tools.ietf.org/html/rfc4178.
“Kerberos: The Network Authentication Protocol.” http://web.mit.edu/kerberos/.
Naryanan, Shmatikov, “Robust De-Anonymization of Large Sparse Datasets.” http://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf.
O’Malley, Owen; Zhang, Kan; Radia, Sanjay; Marti, Ram; and Harrell, Christopher. “Hadoop Security Design”, October 2009, available at https://issues.apache.org/jira/secure/attachment/12428537/security-design.pdf.
“Project Rhino.” https://github.com/intel-hadoop/project-rhino/.
“Security Features for Hadoop”, JIRA HADOOP-4487, https://issues.apache.org/jira/browse/HADOOP-4487.
Williams, Alex. “Intel Releases Hadoop Distribution and Project Rhino — An Effort to Bring Better Security to Big Data.” http://techcrunch.com/2013/02/26/intel-launches-hadoop-distribution-and-project-rhino-an-effort-to-bring-better-security-to-big-data/.