Lucene in Action, Second Edition delivers details, best practices, caveats, tips, and tricks for using the best open-source search engine available.
This book assumes the reader is familiar with basic Java programming. Lucene’s core itself is a single Java Archive (JAR) file, less than 1MB and with no dependencies, and integrates into the simplest Java stand-alone console program as well as the most sophisticated enterprise application.
We organized part 1 of this book to cover the core Lucene Application Programming Interface (API) in the order you’re likely to encounter it as you integrate Lucene into your applications:
Part 2 goes beyond Lucene’s built-in facilities and shows you what can be done around and above Lucene:
Part 3 (chapters 12, 13, and 14) brings all the technical details of Lucene back into focus with case studies contributed by those who have built interesting, fast, and scalable applications with Lucene at their core.
Much has changed in Lucene in the 5 years since this book was originally published. As is often the case with a successful open-source project with a strong technical architecture, a robust community of users and developers has thrived over time, and from all that energy has emerged a number of amazing improvements. Here’s a sampling of the changes:
Entirely new case studies have been added, in Chapters 12, 13 and 14. A new chapter (11) has been added to cover the administrative aspects of Lucene. Chapter 7, which previously described a custom framework for parsing different document types, has been rewritten entirely based on Tika. In addition all code samples have been updated to Lucene’s 3.0.1 APIs. And of course lots of great feedback from our readers has been folded in (thank you, and please keep it coming!).
Developers who need powerful search capabilities embedded in their applications should read this book. Lucene in Action, Second Edition is also suitable for developers who are curious about Lucene or indexing and search techniques, but who may not have an immediate need to use it. Adding Lucene know-how to your toolbox is valuable for future projects—search is a hot topic and will continue to be in the future.
This book primarily uses the Java version of Lucene (from Apache), and the majority of the code examples use the Java language. Readers familiar with Java will be right at home. Java expertise will be helpful; however, Lucene has been ported to a number of other languages including C++, C#, Python, and Perl. The concepts, techniques, and even the API itself are comparable between the Java and other language versions of Lucene.
The source code for this book is available from Manning’s website at http://www.manning.com/LuceneinActionSecondEdition or http://www.manning.com/hatcher3. Instructions for using this code are provided in the README file included with the source-code package.
The majority of the code shown in this book was written by us and is included in the source-code package, licensed under the Apache Software License (http://www.apache.org/licenses/LICENSE-2.0). Some code (particularly the case-study code, and the examples from Lucene’s ports to other programming languages) isn’t provided in our source-code package; the code snippets shown there are owned by the contributors and are donated as is. In a couple of cases, we have included a small snippet of code from Lucene’s codebase, which is also licensed under Apache Software License 2.0.
Code examples don’t include package and import statements, to conserve space; refer to the actual source code for these details. Likewise, in the name of brevity and keeping examples focused on Lucene’s code, there are numerous places where we simply declare throws Exception, while for production code you should declare and catch only specific exceptions and implement proper handling when exceptions occur. In some cases there are fragments of code, inlined in the text, that are not full standalone examples; these cases are included in source files named Fragments.java, under each subdirectory.
We believe code examples in books should be top-notch quality and real-world applicable. The typical “hello world” examples often insult our intelligence and generally do little to help readers see how to really adapt to their environment.
We’ve taken a unique approach to the code examples in Lucene in Action, Second Edition. Many of our examples are actual JUnit test cases (http://www.junit.org), version 4.1. JUnit, the de facto Java unit-testing framework, easily allows code to assert that a particular assumption works as expected in a repeatable fashion. It also cleanly separates what we are trying to accomplish, by showing the small test case up front, from how we accomplish it, by showing the source code behind the APIs invoked by the test case. Automating JUnit test cases through an IDE or Ant allows one-step (or no steps with continuous integration) confidence building. We chose to use JUnit in this book because we use it daily in our other projects and want you to see how we really code. Test Driven Development (TDD) is a development practice we strongly espouse.
If you’re unfamiliar with JUnit, please read the JUnit primer section. We also suggest that you read Pragmatic Unit Testing in Java with JUnit by Dave Thomas and Andy Hunt, followed by Manning’s JUnit in Action by Vincent Massol and Ted Husted, a second edition of which is in the works by Petar Tahchiev, Felipe Leme, Vincent Massol, and Gary Gregory.
Source code in listings or in text is in a fixed width font to separate it from ordinary text. Java method names, within text, generally won’t include the full method signature.
In order to accommodate the available page space, code has been formatted with a limited width, including line continuation markers where appropriate.
We don’t include import statements and rarely refer to fully qualified class names—this gets in the way and takes up valuable space. Refer to Lucene’s Javadocs for this information. All decent IDEs have excellent support for automatically adding import statements; Erik blissfully codes without knowing fully qualified classnames using IDEA IntelliJ, Otis and Mike both use XEmacs. Add the Lucene JAR to your project’s classpath, and you’re all set. Also on the classpath issue (which is a notorious nuisance), we assume that the Lucene JAR and any other necessary JARs are available in the classpath and don’t show it explicitly. The lib directory, with the source code, includes JARs that the source code uses. When you run the ant targets, these JARs are placed on the classpath for you.
We’ve created a lot of examples for this book that are freely available to you. A .zip file of all the code is available from Manning’s web site for Lucene in Action: http://www.manning.com/LuceneinActionSecondEdition. Detailed instructions on running the sample code are provided in the main directory of the expanded archive as a README file.
Most of our book revolves around a common set of example data to provide consistency and avoid having to grok an entirely new set of data for each section. This example data consists of book details. Table 1 shows the data so that you can reference it and make sense of our examples.
Title / Author |
Category |
Subject |
---|---|---|
A Modern Art of Education Rudolf Steiner | /education/pedagogy | education philosophy psychology practice Waldorf |
Lipitor, Thief of Memory Duane Graveline, Kilmer S. McCully, Jay S. Cohen | /health | cholesterol, statin, lipitor |
Nudge: Improving Decisions About Health, Wealth, and Happiness Richard H. Thaler, Cass R. Sunstein | /health | information architecture, decisions, choices |
Imperial Secrets of Health and Longevity Bob Flaws | /health/alternative/Chinese | diet chinese medicine qi gong health herbs |
Tao Te Ching Stephen Mitchell | /philosophy/eastern | taoism |
Gödel, Escher, Bach: an Eternal Golden Braid Douglas Hofstadter | /technology/computers/ai | artificial intelligence number theory mathematics music |
Mindstorms: Children, Computers, And Powerful Ideas Seymour Papert | /technology/computers/programming/education | children computers powerful ideas LOGO education |
Ant in Action Steve Loughran, Erik Hatcher | /technology/computers/programming | apache ant build tool junit java development |
JUnit in Action, Second Edition Petar Tahchiev, Felipe Leme, Vincent Massol, Gary Gregory | /technology/computers/programming | junit unit testing mock objects |
Lucene in Action, Second Edition Michael McCandless, Erik Hatcher, Otis Gospodnetić | /technology/computers/programming | lucene search java |
Extreme Programming Explained Kent Beck | /technology/computers/programming/methodology | extreme programming agile test driven development methodology |
Tapestry in Action Howard Lewis-Ship | /technology/computers/programming | tapestry web user interface components |
The Pragmatic Programmer Dave Thomas, Andy Hunt | /technology/computers/programming | pragmatic agile methodology developer tools |
The data, besides the fields shown in the table, includes fields for ISBN, URL, and publication month. When you unzip the source code available for download at www.manning.com/hatcher3, the books are represented as *.properties files under the data sub-directory, and the command-line tool at src/lia/common/CreateTestIndex.java is used to create the test index used throughout the book. The fields for category and subject are our own subjective values, but the other information is objectively factual about the books.
The purchase of Lucene in Action, Second Edition includes free access to a web forum run by Manning Publications, where you can discuss the book with the authors and other readers. To access the forum and subscribe to it, point your web browser to http://www.manning.com/LuceneinActionSecondEdition. This page provides information on how to get on the forum once you are registered, what kind of help is available, and the rules of conduct on the forum.
By combining introductions, overviews, and how-to examples, the In Action books are designed to help learning and remembering. According to research in cognitive science, the things people remember are things they discover during self-motivated exploration.
Although no one at Manning is a cognitive scientist, we are convinced that for learning to become permanent it must pass through stages of exploration, play, and, interestingly, re-telling of what is being learned. People understand and remember new things, which is to say they master them, only after actively exploring them. Humans learn in action. An essential part of an In Action guide is that it is example-driven. It encourages the reader to try things out, to play with new code, and explore new ideas.
There is another, more mundane, reason for the title of this book: our readers are busy. They use books to do a job or solve a problem. They need books that allow them to jump in and jump out easily and learn just what they want just when they want it. They need books that aid them in action. The books in this series are designed for such readers.
The figure on the cover of Lucene in Action, Second Edition is “An inhabitant of the coast of Syria.” The illustration is taken from a collection of costumes of the Ottoman Empire published on January 1, 1802, by William Miller of Old Bond Street, London. The title page is missing from the collection and we have been unable to track it down to date. The book’s table of contents identifies the figures in both English and French, and each illustration bears the names of two artists who worked on it, both of whom would no doubt be surprised to find their art gracing the front cover of a computer programming book?two hundred years later.
The collection was purchased by a Manning editor at an antiquarian flea market in the “Garage” on West 26th Street in Manhattan. The seller was an American based in Ankara, Turkey, and the transaction took place just as he was packing up his stand for the day. The Manning editor did not have on his person the substantial amount of cash that was required for the purchase and a credit card and check were both politely turned down.
With the seller flying back to Ankara that evening the situation was getting hopeless. What was the solution? It turned out to be nothing more than an old-fashioned verbal agreement sealed with a handshake. The seller simply proposed that the money be transferred to him by wire and the editor walked out with the seller’s bank information on a piece of paper and the portfolio of images under his arm. Needless to say, we transferred the funds the next day, and we remain grateful and impressed by this unknown person’s trust in one of us. It recalls something that might have happened a long time ago.
The pictures from the Ottoman collection, like the other illustrations that appear on our covers, bring to life the richness and variety of dress customs of two centuries ago. They recall the sense of isolation and distance of that period—and of every other historic period except our own hyperkinetic present.
Dress codes have changed since then and the diversity by region, so rich at the time, has faded away. It is now often hard to tell the inhabitant of one continent from another. Perhaps, trying to view it optimistically, we have traded a cultural and visual diversity for a more varied personal life. Or a more varied and interesting intellectual and technical life.
We at Manning celebrate the inventiveness, the initiative, and, yes, the fun of the computer business with book covers based on the rich diversity of regional life of two centuries ago—brought back to life by the pictures from this collection.