Web search engines are your friends. Type lucene in your favorite web search engine and you’ll find many interesting Lucene-related projects. Other good places to look are SourceForge, Google Code, and GitHub; a search for lucene on any of those sites displays a number of open source projects written on top of Lucene.
Search Lucene: http://search-lucene.com/
LucidFind: http://search.lucidimagination.com/
Unicode page in Wikipedia: http://en.wikipedia.org/wiki/Unicode
The Unicode Consortium: http://unicode.org
Bray, Tim, “Characters vs. Bytes”: www.tbray.org/ongoing/When/200x/2003/04/26/UTF
Green, Dale, “Trail: Internationalization”: http://java.sun.com/docs/books/tutorial/i18n/index.html
Lindenberg, Norbert, and Masayoshi Okutsu, “Supplementary Characters in the Java Platform”: http://java.sun.com/developer/technicalArticles/Intl/Supplementary/
Peterson, Erik, “Chinese Character Dictionary—Unicode Version”: www.mandarin-tools.com/chardict_u8.html
Spolsky, Joel, “The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)”: www.joelonsoftware.com/articles/Unicode.html
Davis, Mark, “Globalization Gotchas”: http://macchiato.com/slides/GlobalizationGotchas.ppt
Rosette Language Identifier, http://basistech.com/language-identification
Marr, Rich, “Creating a Language Detection API in 30 minutes”: http://richmarr.word-press.com/2008/10/24/creating-a-language-detection-api-in-30-minutes/
Prager, John M., “Linguini: Language Identification for Multilingual Documents”: ftp://ftp.software.ibm.com/software/globalization/documents/linguini.pdf
Java Text Categorization Library: http://textcat.sourceforge.net/
NGramJ: http://ngramj.sourceforge.net
Google Ajax Language API: http://code.google.com/apis/ajaxlanguage/documentation/
Sematext Language Identifier: www.sematext.com/products/language-identifier/index.html
Language identification on Wikipedia: http://en.wikipedia.org/wiki/Language_identification
Vector Space Model on Wikipedia: http://en.wikipedia.org/wiki/Vector_space_model
Latent Semantic Analysis on Wikipedia: http://en.wikipedia.org/wiki/Latent_semantic_analysis
The Latent Semantic Indexing home page: http://lsa.colorado.edu/
“Latent Semantic Indexing (LSI)”: www.cs.utk.edu/~lsi
Stata, Raymie, Krishna Bharat, and Farzin Maghoul, “The Term Vector Database: Fast Access to Indexing Terms for Web Pages”: www9.org/w9cdrom/159/159.html
CLucene: www.sourceforge.net/projects/clucene/
Lucene.Net: http://incubator.apache.org/lucene.net/
KinoSearch: www.rectangular.com/kinosearch
Apache Lucy: http://lucene.apache.org/lucy/
PyLucene: http://lucene.apache.org/pylucene/
Ferret: http://ferret.davebalmain.com
PHP, (Zend_Search_Lucene, part of Zend Framework): http://framework.zend.com/
Krugle: www.krugle.org/
DERI, SIREn: http://siren.sindice.com/
LinkedIn, Bobo-Browse: http://snaprojects.jira.com/browse/BOBO/
LinkedIn, Zoie: http://snaprojects.jira.com/browse/ZOIE
Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze, Introduction to Information Retrieval (Cambridge University Press, 2008). See www-nlp.stanford.edu/IR-book/.
Calishain, Tara, and Rael Dornfest, Google Hacks (O’Reilly, 2003).
Gilleland, Michael, “Levenshtein Distance, in Three Flavors”: www.merriampark.com/ld.htm
GNU Compiler for the Java Programming Language: http://gcc.gnu.org/java/
Google search results for Lucene: www.google.com/search?q=lucene
Apache Lucene Java: http://lucene.apache.org/java
Lucene Sandbox: http://lucene.apache.org/java/3_0_1/lucene-contrib/index.html
Suffix trees on Wikipedia: http://en.wikipedia.org/wiki/Suffix_tree
dmoz results for information retrieval: http://dmoz.org/Computers/Software/Information_Retrieval/
Egothor: www.egothor.org/
Minion: https://minion.dev.java.net/
Google Directory results for information retrieval: http://directory.google.com/Top/Computers/Software/Information_Retrieval/
ht://Dig: www.htdig.org
Managing Gigabytes for Java (MG4J): http://mg4j.dsi.unimi.it
Terrier: http://ir.dcs.gla.ac.uk/terrier
Namazu: www.namazu.org
Hounder: http://hounder.org
Search Tools for Web Sites and Intranets: www.searchtools.com
SWISH++: http://swishplusplus.sourceforge.net/
SWISH-E: http://swish-e.org/
Autonomy: www.autonomy.com
Aperture: http://aperture.sourceforge.net/
WebGlimpse: http://webglimpse.net
Xapian: www.xapian.org
The Lemur Toolkit: www.lemurproject.org
Doug’s official list of publications, from which this was derived, is available at http://lucene.sourceforge.net/publications.html.
“An Interpreter for Phonological Rules,” coauthored with J. Harrington, Proceedings of Institute of Acoustics Autumn Conference, November 1986
“Information Theater versus Information Refinery,” coauthored with J. Pedersen, P.-K. Halvorsen, and M. Withgott, AAAI Spring Symposium on Text-Based Intelligent Systems, March 1990
“Optimizations for Dynamic Inverted Index Maintenance,” coauthored with J. Pedersen, Proceedings of SIGIR ’90, September 1990
“An Object-Oriented Architecture for Text Retrieval,” coauthored with J. O. Pedersen and P.-K. Halvorsen, Proceedings of RIAO ’91, April 1991
“Snippet Search: A Single Phrase Approach to Text Access,” coauthored with J. O. Pedersen and J. W. Tukey, Proceedings of the 1991 Joint Statistical Meetings, August 1991
“A Practical Part-of-Speech Tagger,” coauthored with J. Kupiec, J. Pedersen, and P. Sibun, Proceedings of the Third Conference on Applied Natural Language Processing, April 1992
“Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections,” coauthored with D. Karger, J. Pedersen, and J. Tukey, Proceedings of SIGIR ’92, June 1992
“Constant Interaction-Time Scatter/Gather Browsing of Very Large Document Collections,” coauthored with D. Karger and J. Pedersen, Proceedings of SIGIR ’93, June 1993
“Porting a Part-of-Speech Tagger to Swedish,” Nordic Datalingvistik Dagen 1993, Stockholm, June 1993
“Space Optimizations for Total Ranking,” coauthored with J. Pedersen, Proceedings of RIAO ’97, Montreal, Quebec, June 1997
5,278,980: “Iterative technique for phrase query formation and an information retrieval system employing same,” with J. Pedersen, P.-K. Halvorsen, J. Tukey, E. Bier, and D. Bobrow, filed August 1991
5,442,778: “Scatter-gather: a cluster-based method and apparatus for browsing large document collections,” with J. Pedersen, D. Karger, and J. Tukey, filed November 1991
5,390,259: “Methods and apparatus for selecting semantically significant images in a document image without decoding image content,” with M. Withgott, S. Bagley, D. Bloomberg, D. Huttenlocher, R. Kaplan, T. Cass, P.-K. Halvorsen, and R. Rao, filed November 1991
5,625,554 “Finite-state transduction of related word forms for text indexing and retrieval,” with P.-K. Halvorsen, R.M. Kaplan, L. Karttunen, M. Kay, and J. Pedersen, filed July 1992
5,483,650 “Method of Constant Interaction-Time Clustering Applied to Document Browsing,” with J. Pedersen and D. Karger, filed November 1992
5,384,703 “Method and apparatus for summarizing documents according to theme,” with M. Withgott, filed July 1993
5,838,323 “Document summary computer system user interface,” with D. Rose, J Bornstein, and J. Hatton, filed September 1995
5,867,164 “Interactive document summarization,” with D. Rose, J. Bornstein, and J. Hatton, filed September 1995
5,870,740 “System and method for improving the ranking of information retrieval results for short queries,” with D. Rose, filed September 1996