Search engines

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Search engines

Search (information retrieval) is a big part of NLP. And it’s extremely important that we get it right so that our AI (and corporate) overlords can’t manipulate us through the information they feed our brains. If you want to learn how to retrieve your own information, by building your own search engines, these are some resources that will help.

Search algorithms

Billion-scale similarity search with GPUs (https://arxiv.org/pdf/1702.08734.pdf) —BidMACH is a high-dimensional vector indexing and KNN search implementation, similar to the annoy Python package. This paper explains an enhancement for GPUs that is 8 times faster than the original implementation.
Spotify’s Annoy Package (https://erikbern.com/2017/11/26/annoy-1.10-released-with-hamming-distance-and-windows-support.html) by Erik Bernhardsson’s—A K-nearest neighbors algorithm used at Spotify to find similar songs.
New benchmarks for approximate nearest neighbors by Erik Bernhardsson (https://erikbern.com/2018/02/15/new-benchmarks-for-approximate-nearest-neighbors.html)—Approximate nearest neighbor algorithms are the key to scalable semantic search, and author Erik keeps tabs on the state of the art.

Open source search engines

BeeSeek (https://launchpad.net/~beeseek-devs)—Open source distributed web indexing and private search (hive search); no longer maintained
WebSPHNIX (https://www.cs.cmu.edu/~rcm/websphinx/)—Web GUI for building a web crawler

Open source full-text indexers

Efficient indexing is critical to any natural language search application. Here are a few open source full-text indexing options. However, these “search engines” don’t crawl the web, so you need to provide them with the corpus you want them to index and search:

Elasticsearch (https://github.com/elastic/elasticsearch)—Open Source, Distributed, RESTful Search Engine.
Apache Lucern + Solr (https://github.com/apache/lucene-solr).
Sphinx Search (https://github.com/sphinxsearch/sphinx).
Kronuz/Xapiand: Xapiand: A RESTful Search Engine (https://github.com/Kronuz/Xapiand)—There are packages for Ubuntu that’ll let you search your local hard drive (like Google Desktop used to do).
Indri (http://www.lemurproject.org/indri.php)—Semantic search with a Pyt-hon interface (https://github.com/cvangysel/pyndri), but it isn’t actively maintained.
Gigablast (https://github.com/gigablast/open-source-search-engine)—Open source web crawler and natural language indexer in C++.
Zettair (http://www.seg.rmit.edu.au/zettair)—Open source HTML and TREC indexer (no crawler or live example); last updated 2009.
OpenFTS: Full Text Search Engine (http://openfts.sourceforge.net)—Full text search indexer for PyFTS using PostgreSQL with a Python API (http://rhodesmill.org/brandon/projects/pyfts.html).

Manipulative search engines

The search engines most of us use aren’t optimized solely to help you find what you need, but rather to ensure that you click links that generate revenue for the company that built it. Google’s innovative second-price sealed-bid auction ensures that advertisers don’t overpay for their ads,^[1] but it doesn’t prevent search users from overpaying when they click disguised advertisements. This manipulative search isn’t unique to Google. It’s used in any search engine that ranks results according to any “objective function” other than your satisfaction with the search results. But here they are, if you want to compare and experiment:

¹
Cornell University Networks Course case study, “Google AdWords Auction - A Second Price Sealed-Bid Auction,” (https://blogs.cornell.edu/info2040/2012/10/27/google-adwords-auction-a-second-price-sealed-bid-auction).

Google
Bing
Baidu

Less manipulative search engines

To determine how commercial and manipulative a search engine was, I queried several engines with things like “open source search engine.” I then counted the number of ad-words purchasers and click-bait sites among the search results in the top 10. The following sites kept that count below one or two. And the top search results were often the most objective and useful sites, such as Wikipedia, Stack Exchange, or reputable news articles and blogs:

Alternatives to Google (https://www.lifehack.org/374487/try-these-15-search-engines-instead-google-for-better-search-results).^[2]

²
See the web page titled “Try These 15 Search Engines Instead of Google For Better Search Results,” (https://www.lifehack.org/374487/try-these-15-search-engines-instead-google-for-better-search-results).
Yandex (https://yandex.com/search/?text=open%20source%20search%20engine&lr=21754)—Surprisingly, the most popular Russian search engine (60% of Russian searches) seemed less manipulative than the top US search engines.
DuckDuckGo (https://duckduckgo.com).
Watson Semantic Web Search (http://watson.kmi.open.ac.uk/WatsonWUI)—No longer in development, and not really a full-text web search, but it’s an interesting way to explore the semantic web (at least what it was years ago before watson was frozen).

Distributed search engines

Distributed search engines^[3] are perhaps the least manipulative and most “objective” because they have no central server to influence the ranking of the search results. However, current distributed search implementations rely on TF-IDF word frequencies to rank pages, because of the difficulty in scaling and distributing semantic search NLP algorithms. However, distribution of semantic indexing approaches such as latent semantic analysis (LSA) and locality sensitive hashing have been successfully distributed with nearly linear scaling (as good as you can get). It’s just a matter of time before someone decides to contribute code for semantic search into an open source project like Yacy or builds a new distributed search engine capable of LSA:

³
See the web pages titled “Distributed search engine,” (https://en.wikipedia.org/wiki/Distributed_search_engine) and “Distributed Search Engines,” (https://wiki.p2pfoundation.net/Distributed_Search_Engines).

Nutch (https://nutch.apache.org/)—Nutch spawned Hadoop and itself became less of a distributed search engine and more of a distributed HPC system over time.
Yacy (https://www.yacy.net/en/index.html)—One of the few open source (https://github.com/yacy/yacy_search_server) decentralized, or federated, search engines and web crawlers still actively in use. Preconfigured clients for Mac, Linux, and Windows are available.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Search engines

Create new playlist

Sign In

Sign Up

Search engines

Search algorithms

Open source search engines

Open source full-text indexers

Manipulative search engines

Less manipulative search engines

Distributed search engines

Table of Contents for
Search engines