Search engines

Search (information retrieval) is a big part of NLP. And it’s extremely important that we get it right so that our AI (and corporate) overlords can’t manipulate us through the information they feed our brains. If you want to learn how to retrieve your own information, by building your own search engines, these are some resources that will help.

Search algorithms

Open source search engines

Open source full-text indexers

Efficient indexing is critical to any natural language search application. Here are a few open source full-text indexing options. However, these “search engines” don’t crawl the web, so you need to provide them with the corpus you want them to index and search:

Manipulative search engines

The search engines most of us use aren’t optimized solely to help you find what you need, but rather to ensure that you click links that generate revenue for the company that built it. Google’s innovative second-price sealed-bid auction ensures that advertisers don’t overpay for their ads,[1] but it doesn’t prevent search users from overpaying when they click disguised advertisements. This manipulative search isn’t unique to Google. It’s used in any search engine that ranks results according to any “objective function” other than your satisfaction with the search results. But here they are, if you want to compare and experiment:

1

Cornell University Networks Course case study, “Google AdWords Auction - A Second Price Sealed-Bid Auction,” (https://blogs.cornell.edu/info2040/2012/10/27/google-adwords-auction-a-second-price-sealed-bid-auction).

  • Google
  • Bing
  • Baidu

Less manipulative search engines

To determine how commercial and manipulative a search engine was, I queried several engines with things like “open source search engine.” I then counted the number of ad-words purchasers and click-bait sites among the search results in the top 10. The following sites kept that count below one or two. And the top search results were often the most objective and useful sites, such as Wikipedia, Stack Exchange, or reputable news articles and blogs:

Distributed search engines

Distributed search engines[3] are perhaps the least manipulative and most “objective” because they have no central server to influence the ranking of the search results. However, current distributed search implementations rely on TF-IDF word frequencies to rank pages, because of the difficulty in scaling and distributing semantic search NLP algorithms. However, distribution of semantic indexing approaches such as latent semantic analysis (LSA) and locality sensitive hashing have been successfully distributed with nearly linear scaling (as good as you can get). It’s just a matter of time before someone decides to contribute code for semantic search into an open source project like Yacy or builds a new distributed search engine capable of LSA:

3

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset