Search (information retrieval) is a big part of NLP. And it’s extremely important that we get it right so that our AI (and corporate) overlords can’t manipulate us through the information they feed our brains. If you want to learn how to retrieve your own information, by building your own search engines, these are some resources that will help.
Efficient indexing is critical to any natural language search application. Here are a few open source full-text indexing options. However, these “search engines” don’t crawl the web, so you need to provide them with the corpus you want them to index and search:
The search engines most of us use aren’t optimized solely to help you find what you need, but rather to ensure that you click links that generate revenue for the company that built it. Google’s innovative second-price sealed-bid auction ensures that advertisers don’t overpay for their ads,[1] but it doesn’t prevent search users from overpaying when they click disguised advertisements. This manipulative search isn’t unique to Google. It’s used in any search engine that ranks results according to any “objective function” other than your satisfaction with the search results. But here they are, if you want to compare and experiment:
Cornell University Networks Course case study, “Google AdWords Auction - A Second Price Sealed-Bid Auction,” (https://blogs.cornell.edu/info2040/2012/10/27/google-adwords-auction-a-second-price-sealed-bid-auction).
To determine how commercial and manipulative a search engine was, I queried several engines with things like “open source search engine.” I then counted the number of ad-words purchasers and click-bait sites among the search results in the top 10. The following sites kept that count below one or two. And the top search results were often the most objective and useful sites, such as Wikipedia, Stack Exchange, or reputable news articles and blogs:
See the web page titled “Try These 15 Search Engines Instead of Google For Better Search Results,” (https://www.lifehack.org/374487/try-these-15-search-engines-instead-google-for-better-search-results).
Distributed search engines[3] are perhaps the least manipulative and most “objective” because they have no central server to influence the ranking of the search results. However, current distributed search implementations rely on TF-IDF word frequencies to rank pages, because of the difficulty in scaling and distributing semantic search NLP algorithms. However, distribution of semantic indexing approaches such as latent semantic analysis (LSA) and locality sensitive hashing have been successfully distributed with nearly linear scaling (as good as you can get). It’s just a matter of time before someone decides to contribute code for semantic search into an open source project like Yacy or builds a new distributed search engine capable of LSA:
See the web pages titled “Distributed search engine,” (https://en.wikipedia.org/wiki/Distributed_search_engine) and “Distributed Search Engines,” (https://wiki.p2pfoundation.net/Distributed_Search_Engines).