This book tackles three core areas of interest in today's search environment: desktop clients, web search, and web crawling. You'll gain practical experience into these sorts of applications by following along with theme projects included throughout the book. So if you've ever aspired to building your own search engine akin to Google or Yahoo! He has extensive experience in developing enterprise systems in e-commerce, web, and search domains on the LAMP, Java, and. NET platforms. Jon has previously contributed to books and industry publications as a technical reviewer and coauthor, respectively.
|Published (Last):||8 May 2013|
|PDF File Size:||4.60 Mb|
|ePub File Size:||4.82 Mb|
|Price:||Free* [*Free Regsitration Required]|
While Google would certainly offer better search results for most of the queries that we were interested in, they no longer offer a cheap and convenient way of creating custom search engines. This need, along with the desire to own and manage my own data spurred me to set about finding a workflow for retrieving decent results for search queries made against a predefined list of websites. That workflow is described here, providing what I hope shall serve as a useful reference for how to go about setting up a small search engine using free and open-source tools.
Commercial search engines are immensely complex feats of engineering. However, at the end of the day, most web search engines perform three basic functions: Crawling the web, Indexing the pages it crawls, and searching for documents in its index. We will walk through how to set up the various tools that provide this functionality. We also take a brief look at how to go about learning a better ranking function. The various components of a search engine.
Reference: Nutch Tutorial A crawler mostly does what its name suggests. It visits pages, consumes their resources, proceeds to visit all the websites that they link to, and then repeats the cycle until a specified crawl depth is reached.
Apache Nutch is one of the more mature open-source crawlers currently available. There are two major releases of Apache Nutch with their releases versioned as 1.
The later is more modern, however, it breaks compatibility with some of the features this tuturial uses. It is therefore advised that readers stick to 1. After crawling, we will want to index the contents of the pages we index. We will use Apache Solr for this purpose, and at the time of writing the latest version of Nutch that is compatible with Solr is v1.
See this page for the most recent Solr-Nutch compatibility table. This name will be used to identify your crawler and will end up in a lot of log files, so give the name some thought. Another important addition is indexer-solr.
This modification instructs Solr to index the lastModified field, allowing us to later rank results based on their recency. The crawler will need a list of seed urls to start its crawl from. Creating this should be as easy as:.
By carefully tailoring a list of seed urls and limiting the your crawl depth, you can control what content makes it into your index. See this page for more details. Reference: Nutch tutorial. Apache Solr is responsible for more than just maintaining a full-text index of the content that our crawler scrapes up.
It also handles search queries, supporting a broad range of fairly sophisticated query parsers. Last of all, it is responsible for reordering the retrieved search results so that the most relevant results show up first. The official Solr documentation should serve as a far better guide of how to setup Solr, but for now we will only need to carry out the following:.
Create resources for a new solr core. A core is a single index with its associated logs and configuration files. The rest of this article assumed you named your core nutch , but you can name it whatever you like:. Copy the nutch schema. The managed-schema file is automatically updated when changes are made to the configuration via the Schema API, and its presence may cause Solr to ignore our schema. More details here. With the required software all setup, we can finally crawl our list of seed urls and index their contents into Solr.
It will take a while to finish the two rounds of crawling and indexing, after which you should be able to issue the following request to your Solr instance:. You can see it for yourself here. This should prod us to ask the question of how we can go about improving the quality of our search results. Commercial search-engines such as Google do this by reranking their search results using a number of heuristics.
Though we will only be using a number of simple features in this article, this list of features put together by Microsoft Research for a shared ranking task should serve as a good reference for appropriate feature selection.
This is known as learning to rank. An example is provided below:. You can manually extract features for a certain query by making a curl request. Note that you will need to include the query both in its default position, and as a parameter passed on to the feature generator. This is because some of our features require the parameter query for their calculation.
This is typically done by eliciting user feeback via a rating system, or inferring preferred ranking by tracking the links users end up clicking on. You can also use -i filename to read newline-delimited queries in from a text file and -o filename to save the output to a file.
Make sure to save the final list of results with their rankings and features in training. Learning to rank is a growing field, and there are a lot of high quality ranking algorithms to choose from.
Alternatively and more conveniently , download the appropriate pre-compiled binary from the project website. Note that it assumes that you use the feature-set we defined above. Invoking it as demonstrated below will generate a model file model. Muuo Wambua. Build yourself a Mini Search Engine. Aug 28, What tools do we need? The various components of a search engine Setting up our Crawler Reference: Nutch Tutorial A crawler mostly does what its name suggests.
See this page for more details Setting up Solr Reference: Nutch tutorial Apache Solr is responsible for more than just maintaining a full-text index of the content that our crawler scrapes up. The official Solr documentation should serve as a far better guide of how to setup Solr, but for now we will only need to carry out the following: Create resources for a new solr core.
Crawl : Where to store contents of crawled pages. This score is a measure of how similar the document is to the query. This is typically calculated using the BM25 ranking function. Training our Ranker Learning to rank is a growing field, and there are a lot of high quality ranking algorithms to choose from.
Learning to Rank