Have you ever wanted to build your own search engine or just enhance you current site search for you website? Today I'm going to teach you how to build better search for your site using Apache Solr for search and Apache Nutch for crawling.
I'm using an ubuntu machine for this tutorial if you would like to follow along, you could also get an account from amazon and use the cloud servers for testing.
The first thing you need to do is download Solr and Nutch using the sudo wget url-to-download command. From there on you can extract the folders on linux using tar -zxvf yourfile.tar.gz command. Do this for both nutch 2 and solr 1.4.1 and best is to extract it in to a temp folder in your home directory.
Which Cup Of Java To Choose
We now need to make sure the correct java path is set for Solr and Nutch to work correctly. If you don't have java installed on your machine I suggest you do that. After you have setup java, we need to make sure that the correct path is set with the following commands:
which java will show you which path is currently set.
use the ls –l command to follow the sim link until you have the correct path. When you found the correct path you can set it with this command JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ export JAVA_HOME that should get you working from now on.
Are You Nutch?
The next thing you need to do is edit the file here NUTCH_ROOT/conf/nutch-default.xml and set the value of http.agent.name in that file. Now create a folder here NUTCH_ROOT/crawl and a file NUTCH_ROOT/urls/nutch
Edit the nutch file you just created and add urls you would like to crawl, one per line. Remember to add the http:// and trailing slash in front of your domains.
Edit the file NUTCH_HOME/conf/crawl-urlfilter.txt and beneath # accept hosts in MY.DOMAIN.NAME replace MY.DOMAIN.NAME with the domain of the first URL you wish to crawl and make a new line for each, for example +^http://([a-z0-9]*\.)*domaintocrawl.com/
I Give You My Solr
Next we need to setup Solr to work correctly with Nutch. Copy all files from NUTCH_ROOT/conf into SOLR_ROOT/example/solr/conf. Create the following files stopwords.txt and protwords.txt and add them to the _ROOT/example/solr/conf directory.
Change line 71 from:
<field name="content" type="text" stored="false" indexed="true"/>
<field name="content" type="text" stored="true" indexed="true"/>
▪ Add the following above the first <requestHandler> tag:
<requestHandler name="/nutch" >
content^0.5 anchor^1.0 title^1.2
content^0.5 anchor^1.5 title^1.2 site^1.5
2<-1 5<-2 6<90%
<str name="hl.fl">title url content</str>
Gentlemen Start Your Engines
Now we are at a point that we can start Solr up. Go to the example directory cd SOLR_ROOT/example and type the following command to start Solr up: java -jar start.jar if Sorl started correctly you can go to http://HOST_ADDRESS:8983/solr/admin for the default Solr admin panel, from there you can do search and other funky search stuff. This is just easier to use then using the command line.
Solr for WordPress
When you have setup your Solr installation and have it running, then you can easily integrate Solr with your wordpress installation by just installing a plugin called Solr for WordPress.
This plugin has just a few settings you need to add before everything is set. You might just want to style the results page afterwards as it’s kind of ugly.
Start crawling with nutch
To get nutch to start crawling the sites you’ve set in the nutch file you need to run the nutch crawl command.
◦ The crawl command has the following options:
▪ -dir names the directory to put the crawl in.
▪ -threads determines the number of threads that will fetch in parallel.
▪ -depth indicates the link depth from the root page that should be crawled.
▪ -topN determines the maximum number of pages that will be retrieved at each level up to the depth.
Here is the command you need to run;
bin/nutch crawl urls -dir crawl -depth 2 -topN 1000
This command will start running the nutch to crawl about a 1000 pages maximum and a link depth of 2.
When crawling is done you can run the following command on linux.
◦ bin/nutch solrindex http://localhost:8983/solr/core0 crawl/crawldb crawl/linkdb crawl/segments/*
This will index the crawled data in to the Solr core you specify.
Now you should understand a few basics and some advanced ways to setup Solr. This article is just intended to get you going. I’ve attached a word doc Nutch Solr Setup Doc that I use as my setup guide for Sorl and Nutch.
If you feel lost or you think you have a better way to setup and use Sorl and Nutch, please feel free to comment as it will help other users too. I will update this article with your comment as a reference.