Search On Steriods: Build A Better Search Engine For Your Site

Have you ever wanted to build your own search engine or just enhance you current site search for you website? Today I'm going to teach you how to build better search for your site using Apache Solr for search and Apache Nutch for crawling.

Solr logo

I'm using an ubuntu machine for this tutorial if you would like to follow along, you could also get an account from amazon and use the cloud servers for testing.

The first thing you need to do is download Solr and Nutch using the sudo wget url-to-download command. From there on you can extract the folders on linux using tar -zxvf yourfile.tar.gz command. Do this for both nutch 2 and solr 1.4.1 and best is to extract it in to a temp folder in your home directory.

Which Cup Of Java To Choose

Java.equals(no pants)

We now need to make sure the correct java path is set for Solr and Nutch to work correctly. If you don't have java installed on your machine I suggest you do that. After you have setup java, we need to make sure that the correct path is set with the following commands:

which java will show you which path is currently set.

use the ls –l command to follow the sim link until you have the correct path. When you found the correct path you can set it with this command JAVA_HOME=/usr/lib/jvm/java-6-openjdk/ export JAVA_HOME that should get you working from now on.

Are You Nutch?

The next thing you need to do is edit the file here NUTCH_ROOT/conf/nutch-default.xml and set the value of http.agent.name in that file. Now create a folder here NUTCH_ROOT/crawl and a file NUTCH_ROOT/urls/nutch

Edit the nutch file you just created and add urls you would like to crawl, one per line. Remember to add the http:// and trailing slash in front of your domains.

Edit the file NUTCH_HOME/conf/crawl-urlfilter.txt and beneath # accept hosts in MY.DOMAIN.NAME replace MY.DOMAIN.NAME with the domain of the first URL you wish to crawl and make a new line for each, for example +^http://([a-z0-9]*\.)*domaintocrawl.com/

I Give You My Solr

Next we need to setup Solr to work correctly with Nutch. Copy all files from NUTCH_ROOT/conf into SOLR_ROOT/example/solr/conf. Create the following files stopwords.txt and protwords.txt and add them to the _ROOT/example/solr/conf directory.

Edit SOLR_ROOT/example/solr/conf/schema.xml

Change line 71 from:

<field name="content" type="text" stored="false" indexed="true"/>

to:

<field name="content" type="text" stored="true" indexed="true"/>

Edit SOLR_ROOT/example/solr/conf/solrconfig.xml

▪ Add the following above the first <requestHandler> tag:

<requestHandler name="/nutch" >

<lst name="defaults">

<str name="defType">dismax</str>

<str name="echoParams">explicit</str>

<float name="tie">0.01</float>

<str name="qf">

content^0.5 anchor^1.0 title^1.2

</str>

<str name="pf">

content^0.5 anchor^1.5 title^1.2 site^1.5

</str>

<str name="fl">

url

</str>

<str name="mm">

2&lt;-1 5&lt;-2 6&lt;90%

</str>

<int name="ps">100</int>

<str name="q.alt">*:*</str>

<str name="hl.fl">title url content</str>

<str name="f.title.hl.fragsize">0</str>

<str name="f.title.hl.alternateField">title</str>

<str name="f.url.hl.fragsize">0</str>

<str name="f.url.hl.alternateField">url</str>

<str name="f.content.hl.fragmenter">regex</str>

</lst>

</requestHandler>

Gentlemen Start Your Engines

Now we are at a point that we can start Solr up. Go to the example directory cd SOLR_ROOT/example and type the following command to start Solr up: java -jar start.jar if Sorl started correctly you can go to http://HOST_ADDRESS:8983/solr/admin for the default Solr admin panel, from there you can do search and other funky search stuff. This is just easier to use then using the command line.

Solr for WordPress

When you have setup your Solr installation and have it running, then you can easily integrate Solr with your wordpress installation by just installing a plugin called Solr for WordPress.

This plugin has just a few settings you need to add before everything is set. You might just want to style the results page afterwards as it’s kind of ugly.

Start crawling with nutch

To get nutch to start crawling the sites you’ve set in the nutch file you need to run the nutch crawl command.

◦   The crawl command has the following options:

▪   -dir names the directory to put the crawl in.

▪   -threads determines the number of threads that will fetch in parallel.

▪   -depth indicates the link depth from the root page that should be crawled.

▪   -topN determines the maximum number of pages that will be retrieved at each level up to the depth.

Here is the command you need to run;

bin/nutch crawl urls -dir crawl -depth 2 -topN 1000

This command will start running the nutch to crawl about a 1000 pages maximum and a link depth of 2.

When crawling is done you can run the following command on linux.

◦   bin/nutch solrindex http://localhost:8983/solr/core0 crawl/crawldb crawl/linkdb crawl/segments/*

This will index the crawled data in to the Solr core you specify.

Conclusion

Now you should understand a few basics and some advanced ways to setup Solr. This article is just intended to get you going. I’ve attached a word doc Nutch Solr Setup Doc that I use as my setup guide for Sorl and Nutch.

If you feel lost or you think you have a better way to setup and use Sorl and Nutch, please feel free to comment as it will help other users too. I will update this article with your comment as a reference.

Related Posts:

  • No Related Posts
  • rss
  • email
  • rss
  • email
I'm a full time PHP developer and I just love all things web related. If you need help I'm your guy.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>