June 27, 2007

Adding The Non-required Term Patch To Nutch

At my new job I’ve been put on the search engine team (I say team, since me being added it’s now a team of two). One of the new requirements for the search was the support of OR based searches rather than just Nutch’s implicit AND. After looking around a bit and asking on the Nutch mailing list I discovered a patch which gave Nutch support for this (NUTCH-479). I've spent a little time getting it all running smoothly so I thought it might be usefull to blog about it. Note that all examples are written for the Linux environment.

  1. First step is getting the Nutch source bundle - the file will be named something like nutch-0.9.tar.gz - and the NUTCH-479 patch.

  2. Unpack the Nutch source bundle, navigate to the java source path and apply the patch.

    # tar -zxvf nutch-0.9.tar.gz
    # cd nutch-0.9/src/java/
    # patch -p 0 < or.patch

  3. Now that we have the code changed we need to compile Nutch. There are two steps to this, rebuilding the javacc syntax file and then running the ANT build to build the patched nutch jar.

    1. To build the javacc syntax files, get hold of javacc and run the following in the nutch-0.9/src/java/org/apache/nutch/analysis/ directory

      # javacc NutchAnalysis.jj

    2. Running the ANT build is almost as simple, from the directory nutch-0.9 run

      # ant

This seemed to work well, but there is a problem with it. The patch adds the keyword OR which you have to add into your queries like OR"term" (which converts to lucene query "term". This works fine untill you need to OR a fielded query, OR"field:term" converts to "field:term" rather than field:"term" or field:term which I would have expected. To get around this I figured out javacc and modded the patch slightly. The modded patch works as shown below. I have subitted the patch to the patch page so if you need it you can get it from there.

nutch query => lucene query
OR field:term => field:term
field:term OR field2:term2 => +field:term field2:term2

No comments: