The Thought Lab

June 27, 2007

Adding The Non-required Term Patch To Nutch

At my new job I’ve been put on the search engine team (I say team, since me being added it’s now a team of two). One of the new requirements for the search was the support of OR based searches rather than just Nutch’s implicit AND. After looking around a bit and asking on the Nutch mailing list I discovered a patch which gave Nutch support for this (NUTCH-479). I've spent a little time getting it all running smoothly so I thought it might be usefull to blog about it. Note that all examples are written for the Linux environment.

First step is getting the Nutch source bundle - the file will be named something like nutch-0.9.tar.gz - and the NUTCH-479 patch.

Unpack the Nutch source bundle, navigate to the java source path and apply the patch.
# tar -zxvf nutch-0.9.tar.gz # cd nutch-0.9/src/java/ # patch -p 0 < or.patch

Now that we have the code changed we need to compile Nutch. There are two steps to this, rebuilding the javacc syntax file and then running the ANT build to build the patched nutch jar.
1. To build the javacc syntax files, get hold of javacc and run the following in the nutch-0.9/src/java/org/apache/nutch/analysis/ directory
  # javacc NutchAnalysis.jj
2. Running the ANT build is almost as simple, from the directory nutch-0.9 run
  # ant

This seemed to work well, but there is a problem with it. The patch adds the keyword OR which you have to add into your queries like OR"term" (which converts to lucene query "term". This works fine untill you need to OR a fielded query, OR"field:term" converts to "field:term" rather than field:"term" or field:term which I would have expected. To get around this I figured out javacc and modded the patch slightly. The modded patch works as shown below. I have subitted the patch to the patch page so if you need it you can get it from there.

nutch query => lucene query OR field:term => field:term field:term OR field2:term2 => +field:term field2:term2 etc...

May 29, 2007

Foodr The Fruit and Veg Search

Looks like I slightly misunderstood what kind of projects were done at Hack Day earlier today with my Symfony idea. Oh well, I guess I'll save that one for another time. I have, however, come up with another idea which I think would fit in very well. I'll call it Foodr. Basically it's a mashup of Flickr (via retrievr or directly and then managed with imgSeek), Wikipedia and Google Base Recipe Search. A user takes a photo of a fruit or vegetable, sends the image to the application which searches for it, retrieves details of it from Wikipedia and Recipes with it from Google Base. Obviously retrievr won't work directly as the base is too wide, so I was thinking of getting people to tag fruit and veg images with 'fruitr' and the name of the fruit / vegetable, then the image library can be restricted to just relevant images.

The Collaborative Cook

I've been thinking about project ideas for HackDay over this past weekend. I've managed to come up with two ideas so far, one very useful but also very boring and one mostly for fun. I'm not sure if I'm going to take either of them any further at HackDay as I really don't know what the event's going to be like. I'm tempted to turn up and go along with other peoples projects to get a feel for the whole thing. Either way, I think the projects could be quite good so I'm recording them here for future reference.

Project One: Object Cache Item Removal Plugin for Symfony

The name could be improved, yes, but I really can't think of anything better which gets across the intent as well. The point of this plugin would be to automatically remove items from the object cache (as apposed to the config cache etc) when the Propel objects on which they depend change. At the moment if you use the caching functionality in Symfony you have to remember to drop all the relevant cache items when changing a particular Propel object. Although this gives you the most flexibility it's also quite a lot of work and opens up the possibility for some pretty nasty 'stale data' bugs.

The basic structure would include something which watches objects coming out of Propel and determines what an action / partial / fragment / whatever is dependent on, something which manages these dependencies, and then something which drops all dependent cache items when one of the dependencies is changed.

After thinking about the idea for a while I've come to realize that it's going to be quite tricky (although I don't think impossible). The problem is the range of rules for removal. Below I have listed the ones I have come up with so far. I have used a recipe site scenario for examples. The recipe site consists of recipes which have one creator, multiple ingredients (in a many-to-many relationship) and multiple method steps (in a one-to-many relationship).

Extracted objects: A cache item is dependent on all Propel objects which are accessible within it's scope. For example, a 'recipe details' action would be dependent on the recipe, the recipe's user, all the ingredients related to it and all the method steps related to it. This is the most basic rule.
Relationships: A cache item is dependent on table references between Propel objects. For example, a 'recipe details' action would be dependent on all recipe ingredients and method steps which relate to the recipe concerned. This rule handles new recipe ingredients and method steps being added while not affecting existing ones or the recipe.
Whole class: A cache item is dropped when any object of a particular Propel class is saved. For example, a 'most popular recipe' action would be dependent on any recipe rating object, so if any change, the item must be dropped. There could be a number of variations on this rule such as only kicking in on newly created objects (for a 'newest recipes' action for example) or only relating to a particular field on the object.

I am sure there are other rules which I have not thought of so I would want to write the plugin with an extensible rules architecture. I think this would be an incredibly difficult project to get exactly right but I also think it could be a very useful one, if it is done right.

The Collaborative Cook: A Recipe Orientated Wiki

This is the more fun idea, the title says it all really. There are two main angles to look at here, collaboration to correct mistakes and add information on a recipe which is pretty much the same as regular wikis, and then there's the idea of collaboration as a creative process to create new recipes together. The second angle is one which I think is much more interesting. It would need to handle branching very well so that recipes could be developed in different directions. It would also benefit from a way of merging two branches together if it turns out they are going in the same direction. I think lessons could be learnt from the collaborative storytelling sphere for this side of things.

May 25, 2007

I'm going to Hack Day!

Horay, I'm going to hack day! Now all I have to do is think of something to do there.

April 04, 2007

Non-blocking I/O With PHP-MIO

A couple of weeks ago I was thinking about non-blocking I/O in PHP, specifically about how clunky PHP's select implementation is. I say clunky because it's not bad, it's just not as easy to use as it could be. It's not as easy, for example, as the implementation found in Java's NIO package which is beautifully simple to use. The main issue I have with PHP's implementation is that I am responsible for keeping track of everything. I have to remember which streams I'm interested in writing to, which streams I'm interested in reading from and when I get to accepting connections, which streams are server sockets that I'm interested in accepting connections on. I'm lazy, I don't want to have to do that, I want a library to handle all that for me. At this point I decided to implement something similar to Java's non-blocking I/O in PHP5. This is now finished and up on sourceforce (under the name of phpmio). In this article I hope to give you enough information to get up and running with the package.

But What Is Multiplexed I/O?

Before I go any further I suppose I should explain exactly what multiplexed (or non-blocking) I/O actually is. When reading from or writing to a stream PHP usually blocks until the operation is complete, however, a stream's blocking mode can be set such that operations on streams don't block and instead return immediately. Used correctly this technique can vastly improve performance in networked applications. This comes at the price of increased complexity and some would argue a more confusing program flow. For this reason I wouldn't suggest it for trivial applications. Let's take a look at this in action. In the example below we open a stream to amazon and try to read some data from it, then display how long each operation took and how much data was read. If this is run with the stream's blocking mode set to 0 (non-blocking) you will notice that the read takes very little time and not all of the bytes are read. If, on the other hand, the stream's blocking mode is set to 1 (blocking) you will notice that the read takes much longer and all 2048 bytes are read.

<?php
$start = microtime( true );

$fp    = fopen( 'http://www.amazon.co.uk/', 'r' );
$open  = microtime( true );

stream_set_blocking( $fp, 1 );
$block = microtime( true );

$data  = fread( $fp, 2048 );
$read  = microtime( true );

// the time taken to open the url
echo "Open:  " . ($open-$start) . "\n";
// the time taken to set the stream to blocking
echo "Block: " . ($block-$open) . "\n";
// the time taken to read from the the stream
echo "Read:  " . ($read-$block)  . "\n";
// the amount of data read
echo "Data Read: " . strlen( $data ) . "\n";

OK, So What Is PHP MIO?

So, how does multiplexed I/O work with the PHP MIO package? Within the package there are three key classes; MioStream, MioSelector and MioSelectionKey. There is also a factory class to provide a convenient way of creating different types of streams, MioStreamFactory, and a few Exception classes. Before diving into the core of the package let us take a quick look at MioStreamFactory to get it out of the way. Below are three examples of how it can be used. Each method creates an instance of the MioStream class which wraps a PHP stream. The first method, createSocketStream, creates (as you might expect) a socket stream (as would be created with fsockopen); the second a server socket (stream_socket_server); and the third a file socket (fopen). One thing which should be noted is that creating an MioStream with MioStreamFactory implicitly sets it's blocking flag to 0.

<?php
$factory = new MioStreamFactory();

// Create a client socket stream
$socket = $factory->createSocketStream( '127.0.0.1', 8888 );

// Create a server socket stream
$server = $factory->createServerStream( '127.0.0.1:8888' );

// Create a local file stream
$file   = $factory->createFileStream( '/etc/hosts' );

An MioStream object allows us to do the basic things we would want to do with a stream such as reading and writing. Although MioStream only handles basic functions itself, it does give you access to it's underlying stream resource in case you need to do anything more advanced. In the code below we write some data to the socket stream we made in the last example and then read some data from it. We then attempt to accept a new stream on the server socket stream (note that the accept method will return an MioStream object). On this new stream we check that it's open, close it and then check that it's no longer open. Finally, we get hold of the socket stream's internal resource and append a stream filter to it.

<?php
$socket->write( 'Put some data' );

$socket->read( 1024 );

$stream = $server->accept();

if( !$stream->isOpen() ) {
    trigger_error( "The new stream should be open", E_USER_ERROR );
}

$stream->close();

if( $stream->isOpen() ) {
    trigger_error( "The new stream should now be closed", E_USER_ERROR );
}

stream_filter_append( $socket->getStream(), 'string.toupper' );

Downloading The AMP Stack

Now that we know how to create MioStream objects and interact with them let's take a look at how they are used with the MioSelector. A selector is an object used for managing and selecting streams which are available for different types of work (reading, writing or accepting connections). This is done by registering MioStream objects with the selector, the relationship between the selector and each stream is excapsulated in an MioSelectionKey object. When we register a stream with a selector we also provide what we're interested in for this stream (this can be reading, writing or accepting connections) and optionally an object which we want associated with the stream (so we know what to do with it). Once we have registered our streams with the selector we can call the select method to get all registered streams which are ready for any of the operations we are interested in. To get an idea of how this works let's take a look at a simple example, downloading three files.

In this example we open a stream to each remote file we want to download and one for each local file we want to write to. We the register each remote stream with the selector, attaching the stream's respective local stream for writing to. Once we have registered them we loop over the selector's select method, the select method returns the number of ready streams (streams which area available for one of the actions we have registered an interest in) or false if there are no streams registered with the selector. An important note to make here is that streams are automatically unregistered from the selector when they are closed so in this case we don't have to explicitly unregister them. Now we can loop over all the streams which have been selected and perform our action on them. In this case we read from the remote stream and write the data to it's associated local stream.

<?php
$selector = new MioSelector();
$factory  = new MioStreamFactory();

// Create and register streams to download the PHP 5.2.1 source
$reader = $factory->createFileStream( 'http://uk.php.net/get/php-5.2.1.tar.bz2/from/this/mirror', 'r' );
$writer = $factory->createFileStream( 'php-5.2.1.tar.bz2', 'w+' );
$selector->register( $reader, MioSelectionKey::OP_READ, $writer );

// Create and register streams to download the MySQL 5.11.15 binary
$reader = $factory->createFileStream( 'http://dev.mysql.com/get/Downloads/MySQL-5.1/mysql-5.11.15-beta-linux-i686-glibc23.tar.gz/from/http://mirrors.dedipower.com/www.mysql.com/', 'r' );
$writer = $factory->createFileStream( 'mysql-5.11.15-beta-linux-i686-glibc23.tar.gz' );
$selector->register( $reader, MioSelectionKey::OP_READ, $writer );

// Create and register streams to download the Apache 2.2.4 source
$reader = $factory->createFileStream( 'http://www.mirrorservice.org/sites/ftp.apache.org/httpd/httpd-2.2.4.tar.bz2' );
$writer = $factory->createFileStream( 'httpd-2.2.4.tar.bz2' );
$selector->register( $reader, MioSelectionKey::OP_READ, $writer );

while( true ) {
    // Loop over select untill we have some streams to act on
    while( !$count = $selector->select() ) {
        if( $count === false ) {
            $selector->close();
            break 2;
        }
    }

    // Loop over all streams which are available for 
    // something we're interested in
    foreach( $selector->selected_keys as $key ) {
        if( $key->isReadable() ) {
            $key->attachment->write(
                $key->stream->read( 16384 )
            );
        }
    }
}

Serving Up Echoes

I think we have a good understanding of how PHP MIO works now so let's take a look at a
server example. To keep it simple I'm going to do an echo server. This example will accept
connections on port 7, read data in and then send it straight back. First off, we're going
to need a class to encapsulate the echoing.

<?php
/**
 * A class to echo data.
 * This is eessentially just a FIFO queue. Data can be
 * added onto the end of the buffer and at a later date
 * it can be read (and implicitly removed) from the
 * beginning of the buffer.
 */
class Echoer
{
    /**
     * Holds the data untill it needs to
     * be echoed back
     */
    private $buffer='';

    /**
     * Add some data to the buffer
     *
     * @param string $data
     * @return void
     */
    public function put( $data )
    {
        $this->buffer .= $data;
    }

    /**
     * Read and remove a chunk of data from
     * the start of the buffer
     *
     * @param int $size The amount of data to read
     * @return string
     */
    public function get( $size = 4096 )
    {
        $data = substr( $this->buffer, 0, $size );
        $this->buffer = substr( $this->buffer, $size );
        return $data;
    }
}

Now we need to set up our server and get it working, what we're going to do is accept connections and then register these with the selector with an interest in reading. These streams will then appear in later selects where we can read a chunk of data off, put it in the echoer and reset the selection key's interest to writing so that we can echo the data back down the line.

<?php
// Create our base objects
$selector = new MioSelector();
$factory  = new MioStreamFactory();
 
// Register a server stream with the selector
$selector->register(
    // the server stream is listening on 127.0.0.1 port 7
    $factory->createServerStream( '127.0.0.1:7' ),
    // we are interested in accepting connections
    MioSelectionKey::OP_ACCEPT
);
 
// loop for ever, this is going to be server
while( true ) {
    // keep selecting until there's something to do
    while( !$count = $selector->select() ) { }
 
    // when there's something to do loop over the ready set
    foreach( $selector->selected_keys as $key ) {
        // do different acctions for different ready ops
        if( $key->isAcceptable() ) {
            // if the stream has connections ready to
            // accept then accept them until there's no more
            while( $stream = $key->stream->accept() ) {
                // register the newly accepted connection with the
                // selector so that it is handled in subsequent operations
                $selector->register(
                    $stream,
                    // we are interested in reading from the stream
                    MioSelectionKey::OP_READ,
                    // attach an instance of the echoer to manage echoing
                    new Echoer()
                );
            }
        } elseif( $key->isReadable() ) {
            // if the stream is ready for reading then
            // read a chunk of data off it and add it to
            // the echoer
            $key->attachment->put(
                $key->stream->read( 4096 )
            );
            // now we're interested in writing back down the pipe
            $key->setInterestOps( MioSelectionKey::OP_WRITE );
        } elseif( $key->isWritable() ) {
            // if the stream is ready for writing then
            // get some data from the echoer
            $data = $key->attachment->get();
            if( $data ) {
                // if there's data there then send it back
                $key->stream->write(
                    $data
                );
            } else {
                // if there's none then remove the key
                $selector->removeKey( $key );
            }
        }
    }
}

So, now we've done a multiplexed downloader and a multiplexed server. We have processed PHP sockets in a high performance and very efficient manner. PHP may not be the first choice for writing high performance networking applications but for knocking up, in a matter of minutes, something which performes pretty damned well, I think this could do the trick.

March 07, 2007

XMI 2 SQL in No Time

I've just discovered how easy XMI (as used by Umbrello) is to parse. I spent most of yesterday putting together an entity relationship diagram of the DB structure for integrating a data warehouse with our catalogue. Today, faced with the prospect of having to hand craft about 30 tables I decided to take a look at the XMI format. It's really simple. I managed to knock up a quick parser to build SQL from the XMI file and hey presto! All my tables are built.

I don't have any private hosting at the moment but I'll get it up when I do.

February 10, 2007

Discovering Kontact

This morning I made an amazing discovery, Kontact. I don't really know how I managed to miss this for so long, it's even installed by default my version of Kubuntu so I really have no excuse. It is basically a hub for a number of standard KDE applications; KMail(mail), KAddressBook(addresses), KOrganizer(calendar,todo and journal), KNotes(notes) and AKregator(feeds) some of which I have been using for quite a while but some of which are completely new to me, namely KOrganizer. I came across Kontact while I was looking for a calendar program to put in some upcoming foody events around the South of England. Now, I've used a few calendar programs before but I've always become annoyed with them for one reason or another, usually related to how difficult they become to integrate with everything else, I don't think that's going to be a problem any more. Just one example of how nicely it all fits together is how birthdays from my address book can be automatically inserted into my calendar and kept up to date, brilliant!