June 27, 2007

Adding The Non-required Term Patch To Nutch

At my new job I’ve been put on the search engine team (I say team, since me being added it’s now a team of two). One of the new requirements for the search was the support of OR based searches rather than just Nutch’s implicit AND. After looking around a bit and asking on the Nutch mailing list I discovered a patch which gave Nutch support for this (NUTCH-479). I've spent a little time getting it all running smoothly so I thought it might be usefull to blog about it. Note that all examples are written for the Linux environment.


  1. First step is getting the Nutch source bundle - the file will be named something like nutch-0.9.tar.gz - and the NUTCH-479 patch.

  2. Unpack the Nutch source bundle, navigate to the java source path and apply the patch.

    # tar -zxvf nutch-0.9.tar.gz
    # cd nutch-0.9/src/java/
    # patch -p 0 < or.patch

  3. Now that we have the code changed we need to compile Nutch. There are two steps to this, rebuilding the javacc syntax file and then running the ANT build to build the patched nutch jar.

    1. To build the javacc syntax files, get hold of javacc and run the following in the nutch-0.9/src/java/org/apache/nutch/analysis/ directory

      # javacc NutchAnalysis.jj


    2. Running the ANT build is almost as simple, from the directory nutch-0.9 run

      # ant

This seemed to work well, but there is a problem with it. The patch adds the keyword OR which you have to add into your queries like OR"term" (which converts to lucene query "term". This works fine untill you need to OR a fielded query, OR"field:term" converts to "field:term" rather than field:"term" or field:term which I would have expected. To get around this I figured out javacc and modded the patch slightly. The modded patch works as shown below. I have subitted the patch to the patch page so if you need it you can get it from there.


nutch query => lucene query
OR field:term => field:term
field:term OR field2:term2 => +field:term field2:term2
etc...

May 29, 2007

Foodr The Fruit and Veg Search

Looks like I slightly misunderstood what kind of projects were done at Hack Day earlier today with my Symfony idea. Oh well, I guess I'll save that one for another time. I have, however, come up with another idea which I think would fit in very well. I'll call it Foodr. Basically it's a mashup of Flickr (via retrievr or directly and then managed with imgSeek), Wikipedia and Google Base Recipe Search. A user takes a photo of a fruit or vegetable, sends the image to the application which searches for it, retrieves details of it from Wikipedia and Recipes with it from Google Base. Obviously retrievr won't work directly as the base is too wide, so I was thinking of getting people to tag fruit and veg images with 'fruitr' and the name of the fruit / vegetable, then the image library can be restricted to just relevant images.

The Collaborative Cook

I've been thinking about project ideas for HackDay over this past weekend. I've managed to come up with two ideas so far, one very useful but also very boring and one mostly for fun. I'm not sure if I'm going to take either of them any further at HackDay as I really don't know what the event's going to be like. I'm tempted to turn up and go along with other peoples projects to get a feel for the whole thing. Either way, I think the projects could be quite good so I'm recording them here for future reference.

Project One: Object Cache Item Removal Plugin for Symfony

The name could be improved, yes, but I really can't think of anything better which gets across the intent as well. The point of this plugin would be to automatically remove items from the object cache (as apposed to the config cache etc) when the Propel objects on which they depend change. At the moment if you use the caching functionality in Symfony you have to remember to drop all the relevant cache items when changing a particular Propel object. Although this gives you the most flexibility it's also quite a lot of work and opens up the possibility for some pretty nasty 'stale data' bugs.

The basic structure would include something which watches objects coming out of Propel and determines what an action / partial / fragment / whatever is dependent on, something which manages these dependencies, and then something which drops all dependent cache items when one of the dependencies is changed.

After thinking about the idea for a while I've come to realize that it's going to be quite tricky (although I don't think impossible). The problem is the range of rules for removal. Below I have listed the ones I have come up with so far. I have used a recipe site scenario for examples. The recipe site consists of recipes which have one creator, multiple ingredients (in a many-to-many relationship) and multiple method steps (in a one-to-many relationship).

Extracted objects
A cache item is dependent on all Propel objects which are accessible within it's scope. For example, a 'recipe details' action would be dependent on the recipe, the recipe's user, all the ingredients related to it and all the method steps related to it. This is the most basic rule.
Relationships
A cache item is dependent on table references between Propel objects. For example, a 'recipe details' action would be dependent on all recipe ingredients and method steps which relate to the recipe concerned. This rule handles new recipe ingredients and method steps being added while not affecting existing ones or the recipe.
Whole class
A cache item is dropped when any object of a particular Propel class is saved. For example, a 'most popular recipe' action would be dependent on any recipe rating object, so if any change, the item must be dropped. There could be a number of variations on this rule such as only kicking in on newly created objects (for a 'newest recipes' action for example) or only relating to a particular field on the object.

I am sure there are other rules which I have not thought of so I would want to write the plugin with an extensible rules architecture. I think this would be an incredibly difficult project to get exactly right but I also think it could be a very useful one, if it is done right.

The Collaborative Cook: A Recipe Orientated Wiki

This is the more fun idea, the title says it all really. There are two main angles to look at here, collaboration to correct mistakes and add information on a recipe which is pretty much the same as regular wikis, and then there's the idea of collaboration as a creative process to create new recipes together. The second angle is one which I think is much more interesting. It would need to handle branching very well so that recipes could be developed in different directions. It would also benefit from a way of merging two branches together if it turns out they are going in the same direction. I think lessons could be learnt from the collaborative storytelling sphere for this side of things.

May 25, 2007

I'm going to Hack Day!

Horay, I'm going to hack day! Now all I have to do is think of something to do there.
Hack Day: London, June 16/17 2007

April 04, 2007

Non-blocking I/O With PHP-MIO

A couple of weeks ago I was thinking about non-blocking I/O in PHP, specifically about how clunky PHP's select implementation is. I say clunky because it's not bad, it's just not as easy to use as it could be. It's not as easy, for example, as the implementation found in Java's NIO package which is beautifully simple to use. The main issue I have with PHP's implementation is that I am responsible for keeping track of everything. I have to remember which streams I'm interested in writing to, which streams I'm interested in reading from and when I get to accepting connections, which streams are server sockets that I'm interested in accepting connections on. I'm lazy, I don't want to have to do that, I want a library to handle all that for me. At this point I decided to implement something similar to Java's non-blocking I/O in PHP5. This is now finished and up on sourceforce (under the name of phpmio). In this article I hope to give you enough information to get up and running with the package.


But What Is Multiplexed I/O?

Before I go any further I suppose I should explain exactly what multiplexed (or non-blocking) I/O actually is. When reading from or writing to a stream PHP usually blocks until the operation is complete, however, a stream's blocking mode can be set such that operations on streams don't block and instead return immediately. Used correctly this technique can vastly improve performance in networked applications. This comes at the price of increased complexity and some would argue a more confusing program flow. For this reason I wouldn't suggest it for trivial applications. Let's take a look at this in action. In the example below we open a stream to amazon and try to read some data from it, then display how long each operation took and how much data was read. If this is run with the stream's blocking mode set to 0 (non-blocking) you will notice that the read takes very little time and not all of the bytes are read. If, on the other hand, the stream's blocking mode is set to 1 (blocking) you will notice that the read takes much longer and all 2048 bytes are read.


<?php
$start = microtime( true );

$fp = fopen( 'http://www.amazon.co.uk/', 'r' );
$open = microtime( true );

stream_set_blocking( $fp, 1 );
$block = microtime( true );

$data = fread( $fp, 2048 );
$read = microtime( true );

// the time taken to open the url
echo "Open: " . ($open-$start) . "\n";
// the time taken to set the stream to blocking
echo "Block: " . ($block-$open) . "\n";
// the time taken to read from the the stream
echo "Read: " . ($read-$block) . "\n";
// the amount of data read
echo "Data Read: " . strlen( $data ) . "\n";

OK, So What Is PHP MIO?

So, how does multiplexed I/O work with the PHP MIO package? Within the package there are three key classes; MioStream, MioSelector and MioSelectionKey. There is also a factory class to provide a convenient way of creating different types of streams, MioStreamFactory, and a few Exception classes. Before diving into the core of the package let us take a quick look at MioStreamFactory to get it out of the way. Below are three examples of how it can be used. Each method creates an instance of the MioStream class which wraps a PHP stream. The first method, createSocketStream, creates (as you might expect) a socket stream (as would be created with fsockopen); the second a server socket (stream_socket_server); and the third a file socket (fopen). One thing which should be noted is that creating an MioStream with MioStreamFactory implicitly sets it's blocking flag to 0.


<?php
$factory = new MioStreamFactory();

// Create a client socket stream
$socket = $factory->createSocketStream( '127.0.0.1', 8888 );

// Create a server socket stream
$server = $factory->createServerStream( '127.0.0.1:8888' );

// Create a local file stream
$file = $factory->createFileStream( '/etc/hosts' );

An MioStream object allows us to do the basic things we would want to do with a stream such as reading and writing. Although MioStream only handles basic functions itself, it does give you access to it's underlying stream resource in case you need to do anything more advanced. In the code below we write some data to the socket stream we made in the last example and then read some data from it. We then attempt to accept a new stream on the server socket stream (note that the accept method will return an MioStream object). On this new stream we check that it's open, close it and then check that it's no longer open. Finally, we get hold of the socket stream's internal resource and append a stream filter to it.


<?php
$socket->write( 'Put some data' );

$socket->read( 1024 );

$stream = $server->accept();

if( !$stream->isOpen() ) {
trigger_error( "The new stream should be open", E_USER_ERROR );
}

$stream->close();

if( $stream->isOpen() ) {
trigger_error( "The new stream should now be closed", E_USER_ERROR );
}

stream_filter_append( $socket->getStream(), 'string.toupper' );

Downloading The AMP Stack

Now that we know how to create MioStream objects and interact with them let's take a look at how they are used with the MioSelector. A selector is an object used for managing and selecting streams which are available for different types of work (reading, writing or accepting connections). This is done by registering MioStream objects with the selector, the relationship between the selector and each stream is excapsulated in an MioSelectionKey object. When we register a stream with a selector we also provide what we're interested in for this stream (this can be reading, writing or accepting connections) and optionally an object which we want associated with the stream (so we know what to do with it). Once we have registered our streams with the selector we can call the select method to get all registered streams which are ready for any of the operations we are interested in. To get an idea of how this works let's take a look at a simple example, downloading three files.

In this example we open a stream to each remote file we want to download and one for each local file we want to write to. We the register each remote stream with the selector, attaching the stream's respective local stream for writing to. Once we have registered them we loop over the selector's select method, the select method returns the number of ready streams (streams which area available for one of the actions we have registered an interest in) or false if there are no streams registered with the selector. An important note to make here is that streams are automatically unregistered from the selector when they are closed so in this case we don't have to explicitly unregister them. Now we can loop over all the streams which have been selected and perform our action on them. In this case we read from the remote stream and write the data to it's associated local stream.


<?php
$selector = new MioSelector();
$factory = new MioStreamFactory();

// Create and register streams to download the PHP 5.2.1 source
$reader = $factory->createFileStream( 'http://uk.php.net/get/php-5.2.1.tar.bz2/from/this/mirror', 'r' );
$writer = $factory->createFileStream( 'php-5.2.1.tar.bz2', 'w+' );
$selector->register( $reader, MioSelectionKey::OP_READ, $writer );

// Create and register streams to download the MySQL 5.11.15 binary
$reader = $factory->createFileStream( 'http://dev.mysql.com/get/Downloads/MySQL-5.1/mysql-5.11.15-beta-linux-i686-glibc23.tar.gz/from/http://mirrors.dedipower.com/www.mysql.com/', 'r' );
$writer = $factory->createFileStream( 'mysql-5.11.15-beta-linux-i686-glibc23.tar.gz' );
$selector->register( $reader, MioSelectionKey::OP_READ, $writer );

// Create and register streams to download the Apache 2.2.4 source
$reader = $factory->createFileStream( 'http://www.mirrorservice.org/sites/ftp.apache.org/httpd/httpd-2.2.4.tar.bz2' );
$writer = $factory->createFileStream( 'httpd-2.2.4.tar.bz2' );
$selector->register( $reader, MioSelectionKey::OP_READ, $writer );

while( true ) {
// Loop over select untill we have some streams to act on
while( !$count = $selector->select() ) {
if( $count === false ) {
$selector->close();
break 2;
}
}

// Loop over all streams which are available for
// something we're interested in
foreach( $selector->selected_keys as $key ) {
if( $key->isReadable() ) {
$key->attachment->write(
$key->stream->read( 16384 )
);
}
}
}

Serving Up Echoes

I think we have a good understanding of how PHP MIO works now so let's take a look at a
server example. To keep it simple I'm going to do an echo server. This example will accept
connections on port 7, read data in and then send it straight back. First off, we're going
to need a class to encapsulate the echoing.


<?php
/**
* A class to echo data.
* This is eessentially just a FIFO queue. Data can be
* added onto the end of the buffer and at a later date
* it can be read (and implicitly removed) from the
* beginning of the buffer.
*/

class Echoer
{
/**
* Holds the data untill it needs to
* be echoed back
*/

private $buffer='';

/**
* Add some data to the buffer
*
*
@param string $data
*
@return void
*/

public function put( $data )
{
$this->buffer .= $data;
}

/**
* Read and remove a chunk of data from
* the start of the buffer
*
*
@param int $size The amount of data to read
*
@return string
*/

public function get( $size = 4096 )
{
$data = substr( $this->buffer, 0, $size );
$this->buffer = substr( $this->buffer, $size );
return $data;
}
}

Now we need to set up our server and get it working, what we're going to do is accept connections and then register these with the selector with an interest in reading. These streams will then appear in later selects where we can read a chunk of data off, put it in the echoer and reset the selection key's interest to writing so that we can echo the data back down the line.


<?php
// Create our base objects
$selector = new MioSelector();
$factory = new MioStreamFactory();

// Register a server stream with the selector
$selector->register(
// the server stream is listening on 127.0.0.1 port 7
$factory->createServerStream( '127.0.0.1:7' ),
// we are interested in accepting connections
MioSelectionKey::OP_ACCEPT
);

// loop for ever, this is going to be server
while( true ) {
// keep selecting until there's something to do
while( !$count = $selector->select() ) { }

// when there's something to do loop over the ready set
foreach( $selector->selected_keys as $key ) {
// do different acctions for different ready ops
if( $key->isAcceptable() ) {
// if the stream has connections ready to
// accept then accept them until there's no more
while( $stream = $key->stream->accept() ) {
// register the newly accepted connection with the
// selector so that it is handled in subsequent operations
$selector->register(
$stream,
// we are interested in reading from the stream
MioSelectionKey::OP_READ,
// attach an instance of the echoer to manage echoing
new Echoer()
);
}
} elseif( $key->isReadable() ) {
// if the stream is ready for reading then
// read a chunk of data off it and add it to
// the echoer
$key->attachment->put(
$key->stream->read( 4096 )
);
// now we're interested in writing back down the pipe
$key->setInterestOps( MioSelectionKey::OP_WRITE );
} elseif( $key->isWritable() ) {
// if the stream is ready for writing then
// get some data from the echoer
$data = $key->attachment->get();
if( $data ) {
// if there's data there then send it back
$key->stream->write(
$data
);
} else {
// if there's none then remove the key
$selector->removeKey( $key );
}
}
}
}

So, now we've done a multiplexed downloader and a multiplexed server. We have processed PHP sockets in a high performance and very efficient manner. PHP may not be the first choice for writing high performance networking applications but for knocking up, in a matter of minutes, something which performes pretty damned well, I think this could do the trick.

March 07, 2007

XMI 2 SQL in No Time

I've just discovered how easy XMI (as used by Umbrello) is to parse. I spent most of yesterday putting together an entity relationship diagram of the DB structure for integrating a data warehouse with our catalogue. Today, faced with the prospect of having to hand craft about 30 tables I decided to take a look at the XMI format. It's really simple. I managed to knock up a quick parser to build SQL from the XMI file and hey presto! All my tables are built.


I don't have any private hosting at the moment but I'll get it up when I do.

February 10, 2007

Discovering Kontact

This morning I made an amazing discovery, Kontact. I don't really know how I managed to miss this for so long, it's even installed by default my version of Kubuntu so I really have no excuse. It is basically a hub for a number of standard KDE applications; KMail(mail), KAddressBook(addresses), KOrganizer(calendar,todo and journal), KNotes(notes) and AKregator(feeds) some of which I have been using for quite a while but some of which are completely new to me, namely KOrganizer. I came across Kontact while I was looking for a calendar program to put in some upcoming foody events around the South of England. Now, I've used a few calendar programs before but I've always become annoyed with them for one reason or another, usually related to how difficult they become to integrate with everything else, I don't think that's going to be a problem any more. Just one example of how nicely it all fits together is how birthdays from my address book can be automatically inserted into my calendar and kept up to date, brilliant!

February 08, 2007

Signing Up To Technorati

It's 8am and I can't be bothered to think up anything interesting to say so I'll just post the technorati blog claiming link.Technorati Profile

February 07, 2007

Selectable Streams In PHP

I've been toying with the idea of implementing selectable streams (like in Java's NIO package) in PHP recently. I have had a good look around and there doesn't seem to be a simple, flexible non-blocking IO implementation out there. I have come across one very nice socket daemon library, which lets you build a server in minutes which is very cool. However, it does only solve building servers and forces the developer to build their server a particular way. Hopefully selectable streams will be a useful solution.

January 25, 2007

A Night Out On The Town

Last night I went to the theatre for the first time in far too long. Thank you, Alex, for suggesting and organising the idea, it was a great night out. We went to see Love Song at the New Ambassadors Theatre just off Charing Cross Road and it was brilliant. I'll let those more qualified give you the breakdown and just say that I had a thoroughly enjoyable time.

January 22, 2007

Accessing Sharedance With PHPDance

Well, I've come a little late to this blogging thing, partly because I haven't had a huge amount to say which I think would be of much use to others. I still may not, but let's see.

This post, apart from starting with a short introduction, is about PHPDance, a PHP interface to the Sharedance cache server which I wrote a couple of months ago. Sharedance is a distributed object cache much like Danga's Memcached, except that while Memcached only saves data to memory, Sharedance also writes it to the hard disk.


Just what is a distributed object cache?

Well, an object cache is a tool which allows you to store arbitrary data, referenced by a key and then retrieve it by that key at a later date. The distributed bit refers to how and where that data is stored. In a local object cache such as APC the data is cached directly on the local machine, whereas a distributed cache spreads it's data across multiple machines. There are two main reasons for wanting to do this.


  1. Performance and making efficient use of the available resources. Imagine you have four webservers, each with 4Gb of memory (2Gb of which is reserved for the cache), and that you are using an object cache. Unless you have divided your application across the four servers (ie users.example.com, shop.example.com, review.example.com, payment.example.com) which is usually not possible, all four of your servers will be serving up roughly the same data. The effect on the cache is that pretty much the same data is being cached across each of the four servers giving you effectively 2Gb of memory for your application. However, if you could spread the cache across the four servers you could increase the amount of available cache space by four times making 8Gb of cache available. This side of things is covered very well by Memcached, with it's incredible speed.

  2. Avoiding the database all together, which requires resilience. Most of the time you'll be caching data which can be found in another source such as the database. However, there are situations where you don't want to have to store data in the database. When there are likely to be a large number of writes involved and there won't be much benefit gained. Session data is a very good example of this. It would be dangerous to store session data in memcached only as if the cache gets full old session data will fall off the cache. In this situation something like Sharedance would be more appropriate as data is also saved to the hard disk so old session data is not lost (unless it is given an expiry or is explicitly deleted).

Enter PHPDance

PHPDance provides a clean, object oriented, PHP5 interface to Sharedance. Let's take a look at an example for a setup similar to that mentioned earlier, four webservers each running an instance of the cache server (in this case Sharedance).


require 'sharedance.class.php';
$cache = new Sharedance();
$cache->addServer( new SharedanceServer( 'web1.example.com' ) );
$cache->addServer( new SharedanceServer( 'web2.example.com' ) );
$cache->addServer( new SharedanceServer( 'web3.example.com' ) );
$cache->addServer( new SharedanceServer( 'web4.example.com' ) );

$key = 'mykey';
$data_in = 'some data which needs to be cached';
$cache->set( $key, $data_in );

$data_out = $cache->get( $key );

Here we create a new instance of the Sharedance object which acts as our gateway to the cache as a whole and then add instances of SharedanceServer for each of the machines which we want involved in this cache. Note, this must be exactly the same everywhere the cache is being used. Then we set some data to the cache and then get it back again with the set and get methods. What happens under the hood is the Sharedance object determines which server the data should be cached on and then caches it there (this is why even the order of the servers must be the same everywhere this is used).


Building In Redundancy

This is all well and good, and the fact that the cache is written to the disk as well is great because it means we don't have to worry about overflowing the cache however it's still not very resilient. What happens when one of the machines goes down? We effectively lose all the cached data on it for the period that it's down, this is OK when the data is actually stored in the database but when the cache is the only place it's stored it's a bit more difficult. PHPDance addresses this issue with redundant writes. If you ask it to be redundant it will write to two servers every time you set and then, if the first one is down when you get it will try the second. It's painfully easy to switch this on, just set the first and only parameter to the Sharedance constructor to true.


$cache = new Sharedance( true );

You should note that redundancy can only work if you have enough servers. If, for example you only have one server, you obviously cannot enable redundancy. Also, if you have more than one server but one has a weighting greater than the total number of servers it will not work either.


Extend And Improve

Earlier I mentioned that a typical example of when something like this is useful is session management. PHP provides a function for setting custom session handlers (session_set_save_handler()) which has been implemented in SharedanceSession to create a distributed, redundant, Sharedance backed session handler for PHP5.

Check it out at http://sourceforge.net/projects/phpdance/