Woo! I have a working PyLucene! (Err, and some other stuff about full text indexing).

I've been after a python library to mess with lucene indexes for a bit, we use/write a fair few things that use lucene at the backend at work, and just occasionally it would be very useful to be able to fiddle in them (read: search from the command line) without having to do the "write java, compile, test, recompile because you got something wrong, retest, and again" type loop for simple things. Using X11 Forwarding over ssh over a shared with the rest of the office 2M line to use luke on the indexes is just not fun at all! So, now that I've got pylucene compiled and know the right options (ish), and it compiles on Debian Sarge (both i386 and amd64 variants), I might just build me a package for installing on our servers so that there's an "easy" option for us sysadmin types that just need stuff that we can use, quickly, without messing about!

Semi-planning writing a luke-alike using PyLucene and the ncurses library to make it easier to diagnose/look in to the indexes on remote servers, but that's not going to happen this week!

Quite impressed with the speed though - looks fast - but then, I suppose if you compile the Java using gcj to get native code, then you can expect it to be a bit quicker than running in the vm, right? At least, that's what I'm attributing it to at the moment, well, that and python startup time is likely to be a little bit quicker than java startup time.

Also, in my travels, I found Divmod Xapwrap, which is a python library round Xapian which seems to be another (interesting) full text indexing engine. Haven't played with that yet - but could be worth a look in the future. Core library written in C++ so should be reasonably quick. The bit that I like the looks of though is:

Xapian can search across several databases as easily as searching across a single one. Simply call Xapian::Database::add_database() for each database that you wish to search through.

Which sounds very useful if you start thinking about distributing indexes across a number of machines. I've not read enough Lucene documentation to know wether you can do a similar thing there.

Right - enough of that - time to put things back together in the office and go home (again!). Hopefully tomorrow morning I won't be rudely awoken at 5.30am by someone asking a question that could have been better asked via e-mail (where "better asked" means "wouldn't have woken me up and annoyed the hell out of me because I couldn't get back to sleep", though - I did then play with some things and watched American Pie: The Wedding before heading in to the office the first time round...).

Posted: 2006-11-05 14:06 in Tech | permalink