Latest Publications

Solr: Indexing speed 1111 docs/sec

On popular demand, I’m going to try and describe what I did to get the 1111 docs/sec indexing (link) on a 1.200.000 documents index. Please don’t be surprised to how little I have to do with the entire story, and how much is thanks to Solr’s great code.

The Machine

First and foremost, small part about the machine I conducted the testing on. Hostnamed ‘searcht’ (for convenience). These values have been collected as according to how much I could figure out through ssh. I suppose they’re all that’s relevant:

CPU: GenuineIntel E7330 @ 2.40GHz
RAM: 4GB (Don’t know how to figure out DDR2/3, memory speed, etc…)
Running Redhat linux 5

The script

For indexing I used the post.sh script, delivered standard with Solr 1.4, but slightly modified:

FILES=$*
URL=http://localhost:8999/solr/update

for f in $FILES; do
 echo Posting file $f to $URL `date`
 curl $URL -F stream.file=$f #--data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
 echo
done

#send the commit command to make sure all the changes are flushed and visible
curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
echo

The numbers

Indexing an 1.205.164 docs index on searcht: (Values are standard values of the time command, and the QTime as returned by Solr)

All tests are appended by a manual

curl http://localhost:8999/solr/update --data-binary -H 'Content-type:text/xml;charset=utf-8'
Solr running on 1024M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
 Real  18m26.515s
 User   0m 0.015s
 Sys    0m 0.045s
 Qtime: 1 105 253 (ADD)
              830 (COMMIT)
           47 787 (OPTIMIZE)
Solr running on 1024M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
 Real  18m53.599s
 User   0m 0.022s
 Sys    0m 0.037s
 Qtime: 1 132 498 (ADD)
              752 (COMMIT)
           48 646 (OPTIMIZE)
Solr running on 2048M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
 Real  18m44.525s
 User   0m 0.013s
 Sys    0m 0.015s
 Qtime: 1 123 135 (ADD)
              849 (COMMIT)
           48 493 (OPTIMIZE)

I tried the same with the 2048M RAM to try and speed up the indexing, however the difference is very small and / or not even proven in the tests. The HDD speed might actually have more of a bottleneck than the memory usage.

I hope this described what people wished to know. Any question will be answered as soon as possible and development still continues (Next stop will probably be using EmbeddedSolrServer for indexing & Complete parsing)

Solr: Finding out all values in a field

Solr and Lucene are truly amazing things, capable of fast indexing and querying vast amounts of data.

However, when coming from a conventional database structure, it’s quite hard getting to the thinking pattern Lucene uses vs SQL (e.g.)

SQL: select fieldname from database

The equivalent of select fieldname from database as known in SQL databases, is one of those fun ones. One would think it would translate simply to something like:

q=fieldname%3A*&start=0&rows=10&fl=*%2Cscore&qt=standard&wt=standard

Basically you’re querying for any value in the fieldname. However, Lucene/Solr doesn’t support this!

q=*%3A*&start=0&rows=10&fl=$%2Cscore&qt=standard&wt=standard&facet=true&facet.field=fieldname

Yeah, just facetting on the field actually gives you all the possibilities & the count of items in that category.

Howto: Hide flash

When using flash and lightbox (and lookalikes) it’s a common issue that the flash movie which should be under the lightbox pops out of there (often in weird places and not 100%)

So, an elegant way to solve this in Javascript would be to have the flash moved or disappeared.

Hide it (display none)

Using simple javascript one can hide the flash (or the div where the flash resides in) to have it re-displayed when necessary.

document.getElementById('div_flash').style.display = 'none' // hide
document.getElementById('div_flash').style.display = 'block' // show

And the same but in jQuery:

$('#div_flash').css('display', 'none') // hide
$('#div_flash').css('display', 'block') // show

However, in certain cases of movies, this will reload the movie (like what I encountered using the amCharts flash movie). Using this method to hide / show the movie is not prefered then.

Move it (margin-top or similar)

The other method, which prevents reloading the flash movie for certain cases, would be to move the flash movie.

It’s best to take in account whether or not it’s ok to have a scroll appear. Mostly this is not wanted so I’ll just use negative margins that seems to prevent the scroll to appear (at least in firefox):

document.getElementById('div_flash').style.marginTop = '-99999px' // hide
document.getElementById('div_flash').style.marginTop = '0' // show

And the jQuery equivalent:

$('#div_flash').css('margin-top', '-99999px') // hide
$('#div_flash').css('margin-top', '0') // show

It’s not required to work with margin-top, but it’s best to work with either margin-top or margin-left, because margin-bottom and margin-right (with negative values) will probably generate a scrollbar.

Internship: An introduction

My internship: Global search @ ugent.be

Lagging a bit behind, I’m going to describe my internship.

In a nutshell, my internship is about search, a whole lot of search. Since the portal site of the university at Ghent (http://www.ugent.be) is moving to and running Plone as its main CMS and a whole lot of data is going into this portal site, people need to be able to search all that data.

Solr logoSo, the challenge is to implement a search system taking care of as much as possible, preferably in an open source environment! The idea was to use Apache Solr as main search engine, which uses Lucene. The fun stuff about this is that we stay within the field of open source, and still provide a strong search engine.

That being said, I’m currently working towards understanding Solr, since I have no experience what-so-ever with Solr (and only very little experience in Plone) this is going to challenge me, and I like challenges!

So, now you know what the assignment is, let’s break it down into steps. Because we’re in a western world and we enjoy the clear path of steps. Obviously that’s needed, since just starting it head-on will end in failure.

First: Learn Solr

First I’m going to discover as much as possible about Solr. It so happens that the university’s library has an implementation of Solr running for their search. I’ve also been given the book Solr 1.4 Enterprise Search Server by David Smiley and Eric Pugh to study. Using that book I hope to get a clear vision of Solr.

Next: Case study

As mentioned in the first step, the university’s library is running a Solr implementation. It just so happens that implementation is currently barely documented. We proposed to do this for them, since they’ve been (and still will be probably) a great help in getting us up to speed with Solr.

I’m basically going to put their techniques in documentation, case study style. That way I have seen a Solr implementation running (all parts of it) and I’ll have a better vision on how to implement it on the scale we want.

After that: Plone + Solr

When that’s done, the goal is to have a Plone plugin to integrate Solr as easy as possible into any Plone setup. Since there’s been found some evidence that people in the Plone community have already started such effort (or once started it) it’s possible that we will communicate with them to finish the module.

Extra extra

Every internship has its main goals and some extra, in case the main goals were achieved too early (or just in time to do more work), I have some extras too.

As extra we have the expansion of the global search. Since the portal site of UGent is not all the content hosted by the university, and we want as much as possible searchable, it’s possible we look for ways to have the external UGent websites searchable too. Possibly by crawler (Nutch maybe)

Internship category

Since it’s best to blog my improvements (and findings, if any) and possibly help people along the way, I created this small subsection on my site. For people interested in Solr and / or plone (possibly the improvement or development of a plone module for Solr, although it seems such effort is already on the way) I hope to help as much as possible!

[howto] Install bazaar explorer in debian

Since I’m quite the python fan, and also indirectly a fan of bazaar, I’m really liking the bazaar explorer. Especially on windows. Since I wanted the same application on linux, tried to install the explorer.

Somehow, somewhere something went wrong. Obviously not a good thing. After installing the bazaar explorer (first through downloading the tarball here, secondly by executing “bzr lp:bzr-explorer explorer” in the ~/.bazaar/plugins folder) I stumbled upon an error:

Unable to load plugin 'explorer' from '/usr/lib/python2.5/site-packages/bzrlib/plugins'

and later the same from my ~/.bazaar/plugins folder…

The solution

Actually deceivingly simple, the source installation page from bazaar explorer said:

If this fails to start, ensure that you have compatible versions of dependent products installed, namely: QBzr, PyQt, Qt and bzr.

Making sure you’ve got these requirements fixes the whole ordeal.

Qbzr on debian

My solution was simply to install qbzr on debian, since I already had bzr, qt and pyqt installed.

To install qbzr on debian, one needs to add the following to /etc/apt/sources.list (following these guidelines):

deb http://ppa.launchpad.net/qbzr-dev/ppa/ubuntu jaunty main

Updating apt and installing qbzr (as root of course, or sudo):

# aptitude update
# aptitude install qbzr

Now, enjoy your bazaar explorer!
2009-10-26-105132_1280x780_scrot
PS: when you have the error on the key, here‘s an explanation on how to add it to apt

Haiku OS: Alpha 1 released

2009-09-14-150703_1280x800_scrot

The Haiku OS project site has published the first version ISO so I immediatelly went to the download page of the Haiku OS site, since I’m a big pro of the haiku idea. Downloading the iso (with sourceforge mirror) only took 10 minutes, but it isn’t nice of them to zip it (like an iso in itself isn’t good enough)

Preparing Virtualbox with pretty much everything standard (for an OS not known by Virtualbox that is):

The iso mounted in Virtualbox really did a nice bootup time, installer could use some work, especially since exiting (square box left-above) the installer / setup actually makes the total system do nothing anymore (it should reboot, or failsafe to desktop imho). Total setup time seems to have come down to 10 minutes, extremely fast!

In the end, there’s got to be large credits for the installer being so straight-forward. In the end, if one reads the installer instructions as they pop-up in the process, one can install with ease. Most files seemed to be html files, which is quite funny ^^

Also, having an OS with 16835 files is seriously nice. Don’t really know if I should say this, but at the moment, it’s the smallest OS I’ve got in my entire Virtualbox setup. And I will most certainly play around with this one a _lot_ considering I hope it will become what I always wished for in the Desktop Linux experience.

What is also quite amazing imho is the fact that python, perl, and more are installed by default. Because I believe scripting will become more and more common in the real world, this can go the correct way!
Having played around with this some, taking some (loaded) screenshots, etc… I think Haiku OS has gone a long way, but still has very long to go. Positive points (for now):

  • Extremely fast boot-time, and a no-nonsense approach to the whole deal
  • Having the BeOS ideas and not trying to re-invent things seems to have effect

Negative points (points to work on):

  • alt-tab feature, or something more obvious than the Deskbar to switch between applications

These points and conclusions are what I saw on first-impression. And I like what I see at the moment! Although the project has a long way to go still, it seems to gain momentum every time I try it, and I hope to start developing for it soon myself (although I prefer developing for servers :P)

2009-09-14-151101_1280x800_scrot

Share this post, help Haiku gain even more momentum, try things out and comment here (or directly on the haiku site)

[pySM] Updates

This person has been very busy, and still little work seems to be done.

In a nutshell, we’ve been busy getting the main website for pySM online. pySM is me and Tim‘s project into system management (There are some posts about it in this blog)

However, soon after getting quite the lot of url’s and basic apps online, I decided to change the url structure to become a littlebit more simple. Fact is that we were trying to overachieve, something that seems to happen to us a lot :(

Anyhow, pySM is currently online on 2 (yes, 2 url’s, instead of the douzen-or-so earlier) links:

  • pysm.be << Main site + Blog, makes sense that people interested in the project will directly see the progress. However we will probably in the future try to maintain a proper website with introduction, links, screenshots, documentation, world domination plans, etc…
  • trac.pysm.be << Trac site, contains documentation (raw material basically) and a webview for the bazaar repository. Tickets can be created over there too, although I don’t really see the use for the moment.

I hope that having this simpler structure will enable us to work more on pySM to get our first _decent_ version out. After that, we can try to overachieve again :)

Best of regards,
cpf

[pySM] Sites

Today all pysm sites (which used to be only trac) has been moved.

The “old” url (www.pysm.be) has been moved to the server hosting codercpf.be (Yes, this site)

We’ve changed the old only trac to seperated sites:

Update!

I put some url’s here last time, but those changed. For sake of having consistency, and not sending people to the wrong places, here are the updated ones :) (And the post announcing it on this blog)

  • pysm.be << Main site + Blog, makes sense that people interested in the project will directly see the progress. However we will probably in the future try to maintain a proper website with introduction, links, screenshots, documentation, world domination plans, etc…
  • trac.pysm.be << Trac site, contains documentation (raw material basically) and a webview for the bazaar repository. Tickets can be created over there too, although I don’t really see the use for the moment.

While we were at it, we changed the previous Subversion repository to a bazaar repo. Makes it a lot easier for us developers, although on first glance there’ll be little difference between the current bazaar repo and the Subversion way of doing things.

This will only be a small step into our attempt to rule the world using pysm!

Logo theft

Update: The site has removed the logo. Arch is victorious (linky)

As it appears, there is someone (in particular a company named Ace International Tutoring Pty Ltd) as outragious as it might seem, it is true.

As for the original article, it explains how the devs and official archlinux crew have come to the conclusion that this logo is indeed originaly theirs, and it has been “stolen” by someone thinking he can get away with it…

DIGG THE ARTICLE :)

Migrate from thunderbird to icedove in debian

Long title for a short post. I tried to search on google, and since I found nothing decent.

To migrate from thunderbird to icedove in debian linux, one only needs to rename the config directory of thunderbird:

mv .thunderbird .mozilla-thunderbird

Although I think this is stupid (I’d have prefered to see .icedove instead of .mozilla-thunderbird)

Enjoy your path towards ice* enlightenment…