Solr: Indexing speed 1111 docs/sec

On popular demand, I’m going to try and describe what I did to get the 1111 docs/sec indexing (link) on a 1.200.000 documents index. Please don’t be surprised to how little I have to do with the entire story, and how much is thanks to Solr’s great code.

The Machine

First and foremost, small part about the machine I conducted the testing on. Hostnamed ‘searcht’ (for convenience). These values have been collected as according to how much I could figure out through ssh. I suppose they’re all that’s relevant:

CPU: GenuineIntel E7330 @ 2.40GHz
RAM: 4GB (Don’t know how to figure out DDR2/3, memory speed, etc…)
Running Redhat linux 5

The script

For indexing I used the post.sh script, delivered standard with Solr 1.4, but slightly modified:

FILES=$*
URL=http://localhost:8999/solr/update

for f in $FILES; do
 echo Posting file $f to $URL `date`
 curl $URL -F stream.file=$f #--data-binary @$f -H 'Content-type:text/xml; charset=utf-8'
 echo
done

#send the commit command to make sure all the changes are flushed and visible
curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml; charset=utf-8'
echo

The numbers

Indexing an 1.205.164 docs index on searcht: (Values are standard values of the time command, and the QTime as returned by Solr)

All tests are appended by a manual

curl http://localhost:8999/solr/update --data-binary -H 'Content-type:text/xml;charset=utf-8'
Solr running on 1024M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
 Real  18m26.515s
 User   0m 0.015s
 Sys    0m 0.045s
 Qtime: 1 105 253 (ADD)
              830 (COMMIT)
           47 787 (OPTIMIZE)
Solr running on 1024M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
 Real  18m53.599s
 User   0m 0.022s
 Sys    0m 0.037s
 Qtime: 1 132 498 (ADD)
              752 (COMMIT)
           48 646 (OPTIMIZE)
Solr running on 2048M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
 Real  18m44.525s
 User   0m 0.013s
 Sys    0m 0.015s
 Qtime: 1 123 135 (ADD)
              849 (COMMIT)
           48 493 (OPTIMIZE)

I tried the same with the 2048M RAM to try and speed up the indexing, however the difference is very small and / or not even proven in the tests. The HDD speed might actually have more of a bottleneck than the memory usage.

I hope this described what people wished to know. Any question will be answered as soon as possible and development still continues (Next stop will probably be using EmbeddedSolrServer for indexing & Complete parsing)

Share and Enjoy:
  • Print
  • Reddit
  • Identi.ca
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Slashdot

You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.