Solr: Indexing speed 1111 docs/sec
On popular demand, I’m going to try and describe what I did to get the 1111 docs/sec indexing (link) on a 1.200.000 documents index. Please don’t be surprised to how little I have to do with the entire story, and how much is thanks to Solr’s great code.
The Machine
First and foremost, small part about the machine I conducted the testing on. Hostnamed ‘searcht’ (for convenience). These values have been collected as according to how much I could figure out through ssh. I suppose they’re all that’s relevant:
CPU: GenuineIntel E7330 @ 2.40GHz
RAM: 4GB (Don’t know how to figure out DDR2/3, memory speed, etc…)
Running Redhat linux 5
The script
For indexing I used the post.sh script, delivered standard with Solr 1.4, but slightly modified:
FILES=$* URL=http://localhost:8999/solr/update for f in $FILES; do echo Posting file $f to $URL `date` curl $URL -F stream.file=$f #--data-binary @$f -H 'Content-type:text/xml; charset=utf-8' echo done #send the commit command to make sure all the changes are flushed and visible curl $URL --data-binary '<commit/>' -H 'Content-type:text/xml; charset=utf-8' echo
The numbers
Indexing an 1.205.164 docs index on searcht: (Values are standard values of the time command, and the QTime as returned by Solr)
All tests are appended by a manual
curl http://localhost:8999/solr/update --data-binary -H 'Content-type:text/xml;charset=utf-8'
Solr running on 1024M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
Real 18m26.515s
User 0m 0.015s
Sys 0m 0.045s
Qtime: 1 105 253 (ADD)
830 (COMMIT)
47 787 (OPTIMIZE)
Solr running on 1024M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
Real 18m53.599s
User 0m 0.022s
Sys 0m 0.037s
Qtime: 1 132 498 (ADD)
752 (COMMIT)
48 646 (OPTIMIZE)
Solr running on 2048M RAM
time ./post.sh /vol/indexes/testing/rug01.20100324.harvest.proc
Real 18m44.525s
User 0m 0.013s
Sys 0m 0.015s
Qtime: 1 123 135 (ADD)
849 (COMMIT)
48 493 (OPTIMIZE)
I tried the same with the 2048M RAM to try and speed up the indexing, however the difference is very small and / or not even proven in the tests. The HDD speed might actually have more of a bottleneck than the memory usage.
I hope this described what people wished to know. Any question will be answered as soon as possible and development still continues (Next stop will probably be using EmbeddedSolrServer for indexing & Complete parsing)





