Drupal 7 tip: Concurrent indexing with search api. Concurrent queues using the concurrency capabilities of drush.
We were using search api to build our search pages. While this module is great to create search pages it has some problems with indexing a lot of nodes. There is a patch available http://drupal.org/node/1137734 to avoid the memory limit issue. But indexing with search api using it is still very slow.
The solution?
We want to index our nodes concurrent of course. This will make full use of our machine and make indexing go fast.
Two years earlier this very problem of concurrent indexing was solved with a trick described in this article http://dominiquedecooman.com/blog/doing-big-imports-and-apache-solr-inde.... In the article the concurrency issue was solved by using a cron run that spawned processes depending on how high the load of the machine got. This worked great but now there is a very easy way to do the same thing. Two years ago drush was not that advanced and drupal 7 didnt even existed so no queues in core.
How did we index our nodes concurrent?
The search api implements a cron queue (http://dominiquedecooman.com/blog/drupal-7-tip-cron-queues) This we can exploit to have our nodes indexed concurrently. The only thing we need to do is fill a queue with items to be indexed and then write a special command that uses the "drush_backend_invoke_concurrent()" function to process that queue concurrently. The function will grab a number of items in the queue depending on how high you set the concurrency level and process them. Drush will take care of everything it will kill the processes when they are done and will return output to console if any.
In the code we have two commands. One to fill the queue and one to process it.
<?php
/**
* Implements hook_drush_command()
*/
function dfc_drush_command() {
$items = array();
$items['search-api-add-to-queue'] = array(
'description' => 'Fetch all items to be indexed, and add them to the queue.',
'bootstrap' => 'DRUSH_BOOTSTRAP_DRUPAL_FULL',
'callback' => 'search_api_cron',
'aliases' => array('sapi-aq'),
);
$items['queue-run-concurrent'] = array(
'description' => 'Run a specific queue by name',
'arguments' => array(
'queue_name' => 'The name of the queue to run, as defined in either hook_queue_info or hook_cron_queue_info.',
'concurrency level' => 'The amount of background processes to spin-off'
),
'required-arguments' => TRUE,
);
return $items;
}
/**
* Command callback for drush queue-run-concurrent.
*
* Queue runner that is compatible with queues declared using both
* hook_queue_info() and hook_cron_queue_info().
*
* @param $queue_name
* Arbitrary string. The name of the queue to work with.
* @param $concurrency_level
* Amount of background
*/
function drush_dfc_queue_run_concurrent($queue_name, $concurrency_level) {
// Get all queues.
$queues = drush_queue_get_queues();
if (isset($queues[$queue_name])) {
$queue = DrupalQueue::get($queue_name);
for ($i = 0; $queue->numberOfItems() > $i; $i++) {
$invocations[] = array('command' => 'queue-run ' . $queue_name, 'site' => '@self');
}
$common_options = array(
'concurrency' => $concurrency_level,
);
drush_backend_invoke_concurrent($invocations, array(), $common_options);
}
}
?>
To make it easy we have created a command that adds all items to queue by just calling the search_api_cron function which will just add all items that need indexing to the queue.
#In terminal drush sapi-aq
Then we call our other command that will "drush_dfc_queue_run_concurrent()" function. Like this:
#In terminal drush queue-run-concurrent search_api_indexing_queue 8
Now watch the queue spawn 8 other processes beside itself by monitoring the shell you are in by typing in another shell:
watch ps -s[id].
To find out the shell you are on type:
ps -ef|grep $$|grep -v grep
(its the third number on the second line) For more (http://stackoverflow.com/questions/3327013/how-to-determine-the-current-...)
Indexing goes now 8 times faster. If you have a big machine you can raise the concurrency level and it will even go faster. Do monitor your search backend. In our case apachesolr so it can cope with the load.
Conclusion
This function/trick is reusable to call any queue in the system. The process is always the same. Fill up a queue with items and let drush process the queue concurrently. You can do it for any task, from importing content, feeds, indexing
Mention
This little function was developed together with to Sander Van Dooren. Check out his blog, he posts nice tips regularly. http://www.sandervandooren.be/
Add new comment