Doing BIG imports and apache solr indexing using drush, batch api and cron

Posted by: 
Dominique De Cooman

By big imports I mean creating millions of nodes and indexing them fo apache solr search. We needed to import 1 million records from file and we needed to create 3 nodes for each record, one in each of our 3 languages. So this gave us the task of creating scripts for creating, updating and deleting 3 millions nodes. These nodes also had 3 taxonomy vocabularies, around 30 cck fields and we also needed to join another table containing a spatial coordinates for our geo field.
So saving one node like that is an expensive operation.
We got the time down to 72 hours to import all of these records. The process was fully automated and we used a manualy configured setup to use the server at maximum capacity.

We received two main files (and another set of 5 files with key value pairs as mapping for some of our cck fields). One containing our records with data to fill our other cck fields and taxonomy, a second file with the coordinates.
First we loaded all the files into the database. We gave the first table an incremented id so we could use it later to track progress (we used a little script to do that). The second table we imported using phpmyadmin. The lookup tables we also imported using phpmyadmin. This process goes really fast. Under two hours we got the files onto the server and into the database.
Why dont we read out of file? We had to look up the VAT id of the company in the second file to find the coordinates for the appropriate company. This is a task a database can do a lot better than a file reading process in which you'll have to scan the file line by line.

We also created indexes on our tables a primary key on the auto inc field and a unique index on the VAT id in the first table and a primary on the VAT id in the second table. On the other tables with lookups we gave the key collumn a primary key. We also created an index on our VAT id in our cck table, since we ll use it to look up if a node needed to be created or updated. Indexes are very important as they speedup lookups in the database (http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html). So allways analyse your database tables to see if an index is appropriate when doing imports.

Now we created our import script that basically reads out a line in the database and builds a node object and does a node_save to get it into drupal. Nothing special about that. Except that if you would just run that from your browser the script would just time out when your php max_execution_time is reached or when php is out of memory. Even if you set memory_limit to the max on your server and max_execution_time to unlimited the script would still fail since the machines memory would be consumed after an amount of time. Also this is not the way to follow up progress on your import. You cant restart when something fails...

So naturaly in drupal we think the batch api will save us. Here is the handbook page on how to create a batch (http://drupal.org/node/180528) Now you dont have to worry about your scripts timing out the batch api will make sure it wont use that much memory. There are two ways to construct a batch the first which is explained in the handbook page. The second is a bit different since it only uses one function in the operation array. We ll take the apachesolr reindex batch as an example. Why is this usefull, the batch api saves its batch in the batch table. So when you construct a batch containing a million operations and that array is serialized and put into the database nasty things will happen. Depending on your mysql settings and server capacity this will fail. In my case it failed at around 30k records.
You can still use the first method, what you do is split up the batches in pieces. And import them one by one. Which we did but not for the reason of a tho big serialized array. But more on that in a moment.

Check out this snippet to know how to write a batch api that only uses one function in the operations table. In our case the only thing you need is a variable that keeps track of where you are.

<?php
/**
 * Batch reindex functions.
 */

/**
* Submit a batch job to index the remaining, unindexed content.
*/
function apachesolr_batch_index_remaining() {
  
$batch = array(
    
'operations' => array(
      array(
'apachesolr_batch_index_nodes', array()),
    ),
    
'finished' => 'apachesolr_batch_index_finished',
    
'title' => t('Indexing'),
    
'init_message' => t('Preparing to submit content to Solr for indexing...'),
    
'progress_message' => t('Submitting content to Solr...'),
    
'error_message' => t('Solr indexing has encountered an error.'),
    
'file' => drupal_get_path('module''apachesolr') . '/apachesolr.admin.inc',
  );
  
batch_set($batch);
}

/**
* Batch Operation Callback
*/
function apachesolr_batch_index_nodes(&$context) {
  if (empty(
$context['sandbox'])) {
    try {
      
// Get the $solr object
      
$solr apachesolr_get_solr();
      
// If there is no server available, don't continue.
      
if (!$solr->ping()) {
        throw new 
Exception(t('No Solr instance available during indexing.'));
      }
    }
    catch (
Exception $e) {
      
watchdog('Apache Solr'$e->getMessage(), NULLWATCHDOG_ERROR);
      return 
FALSE;
    }

    
$status module_invoke('apachesolr_search''search''status');
    
$context['sandbox']['progress'] = 0;
    
$context['sandbox']['max'] = $status['remaining'];
  }

  
// We can safely process the apachesolr_cron_limit nodes at a time without a
  // timeout or out of memory error.
  
$limit variable_get('apachesolr_cron_limit'50);

  
// With each pass through the callback, retrieve the next group of nids.
  
$rows apachesolr_get_nodes_to_index('apachesolr_search'$limit);
  
apachesolr_index_nodes($rows'apachesolr_search');

  
$context['sandbox']['progress'] += count($rows);
  
$context['message'] = t('Indexed @current of @total nodes', array('@current' => $context['sandbox']['progress'], '@total' => $context['sandbox']['max']));

  
// Inform the batch engine that we are not finished, and provide an
  // estimation of the completion level we reached.
  
$context['finished'] = empty($context['sandbox']['max']) ? $context['sandbox']['progress'] / $context['sandbox']['max'];

  
// Put the total into the results section when we're finished so we can
  // show it to the admin.
  
if ($context['finished']) {
    
$context['results']['count'] = $context['sandbox']['progress'];
  }
}

/**
* Batch 'finished' callback
*/
function apachesolr_batch_index_finished($success$results$operations) {
  
$message format_plural($results['count'], '1 item processed successfully.''@count items successfully processed.');
  if (
$success) {
    
$type 'status';
  }
  else {
    
// An error occurred, $operations contains the operations that remained
    // unprocessed.
    
$error_operation reset($operations);
    
$message .= ' 't('An error occurred while processing @num with arguments :', array('@num' => $error_operation[0])) . print_r($error_operation[0], TRUE);
    
$type 'error';
  }
  
drupal_set_message($message$type);
}
?>

Allright so now we could import our nodes without worrying about timeouts and our batch failing. But we still have to monitor the process in case our internet connection got cut off. We would have to refresh the page and make the batch continue.
So it would be great to just have way of launching a command and dont have to worry about anything. To achieve this we use http://drupal.org/project/drush and the cron functionality on the server.
You could write your own drush command that launches a batch as you would do by calling the script we allready had. However it wont work, your memory will get exhausted. But dont worry, there is a solution to this. You ll call a drush command that is able to do batches. This drush command will make sure your memory doesnt get exhausted while doing batches. You can see how it is used in the updatedb drush function. Here is my snippet on how I implemented it using a custom drush command that calls the function that would be called by the "drush batch-process [batch-id]" command.

<?php
function import_drush_command() {
  
$items = array();
  
  
$items['import'] = array(
    
'callback' => 'import_drush_import',
    
'description' => dt('Import'),
    
'arguments' => array(
      
'start'        => "start",
      
'stop'   => "stop",
    ),
  );
}

function 
import_drush_import($start$stop) {
  
$result db_query("SELECT * FROM {our_table_with_records} WHERE id > %d AND id < %d"$start$stop);

  
import_drush_import_operations($batch$result);
   
  
batch_set($batch);
 
  
$batch =& batch_get();
 
  
$batch['progressive'] = FALSE;
 
  
drush_backend_batch_process();
}

/**
*  Creates operations for importing bedrijf nodes
*/
function import_drush_import_operations(&$batch, &$result) {
  while (
$fields db_fetch_array($result)) {   
    
array_shift($fields);  
    
$fields_out = array();
    foreach (
$fields as $field) {
      
$fields_out[] = $field;
    }

    
$batch['operations'][] = array('import_create_bedrijf_nodes', array($fields_outTRUE17));   
  }
}
?>

Ok so now in the terminal we typ something like: drush import 1 1000 and it will create a batch, fire it and import the first 1000 records and create nodes for it.
You could have this function called by cron so you dont even need to have a terminal open. But as said earlier we are still creating an operation for every record. Why? When doing the one operation trick I ve noticed only 30% of the cpu was used (check this typing "top" in another terminal window). So I figured we could spawn multiple shells and make them all do work. I did that and I found I could launch six shells with the drush import command. On the seventh the server cpu spiked up to 300% and made the server crash so six shells was the limit. It is probably possible to measure resources and launch commands according to that. But for now I figured the server is using all of its resources and importing goes as fast as possible despite this being a manual process.
The final thing I did to automate the process was setting up an importcron.php in the drupal root installation containing this:

<?php
//Set the path correctly so drupal know how to include it's files during bootstrap
$path='/var/www/html/your_drupal';
chdir($path);

//Bootstrap drupal
include_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

//Launch your function
import('importer_import_progress');

//IMPORT
function import($value) {
  
$amount db_result(db_query("SELECT COUNT(bid) FROM {batch}"));
  
$start = (string) variable_get($value0);
 
  if(
$amount && $start 985558) {
    
//for some reason you need to cast everyhting to strings explicitly otherwise it wouldnt launch the command        
    
$stop = (string) $start 1000;                                      
    
variable_set($value$stop);                      
    
    
$command =  "/var/www/html/your_drupal_site/sites/all/modules/contrib/drush/drush import ";
    
$start = (string) ($start+1);                                       
    
$command .= $start " " $stop;
    
    
exec("$command$output $return);         
  }                       
}                                                                  
?>

Then in type crontab -e to edit the cron listing, typ i to insert and set this command:

* * * * * /usr/bin/php /var/www/html/your_drupal_site/cronimport.php

Type the full path to php and the full path to your file. This will execute the command once every minute. In the script a drush command will be exectuted with the next 300 items needed to import. To prevent the command from firing to much it will allway check if previous batches are finished. Our limit was 6 batches at the same time. If your server is more powerful you can increase the 1000 items and the 6 batches. It would be nice to have this process controlled by a function that calculates server resources and launches batches accordingly, but I would have to do some research on how to do that.

Conclusion
The script ran for about 70 hours and the site contained 3 Million nodes. The same principle was used to to the indexing which took about 50 hours to index all nodes. In the indexing we modified some other things to make it go faster but thats for another blog post.

Add new comment