Batch Processing

CellProfiler is designed to analyze images in a high-throughput manner. Once a pipeline has been established for a set of images, CellProfiler can export files that enable batches of images to be analyzed on a computing cluster with the pipeline.

It is possible to process tens or even hundreds of thousands of images for one analysis in this manner. We do this by breaking the entire set of images into separate batches, then submitting each of these batches as individual jobs to a cluster. Each individual batch can be separately analyzed from the rest.

The following describes the workflow for running your pipeline on a cluster that’s physically located at your local institution; for running in a cloud-based cluster using Amazon Web Services, please see our blog post on Distributed CellProfiler, a tool designed to streamline that process.

Submitting files for batch processing

Below is a basic workflow for submitting your image batches to a cluster.

  1. Create a folder for your project on your cluster. For high-throughput analysis, it is recommended to create a separate project folder for each run.

  2. Within this project folder, create the following folders (both of which must be connected to the cluster computing network):

    • Create an input folder, then transfer all of your images to this folder as the input folder. The input folder must be readable by everyone (or at least your cluster) because each of the separate cluster computers will read input files from this folder.
    • Create an output folder where all your output data will be stored. The output folder must be writeable by everyone (or at least your cluster) because each of the separate cluster computers will write output files to this folder.

    If you cannot create folders and set read/write permissions to these folders (or do not know how), ask your Information Technology (IT) department for help.

  3. Press the “View output settings” button. In the panel that appears, set the Default Input and Default Output Folders to the images and output folders created above, respectively. The Default Input Folder setting will only appear if a legacy pipeline is being run.

  4. Create a pipeline for your image set. You should test it on a few example images from your image set (if you are unfamiliar with the concept of an image set, please see the help for the Input modules). The module settings selected for your pipeline will be applied to all your images, but the results may vary depending on the image quality, so it is critical to ensure your settings are robust against your “worst-case” images. For instance, some images may contain no cells. If this happens, the automatic thresholding algorithms will incorrectly choose a very low threshold, and therefore “find” spurious objects. This can be overcome by setting a lower limit on the threshold in the IdentifyPrimaryObjects module. The Test mode in CellProfiler may be used for previewing the results of your settings on images of your choice. Please refer to Help > Testing Your Pipeline for more details on how to use this utility.

  5. Add the CreateBatchFiles module to the end of your pipeline. This module is needed to resolve the pathnames to your files with respect to your local machine and the cluster computers. If you are processing large batches of images, you may also consider adding ExportToDatabase to your pipeline, after your measurement modules but before the CreateBatchFiles module. This module will export your data either directly to a MySQL/SQLite database or into a set of comma-separated files (CSV) along with a script to import your data into a MySQL database. Please refer to the help for these modules in order learn more about which settings are appropriate.

  6. Run the pipeline to create a batch file. Click the Analyze images button and the analysis will begin processing locally. Do not be surprised if this initial step takes a while: CellProfiler must first create the entire image set list based on your settings in the Input modules (this process can be sped up by creating your list of images as a CSV and using the LoadData module to load it). With the CreateBatchFiles module in place, the pipeline will not process all the images, but instead will create a batch file (a file called Batch_data.h5) and save it in the Default Output Folder (Step 1). The advantage of using CreateBatchFiles from the researcher’s perspective is that the Batch_data.h5 file generated by the module captures all of the data needed to run the analysis. You are now ready to submit this batch file to the cluster to run each of the batches of images on different computers on the cluster.

  7. Submit your batches to the cluster. Log on to your cluster, and navigate to the directory where you have installed CellProfiler on the cluster. A single batch can be submitted with the following command:

    ./python -m cellprofiler -p <Default_Output_Folder_path>/Batch_data.h5 \\
    -c -r -b \\
    -f <first_image_set_number> \\
    -l <last_image_set_number>

    This command submits the batch file to CellProfiler and specifies that CellProfiler run in a batch mode without its user interface to process the pipeline. This run can be modified by using additional options to CellProfiler that specify the following:

    • -p <Default_Output_Folder_path>/Batch_data.h5: The location of the batch file, where <Default\_Output\_Folder\_path> is the output folder path as seen by the cluster computer.
    • -c: Run “headless”, i.e., without the GUI
    • -r: Run the pipeline specified on startup, which is contained in the batch file.
    • -b: Do not build extensions, since by this point, they should already be built.
    • -f <first_image_set_number>: Start processing with the image set specified, <first_image_set_number>
    • -l <last_image_set_number>: Finish processing with the image set specified, <last_image_set_number>

    Typically, a user will break a long image set list into pieces and execute each of these pieces using the command line switches, -f and -l to specify the first and last image sets in each job. A full image set would then need a script that calls CellProfiler with these options with sequential image set numbers, e.g, 1-50, 51-100, etc to submit each as an individual job.

    If you need help in producing the batch commands for submitting your jobs, use the --get-batch-commands along with the -p switch to specify the Batch_data.h5 file output by the CreateBatchFiles module. When specified, CellProfiler will output one line to the terminal per job to be run. This output should be further processed to generate a script that can invoke the jobs in a cluster-computing context.

    The above notes assume that you are running CellProfiler using our source code (see “Developer’s Guide” under Help for more details). If you are using the compiled version, you would replace ./python -m cellprofiler with the CellProfiler executable file itself and run it from the installation folder.

Once all the jobs are submitted, the cluster will run each batch individually and output any measurements or images specified in the pipeline. Specifying the output filename using the -o switch when calling CellProfiler will also produce an output file containing the measurements for that batch of images in the output folder. Check the output from the batch processes to make sure all batches complete. Batches that fail for transient reasons can be resubmitted.

To see a listing and documentation for all available arguments to CellProfiler, type``cellprofiler –help``.

For additional help on batch processing, refer to our wiki if installing CellProfiler on a Unix system, our wiki on adapting CellProfiler to a LIMS environment, or post your questions on the CellProfiler CPCluster forum.