# Data Tools¶

Data Tools allow you to plot, view, export or perform specialized analyses on your measurements. These can be run as a module in a pipeline but also run after analysis is complete using “Data Tools” in CellProfiler’s main menu.

## CalculateMath¶

CalculateMath takes measurements produced by previous modules and performs basic arithmetic operations.

The arithmetic operations available in this module include addition, subtraction, multiplication, and division. The result can be log-transformed or raised to a power and can be used in further calculations if another CalculateMath module is added to the pipeline.

The module can make its calculations on a per-image basis (for example, multiplying the area occupied by a stain in the image by the total intensity in the image) or on an object-by-object basis (for example, dividing the intensity in the nucleus by the intensity in the cytoplasm for each cell).

Supports 2D? Supports 3D? Respects masks?
YES YES NO

### Measurements made by this module¶

• Image measurements: If both input measurements are whole-image measurements, then the result will also be a whole-image measurement.
• Object measurements: Object measurements can be produced in two ways:
• If both input measurements are individual object measurements, then the result will also be an object measurement. In these cases, the measurement will be associated with both objects that were involved in the measurement.
• If one measure is object-based and one image-based, then the result will be an object measurement.

The result of these calculations is a new measurement in the “Math” category.

## CalculateStatistics¶

CalculateStatistics calculates measures of assay quality (V and Z’ factors) and dose-response data (EC50) for all measured features made from images.

The V and Z’ factors are statistical measures of assay quality and are calculated for each per-image measurement and for each average per-object measurement that you have made in the pipeline. Placing this module at the end of a pipeline in order to calculate these values allows you to identify which measured features are most powerful for distinguishing positive and negative control samples (Z’ factor), or for accurately quantifying the assay’s response to dose (V factor). These measurements will be calculated for all measured values (Intensity, AreaShape, Texture, etc.) upstream in the pipeline. The statistics calculated by this module can be exported as the “Experiment” set of data.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

### What do I need as input?¶

Example format for a file to be loaded by LoadData for this module:

LoadData loads information from a CSV file. The first line of this file is a header that names the items. Each subsequent line represents data for one image cycle, so your file should have the header line plus one line per image to be processed. You can also make a file for LoadData to load that contains the positive/negative control and dose designations plus the image file names to be processed, which is a good way to guarantee that images are matched with the correct data. The control and dose information can be designated in one of two ways:

• As metadata (so that the column header is prefixed with the “Metadata_” tag). “Metadata” is the category and the name after the underscore is the measurement.
• As some other type of data, in which case the header needs to be of the form <prefix>_<measurement>. Select <prefix> as the category and <measurement> as the measurement.

Here is an example file:

 Image_FileName_CY3, Image_PathName_CY3, Data_Control, Data_Dose “Plate1_A01.tif”, “/images”, -1, 0 “Plate1_A02.tif”, “/images”, 1, 1E10 “Plate1_A03.tif”, “/images”, 0, 3E4 “Plate1_A04.tif”, “/images”, 0, 5E5

### Measurements made by this module¶

• Experiment features: Whereas most CellProfiler measurements are calculated for each object (per-object) or for each image (per-image), this module produces per-experiment values; for example, one Z’ factor is calculated for each measurement, across the entire analysis run.
• Zfactor: The Z’-factor indicates how well separated the positive and negative controls are. A Z’-factor > 0 is potentially screenable; a Z’-factor > 0.5 is considered an excellent assay. The formula is 1 - 3 × (σp + σn)/|μp - μn| where σp and σn are the standard deviations of the positive and negative controls, and μp and μn are the means of the positive and negative controls.
• Vfactor: The V-factor is a generalization of the Z’-factor, and is calculated as 1 - 6 × mean(σ)/|μp - μn| where σ are the standard deviations of the data, and μp and μn are defined as above.
• EC50: The half maximal effective concentration (EC50) is the concentration of a treatment required to induce a response that is 50% of the maximal response.
• OneTailedZfactor: This measure is an attempt to overcome a limitation of the original Z’-factor formulation (it assumes a Gaussian distribution) and is informative for populations with moderate or high amounts of skewness. In these cases, long tails opposite to the mid-range point lead to a high standard deviation for either population, which results in a low Z’ factor even though the population means and samples between the means may be well-separated. Therefore, the one-tailed Z’ factor is calculated with the same formula but using only those samples that lie between the positive/negative population means. This is not yet a well established measure of assay robustness, and should be considered experimental.

For both Z’ and V factors, the highest possible value (best assay quality) is 1, and they can range into negative values (for assays where distinguishing between positive and negative controls is difficult or impossible). The Z’ factor is based only on positive and negative controls. The V factor is based on an entire dose-response curve rather than on the minimum and maximum responses. When there are only two doses in the assay (positive and negative controls only), the V factor will equal the Z’ factor.

Note that if the standard deviation of a measured feature is zero for a particular set of samples (e.g., all the positive controls), the Z’ and V factors will equal 1 despite the fact that the assay quality is poor. This can occur when there is only one sample at each dose. This also occurs for some non-informative measured features, like the number of cytoplasm compartments per cell, which is always equal to 1.

This module can create MATLAB scripts that display the EC50 curves for each measurement. These scripts will require MATLAB and the statistics toolbox in order to run. See Create dose-response plots? below.

### References¶

• Z’ factor: Zhang JH, Chung TD, et al. (1999) “A simple statistical parameter for use in evaluation and validation of high throughput screening assays” J Biomolecular Screening 4(2): 67-73. (link)
• V factor: Ravkin I (2004): Poster #P12024 - Quality Measures for Imaging-based Cellular Assays. Society for Biomolecular Screening Annual Meeting Abstracts.
• Code for the calculation of Z’ and V factors was kindly donated by Ilya Ravkin. Carlos Evangelista donated his copyrighted dose-response-related code.

## DisplayDataOnImage¶

DisplayDataOnImage produces an image with measured data on top of identified objects.

This module displays either a single image measurement on an image of your choosing, or one object measurement per object on top of every object in an image. The display itself is an image which you can save to a file using SaveImages.

Supports 2D? Supports 3D? Respects masks?
YES NO YES

## DisplayDensityPlot¶

DisplayDensityPlot plots measurements as a two-dimensional density plot.

A density plot displays the relationship between two measurements (that is, features) but instead of showing each data point as a dot, as in a scatter plot, the data points are binned into an equally-spaced grid of points, where the color of each point in the grid represents the tabulated frequency of the measurements within that region of the grid. A density plot is also known as a 2-D histogram; in a conventional histogram the height of a bar indicates how many data points fall in that region. By contrast, in a density plot (2-D histogram), the color of a portion of the plot indicates the number of data points in that region.

The module shows the values generated for the current cycle. However, this module can also be run as a Data Tool, in which case you will first be asked for the output file produced by the analysis run. The resulting plot is created from all the measurements collected during the run.

At this time, the display produced when DisplayDensityPlot is run as a module cannot be saved in the pipeline (e.g., by using SaveImages). The display can be saved manually by selecting the window produced by the module and clicking the Save icon in its menu bar or by choosing File > Save from CellProfiler’s main menu bar.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

## DisplayHistogram¶

DisplayHistogram plots a histogram of the desired measurement.

A histogram is a bar plot depicting frequencies of items in each data range. Here, each bar’s value is created by binning measurement data for a set of objects. A two-dimensional histogram can be created using the DisplayDensityPlot module.

The module shows the values generated for the current cycle. However, this module can also be run as a Data Tool, in which you will first be asked for the output file produced by the analysis run. The resultant plot is created from all the measurements collected during the run.

At this time, the display produced when DisplayHistogram is run as a module cannot be saved in the pipeline (e.g., by using SaveImages). The display can be saved manually by selecting the window produced by the module and clicking the Save icon in its menu bar or by choosing File > Save from CellProfiler’s main menu bar.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

## DisplayPlatemap¶

DisplayPlatemap displays a desired measurement in a plate map view.

DisplayPlatemap is a tool for browsing image-based data laid out on multi-well plates common to high-throughput biological screens. The display window for this module shows a plate map with each well color-coded according to the measurement chosen.

As the pipeline runs, the measurement information displayed is updated, so the value shown for each well is current up to the image cycle currently being processed; wells that have no corresponding measurements as yet are shown as blank.

At this time, the display produced when DisplayPlatemap is run as a module cannot be saved in the pipeline (e.g., by using SaveImages). The display can be saved manually by selecting the window produced by the module and clicking the Save icon in its menu bar or by choosing File > Save from CellProfiler’s main menu bar.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

## DisplayScatterPlot¶

DisplayScatterPlot plots the values for two measurements.

A scatter plot displays the relationship between two measurements (that is, features) as a collection of points. If there are too many data points on the plot, you should consider using DisplayDensityPlot instead.

The module will show a plot of the values generated for the current cycle. However, this module can also be run as a Data Tool, in which you will first be asked for the output file produced by the analysis run. The resulting plot is created from all the measurements collected during the run.

At this time, the display produced when DisplayScatterPlot is run as a module cannot be saved in the pipeline (e.g., by using SaveImages). The display can be saved manually by selecting the window produced by the module and clicking the Save icon in its menu bar or by choosing File > Save from CellProfiler’s main menu bar.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

## ExportToDatabase¶

ExportToDatabase exports data directly to a database or in database readable format, including a CellProfiler Analyst properties file, if desired.

This module exports measurements directly to a database or to a SQL-compatible format. It allows you to create and import MySQL and associated data files into a database and gives you the option of creating a properties file for use with CellProfiler Analyst. Optionally, you can create an SQLite database file if you do not have a server on which to run MySQL itself. This module must be run at the end of a pipeline, or second to last if you are using the CreateBatchFiles module. If you forget this module, you can also run the ExportDatabase data tool (accessed from CellProfiler’s main menu) after processing is complete; its functionality is the same.

The database is set up with two primary tables. These tables are the Per_Image table and the Per_Object table (which may have a prefix if you specify):

• The Per_Image table consists of all the per-image measurements made during the pipeline, plus per-image population statistics (such as mean, median, and standard deviation) of the object measurements. There is one per_image row for every “cycle” that CellProfiler processes (a cycle is usually a single field of view, and a single cycle usually contains several image files, each representing a different channel of the same field of view).
• The Per_Object table contains all the measurements for individual objects. There is one row of object measurements per object identified. The two tables are connected with the primary key column ImageNumber, which indicates the image to which each object belongs. The Per_Object table has another primary key called ObjectNumber, which is unique to each image.

Typically, if multiple types of objects are identified and measured in a pipeline, the numbers of those objects are equal to each other. For example, in most pipelines, each nucleus has exactly one cytoplasm, so the first row of the Per-Object table contains all of the information about object #1, including both nucleus- and cytoplasm-related measurements. If this one-to-one correspondence is not the case for all objects in the pipeline (for example, if dozens of speckles are identified and measured for each nucleus), then you must configure ExportToDatabase to export only objects that maintain the one-to-one correspondence (for example, export only Nucleus and Cytoplasm, but omit Speckles). If you have extracted “Plate” and “Well” metadata from image filenames or loaded “Plate” and “Well” metadata via the Metadata or LoadData modules, you can ask CellProfiler to create a “Per_Well” table, which aggregates object measurements across wells. This option will output a SQL file (regardless of whether you choose to write directly to the database) that can be used to create the Per_Well table. Note that the “Per_Well” mean/median/stdev values are only usable for database type MySQL (and CSV/MySQL), not SQLite.

At the secure shell where you normally log in to MySQL, type the following, replacing the italics with references to your database and files, to import these CellProfiler measurements to your database:

mysql -h hostname -u username -p databasename < pathtoimages/perwellsetupfile.SQL

The commands written by CellProfiler to create the Per_Well table will be executed. Oracle is not fully supported at present; you can create your own Oracle DB using the .csv output option and writing a simple script to upload to the database.

For details on the nomenclature used by CellProfiler for the exported measurements, see Help > General Help > How Measurements Are Named.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

ExportToSpreadsheet exports measurements into one or more files that can be opened in Excel or other spreadsheet programs.

This module will convert the measurements to a comma-, tab-, or other character-delimited text format and save them to the hard drive in one or several files, as requested.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

### Using metadata tags for output¶

ExportToSpreadsheet can write out separate files for groups of images based on their metadata tags. This is controlled by the directory and file names that you enter. For instance, you might have applied two treatments to each of your samples and labeled them with the metadata names “Treatment1” and “Treatment2”, and you might want to create separate files for each combination of treatments, storing all measurements with a given “Treatment1” in separate directories. You can do this by specifying metadata tags for the folder name and file name:

• Choose “Elsewhere…” or “Default Input/Output Folder sub-folder” for the output file location. Do note that regardless of your choice, the Experiment.csv is saved to the Default Input/Output Folder and not to individual subfolders. All other per-image and per-object .csv files are saved to the appropriate subfolders. See Github issue #1110 for details.

• Insert the metadata tag of choice into the output path. You can insert a previously defined metadata tag by either using:

• The insert key
• A right mouse button click inside the control
• In Windows, the Context menu key, which is between the Windows key and Ctrl key

The inserted metadata tag will appear in green. To change a previously inserted metadata tag, navigate the cursor to just before the tag and either:

• Use the up and down arrows to cycle through possible values.
• Right-click on the tag to display and select the available values.

In this instance, you would select the metadata tag “Treatment1”

• Uncheck “Export all measurements?

• Uncheck “Use the object name for the file name?

• Using the same approach as above, select the metadata tag “Treatment2”, and complete the filename by appending the text “.csv”.

Here’s an example table of the files that would be generated:
Treatment1 Treatment2 Path
1M_NaCl 20uM_DMSO 1M_NaCl/20uM_DMSO.csv
1M_NaCl 40uM_DMSO 1M_NaCl/40uM_DMSO.csv
2M_NaCl 20uM_DMSO 2M_NaCl/20uM_DMSO.csv
2M_NaCl 40uM_DMSO 2M_NaCl/40uM_DMSO.csv

### Measurements made by this module¶

For details on the nomenclature used by CellProfiler for the exported measurements, see Help > General Help > How Measurements Are Named. See also ^^^^^^^^

## FlagImage¶

FlagImage allows you to flag an image based on properties that you specify, for example, quality control measurements.

This module allows you to assign a flag if an image meets certain measurement criteria that you specify (for example, if the image fails a quality control measurement). The value of the flag is 1 if the image meets the selected criteria (for example, if it fails QC), and 0 if it does not meet the criteria (if it passes QC).

The flag can be used in post-processing to filter out images you do not want to analyze, e.g., in CellProfiler Analyst. In addition, you can use ExportToSpreadsheet to generate a file that includes the flag as a metadata measurement associated with the images. The Metadata module can then use this flag to put images that pass QC into one group and images that fail into another.

A flag can be based on one or more measurements. If you create a flag based on more than one measurement, you can choose between setting the flag if all measurements are outside the bounds or if one of the measurements is outside of the bounds. This module must be placed in the pipeline after the relevant measurement modules upon which the flags are based.

Supports 2D? Supports 3D? Respects masks?
YES NO NO

## MergeOutputFiles¶

MergeOutputFiles merges several output .mat files into one.

This data tool lets you collect the output .mat files from several runs, for instance, as might be created by running CellProfiler in batch mode. To save .mat files, click the View output settings at the lower left of CellProfiler’s main menu and follow the instructions there to save MATLAB output files.

MergeOutputFiles is a pure data tool; you cannot use it as a module, and it will generate an error if you try to do so. To use it as a data tool, choose it from the Data Tools menu to bring up the MergeOutputFiles dialog.

The dialog has the following parts:

• Destination file: This is the name of the file that will be created. The file will contain all merged input data files in MATLAB format.
• File list: The file list is the box with the columns, “Folder” and “File”. It will be empty until you add files using the “Add…” button. Measurement files are written out to the destination file in the order they appear in this list. You can select multiple files in this box to move them up or down or to remove them.
• Add button: Brings up a file chooser when you press it. You can select multiple files from the file chooser and they will be added in alphabetical order to the bottom of the current list of files.
• Remove button: Removes all currently selected files from the list.
• Up button: Moves the currently selected files up in the list.
• Down button: Moves the currently selected files down in the list.
• OK button: Accepts the file list and writes it to the output.
• Cancel button: Closes the dialog without performing any operation.

Once merged, this output file will be compatible with other data tools. Output files can be quite large, so prior to merging, be sure that the total size of the merged output file is of a reasonable size to be opened on your computer (based on the amount of memory available on your computer). It may be preferable instead to import data from individual output files directly into a database using ExportDatabase as a data tool.

Supports 2D? Supports 3D? Respects masks?
YES NO NO