Metadata

The Metadata module connects information about the images (i.e., metadata) to your list of images for processing in CellProfiler.

The Metadata module allows you to extract and associate metadata with your images. The metadata can be extracted from the image file itself, from a part of the file name or location, and/or from a text file you provide.

What is "metadata"?

The term metadata refers to "data about data." For many assays, metadata is important in the context of tagging images with various attributes, which can include (but is not limited to) items such as the following:

The row and column of the microtiter plate that the image was acquired from.
The experimental treatment applied to the well that the image was acquired from.
The number of timepoints or channels contained in the image file.
The image type, i.e., RGB, indexed or separate channels.
The height and width of an image, in pixels.
Etc.

It can be helpful to inform CellProfiler about certain metadata in order to define a specific relationship between the images and the associated metadata. For instance:

You want images with a common tag to be matched together so they are processed together during the pipeline run. E.g., the filenames for fluorescent DAPI and GFP images contain different tags indicating the wavelength but share '_s1' in the filename if they were acquired from site #1, '_s2' from site #2, and so on.
You want certain information attached to the output measurements and filenames for annotation or sample-tracking purposes. E.g., some images are to be identified as acquired from DMSO treated wells, whereas others were collected from wells treated with Compound 1, 2,... and so forth.

The underlying assumption in matching metadata values to image sets is that there is an exact pairing (i.e., a one-to-one match) for a given combination of metadata tags. A common example is that for a two-channel microtiter plate assay, the values of the plate, well, and site tags from one channel get matched uniquely to the plate, well, and site tag values from the other channel.

What are the inputs?

If you do not have metadata that is relevant to your analysis, you can leave this module in the default setting, and continue on to the NamesAndTypesmodule If you do have relevant metadata, the Metadata module receives the file list produced by the Images module. It then associates information to each file in the File list, which can be obtained from several sources:

From the image file name or location (e.g., as assigned by a microscope). In this case, you will provide the text search pattern to obtain this information.
In a text file created and filled out by you or a laboratory information management system. In this case, you will point the module to the location of this file.
In the image file itself.

What do the settings mean?

See below for help on the individual settings. In general, the settings serve in various forms of metadata extraction. You can extract metadata from all images from Images modules or a subset of them by using rules to filter the list.

What do I get as output?

The final product of the Metadata module is a list of files from the Imagesmodule, accompanied by the associated metadata retrieved from the source(s) provided and matched to the desired images.

As you are extracting metadata from your various sources, you can click the "Update" button below the divider to display a table of results using the current settings. Each row corresponds to an image file from the Images module, and the columns display the metadata obtained for each tag specified. You can press this button as many times as needed to display the most current metadata obtained.

Some downstream use cases for metadata include the following:

If the metadata establishes how channels are related to one another, you can use them in the NamesAndTypes module to aid in creating an image set.
If the images need to be further sub-divided into groups of images that share a common metadata value, the Groups module can be used to specify which metadata is needed for this purpose.
You can also use metadata to reference their values in later modules. Since the metadata is stored as an image measurement and can be assigned as an integer or floating-point number, any module which allows measurements as input can make use of it.
Several modules are also capable of using metadata for more specific purposes. Refer to the module setting help for additional information on how to use them in the context of the specific module.

If the metadata originates from an external source such as a CSV, there are some caveats in the cases when metadata is either missing or duplicated for the referenced images; see the NamesAndTypes module for more details.

Available measurements

Metadata: The prefix of each metadata tag in the per-image table.

Settings:

Extract metadata?

Select Yes if your file or path names or file headers contain information (i.e., metadata) you would like to extract and store along with your measurements. See the main module help for more details.

Metadata data type

Metadata can be stored as either a text or numeric value:

Text: Save all metadata item as text.
Choose for each: Choose the data type separately for each metadata entry. An example of when this approach would be necessary would be if a whole filename is captured as metadata but the file name is numeric, e.g., "0001101". In this situation, if the file name needs to be used for an arithmetic calculation or index, the name would need to be converted to a number and you would select "Integer" as the data type. On the other hand, if it important that the leading zeroes be retained, setting it to an integer would them upon conversion to a number. In this case, storing the metadata values as "Text" would be more appropriate.

Metadata types

(Used only when Choose for each is selected for the metadata data type)
This setting determines the data type of each metadata field when stored as a measurement.

Text: Save the metadata as text.
Integer: Save the metadata as an integer.
Float: Save the metadata as a decimal number.
None: Do not save the metadata as a measurement.

Metadata extraction method

Metadata can be stored in either or both of two ways:

Internally: This method is often through the file naming, directory structuring, or the file header information.
Externally: This is through an external index, such as spreadsheet or database of some kind.

The Metadata module can extract internal or external metadata from the images in any of three ways:

Extract from file/folder names: This approach retrieves information based on the file nomenclature and/or location. A special syntax called "regular expressions" is used to match text patterns in the file name or path, and then assign this text as metadata for the images you specify. The tag for each metadata is assigned a name that is meaningful to you.

When would you want to use this option? If you want to take advantage of the fact that acquisition software often automatically assigns a regular nomenclature to the filenames or the containing folders. Alternately, the researcher acquiring the images may also have a specific nomenclature they adhere to for bookkeeping purposes.
Import from file: This option retrieves metadata from a comma-delimited file (known as a CSV file, for comma-separated values) of information; you will be prompted to specify the location of the CSV file. You can create such a file using a spreadsheet program such as Microsoft Excel.

When would you want to use this option? You have information curated in software that allows for export to a spreadsheet. This is commonly the case for laboratories that use data management systems that track samples and acquisition.
Extract from image file headers: This option retrieves information from the internal structure of the file format itself. Typically, image metadata is embedded in the image file as header information; this information includes the dimensions and color depth among other things. If you select this method, press the "Update metadata" button to extract the metadata. Note that this extraction process can take a while for assays with lots of images since each one needs to read for extraction. Since the metadata is often image-format specific, this option will extract information that is common to most image types:
- Series: The series index of the image. This value is set to "None" if not applicable. Some image formats can store more than one stack in a single file; for those, the Series value for each stack in the file will be different
- Frame: The frame index of the image. This value is set to "None" if not applicable. For stack frames and movies, this is the frame number for an individual 2-D image slice.
- ColorFormat: Set to "Monochrome" for grayscale images, "RGB" for color.
- SizeZ: The number of image slices. Typically has a value > 1 for confocal stacks and the like.
- SizeT: The number of image frames. Typically has a value > 1 for movies.
- SizeC: The number of color channels. Typically has a value > 1 for non-grayscale images and for confocal stacks containing channel images acquired using different filters and illumination sources.
When would you want to use this option? You want to analyze images that are contained as file stacks, i.e., the images that are related to each other in some way, such as by time (temporal), space (spatial), or color (spectral).

Specifics on the metadata extraction options are described below. Any or all of these options may be used at time; press the "Add another extraction method" button to add more.

Metadata source

You can extract the metadata from the image's file name or from its folder name.

Regular expression

(Used only if you want to extract metadata from the file name)
The regular expression to extract the metadata from the file name is entered here. Note that this field is available whether you have selected Text-Regular expressions to load the files or not. Please see the general module help for more information on construction of a regular expression.

Clicking the magnifying glass icon to the right will bring up a tool for checking the accuracy of your regular expression. The regular expression syntax can be used to name different parts of your expression. The syntax (?P<fieldname>expr) will extract whatever matches expr and assign it to the measurement,fieldname for the image.

For instance, a researcher uses plate names composed of a string of letters and numbers, followed by an underscore, then the well, followed by another underscore, followed by an "s" and a digit representing the site taken within the well (e.g., TE12345_A05_s1.tif). The following regular expression will capture the plate, well, and site in the fields "Plate", "Well", and "Site":

^(?P<Plate>.*)_(?P<Well>[A-P][0-9]{1,2})_s(?P<Site>[0-9])

^ Start only at beginning of the file name

(?P<Plate> Name the captured field Plate

.* Capture as many characters as follow

_ Discard the underbar separating plate from well

(?P<Well> Name the captured field Well

[A-P] Capture exactly one letter between A and P

[0-9]{1,2} Capture one or two digits that follow

_s Discard the underbar followed by s separating well from site

(?P<Site> Name the captured field Site

[0-9] Capture one digit following

The regular expression can be typed in the upper text box, with a sample file name given in the lower text box. Provided the syntax is correct, the corresponding fields will be highlighted in the same color in the two boxes. Press Submit to enter the typed regular expression.

You can create metadata tags for any portion of the filename or path, but if you are specifying metadata for multiple images, an image cycle can only have one set of values for each metadata tag. This means that you can only specify the metadata tags which have the same value across all images listed in the module. For example, in the example above, you might load two wavelengths of data, one named TE12345_A05_s1_w1.tif and the other TE12345_A05_s1_w2.tif, where the number following the w is the wavelength. In this case, a "Wavelength" tag should not be included in the regular expression because while the "Plate", "Well" and "Site" metadata is identical for both images, the wavelength metadata is not.

Note that if you use the special fieldnames <WellColumn> and <WellRow> together, LoadImages will automatically create a <Well> metadata field by joining the two fieldname values together. For example, if <WellRow> is "A" and <WellColumn> is "01", a field <Well> will be "A01". This is useful if your well row and column names are separated from each other in the filename, but you want to retain the standard well nomenclature.

Regular expression

(Used only if you want to extract metadata from the path)
Enter the regular expression for extracting the metadata from the path. Note that this field is available whether you have selected Text-Regular expressions to load the files or not.

Clicking the magnifying glass icon to the right will bring up a tool that will allow you to check the accuracy of your regular expression. The regular expression syntax can be used to name different parts of your expression. The syntax (?<fieldname>expr) will extract whatever matches expr and assign it to the image's fieldname measurement.

For instance, a researcher uses folder names with the date and subfolders containing the images with the run ID (e.g., ./2009_10_02/1234/) The following regular expression will capture the plate, well, and site in the fields Date and Run:

.*[\\/](?P<Date>.*)[\\/](?P<Run>.*)$

.*[\\/] Skip characters at the beginning of the pathname until either a slash (/) or backslash (\) is encountered (depending on the operating system)

(?P<Date> Name the captured field Date

.* Capture as many characters that follow

[\\/] Discard the slash/backslash character

(?P<Run> Name the captured field Run

.* Capture as many characters as follow

$ The Run field must be at the end of the path string, i.e., the last folder on the path. This also means that the Date field contains the parent folder of the Date folder.

Extract metadata from

Select whether you want to extract metadata from all of the images chosen by the Images module or a subset of the images.

This setting controls how different image types (e.g., an image of the GFP stain and a brightfield image) have different metadata extracted. There are two choices:

All images: Extract metadata from all images specified in Images. This is the simplest choice and the appropriate one if you have only one kind of image (or only one image). CellProfiler will extract metadata from all images using the same method per iteration.
Images matching a rule: Extract metadata depending on specific file attributes. This is the appropriate choice if more than one image was taken of each imaging site. You can specify distinctive criteria for each image subset with matching metadata.

Select the filtering criteria

Select Yes to display and use rules to select files for metadata extraction.

Clicking the rule menus shows you all the file attributes, operators and conditions you can specify to narrow down the image list.

For each rule, first select the attribute that the rule is to be based on. For example, you can select "File" to define a rule that will filter files on the basis of their filename.
The operator drop-down is then updated with operators applicable to the attribute you selected. For example, if you select "File" as the attribute, the operator menu includes text operators such as Contain or Starts with. On the other hand, if you select "Extension" as the attribute, you can choose the logical operators "Is" or "Is not" from the menu.
In the operator drop-down menu, select the operator you want to use. For example, if you want to match data exactly, you may want the "Exactly match" or the "Is" operator. If you want the condition to be more loose, select an operator such as "Contains".
Use the condition box to type the condition you want to match. The more you type, the more specific the condition is.
- As an example, if you create a new filter and select File as the attribute, then select "Does" and "Contain" as the operators, and type "Channel" as the condition, the filter finds all files that include the text "Channel", such as "Channel1.tif" "Channel2.jpg", "1-Channel-A01.BMP" and so on.
- If you select "Does" and "Start with" as the operators and "Channel1" in the Condition box, the rule will includes such files as "Channel1.tif" "Channel1-A01.png", and so on.

below

To add another rule, click the plus buttons to the right of each rule. Subtract an existing rule by clicking the minus button.

You can also link a set of rules by choosing the logical expression All or Any. If you use All logical expression, all the rules be true for a file to be included in the File list. If you use the Any option, only one of the conditions has to be met for a file to be included.

If you want to create more complex rules (e.g, some criteria matching all rules and others matching any), you can create sets of rules, by clicking the ellipsis button (to the right of the plus button). Repeat the above steps to add more rules to the filter until you have all the conditions you want to include.

Details on regular expressions

A regular expression is a general term refering to a method of searching for pattern matches in text. There is a high learning curve to using them, but are quite powerful once you understand the basics.

Patterns are specified using combinations of metacharacters and literal characters. There are a few classes of metacharacters, partially listed below. Some helpful links follow:

A more extensive explanation of regular expressions can be found here
A helpful quick reference can be found here
Pythex provides quick way to test your regular expressions. Here is an example to capture information from a common microscope nomenclature.

The following metacharacters match exactly one character from its respective set of characters:

Metacharacter Meaning

. Any character

[] Any character contained within the brackets

[^] Any character not contained within the brackets

\w A word character [a-z_A-Z0-9]

\W Not a word character [^a-z_A-Z0-9]

\d A digit [0-9]

\D Not a digit [^0-9]

\s Whitespace [ \t\r\n\f\v]

\S Not whitespace [^ \t\r\n\f\v]

Metacharacter	Meaning
.	Any character
[]	Any character contained within the brackets
[^]	Any character not contained within the brackets
\w	A word character [a-z_A-Z0-9]
\W	Not a word character [^a-z_A-Z0-9]
\d	A digit [0-9]
\D	Not a digit [^0-9]
\s	Whitespace [ \t\r\n\f\v]
\S	Not whitespace [^ \t\r\n\f\v]

The following metacharacters are used to logically group subexpressions or to specify context for a position in the match. These metacharacters do not match any characters in the string:

Metacharacter Meaning

( ) Group subexpression

| Match subexpression before or after the |

^ Match expression at the start of string

$ Match expression at the end of string

\< Match expression at the start of a word

\> Match expression at the end of a word

Metacharacter	Meaning
( )	Group subexpression
\|	Match subexpression before or after the \|
^	Match expression at the start of string
$	Match expression at the end of string
\<	Match expression at the start of a word
\>	Match expression at the end of a word

The following metacharacters specify the number of times the previous metacharacter or grouped subexpression may be matched:

Metacharacter Meaning

* Match zero or more occurrences

+ Match one or more occurrences

? Match zero or one occurrence

{n,m} Match between n and m occurrences

Metacharacter	Meaning
*	Match zero or more occurrences
+	Match one or more occurrences
?	Match zero or one occurrence
{n,m}	Match between n and m occurrences

Characters that are not special metacharacters are all treated literally in a match. To match a character that is a special metacharacter, escape that character with a '\'. For example '.' matches any character, so to match a '.' specifically, use '\.' in your pattern. Examples:

[trm]ail matches 'tail' or 'rail' or 'mail'.
[0-9] matches any digit between 0 to 9.
[^Q-S] matches any character other than 'Q' or 'R' or 'S'.
[[]A-Z] matches any upper case alphabet along with square brackets.
[ag-i-9] matches characters 'a' or 'g' or 'h' or 'i' or '-' or '9'.
[a-p]* matches '' or 'a' or 'aab' or 'p' etc.
[a-p]+ matches 'a' or 'abc' or 'p' etc.
[^0-9] matches any string that is not a number.
^[0-9]*$ matches either a blank string or a natural number.
^-[0-9]+$|^\+?[0-9]+$ matches any integer.

Metadata file location

The file containing the metadata must be a comma-delimited file (CSV). You can create or edit such a file using a spreadsheet program such as Microsoft Excel.

The CSV file needs to conform to the following format:

Each column describes one type of metadata.
Each row describes the metadata for one image site.
The column headers are uniquely named. You can optionally prepend "Metadata_" to the header name in order to insure that it is interpreted correctly.
The CSV must be plain text, i.e., without hidden file encoding information. If using Excel on a Mac to edit the file, choose to save the file as "Windows CSV" or "Windows Comma Separated".

The file must be saved as plain text, i.e., without hidden file encoding information. If using Excel on a Mac to edit the file, choose to save the file as "Windows CSV" or "Windows Comma Separated".

Match file and image metadata

Match columns in your .csv file to image metadata items. If you are using a CSV in conjunction with the filename/path metadata matching, you might want to capture the metadata in common with both sources. For example, you might be extracting the well tag from the image filename while your CSV contains treatment dosage information paired with each well. Therefore, you would want to let CellProfiler know that the well tag extracted from the image filename and the well tag noted in the CSV are in fact the one and the same.

This setting controls how rows in your CSV file are matched to different images. Set the drop-downs to pair the metadata tags of the images and the CSV, such that each row contains the corresponding tags. This can be done for as many metadata correspondences as you may have for each source; press to add more rows.

Use case insensitive matching?

This setting controls whether row matching takes the metadata case into account when matching. If you note that your CSV metadata is not being applied, your choice on this setting may be the culprit.

Select No so that metadata entries that only differ by case (for instance, "A01" and "a01") will not match.

Select Yes to match metadata entries that only differ by case.

^(?P<Plate>.*)_(?P<Well>[A-P][0-9]{1,2})_s(?P<Site>[0-9])
^	Start only at beginning of the file name
(?P<Plate>	Name the captured field Plate
.*	Capture as many characters as follow
_	Discard the underbar separating plate from well
(?P<Well>	Name the captured field Well
[A-P]	Capture exactly one letter between A and P
[0-9]{1,2}	Capture one or two digits that follow
_s	Discard the underbar followed by s separating well from site
(?P<Site>	Name the captured field Site
[0-9]	Capture one digit following

.[\\/](?P<Date>.)[\\/](?P<Run>.*)$
.*[\\/]	Skip characters at the beginning of the pathname until either a slash (/) or backslash (\) is encountered (depending on the operating system)
(?P<Date>	Name the captured field Date
.*	Capture as many characters that follow
[\\/]	Discard the slash/backslash character
(?P<Run>	Name the captured field Run
.*	Capture as many characters as follow
$	The Run field must be at the end of the path string, i.e., the last folder on the path. This also means that the Date field contains the parent folder of the Date folder.

Module: Metadata

What is "metadata"?

What are the inputs?

What do the settings mean?

What do I get as output?

Available measurements

Settings:

Extract metadata?

Metadata data type

Metadata types

Metadata extraction method

Metadata source

Regular expression

Regular expression

Extract metadata from

Select the filtering criteria

Details on regular expressions

Metadata file location

Match file and image metadata

Use case insensitive matching?