Parameter Files

Phosphomatics uses a yaml-style parameter file to configure initial data processing and track the creation of phosphorylation site data groups. The parameter file can be opened and edited in any plain text editor, or, alternatively, in python using the yaml library. An example parameter data file can be downloaded here and the text is copied at the end of this page.

Structure

The minimal parameter file that can be used for initial processing of data consists of five blocks; sampleAlias, columnAssignments, sampleMaps, comparisons and processing. The order of these blocks is not important.

Sample Alias Block

The sampleAlias block allows you to specify an alias for your sample names. This can be useful since many proteomics search software label these columns with the raw mass spectrometry data file name which is frequently long and verbose. Here, we alias columns in the original input files (‘QUANT_CTRL’, ‘QUANT_CTRL_1’…) to ‘CTRL’, ‘CTRL_1’ which removes the unnecessary ‘QUANT_’ prefix:

sampleAlias:
  QUANT_CTRL: CTRL
  QUANT_CTRL_1: CTRL_1
  QUANT_CTRL_2: CTRL_2
  QUANT_THZ1: THZ1
  QUANT_THZ1_1: THZ1_1
  QUANT_THZ1_2: THZ1_2
  QUANT_U0126: U0126
  QUANT_U0126_1: U0126_1
  QUANT_U0126_2: U0126_2

Note

After a sample alias has been set, the alias must be used at each subsequent point in the parameter file.

Column Assignments Block

The column assignments block contains information about how individual columns from a larger data file are to be utilised for analysis. This section consists of 7 sub sections.

We must specify which columns are to be used for protein identifiers, phosphorylation site and residue specification as well as quantitative data. An example of these sections is below:

upidColumn: <COLUMN_FOR_PROTEIN_UNIPROT_ID>
residueColumn: <COLUMN_FOR_PHOSPHORYLATED_RESIDUE_(S/T/Y)>
siteColumn: <COLUMN_FOR_PHOSPHORYLATION_POSITION_IN_PROTEIN>
quantColumns:
- <COLUMN_FOR_SAMPLE_1_QUANT_DATA>
- <COLUMN_FOR_SAMPLE_2_QUANT_DATA>
  ...
- <COLUMN_FOR_SAMPLE_X_QUANT_DATA>

For example, using the phosphomatics example data, this columnAssignments block would be:

columnAssignments
  upidColumn: ID
  residueColumn: Residue
  siteColumn: Position
  quantColumns:
  - CTRL
  - CTRL_1
  - CTRL_2
  - THZ1
  - THZ1_1
  - THZ1_2
  - U0126
  - U0126_1
  - U0126_2

Sample Map Block

The next section allows us to specify which samples correspond to which treatment groups. Here, keys are sample aliases and values are treatment groups. We’ve created three treatment groups called CTRL, THZ1 and U0126:

sampleGroupMap:
  CTRL: CTRL
  CTRL_1: CTRL
  CTRL_2: CTRL
  THZ1: THZ1
  THZ1_1: THZ1
  THZ1_2: THZ1
  U0126: U0126
  U0126_1: U0126
  U0126_2: U0126

The last section allows us to control the order in which data is presented. For example, with time series data, we usually want to plot/tabulate data in order of increasing time post-treatment. In the block below, indices can be entered beside individual files (1,2,3…) and the data will then be displayed with the specified order. The sample indexed 1 will be presented left most and the highest index will be presented right-most.:

sampleIndexMap:
  CTRL: 1
  CTRL_1: 2
  CTRL_2: 3
  THZ1: 4
  THZ1_1: 5
  THZ1_2: 6
  U0126: 7
  U0126_1: 8
  U0126_2: 9

The final sampleMaps block would be:

sampleMaps:
  sampleGroupMap:
    CTRL: CTRL
    CTRL_1: CTRL
    CTRL_2: CTRL
    THZ1: THZ1
    THZ1_1: THZ1
    THZ1_2: THZ1
    U0126: U0126
    U0126_1: U0126
    U0126_2: U0126
  sampleIndexMap:
    CTRL: 1
    CTRL_1: 2
    CTRL_2: 3
    THZ1: 4
    THZ1_1: 5
    THZ1_2: 6
    U0126: 7
    U0126_1: 8
    U0126_2: 9

Comparisons Block

The comparison block allows you to pre-define group comparisons. For each comparison, phosphomatics will conduct t-tests and phosphorylation sites that meet specified fold-change and p-value thresholds will be placed into new data groups.

To define a group comparison:

comparisons:
- foldChangeThreshold: '1'
  group1: CTRL
  group2: THZ1
  name: CTRL_THZ1
  pvalThreshold: '2'
- foldChangeThreshold: '1'
  group1: CTRL
  group2: U0126
  name: CTRL_U0126
  pvalThreshold: '2'

Here, we’ve defined two separate group comparisons: THZ1 is compared to CTRL and U0126 is compared to CTRL. The name parameter is used to set the name of the data group into which differentially abundant phosphorylation sites will be placed. The foldChangeThreshold and pvalThreshold are used to set the fold change and p value cutoffs, respectively.

Processing Block

The processing block defines the quantitative data filtering and pre-processing steps that will be conducted prior to statistical analysis.

Filtering

The filtering block allows you to remove phosphorylation sites with too great a proportion of missing values and those phosphorylation sites that are assigned to undesired proteins such as decoys or contaminants.

An example of a filtering block is given below:

filtering:
  doFiltering: 'true'
  filterTerms:
  - REV_
  - CON_
  minValues: '2'
  minValuesIn: group

Here, we activate the filtering process by setting doFiltering to 'true'. Setting this parameter to any other value will bypass filtering.

Values added to the filterTerms list will be used to remove phosphorylation sites mapping to proteins containing these terms anywhere in the test of their upidColumn. For example, here we remove phosphorylation sites that contain REV_ or CON_ in their protein identifier.

The minValues parameter allows you to specify the minimum number of non-zero values that must be present for a phosphorylation site to be included in subsequent analysis. The minValuesIn parameter restricts how these missing values can be distributed. Valid options are:

  • total : The minValues setting must be reached by any combination of samples.

  • group : The minValues setting must be reached within samples of a treatment group.

Imputation

The imputation block allows you to specify a strategy by which missing values that remain after filtering are replaced.

An example of an imputation block is given below:

imputation:
  doImputation: 'true'
  imputeCategory: group
  imputeType: median

Here, we activate imputation by setting doImputation to 'true'. Setting this parameter to any other value will bypass imputation.

The imputeCategory values sets which valid (non-zero) values are used to calculate the new replacement values.

  • group : The replacement value is calculated using the valid values for a phosphorylation site within the same treatment group as the missing value.

  • site : The replacement value is calculated using the valid values for a phosphorylation site the phosphorylation site, i.e. regardless of group.

The imputeCategory values sets the mathematical function that will be used to calculate the replacement value.

  • min : The minimum of valid values will be used.

  • median : The median of the valid values will be used.

  • mean : The mean of the valid values will be used

normalisation

Normalisation of quantification values can be used to correct for global differences in phosphorylation site abundances caused by small random errors in protein loading.

  • none : No normalisation applied

  • median : Samples normalised by median intensity

  • tic : Samples normalised by total intensity of all quantified phosphorylation sites

  • quantile : Samples normalised such that distributions of quantification values of quantification values are the same.

transform

The transformation parameter allows you to apply log2 transformation to quantitative data so that an approximate normal distribution is obtained. Valid options are:

  • none : Transformation is bypassed

  • log2 : Data are log2 transformed.

An example of a completed and valid processing block is below:

processing:
  filtering:
    doFiltering: 'true'
    filterTerms:
    - REV_
    - CON_
    minValues: '2'
    minValuesIn: group
  imputation:
    doImputation: 'false'
    imputeCategory: group
    imputeType: median
  normalisation: median
  transform: log2

Example Parameter File

An example parameter data file can be downloaded here and the text is copied below:

columnAssignments:
  quantColumns:
  - CTRL
  - CTRL_1
  - CTRL_2
  - THZ1
  - THZ1_1
  - THZ1_2
  - U0126
  - U0126_1
  - U0126_2
  residueColumn: Residue
  siteColumn: Position
  upidColumn: ID
comparisons:
- foldChangeThreshold: '1'
  group1: THZ1
  group2: CTRL
  name: THZ1_CTRL
  pvalThreshold: '2'
- foldChangeThreshold: '1'
  group1: U0126
  group2: CTRL
  name: U0126_CTRL
  pvalThreshold: '2'
processing:
  filtering:
    doFiltering: 'true'
    filterTerms:
    - REV_
    - CON_
    minValues: '2'
    minValuesIn: group
  imputation:
    doImputation: 'true'
    imputeCategory: group
    imputeType: median
  normalisation: median
  transform: log2
sampleAlias:
  QUANT_CTRL: CTRL
  QUANT_CTRL_1: CTRL_1
  QUANT_CTRL_2: CTRL_2
  QUANT_THZ1: THZ1
  QUANT_THZ1_1: THZ1_1
  QUANT_THZ1_2: THZ1_2
  QUANT_U0126: U0126
  QUANT_U0126_1: U0126_1
  QUANT_U0126_2: U0126_2
sampleMaps:
  sampleGroupMap:
    CTRL: CTRL
    CTRL_1: CTRL
    CTRL_2: CTRL
    THZ1: THZ1
    THZ1_1: THZ1
    THZ1_2: THZ1
    U0126: U0126
    U0126_1: U0126
    U0126_2: U0126
  sampleIndexMap:
    CTRL: 1
    CTRL_1: 2
    CTRL_2: 3
    THZ1: 4
    THZ1_1: 5
    THZ1_2: 6
    U0126: 7
    U0126_1: 8
    U0126_2: 9