Fragment Index Construction
The database section of the configuration file determines how the internal fragment index data structure is generated.
{
"bucket_size": 8192,
"enzyme": {
"missed_cleavages": 2,
"min_len": 7,
"max_len": 50,
"cleave_at": "KR",
"restrict": "P",
"c_terminal": true,
"semi_enzymatic": false
},
"peptide_min_mass": 500.0,
"peptide_max_mass": 5000.0,
"ion_kinds": [
"b",
"y"
],
"min_ion_index": 2,
"max_variable_mods": 2,
"static_mods": {
"C": 57.0214
},
"variable_mods": {
"M": 15.9949
},
"decoy_tag": "rev_",
"generate_decoys": true,
"fasta": "s3://sage-benchmarking/fasta/human_contam.fasta"
}
Bucket size
This parameter only affects search speed and will not change your results.
MS2 resolution | Suggested Setting |
---|---|
Low | 65536 |
High | 8192 |
This parameter can be used to tune performance of Sage. This value sets the number of fragment ions within each "bucket" in the internal index datastructure. This value will always be set to the next largest power of 2.
A smaller number (8192
is the minimum) is suitable for high resolution MS/MS spectra, since not many buckets will need to be searched. Low resolution MS/MS spectra
will need to search more buckets, so increasing the size of the bucket will lower the total number of internal buckets.
A good starting point is to use 65536
for ion-trap data, but the optimal value for your search parameters and files might require empirical tuning.
Enzyme
The enzyme section contains parameters related to the enzyme used for digestion. The default enzyme is trypsin, with the parameters specified below.
-
missed_cleavages
: Integer. The number of missed cleavages for tryptic digest (default: 1). -
min_len
: Integer. The minimum amino acid (AA) length of peptides to search (default: 5). -
max_len
: Integer. The maximum AA length of peptides to search (default: 50). -
cleave_at
: String. Amino acids to cleave at (default: 'KR').The
cleave_at
parameter can also be used to specify alternative digestion schemes:- Non-enzymatic:
cleave_at = ""
- All potential peptides betweenmin_len
andmax_len
will be generated from the sequence - No digestion:
cleave_at = "$"
- FASTA entries will be used as-is, subject tomin_len
andmax_len
options
- Non-enzymatic:
-
restrict
: Single character string. Do not cleave if this amino acid follows the cleavage site (default: 'P'). -
c_terminal
: Boolean. Cleave at the C-terminus of matching amino acids (default:true). -
semi_enzymatic
: Boolean. Perform a semi-enzymatic digest (default:false).
Fragment Settings
peptide_min_mass
: Float. The minimum monoisotopic mass of peptides to fragment in silico (default: 500.0).peptide_max_mass
: Float. The maximum monoisotopic mass of peptides to fragment in silico (default: 5000.0).ion_kinds
: List of strings. Which fragment ions to produce? Allowed values: "a", "b", "c", "x", "y", "z". (default: ["b", "y"])min_ion_index
: Integer. Do not generate b1..bN or y1..yN ions for preliminary searching ifmin_ion_index = N
. Does not affect full scoring of PSMs (default: 2).
Modifications
Static Modifications
static_mods
Dictionary with characters as keys and floats as values. Represents static modifications applied to amino acids or termini (default: ). Static modifications are applied after variable modifications
Example: Apply a static modification of 304.207 to the N-terminus of the peptide and lysine, and 57.0215 to cysteine.
{
"static_mods": {
"^": 304.207,
"K": 304.207,
"C": 57.0215
}
}
Variable Modifications
max_variable_mods
: Integer. Limit k-combinations of variable modifications (default: 2).variable_mods
: Dictionary with characters as keys and list of floats (or single floats) as values. Represents variable modifications applied to amino acids or termini (default: ).
Example: Apply a variable modification of 15.9949 to methionine, 49.2022 to the C-terminus of the peptide, 42.0 to the N-terminus of the protein, and 111.0 to the C-terminus of the protein, in addition to pyro-glutamine/pyro-glutamic acid. Allow only up to 3 variable modifications in total.
{
"max_variable_mods": 3,
"variable_mods": {
"M": [15.9949],
"^Q": [-17.026549],
"^E": [-18.010565],
"$": [49.2022],
"[": 42.0,
"]": 111.0
}
}
Modification Syntax:
"^X"
: Modification to be applied to amino acid X if it appears at the N-terminus of a peptide"$X"
: Modification to be applied to amino acid X if it appears at the C-terminus of a peptide"[X"
: Modification to be applied to amino acid X if it appears at the N-terminus of a protein"]X"
: Modification to be applied to amino acid X if it appears at the C-terminus of a protein
Fasta Database and Decoy Generation
For best results, let Sage generate decoy sequences.
decoy_tag
: String. The tag used to identify decoy entries in the FASTA database (default: "rev_").generate_decoys
: Boolean. If true, ignore decoys in the FASTA database matchingdecoy_tag
, and generate internally reversed peptides (default: false).fasta
: String. The path to the FASTA file, either a local path or s3 object URI.
Target-decoy competition is key to controlling the false discovery rate in proteomics experiments. Sage can use decoy sequences included in the supplied FASTA file, or it can generate internal sequences (recommended). Sage reverses tryptic peptides (not proteins), so that the picked-peptide (opens in a new tab) approach to FDR can be used.
If generate_decoys
is set to true (or unspecified), then decoy sequences in the FASTA database matching decoy_tag
will be ignored,
and Sage will internally generate decoys.
It is critical that you ensure you use the proper decoy_tag
if you are using a FASTA database containing decoys
and have internal decoy generation turned on - otherwise Sage will treat the supplied decoys as hits!
Internally generated decoys will have protein accessions matching "{decoy_tag}{accession}"
, e.g. if decoy_tag
is "rev_" then a protein accession like "rev_sp|P01234|HUMAN" will be listed in the output file.