Documentation
Configuring Sage

Configuring Sage

Sage is primarily configured via a single file, but there are several parameters that can be set or overrode by command-line arguments or environment variables:

Command Line Arguments

  • --batch-size N: Process N files in parallel. Setting this value to the number of logical CPU cores will maximize performance at the cost of higher memory use
  • --write-pin: Write percolator/mokapot compatible output file in addition to any other output files
  • --parquet: Write parquet-formatted files instead of tab-separated files. Do not use this if you need to open your results in Excel.
  • --fasta path: Override or set path to FASTA database
  • --output_directory: Override or set path to output directory or S3 location
  • --annotate-matches: Record all experimental-theoretical fragment ion matches for use in generating spectral libraries or rescoring PSMs
  • --disable-telemetry-i-dont-want-to-improve-sage: Turn off Sage's basic telemetry (which reports # of CPU cores, memory usage, run time, and size of fragment ion index)
  • [mzml_paths]: Override or set mzML files to search
Usage: sage [OPTIONS] <parameters> [mzml_paths]...
 
🔮 Sage 🧙 - Proteomics searching so fast it feels like magic!
 
Arguments:
  <parameters>     Path to configuration parameters (JSON file)
  [mzml_paths]...  Paths to mzML files to process. Overrides mzML files listed in the configuration file.
 
Options:
  -f, --fasta <fasta>
          Path to FASTA database. Overrides the FASTA file specified in the configuration file.
  -o, --output_directory <output_directory>
          Path where search and quant results will be written. Overrides the directory specified in the configuration file.
      --batch-size <batch-size>
          Number of files to search in parallel (default = number of CPUs/2)
      --parquet
          Write parquet files instead of tab-separated files
      --annotate-matches
          Write matched fragments output file.
      --write-pin
          Write percolator-compatible `.pin` output files
      --disable-telemetry-i-dont-want-to-improve-sage
          Disable sending telemetry data
  -h, --help
          Print help information
  -V, --version
          Print version information

Environment Variables

There are two shell environment variables that can be used to change behavior of Sage:

# Sage will print additional information to the terminal
export SAGE_LOG=trace
# Sage will only use N threads, default is to use the number of logical CPU cores
export RAYON_NUM_THREADS=2
 
sage config.json

or in inline-style:

SAGE_LOG=trace RAYON_NUM_THREADS=2 sage config.json

JSON File Schema

Sage uses a json file for configuration. This enables easy programmatic manipulation of the configuration file. The order of options in the configuration file does not matter.

A full example is shown below, but it's worth noting that many of the configuration options can be omitted the first time you run Sage.

One of the nice features of Sage is that it will write out a results.json file listing all parameters that were actually used during the run (i.e. any defaults that you did not specify), which you can then modify. An even nicer feature is that you can reproduce your results by running the results.json file!

# Reproduce the same results!
sage results.json

⬅️ Please checkout the rest of the docs for detailed descriptions of the parameters

🚫

Note that json files do not allow comments - they are provided here for explanation, but need to be removed in a real configuration file. Trailing commas will also cause an error to be thrown. You can check out some example files without comments in the sidebar (Example pages)

config.json
{
  "database": {
    // How many fragments are in each internal mass bucket
    // Use a lower value (8192) for high-res MS/MS, and higher values for low-res MS/MS
    "bucket_size": 32768,
    // This section is optional. Default is trypsin, using the parameters below
    "enzyme": {
      // Optional[int] {default=1}, Number of missed cleavages to allow
      "missed_cleavages": 2,
      // Optional[int] {default=5}, Minimum AA length of peptides to search
      "min_len": 5,
      // Optional[int] {default=50}, Maximum AA length of peptides to search
      "max_len": 50,
      // Optional[str] {default='KR'}. Amino acids to cleave at
      "cleave_at": "KR",
      // Optional[char/single AA] {default='P'}. Do not cleave if this AA follows the cleavage site
      "restrict": "P",
      // Optional[bool] {default=true}. Cleave at c terminus of matching amino acid
      "c_terminal": true
    },
    // Optional[float] {default=150.0}, Minimum mass of fragments to search
    "fragment_min_mz": 200.0,
    // Optional[float] {default=2000.0}, Maximum mass of fragments to search 
    "fragment_max_mz": 2000.0,
    // Optional[float] {default=500.0}, Minimum monoisotopic mass of peptides to fragment
    "peptide_min_mass": 500.0,
    // Optional[float] {default=5000.0}, Maximum monoisotopic mass of peptides to fragment
    "peptide_max_mass": 5000.0,
    // Optional[List[str]] {default=["b","y"]} Which fragment ions to generate and search?
    "ion_kinds": ["b", "y"],
    // Optional[int] {default=2}, Do not generate b1/b2/y1/y2 ions for preliminary searching.
    // Does not affect full scoring of PSMs!
    "min_ion_index": 2,
    // Optional[Dict[char, float]] {default=null}, static modifications
    "static_mods": {
      // Apply static modification to N-terminus of peptide
      "^": 304.207,
      // Apply static modification to lysine
      "K": 304.207,
      // Apply static modification to cysteine
      "C": 57.0215
    },
    // Optional[Dict[char, list[float]]] {default=null}, variable modifications
    "variable_mods": {
      // Variable mods are applied *before* static mod
      "M": [15.9949],
      "^Q": [-17.026549],
      // Applied to N-terminal glutamic acid
      "^E": [-18.010565],
      // Applied to peptide C-terminus
      "$": [49.2, 22.9],
      // Applied to protein N-terminus
      "[": 42.0,
      // Applied to protein C-terminus
      "]": 111.0
    },
    // Optional[int] {default=2} Limit k-combinations of variable modifications
    "max_variable_mods": 2,
    // Optional[str] {default="rev_"}: See notes above
    "decoy_tag": "rev_",
    // Optional[bool] {default="true"}: Ignore decoys in FASTA database matching `decoy_tag`
    "generate_decoys": false,
    // str: mandatory path to FASTA file
    "fasta": "dual.fasta"
  },
  // Optional - specify only if TMT or LFQ
  "quant": {
    // Optional[str] {default=null}, one of "Tmt6", "Tmt10", "Tmt11", "Tmt16", or "Tmt18"
    "tmt": "Tmt16",
    "tmt_settings": {
      // Optional[int] {default=3}, MS-level to perform TMT quantification on
      "level": 3,
      // Optional[bool] {default=false}, use Signal/Noise instead of intensity for TMT quant
      // Requires noise values in mzML
      "sn": false
    },
    // Optional[bool] {default=null}, perform label-free quantification
    "lfq": true,
    "lfq_settings": {
      // See documentation for details - recommend that you do not change this setting
      "peak_scoring": "Hybrid",
      // Optional["Sum" | "Apex"] {default="Sum"}, use sum or peak of MS1 traces in peak
      "integration": "Sum",
      // Optional[float] {default = 0.7}, normalized spectral angle (vs. theoretical isotopic envelope)
      // cutoff for calling an MS1 peak
      "spectral_angle": 0.7,
      // Optional[float] {default = 5.0}, tolerance for DICE window around calculated precursor mass
      "ppm_tolerance": 5.0
    }
  },
  // Tolerance can be either "ppm" or "da"
  "precursor_tol": {
    "da": [
      // This value is substracted from the experimental precursor to match theoretical peptides
      -500,
      // This value is added to the experimental precursor to match theoretical peptides
      100
    ]
  },
  "fragment_tol": {
    "ppm": [
     // This value is subtracted from the experimental fragment to match theoretical fragments 
     -10,
     // This value is added to the experimental fragment to match theoretical fragments 
     10
    ]
  },
  // Optional[Tuple[int, int]] {default=[2, 4]}
  // If charge states are not annotated in the mzML, or if `wide_window` mode is turned on, then consider
  // all precursors at z=2, z=3, z=4
  "precursor_charge": [2, 4],
  // Optional[Tuple[int, int]] {default=[0,0]}: C13 isotopic envelope to consider for precursor
  "isotope_errors": [
    // Consider -1 C13 isotope
    -1,
    // Consider up to +3 C13 isotope (-1/0/1/2/3) 
    3
  ],
  // Optional[bool] {default=false}: perform deisotoping and charge state deconvolution on MS2 spectra
  "deisotope": false,
  // Optional[bool] {default=false}: search for chimeric/co-fragmenting PSMs
  "chimera": false,
  // Optional[bool] {default=false}: _ignore_ `precursor_tol` and search in wide-window/DIA mode
  "wide_window": false,
  // Optional[bool] {default=true}: use retention time prediction model as an feature for LDA
  "predict_rt": false,
  // Optional[int] {default=15}: only process MS2 spectra with at least N peaks
  "min_peaks": 15,
  // Optional[int] {default=150}: take the top N most intense MS2 peaks to search,
  "max_peaks": 150,
  // Optional[int] {default=4}: minimum # of matched b+y ions to use for reporting PSMs
  "min_matched_peaks": 6,
  // Optional[int] {default=null}: maximum fragment ion charge states to consider,
  "max_fragment_charge": 1,
  // Optional[int] {default=1}: number of PSMs to report for each spectra. Higher values might disrupt PSM rescoring.
  "report_psms": 1,
  // Optional[str] {default=`.`}: Place output files in a given directory or S3 bucket/prefix
  "output_directory": "s3://bucket/prefix",
  // List[str]: representing paths to mzML (or gzipped-mzML) files for search
  "mzml_paths": [
    "local/path.mzML",
    "s3://bucket/PXD0000001/foo.mzML.gz"
  ]       
}