Interpreting Sage Results

Sage results are primarily meant to be analyzed using standard programming/data-science tools (Pandas (opens in a new tab), Polars (opens in a new tab), data.table (opens in a new tab), Rstudio (opens in a new tab), etc), and I highly encourage you to learn enough of one of these tools to filter, merge, pivot, and groupby. It is a truly invaluable skill to have for anyone who needs to analyze proteomics datasets (or data in general!). That being said, TSV files are provided such that end-users can perform some filtering and analyzes in Excel.

with_polars.py

import polars as pl
# Sum TMT intensitities for all confident matches
df = (
    pl.read_csv("results.sage.tsv", separator="\t")
    .filter((pl.col("peptide_q") <= 0.01) & (pl.col("protein_q") <= 0.01) & (pl.col("label") == 1))
    .join(pl.read_csv("tmt.tsv", separator='\t'), on=["filename", "scannr"])
    .groupby(["filename", "peptide", "proteins"]).agg(pl.col("tmt*").sum())
)

Tab-Separated Format

By default, Sage will write several files to the output_directory configuration option (or, the current directory if not set):

results.json
results.sage.tsv
results.sage.pin
tmt.tsv
lfq.tsv

results.json contains a record of the search parameters and files used for the analysis. It is possible reproduce the analysis from this file:

reproduce

sage results.json

results.sage.tsv contains all PSMs (including decoys and non-confident matches) identified in the analysis.
results.sage.pin is a percolator/Mokapot compatible file, produced if --write-pin is passed as a command-line argument.
tmt.tsv contains all reporter ions quantified
lfq.tsv contains MS1 intensities for each peptide quantified

Parquet File Format

Parquet is a file format for data analysis and distributed queries. Parquet files are natively compressed, and yield smaller file sizes and faster query times than TSV files. Parquet files can be directly read by Pandas, Polars, and more!

Sage will write parquet files instead of tsv files if you pass --parquet as a command-line argument.

⚠️

Serializing results to parquet files is still experimental, and the column names are liable to change!

Notable differences from the standard output are that TMT reporter ion intensities are pre-merged onto search results, and can be found in the reporter_ion_intensity column. The other columns are largely the same as the TSV format

results.json
results.sage.parquet
lfq.parquet

Results schema

results.sage.parquet

message schema {
    required byte_array filename (utf8);
    required byte_array scannr (utf8);
    required byte_array peptide (utf8);
    required byte_array stripped_peptide (utf8);
    required byte_array proteins (utf8);
    required int32 num_proteins;
    required int32 rank;
    required boolean is_decoy;
    required float expmass;
    required float calcmass;
    required int32 charge;
    required int32 peptide_len;
    required int32 missed_cleavages;
    required float isotope_error;
    required float precursor_ppm;
    required float fragment_ppm;
    required float hyperscore;
    required float delta_next;
    required float delta_best;
    required float rt;
    required float aligned_rt;
    required float predicted_rt;
    required float delta_rt_model;
    required int32 matched_peaks;
    required int32 longest_b;
    required int32 longest_y;
    required float longest_y_pct;
    required float matched_intensity_pct;
    required int32 scored_candidates;
    required float poisson;
    required float sage_discriminant_score;
    required float posterior_error;
    required float spectrum_q;
    required float peptide_q;
    required float protein_q;
    optional group reporter_ion_intensity (LIST) {
        repeated group list {
            optional float element;
        }
    }
}

LFQ Schema

Notably, the lfq output in parquet format is in long-form

lfq.parquet

message schema {
    required byte_array peptide (utf8);
    required byte_array stripped_peptide (utf8);
    required byte_array proteins (utf8);
    required boolean is_decoy;
    required float q_value;
    required byte_array filename (utf8);
    required float intensity;
}

Bruker File Processing Search Results