LLM guide — complete operating instructions¶

This page is a single, self-contained reference written for an LLM assistant to help an end user run every MHC-TP use case correctly. It restates the install, inputs, commands, flags, recipes, outputs, and the matching method so no other page is required.

Raw text: a machine-readable copy of this page is served at /llm.txt. Use the Copy page as text button (top right) to copy the whole guide to your clipboard.

What MHC-TP does¶

mhc-tp is a command-line tool. Given a GibbsCluster output folder, it correlates each peptide cluster's position-specific scoring matrix (PSSM) against a reference library of HLA/MHC class I + II binding motifs (human and mouse) and reports, per cluster, the best-matching allotype(s). It writes a ranked CSV and a standalone interactive HTML report.

Package / import name: mhc-tp / mhc_tp. Console command: mhc-tp.
Python 3.9–3.11. No internet needed at search time (only once, to fetch the reference).

Install¶

pip install mhc-tp

Editable from source (for development):

git clone https://github.com/PurcellLab/MHC-TP.git
cd MHC-TP
pip install -e .

Reference data (required, fetched once)¶

Reference motif parquets are downloaded from the GitHub release, not bundled.

mhc-tp fetch -s human     # human  |  mouse  |  all

fetch options: -s/--species {human,mouse,all} (default all), -d/--dest DIR (override the data dir; otherwise a per-user data dir, also settable via $MHC_TP_DATA_DIR). After fetching, mhc-tp search finds the parquet automatically — you only pass -r/--reference to use a custom file.

Input: the GibbsCluster folder¶

The positional gibbs_folder argument is a GibbsCluster run directory. It must contain a matrices/ subdirectory with the per-cluster matrix files (gibbs.<g>of<N>.mat). If a images/gibbs.KLDvsClusters.tab file is present, the report also shows each cluster's KLD (information content). GibbsCluster's own logo PNGs, if present, are reused in the report.

Command: `mhc-tp search`¶

mhc-tp search [-h] [-r REFERENCE] [-s SPECIES] [-c {I,II,all}]
              [-t THRESHOLD] [--topNHits TOPNHITS] [--always-top-n]
              [-o OUTPUT] [--no-html] [-l]
              [--log-level {debug,info,warning,error,critical}]
              [--threads THREADS]
              gibbs_folder

flag	meaning	default
`gibbs_folder`	GibbsCluster run dir (must have `matrices/`)	required
`-s, --species`	`human` or `mouse`	`human`
`-c, --class`	restrict reference to MHC class `I`, `II`, or `all`	`all`
`-r, --reference`	path to a `<species>.parquet` (else the fetched one)	auto
`-t, --threshold`	minimum Pearson correlation (PCC) to report	`0.70`
`--topNHits`	allotype matches to keep per cluster	`3`
`--always-top-n`	keep each cluster's top-N even below threshold (flagged "below cutoff")	off
`-o, --output`	output directory	`output`
`--no-html`	write only the CSV, skip the HTML report	off
`-l, --log`	also save the coloured session log to the output dir	off
`--log-level`	`debug`/`info`/`warning`/`error`/`critical`	`info`
`--threads`	max CPU threads (also `$MHC_TP_THREADS`)	`4`

Use-case recipes (copy-paste)¶

# 1. Basic human search (class I + II reference)
mhc-tp search runs/sampleA -s human -o results/

# 2. Mouse sample
mhc-tp fetch -s mouse
mhc-tp search runs/mouseA -s mouse -o results_mouse/

# 3. Restrict to MHC class I only (faster, class-I immunopeptidome)
mhc-tp search runs/sampleA -s human -c I -o results/

# 4. Class II only
mhc-tp search runs/sampleA -s human -c II -o results/

# 5. Keep the top 5 matches per cluster instead of 3
mhc-tp search runs/sampleA -s human --topNHits 5 -o results/

# 6. Guarantee a top-3 for EVERY cluster even if below threshold
#    (weak matches are tagged "below cutoff" in the report)
mhc-tp search runs/sampleA -s human --always-top-n -o results/

# 7. Stricter / looser confidence cutoff
mhc-tp search runs/sampleA -s human -t 0.80 -o results/

# 8. CSV only (no HTML report)
mhc-tp search runs/sampleA -s human --no-html -o results/

# 9. Use a custom / local reference parquet
mhc-tp search runs/sampleA -s human -r path/to/human.parquet -o results/

# 10. Limit CPU threads and save a log
mhc-tp search runs/sampleA -s human --threads 8 -l -o results/

[!IMPORTANT] Threshold vs top-N selection order: per cluster, all allotypes are ranked by PCC, then the threshold is applied, then the top-N are taken. So by default a cluster can return fewer than --topNHits rows (or none). With --always-top-n, the top-N are taken regardless of threshold and the threshold only annotates confidence — every cluster keeps its best N.

Outputs¶

Written to <output>/clust_result/:

file	description
`correlations.csv`	columns: `cluster` (e.g. `gibbs.2of5`), `hla` (canonical display name, e.g. `HLA-B39:124`), `formatted` (raw join key, e.g. `HLAB39124`), `correlation` (PCC)
`mhc-tp-result.html`	standalone interactive report (open in any browser)

The report contains: a motif-comparison carousel (per cluster solution, choose via dropdown), a paginated + searchable Top-N table, and a correlation analysis section (heatmap + force-directed network, with a PCC threshold slider that filters the view only). Rows/motifs below the search threshold (only present with --always-top-n) are tagged below cutoff.

How matching works (method)¶

Each motif is a PSSM (n_positions × 20). For a cluster motif g and reference r, the score is the Pearson correlation over the cells V that are informative in the cluster motif (g_k ≠ 0 and not NaN):

PCC(g, r) = Σ_{k∈V}(g_k − ḡ)(r_k − r̄) / ( |V| · σ_g · σ_r )    ∈ [−1, 1]

1.0 = identical motif shape. It is scale/offset invariant (scores the pattern of position preferences, not absolute magnitudes). Allele display names are returned verbatim from the reference (matching the embedded logo titles), e.g. HLA-A25:08.

Other commands¶

# Export embedded reference logos to PNGs
mhc-tp export-logos -r human.parquet -o logos/ -a "HLA-B39:124,A0201"   # -a optional; default all

Developer-only (end users never need these): mhc-tp build-db and mhc-tp build-ref rebuild the reference parquets from NetMHCpan / NetMHCIIpan packs; embedding Seq2Logo logos (--with-logos) needs a separate Python 2.7 env.

Troubleshooting¶

"no class II allotypes" / 0 matches with -c II: the sample is likely class I (short 8–11mers); use -c I or -c all.
No matches: lower -t (e.g. -t 0.5) or use --always-top-n to force the best N per cluster.
Reference not found: run mhc-tp fetch -s <species> first, or pass -r path/to/<species>.parquet.
gibbs_folder rejected: ensure it contains a matrices/ subdirectory.

Citation¶

Munday PR, Krishna SSG, Fehring J, Croft NP, Purcell AW, Li C, Braun A. Immunolyser 2.0: An advanced computational pipeline for comprehensive analysis of immunopeptidomic data. Comput Struct Biotechnol J. 2025;29:296–304. doi:10.1016/j.csbj.2025.10.007. PMID: 41209766.