API reference¶
Auto-generated from the package docstrings.
Search engine¶
mhc_tp.engine.search ¶
Search Gibbs cluster matrices against the reference array.
Method (how a cluster is matched to an allotype)¶
- Each GibbsCluster motif and each reference allotype is represented as a
position-specific scoring matrix (PSSM):
n_positions x 20amino-acid weights. Gibbs matrices are zero-padded to a common position count so they can be batched. -
For every (cluster, allotype) pair the two PSSMs are compared by Pearson correlation of their flattened weights, computed only over the cells
V = {k : g_k != 0 and g_k not NaN}that are informative in the cluster matrix (so padding and empty positions do not dilute the score)::PCC(g, r) = Σ_{k∈V}(g_k - ḡ)(r_k - r̄) / ( |V| · σ_g · σ_r ) ∈ [-1, 1]
with means/std taken over V. It is scale- and offset-invariant, so it
scores motif shape, not absolute magnitudes. Full derivation and numerical
guards: :func:mhc_tp.engine.kernels.compute_all_correlations.
3. Per cluster the allotypes are ranked by correlation (PCC, -1..1;
1.0 = identical motif). Selection is then either threshold-gated
(default) or pure top-N (always_top_n); see :func:search.
The correlation is a motif-shape similarity: it rewards matching the relative preference pattern across positions, not the absolute weight magnitudes.
search ¶
Return {(gibbs_name, ref_formatted): correlation} for top-N hits.
By default a hit must score >= threshold to be returned, so a cluster
may yield fewer than top_n rows (or none). When always_top_n is set,
every cluster returns its top_n best matches regardless of threshold —
the threshold then only annotates confidence downstream, it never drops a
row.
Source code in src/mhc_tp/engine/search.py
mhc_tp.engine.kernels ¶
Numba JIT correlation kernel (ported from the proven NumbaSearch engine).
compute_all_correlations ¶
All-pairs flattened Pearson correlation, parallel over Gibbs matrices.
Each PSSM is flattened to a vector. Only the cells that are informative in the Gibbs matrix are scored: the valid set is
V = { k : g_k != 0 and g_k is not NaN }
Restricted to those cells, the score for a (Gibbs g, reference r) pair is the Pearson correlation coefficient
(1/|V|) * Σ_{k in V} (g_k - ḡ)(r_k - r̄)
PCC(g,r) = ----------------------------------------
σ_g · σ_r
where ḡ, r̄ are the means and σ_g, σ_r the population standard deviations taken over V:
ḡ = (1/|V|) Σ g_k , σ_g = sqrt( (1/|V|) Σ (g_k - ḡ)^2 )
r̄ = (1/|V|) Σ r_k , σ_r = sqrt( (1/|V|) Σ (r_k - r̄)^2 )
PCC lies in [-1, 1] (1 = identical motif shape) and is scale/offset invariant, so it measures the pattern of position preferences rather than absolute weight magnitudes.
Guards: a Gibbs matrix with |V| < 10 or σ_g = 0 is flagged invalid (its row
is skipped); a reference with σ_r = 0 is skipped for that pair. A score is
stored only when PCC >= threshold; otherwise the cell keeps the -1.0
sentinel. Returns (correlations, invalid_flags).
Source code in src/mhc_tp/engine/kernels.py
Reference data¶
mhc_tp.refdata.fetch ¶
Resolve and fetch prebuilt reference parquets to a per-user data dir.
End users run mhc-tp fetch to download the prebuilt class I+II
reference parquets (with embedded Seq2Logo reference logos) instead of building
them. The download source + checksums live in the packaged
reference_manifest.tsv; the maintainer fills them in on each release.
data_dir ¶
User data dir for reference files. Overridable via MHC_TP_DATA_DIR.
reference_path ¶
resolve_reference ¶
Return the reference parquet path, raising a helpful error if absent.
Source code in src/mhc_tp/refdata/fetch.py
load_manifest ¶
Parse the packaged reference manifest (species/filename/sha256/url).
Source code in src/mhc_tp/refdata/fetch.py
fetch ¶
Download the reference parquet(s) into the data dir; verify checksums.
Source code in src/mhc_tp/refdata/fetch.py
Report¶
mhc_tp.report.render ¶
Assemble the standalone HTML report.
render_report ¶
render_report(
correlation_dict,
reference_df,
gibbs_matrices,
output_dir,
kld_df=None,
version="",
gibbs_dir=None,
logo_map=None,
name_map=None,
top_n=3,
threshold=0.7,
always_top_n=False,
)
Write
logo_map ({formatted: png_bytes}) supplies reference logos when the
reference DataFrame was loaded without the heavy logo column.
name_map ({formatted: display}) supplies pretty allele labels.
Source code in src/mhc_tp/report/render.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | |
mhc_tp.naming ¶
Display formatting for allele names.
The reference parquet stores each allotype in the very string that was burned
into its Seq2Logo motif title (e.g. HLA-A25:08, HLA-A3301,
DRB1_0101, H-2-IAb). To keep the report's text labels consistent with
those embedded reference logo titles, the canonical display name is the source
allotype verbatim.
pretty_allele ¶
Canonical display name for an allotype, matching its reference logo title.
The source allotype is already the name shown on the embedded Seq2Logo
motif (HLA-A25:08, HLA-A3301, DRB1_0101 ...), so it is returned
verbatim apart from surrounding whitespace.
``HLA-A25:08`` -> ``HLA-A25:08``
``HLA-A3301`` -> ``HLA-A3301``
``DRB1_0101`` -> ``DRB1_0101``
``H-2-IAb`` -> ``H-2-IAb``