API reference¶

Auto-generated from the package docstrings.

Search engine¶

mhc_tp.engine.search ¶

Search Gibbs cluster matrices against the reference array.

Method (how a cluster is matched to an allotype)¶

Each GibbsCluster motif and each reference allotype is represented as a position-specific scoring matrix (PSSM): n_positions x 20 amino-acid weights. Gibbs matrices are zero-padded to a common position count so they can be batched.
For every (cluster, allotype) pair the two PSSMs are compared by Pearson correlation of their flattened weights, computed only over the cells V = {k : g_k != 0 and g_k not NaN} that are informative in the cluster matrix (so padding and empty positions do not dilute the score)::

PCC(g, r) = Σ_{k∈V}(g_k - ḡ)(r_k - r̄) / ( |V| · σ_g · σ_r ) ∈ [-1, 1]

with means/std taken over V. It is scale- and offset-invariant, so it scores motif shape, not absolute magnitudes. Full derivation and numerical guards: :func:mhc_tp.engine.kernels.compute_all_correlations. 3. Per cluster the allotypes are ranked by correlation (PCC, -1..1; 1.0 = identical motif). Selection is then either threshold-gated (default) or pure top-N (always_top_n); see :func:search.

The correlation is a motif-shape similarity: it rewards matching the relative preference pattern across positions, not the absolute weight magnitudes.

search ¶

search(
    reference,
    gibbs_matrices,
    threshold=0.7,
    top_n=3,
    hla_filter=None,
    always_top_n=False,
)

Return {(gibbs_name, ref_formatted): correlation} for top-N hits.

By default a hit must score >= threshold to be returned, so a cluster may yield fewer than top_n rows (or none). When always_top_n is set, every cluster returns its top_n best matches regardless of threshold — the threshold then only annotates confidence downstream, it never drops a row.

Source code in src/mhc_tp/engine/search.py

def search(
    reference: pd.DataFrame,
    gibbs_matrices: dict[str, np.ndarray],
    threshold: float = 0.70,
    top_n: int = 3,
    hla_filter: list[str] | None = None,
    always_top_n: bool = False,
) -> dict[tuple[str, str], float]:
    """Return ``{(gibbs_name, ref_formatted): correlation}`` for top-N hits.

    By default a hit must score ``>= threshold`` to be returned, so a cluster
    may yield fewer than ``top_n`` rows (or none). When ``always_top_n`` is set,
    every cluster returns its ``top_n`` best matches regardless of threshold —
    the threshold then only annotates confidence downstream, it never drops a
    row.
    """
    ref_arr, max_positions = build_reference_array(reference)
    names = list(gibbs_matrices.keys())

    padded = np.zeros((len(names), max_positions, N_AMINO_ACIDS), dtype=np.float32)
    for i, name in enumerate(names):
        m = gibbs_matrices[name]
        padded[i, : m.shape[0], : m.shape[1]] = m

    mask = np.ones(len(reference), dtype=np.bool_)
    if hla_filter:
        mask = reference["formatted"].isin(hla_filter).to_numpy()

    # In always-top-N mode, store every valid correlation (kernel keeps a -1.0
    # sentinel for cells below its threshold), then rank in Python.
    kernel_threshold = -2.0 if always_top_n else threshold
    corr, _invalid = compute_all_correlations(
        padded, ref_arr.astype(np.float32), mask, kernel_threshold
    )

    formatted = reference["formatted"].to_numpy()
    out: dict[tuple[str, str], float] = {}
    for i, name in enumerate(names):
        row = corr[i, :]
        order = np.argsort(row)[::-1]
        if always_top_n:
            # Top-N among computed (non-sentinel, unmasked) cells, any score.
            hits = [j for j in order if mask[j] and row[j] > -1.0][:top_n]
        else:
            hits = [j for j in order if row[j] >= threshold][:top_n]
        for j in hits:
            out[(name, str(formatted[j]))] = float(row[j])
    return out

mhc_tp.engine.kernels ¶

Numba JIT correlation kernel (ported from the proven NumbaSearch engine).

compute_all_correlations ¶

compute_all_correlations(
    gibbs_matrices, ref_matrices, hla_mask, threshold
)

All-pairs flattened Pearson correlation, parallel over Gibbs matrices.

Each PSSM is flattened to a vector. Only the cells that are informative in the Gibbs matrix are scored: the valid set is

V = { k : g_k != 0 and g_k is not NaN }

Restricted to those cells, the score for a (Gibbs g, reference r) pair is the Pearson correlation coefficient

          (1/|V|) * Σ_{k in V} (g_k - ḡ)(r_k - r̄)
PCC(g,r) = ----------------------------------------
                        σ_g · σ_r

where ḡ, r̄ are the means and σ_g, σ_r the population standard deviations taken over V:

ḡ   = (1/|V|) Σ g_k ,                σ_g = sqrt( (1/|V|) Σ (g_k - ḡ)^2 )
r̄   = (1/|V|) Σ r_k ,                σ_r = sqrt( (1/|V|) Σ (r_k - r̄)^2 )

PCC lies in [-1, 1] (1 = identical motif shape) and is scale/offset invariant, so it measures the pattern of position preferences rather than absolute weight magnitudes.

Guards: a Gibbs matrix with |V| < 10 or σ_g = 0 is flagged invalid (its row is skipped); a reference with σ_r = 0 is skipped for that pair. A score is stored only when PCC >= threshold; otherwise the cell keeps the -1.0 sentinel. Returns (correlations, invalid_flags).

Source code in src/mhc_tp/engine/kernels.py

@jit(nopython=True, parallel=True, cache=True)
def compute_all_correlations(gibbs_matrices, ref_matrices, hla_mask, threshold):
    r"""All-pairs flattened Pearson correlation, parallel over Gibbs matrices.

    Each PSSM is flattened to a vector. Only the cells that are informative in
    the Gibbs matrix are scored: the valid set is

        V = { k : g_k != 0 and g_k is not NaN }

    Restricted to those cells, the score for a (Gibbs g, reference r) pair is
    the Pearson correlation coefficient

                  (1/|V|) * Σ_{k in V} (g_k - ḡ)(r_k - r̄)
        PCC(g,r) = ----------------------------------------
                                σ_g · σ_r

    where ḡ, r̄ are the means and σ_g, σ_r the population standard deviations
    taken over V:

        ḡ   = (1/|V|) Σ g_k ,                σ_g = sqrt( (1/|V|) Σ (g_k - ḡ)^2 )
        r̄   = (1/|V|) Σ r_k ,                σ_r = sqrt( (1/|V|) Σ (r_k - r̄)^2 )

    PCC lies in [-1, 1] (1 = identical motif shape) and is scale/offset
    invariant, so it measures the *pattern* of position preferences rather than
    absolute weight magnitudes.

    Guards: a Gibbs matrix with |V| < 10 or σ_g = 0 is flagged invalid (its row
    is skipped); a reference with σ_r = 0 is skipped for that pair. A score is
    stored only when PCC >= ``threshold``; otherwise the cell keeps the -1.0
    sentinel. Returns (correlations, invalid_flags).
    """
    n_gibbs = gibbs_matrices.shape[0]
    n_refs = ref_matrices.shape[0]
    correlations = np.full((n_gibbs, n_refs), -1.0, dtype=np.float32)
    invalid_flags = np.zeros(n_gibbs, dtype=np.int32)

    for i in prange(n_gibbs):
        gibbs_flat = gibbs_matrices[i].flatten()
        valid = ~(np.isnan(gibbs_flat) | (gibbs_flat == 0.0))
        gibbs_clean = gibbs_flat[valid]
        if len(gibbs_clean) < 10:
            invalid_flags[i] = 1
            continue
        g_mean = np.mean(gibbs_clean)
        g_std = np.std(gibbs_clean)
        if g_std == 0.0:
            invalid_flags[i] = 1
            continue
        for j in range(n_refs):
            if not hla_mask[j]:
                continue
            ref_clean = ref_matrices[j].flatten()[valid]
            r_mean = np.mean(ref_clean)
            r_std = np.std(ref_clean)
            if r_std == 0.0:
                continue
            num = np.mean((gibbs_clean - g_mean) * (ref_clean - r_mean))
            corr = num / (g_std * r_std)
            if corr >= threshold:
                correlations[i, j] = corr
    return correlations, invalid_flags

Reference data¶

mhc_tp.refdata.fetch ¶

Resolve and fetch prebuilt reference parquets to a per-user data dir.

End users run mhc-tp fetch to download the prebuilt class I+II reference parquets (with embedded Seq2Logo reference logos) instead of building them. The download source + checksums live in the packaged reference_manifest.tsv; the maintainer fills them in on each release.