This vignette provides a complete, step-by-step guide to training
custom KIR prediction models using the train command in
PONG2.
Training is useful when:
KIR3DL1 or
KIR2DL1)| Requirement | Version | Notes |
|---|---|---|
| PLINK2 | ≥ 2.0 | Must be in PATH |
| R | ≥ 4.0 | With PONG2 installed |
| Reference PLINK files | — | chr19 covering the KIR locus |
| KIR allele calls | — | CSV format (see below) |
--bfile)PLINK bed/bim/fam files containing SNPs in the KIR locus (chr19).
The 1KGP is the recommended reference panel for training PONG2 models. Choose the appropriate panel for your genome assembly:
| Assembly | 1KGP Reference Panel | Samples | URL |
|---|---|---|---|
| hg19 | Phase 3 | 2,504 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ |
| hg38 | High Coverage | 3,202 | https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/ |
Download and extract the KIR region:
# ── hg19: 1KGP Phase 3 ──────────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\
ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz
plink2 \
--vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
--chr 19 \
--from-bp 55000000 \
--to-bp 55400000 \
--make-bed \
--out reference_chr19_hg19
# ── hg38: 1KGP High Coverage ─────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/\
1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/\
1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz
plink2 \
--vcf 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz \
--chr 19 \
--from-bp 54000000 \
--to-bp 55000000 \
--make-bed \
--out reference_chr19_hg38--kfile)A comma-separated CSV file with sample IDs and phased KIR allele
calls. Each locus requires two columns representing the two haplotypes
(_h1 and _h2).
Sample,KIR2DL1_h1,KIR2DL1_h2,KIR3DL1_h1,KIR3DL1_h2
SAMPLE001,KIR2DL1*00101,KIR2DL1*00302,KIR3DL1*001,KIR3DL1*002
SAMPLE002,KIR2DL1*00104,KIR2DL1*0000,KIR3DL1*004,KIR3DL1*005
SAMPLE003,KIR2DL1*009,KIR2DL1*00601,KIR3DL1*00302,KIR3DL1*001
| Rule | Detail |
|---|---|
| Header required | First row must be
Sample,<locus>_h1,<locus>_h2,... |
| Sample IDs | Must exactly match IDs in the PLINK .fam file |
| Allele names | Use full standard nomenclature (e.g. KIR3DL1*001,
KIR2DL1*00201) |
| Null allele | Use KIR<locus>*0000 (not blank, not
null) |
| Unresolved alleles | Rows with *new or *unresolved are
automatically excluded |
pong2 train \
-i reference_chr19 \
-k kir_calls.csv \
-o models/KIR3DL1 \
-l KIR3DL1 \
-a hg38 \
-t 20| Flag | Default | Description |
|---|---|---|
--nclassifier |
100 |
Number of ensemble classifiers — higher = more accurate but slower |
--split |
0.7 |
Proportion of samples for training (remainder used for validation) |
--kirmaf |
0.00 |
Minimum KIR allele frequency — filters rare alleles from training |
--mac |
3 |
Minimum allele count for SNPs — removes very rare variants |
-r, --region |
Optimized default | Custom SNP region (e.g. 55281035-55295784 for
hg19) |
After training completes, the output directory (-o) will
contain:
| File | Description |
|---|---|
<locus>_model.RData |
Trained prediction model — main output |
<locus>_test.RData |
Test genotypes (only when --split < 1) |
<locus>_split.RData |
Train/test split object (only when --split < 1) |
Note: Temporary files in
tmp/are automatically removed after training completes.
If --split < 1, PONG2 holds out a validation set
during training.
This outputs haplotype accuracy, genotype accuracy, call rate, and per-allele sensitivity/specificity, and saves a summary CSV to the model directory.
library(PONG2)
# Load saved objects
path <- "models/KIR3DL1"
mobj <- get(load(paste0(path, "_model.RData")))
test.geno <- get(load(paste0(path, "_test.RData")))
kirtab <- get(load(paste0(path, "_split.RData")))
model <- hlaModelFromObj(mobj)
# Predict on test set
pred <- kirPredict(model, test.geno, type = "response+prob", verbose = FALSE)
# Evaluate using hlaCompareAllele
comp <- hlaCompareAllele(kirtab$validation, pred,
allele.limit = model, call.threshold = 0.5)
# Overall accuracy
cat(sprintf("Haplotype accuracy: %.1f%%\n", comp$overall$acc.haplo * 100))
cat(sprintf("Genotype accuracy: %.1f%%\n", comp$overall$acc.geno * 100))
cat(sprintf("Call rate: %.1f%%\n", comp$overall$call.rate * 100))
cat(sprintf("Test samples: %d\n", comp$overall$n.samp))
# Per-allele accuracy
if (!is.null(comp$detail)) {
allele_detail <- comp$detail[order(comp$detail$acc.haplo), ]
print(allele_detail)
}
# Save summary
eval_summary <- data.frame(
Locus = "KIR3DL1",
N_test = comp$overall$n.samp,
Acc_Haplo = round(comp$overall$acc.haplo * 100, 2),
Acc_Geno = round(comp$overall$acc.geno * 100, 2),
Call_Rate = round(comp$overall$call.rate * 100, 2)
)
write.csv(eval_summary, file = paste0(path, "_eval_summary.csv"), row.names = FALSE)Once trained, use your model directly with
pong2 impute:
pong2 impute \
-i chr19_target \
-o results/ \
-l KIR3DL1 \
-a hg38 \
-m models/KIR3DL1/KIR3DL1_model.RDataOr load it directly in R:
library(PONG2)
load("models/KIR3DL1/KIR3DL1_model.RData")
model <- hlaModelFromObj(mobj)
geno <- hlaBED2Geno("chr19_target.bed", "chr19_target.fam", "chr19_target.bim",
import.chr = "19", assembly = "hg38")
pred <- kirPredict(model, geno, type = "response+prob")| Error | Likely Cause | Fix |
|---|---|---|
No matching samples |
Sample IDs in --kfile don’t match
.fam |
Check ID format — must match exactly |
Insufficient training samples |
Too few overlapping samples (<10) | Verify PLINK and KIR file sample overlap |
No SNPs found in region |
Wrong assembly or region coordinates | Check --assembly and --region |
No model found for locus |
Locus name typo or unsupported locus | Check locus spelling (e.g. KIR3DL1 not
KIR3DL) |
| Memory issues | Too many threads or large dataset | Reduce --threads or use HPC with more RAM |
| Slow training | Insufficient threads | Increase --threads or reduce
--nclassifier |
| Low accuracy | Too few training samples or rare alleles | Increase sample size or adjust --kirmaf |
Happy KIR model training! 🧬