PONG2 Training: Building Custom KIR Prediction Models

Norman Lab

Overview

This vignette provides a complete, step-by-step guide to training custom KIR prediction models using the train command in PONG2.

Training is useful when:


Prerequisites

Requirement Version Notes
PLINK2 ≥ 2.0 Must be in PATH
R ≥ 4.0 With PONG2 installed
Reference PLINK files chr19 covering the KIR locus
KIR allele calls CSV format (see below)

Step 1: Prepare Input Data

1a. Reference genotypes (--bfile)

PLINK bed/bim/fam files containing SNPs in the KIR locus (chr19).

Using the 1000 Genomes Project (1KGP) as reference panel

The 1KGP is the recommended reference panel for training PONG2 models. Choose the appropriate panel for your genome assembly:

Assembly 1KGP Reference Panel Samples URL
hg19 Phase 3 2,504 https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
hg38 High Coverage 3,202 https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/

Download and extract the KIR region:

# ── hg19: 1KGP Phase 3 ──────────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/\
ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz

plink2 \
  --vcf ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz \
  --chr 19 \
  --from-bp 55000000 \
  --to-bp 55400000 \
  --make-bed \
  --out reference_chr19_hg19

# ── hg38: 1KGP High Coverage ─────────────────────────────────────────────────
wget https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/\
1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/\
1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz

plink2 \
  --vcf 1kGP_high_coverage_Illumina.chr19.filtered.SNV_INDEL_SV_phased_panel.vcf.gz \
  --chr 19 \
  --from-bp 54000000 \
  --to-bp 55000000 \
  --make-bed \
  --out reference_chr19_hg38

Using your own reference dataset

Extract chr19 from your full-genome PLINK files:

plink2 --bfile full_reference --chr 19 --make-bed --out reference_chr19

1b. Known KIR allele calls (--kfile)

A comma-separated CSV file with sample IDs and phased KIR allele calls. Each locus requires two columns representing the two haplotypes (_h1 and _h2).

Format

Sample,KIR2DL1_h1,KIR2DL1_h2,KIR3DL1_h1,KIR3DL1_h2
SAMPLE001,KIR2DL1*00101,KIR2DL1*00302,KIR3DL1*001,KIR3DL1*002
SAMPLE002,KIR2DL1*00104,KIR2DL1*0000,KIR3DL1*004,KIR3DL1*005
SAMPLE003,KIR2DL1*009,KIR2DL1*00601,KIR3DL1*00302,KIR3DL1*001

Rules

Rule Detail
Header required First row must be Sample,<locus>_h1,<locus>_h2,...
Sample IDs Must exactly match IDs in the PLINK .fam file
Allele names Use full standard nomenclature (e.g. KIR3DL1*001, KIR2DL1*00201)
Null allele Use KIR<locus>*0000 (not blank, not null)
Unresolved alleles Rows with *new or *unresolved are automatically excluded

Step 2: Run Training

pong2 train \
  -i reference_chr19 \
  -k kir_calls.csv \
  -o models/KIR3DL1 \
  -l KIR3DL1 \
  -a hg38 \
  -t 20

With optional parameters

pong2 train \
  -i reference_chr19 \
  -k kir_calls.csv \
  -o models/KIR3DL1 \
  -l KIR3DL1 \
  -a hg38 \
  -t 20 \
  --nclassifier 200 \
  --split 0.8 \
  --kirmaf 0.005 \
  --mac 5

Key training parameters

Flag Default Description
--nclassifier 100 Number of ensemble classifiers — higher = more accurate but slower
--split 0.7 Proportion of samples for training (remainder used for validation)
--kirmaf 0.00 Minimum KIR allele frequency — filters rare alleles from training
--mac 3 Minimum allele count for SNPs — removes very rare variants
-r, --region Optimized default Custom SNP region (e.g. 55281035-55295784 for hg19)

Step 3: Training Output

After training completes, the output directory (-o) will contain:

File Description
<locus>_model.RData Trained prediction model — main output
<locus>_test.RData Test genotypes (only when --split < 1)
<locus>_split.RData Train/test split object (only when --split < 1)

Note: Temporary files in tmp/ are automatically removed after training completes.


Step 4: Evaluate Model Performance

If --split < 1, PONG2 holds out a validation set during training.

Option B: Evaluate in R

library(PONG2)

# Load saved objects
path      <- "models/KIR3DL1"
mobj      <- get(load(paste0(path, "_model.RData")))
test.geno <- get(load(paste0(path, "_test.RData")))
kirtab    <- get(load(paste0(path, "_split.RData")))
model     <- hlaModelFromObj(mobj)

# Predict on test set
pred <- kirPredict(model, test.geno, type = "response+prob", verbose = FALSE)

# Evaluate using hlaCompareAllele
comp <- hlaCompareAllele(kirtab$validation, pred,
                         allele.limit = model, call.threshold = 0.5)

# Overall accuracy
cat(sprintf("Haplotype accuracy: %.1f%%\n", comp$overall$acc.haplo * 100))
cat(sprintf("Genotype accuracy:  %.1f%%\n", comp$overall$acc.geno  * 100))
cat(sprintf("Call rate:          %.1f%%\n", comp$overall$call.rate  * 100))
cat(sprintf("Test samples:       %d\n",     comp$overall$n.samp))

# Per-allele accuracy
if (!is.null(comp$detail)) {
  allele_detail <- comp$detail[order(comp$detail$acc.haplo), ]
  print(allele_detail)
}

# Save summary
eval_summary <- data.frame(
  Locus     = "KIR3DL1",
  N_test    = comp$overall$n.samp,
  Acc_Haplo = round(comp$overall$acc.haplo  * 100, 2),
  Acc_Geno  = round(comp$overall$acc.geno   * 100, 2),
  Call_Rate = round(comp$overall$call.rate  * 100, 2)
)
write.csv(eval_summary, file = paste0(path, "_eval_summary.csv"), row.names = FALSE)

Step 5: Use a Custom Model for Imputation

Once trained, use your model directly with pong2 impute:

pong2 impute \
  -i chr19_target \
  -o results/ \
  -l KIR3DL1 \
  -a hg38 \
  -m models/KIR3DL1/KIR3DL1_model.RData

Or load it directly in R:

library(PONG2)

load("models/KIR3DL1/KIR3DL1_model.RData")
model <- hlaModelFromObj(mobj)

geno <- hlaBED2Geno("chr19_target.bed", "chr19_target.fam", "chr19_target.bim",
                    import.chr = "19", assembly = "hg38")
pred <- kirPredict(model, geno, type = "response+prob")

Troubleshooting

Error Likely Cause Fix
No matching samples Sample IDs in --kfile don’t match .fam Check ID format — must match exactly
Insufficient training samples Too few overlapping samples (<10) Verify PLINK and KIR file sample overlap
No SNPs found in region Wrong assembly or region coordinates Check --assembly and --region
No model found for locus Locus name typo or unsupported locus Check locus spelling (e.g. KIR3DL1 not KIR3DL)
Memory issues Too many threads or large dataset Reduce --threads or use HPC with more RAM
Slow training Insufficient threads Increase --threads or reduce --nclassifier
Low accuracy Too few training samples or rare alleles Increase sample size or adjust --kirmaf

Next Steps

Happy KIR model training! 🧬