The Normality Transformation via Optimized Skewness and Kurtosis (OSKT) is a normality method that simultaneously evaluates deviations in skewness and kurtosis of non-normal data.
If the package osktnorm has already been installed, load it into R working environment by using the following command:
library(osktnorm)To handle Excel file formats, the readxl package is used for importing data and writexl for exporting results. The following code chunk checks for their presence and installs them if necessary before loading.
if (!requireNamespace("readxl", quietly = TRUE)) install.packages("readxl")
if (!requireNamespace("writexl", quietly = TRUE)) install.packages("writexl")
library(readxl)
library(writexl)The target dataset (sativapheno.xlsx) is imported from the package’s internal extdata directory. The code then subsets specific columns of interest and applies a filter to retain only numeric variables, ensuring the data is compatible with the normalization algorithms.
file_path <- system.file("extdata", "sativapheno.xlsx", package = "osktnorm")
pheno <- read_excel(file_path)
# Select some variables from the pheno
phenoset <- as.data.frame(pheno[, c(4,5,6,7,9,10,12,13,14)])
phenoset <- phenoset[, sapply(phenoset, is.numeric)]
head(phenoset)
FTB RTA RTF CULM FLL FLW PNP PHT PLEN
1 81 0.9269547 0.7901235 4.0 28.37500 1.2833333 3.068053 110.9167 20.48182
2 83 1.0783133 0.7951807 7.5 39.00833 1.0000000 4.051785 143.5000 26.83333
3 108 0.8101852 0.6481481 3.5 30.41667 0.8916667 3.697178 153.7500 28.96364
4 101 0.8820132 0.7227723 6.0 36.90833 1.7500000 2.857428 148.3333 30.91667
5 81 0.9012346 0.9259259 6.5 24.42500 1.2125000 2.924505 128.8750 20.20000
6 124 0.8342294 0.6612903 3.0 36.38889 1.1666667 2.899588 112.4444 21.40000This section focuses on evaluating the distributional characteristics of the raw phenotypic data before any transformation. Visual inspection is a critical first step in determining the degree of skewness and kurtosis present in each trait.
A grid of histograms is generated to visualize the distribution of each variable. For each plot, the actual data distribution (black line) is compared against a theoretical normal distribution curve (red dashed line). This comparison highlights which traits deviate significantly from normality.
oldpar <- par(no.readonly = TRUE)
nvar <- ncol(phenoset)
nrow_plot <- ceiling(sqrt(nvar))
ncol_plot <- ceiling(nvar / nrow_plot)
par(mfrow = c(nrow_plot, ncol_plot), mar = c(4, 4, 3, 1))
cols <- rainbow(nvar)
for(i in seq_len(nvar)) {
x <- phenoset[[i]]
x <- x[!is.na(x)]
if(length(x) == 0) next
h_calc <- hist(x, plot = FALSE)
d <- density(x)
max_y <- max(c(h_calc$density, d$y)) * 1.1
hist(x,
breaks = 15,
freq = FALSE,
col = adjustcolor(cols[i], 0.5),
border = "white",
main = colnames(phenoset)[i],
xlab = "Transformed Value",
ylab = "Density",
ylim = c(0, max_y))
lines(d, col = "black", lwd = 2)
curve(dnorm(x, mean = mean(x), sd = sd(x)), add = TRUE, col = "red", lty = 2, lwd = 2)
}par(oldpar)To provide statistical evidence of non-normality, the Shapiro-Wilk test is applied to each trait. A data frame is created to summarize the W statistic and p-values. Traits with a p-value <0.05 are flagged as “NO” (not normal), justifying the need for OSKT normalization.
shapiro_results <- data.frame(
Trait = colnames(phenoset),
W = NA_real_,
pval = NA_real_,
IsNormal = NA_character_,
stringsAsFactors = FALSE
)
for(i in seq_len(ncol(phenoset))) {
x <- phenoset[[i]]
x <- x[!is.na(x)]
if(length(x) >= 3) {
test <- shapiro.test(x)
shapiro_results$W[i] <- test$statistic
shapiro_results$pval[i] <- test$p.value
shapiro_results$IsNormal[i] <- ifelse(test$p.value > 0.05, "YES", "NO")
} else {
shapiro_results$W[i] <- NA
shapiro_results$pval[i] <- NA
shapiro_results$IsNormal[i] <- NA
}
}
shapiro_results
Trait W pval IsNormal
1 FTB 0.9788356 5.092575e-03 NO
2 RTA 0.9334772 9.979590e-08 NO
3 RTF 0.9888670 1.369634e-01 YES
4 CULM 0.9477705 1.707394e-06 NO
5 FLL 0.9816832 1.262374e-02 NO
6 FLW 0.9879264 1.002137e-01 YES
7 PNP 0.9836742 2.422959e-02 NO
8 PHT 0.9743236 1.292045e-03 NO
9 PLEN 0.9887964 1.338064e-01 YESThe core normalization process is executed using the osktnorm function. By setting normtests = FALSE, the function focuses on calculating the optimal transformation parameters and generating the normalized data frame (phenonormal), where each variable is adjusted to achieve a distribution as close to normal as possible.
# Normalization only
norm_res1 <- osktnorm(phenoset, normtests = FALSE)
phenonormal <- norm_res1$normalized
head(phenonormal)
FTB RTA RTF CULM FLL FLW PNP PHT PLEN
1 -0.6614016 0.14436235 0.05676644 -0.09376608 -0.39908842 0.1552799 -0.3379437 -0.2328676 -0.9937679
2 -0.5690568 1.03287842 0.08838504 1.53027503 1.62682225 -1.0849767 2.1548686 1.2956039 0.8530834
3 0.4976650 -0.61989787 -0.87304919 -0.41355860 0.00958448 -1.6948815 1.1690496 1.8150743 1.4679050
4 0.2146980 -0.14122053 -0.37401838 0.89300452 1.20340711 2.0926295 -0.9257250 1.5323389 2.0615666
5 -0.6614016 -0.01779863 0.87197474 1.10524249 -1.36552951 -0.1256009 -0.7279871 0.6251948 -1.0826167
6 1.1021147 -0.45653796 -0.78346201 -0.78348175 1.10451288 -0.3146540 -0.8001148 -0.1534829 -0.7108529Following the transformation, the data is re-visualized to confirm the effectiveness of the OSKT method. This allows for a direct comparison between the initial skewed distributions and the newly normalized results.
Boxplots are utilized to examine the center, spread, and outliers of the normalized traits. This visualization helps verify that the variables now share a similar scale and that the influence of extreme values (outliers) has been effectively stabilized across the entire dataset.
oldpar <- par(no.readonly = TRUE)
par(mar = c(8,4,4,2))
boxplot(
phenonormal,
col = rainbow(ncol(phenonormal)),
border = "black",
outline = TRUE,
las = 2,
main = "OSKT Normalized Values",
ylab = "Values",
cex.axis = 0.8
)par(oldpar)After transformation, a new grid of histograms is generated for the phenonormal dataset. This visualization allows for an empirical evaluation of the OSKT method, confirming that the distributions now exhibit improved symmetry and follow a bell-shaped curve more closely compared to the raw data.
oldpar <- par(no.readonly = TRUE)
nvar <- ncol(phenonormal)
nrow_plot <- ceiling(sqrt(nvar))
ncol_plot <- ceiling(nvar / nrow_plot)
par(mfrow = c(nrow_plot, ncol_plot), mar = c(4,4,3,1))
cols <- rainbow(nvar)
for(i in seq_len(nvar)) {
x <- phenonormal[[i]]
x <- x[!is.na(x)]
if(length(x) == 0) next
d <- density(x)
h_calc <- hist(x, plot = FALSE)
max_y <- max(d$y, h_calc$density) * 1.1
hist(x, freq = FALSE,
col = adjustcolor(cols[i], 0.6),
border = "white",
main = colnames(phenonormal)[i],
xlab = "", ylab = "Density",
ylim = c(0, max_y))
lines(d, col = "black", lwd = 2)
}par(oldpar)This section extracts the internal parameters estimated during the transformation process. The table provides the optimized values for g (skewness parameter) and h (kurtosis parameter), along with the final A^2 value, which represents the minimized objective function for each trait.
norm_res2 <- osktnorm(phenoset, normtests = "all")
phenonormal <- norm_res2$normalized
paramstable <- norm_res2$parameters
paramstable
Trait g h A2
1 FTB -0.13899749 0.00000000 0.6090078
2 RTA -0.12797772 0.00000000 1.6800176
3 RTF -0.09953805 0.00000000 0.3356170
4 CULM -0.42127828 0.08195892 1.0143184
5 FLL -0.19870891 0.10456733 0.1649252
6 FLW -0.14575407 0.09071026 0.2707898
7 PNP -0.15577358 0.09627427 0.3650047
8 PHT -0.23855271 0.09861026 0.3858293
9 PLEN -0.05051491 0.03224488 0.3999539A comprehensive suite of statistical diagnostics is performed on the normalized data. By requesting normtests = “all”, the function computes multiple metrics including skewness, kurtosis, and various p-values (Shapiro-Wilk, Anderson-Darling, etc.). This table serves as the final validation that the traits have successfully achieved or approached normality.
testtable <- norm_res2$normtests
testtable
SKEW KURT SW ZA ZC CVM PPM
FTB 0.003506758 0.15576872 0.1662369210 0.34653465 0.26732673 0.19078092 0.8694301
RTA 0.208256466 1.61948818 0.0003716262 0.01980198 0.00990099 0.00039996 4.0196891
RTF -0.016418109 0.28318917 0.7596107115 0.85148515 0.91089109 0.45815418 1.3005181
CULM -0.125799038 -0.08714985 0.0303000687 0.04950495 0.04950495 0.01059894 4.3512953
FLL -0.044948116 -0.21528412 0.8842530426 0.77227723 0.74257426 0.92770723 0.6538860
FLW -0.203279109 0.39615765 0.3676688039 0.31683168 0.20792079 0.71602840 1.0849741
PNP -0.193742665 -0.01383552 0.2929899519 0.09900990 0.26732673 0.44715528 1.1181347
PHT 0.354595673 1.85781505 0.0015081575 0.00990099 0.00990099 0.69483052 0.3886010
PLEN 0.056966782 -0.35928894 0.3711271166 0.49504950 0.28712871 0.42095790 1.2010363In the final step, the successfully transformed and normalized dataset is exported to an external file. Using the writexl package, the data is saved as an Excel spreadsheet (sativapheno_normalized.xlsx), making it ready for use in subsequent downstream statistical analyses.
writexl::write_xlsx(phenonormal, path = "sativapheno_normalized.xlsx") options(oldopts)