2 Normalizing Data from an Excel File

2.1 Load Required Packages for Excel I/O

To handle Excel file formats, the readxl package is used for importing data and writexl for exporting results. The following code chunk checks for their presence and installs them if necessary before loading.

if (!requireNamespace("readxl", quietly = TRUE)) install.packages("readxl")
if (!requireNamespace("writexl", quietly = TRUE)) install.packages("writexl")

library(readxl)
library(writexl)

2.2 Read Excel File

The target dataset (sativapheno.xlsx) is imported from the package’s internal extdata directory. The code then subsets specific columns of interest and applies a filter to retain only numeric variables, ensuring the data is compatible with the normalization algorithms.

file_path <- system.file("extdata", "sativapheno.xlsx", package = "osktnorm")
pheno <- read_excel(file_path)

# Select some variables from the pheno
phenoset <- as.data.frame(pheno[, c(4,5,6,7,9,10,12,13,14)])
phenoset <- phenoset[, sapply(phenoset, is.numeric)]

head(phenoset)
    FTB       RTA       RTF CULM      FLL       FLW      PNP      PHT     PLEN
  1  81 0.9269547 0.7901235  4.0 28.37500 1.2833333 3.068053 110.9167 20.48182
  2  83 1.0783133 0.7951807  7.5 39.00833 1.0000000 4.051785 143.5000 26.83333
  3 108 0.8101852 0.6481481  3.5 30.41667 0.8916667 3.697178 153.7500 28.96364
  4 101 0.8820132 0.7227723  6.0 36.90833 1.7500000 2.857428 148.3333 30.91667
  5  81 0.9012346 0.9259259  6.5 24.42500 1.2125000 2.924505 128.8750 20.20000
  6 124 0.8342294 0.6612903  3.0 36.38889 1.1666667 2.899588 112.4444 21.40000

2.3 Visualization of Variables

This section focuses on evaluating the distributional characteristics of the raw phenotypic data before any transformation. Visual inspection is a critical first step in determining the degree of skewness and kurtosis present in each trait.

2.3.1 Boxplots

A grid of histograms is generated to visualize the distribution of each variable. For each plot, the actual data distribution (black line) is compared against a theoretical normal distribution curve (red dashed line). This comparison highlights which traits deviate significantly from normality.

oldpar <- par(no.readonly = TRUE)
nvar <- ncol(phenoset)
nrow_plot <- ceiling(sqrt(nvar))
ncol_plot <- ceiling(nvar / nrow_plot)

par(mfrow = c(nrow_plot, ncol_plot), mar = c(4, 4, 3, 1))
cols <- rainbow(nvar)

for(i in seq_len(nvar)) {
  x <- phenoset[[i]]
  x <- x[!is.na(x)]
  
  if(length(x) == 0) next
  h_calc <- hist(x, plot = FALSE) 
  d <- density(x)   
  max_y <- max(c(h_calc$density, d$y)) * 1.1
  hist(x, 
       breaks = 15, 
       freq = FALSE, 
       col = adjustcolor(cols[i], 0.5), 
       border = "white",
       main = colnames(phenoset)[i],
       xlab = "Transformed Value", 
       ylab = "Density",
       ylim = c(0, max_y))
  
  lines(d, col = "black", lwd = 2)
  curve(dnorm(x, mean = mean(x), sd = sd(x)), add = TRUE, col = "red", lty = 2, lwd = 2)
}

par(oldpar)

2.3.2 Normality Tests for Original Values

To provide statistical evidence of non-normality, the Shapiro-Wilk test is applied to each trait. A data frame is created to summarize the W statistic and p-values. Traits with a p-value <0.05 are flagged as “NO” (not normal), justifying the need for OSKT normalization.

shapiro_results <- data.frame(
  Trait = colnames(phenoset),
  W     = NA_real_,
  pval  = NA_real_,
  IsNormal = NA_character_,
  stringsAsFactors = FALSE
)

for(i in seq_len(ncol(phenoset))) {
  x <- phenoset[[i]]
  x <- x[!is.na(x)]
  
  if(length(x) >= 3) {
    test <- shapiro.test(x)
    shapiro_results$W[i] <- test$statistic
    shapiro_results$pval[i] <- test$p.value
    shapiro_results$IsNormal[i] <- ifelse(test$p.value > 0.05, "YES", "NO")
  } else {
    shapiro_results$W[i] <- NA
    shapiro_results$pval[i] <- NA
    shapiro_results$IsNormal[i] <- NA
  }
}
shapiro_results
    Trait         W         pval IsNormal
  1   FTB 0.9788356 5.092575e-03       NO
  2   RTA 0.9334772 9.979590e-08       NO
  3   RTF 0.9888670 1.369634e-01      YES
  4  CULM 0.9477705 1.707394e-06       NO
  5   FLL 0.9816832 1.262374e-02       NO
  6   FLW 0.9879264 1.002137e-01      YES
  7   PNP 0.9836742 2.422959e-02       NO
  8   PHT 0.9743236 1.292045e-03       NO
  9  PLEN 0.9887964 1.338064e-01      YES

2.4 Normalization with OSKT

The core normalization process is executed using the osktnorm function. By setting normtests = FALSE, the function focuses on calculating the optimal transformation parameters and generating the normalized data frame (phenonormal), where each variable is adjusted to achieve a distribution as close to normal as possible.


# Normalization only 

norm_res1 <- osktnorm(phenoset, normtests = FALSE)
phenonormal <- norm_res1$normalized
head(phenonormal)
           FTB         RTA         RTF        CULM         FLL        FLW        PNP        PHT       PLEN
  1 -0.6614016  0.14436235  0.05676644 -0.09376608 -0.39908842  0.1552799 -0.3379437 -0.2328676 -0.9937679
  2 -0.5690568  1.03287842  0.08838504  1.53027503  1.62682225 -1.0849767  2.1548686  1.2956039  0.8530834
  3  0.4976650 -0.61989787 -0.87304919 -0.41355860  0.00958448 -1.6948815  1.1690496  1.8150743  1.4679050
  4  0.2146980 -0.14122053 -0.37401838  0.89300452  1.20340711  2.0926295 -0.9257250  1.5323389  2.0615666
  5 -0.6614016 -0.01779863  0.87197474  1.10524249 -1.36552951 -0.1256009 -0.7279871  0.6251948 -1.0826167
  6  1.1021147 -0.45653796 -0.78346201 -0.78348175  1.10451288 -0.3146540 -0.8001148 -0.1534829 -0.7108529

2.5 Visualizing the Normalized Values

Following the transformation, the data is re-visualized to confirm the effectiveness of the OSKT method. This allows for a direct comparison between the initial skewed distributions and the newly normalized results.

2.5.1 Boxplots

Boxplots are utilized to examine the center, spread, and outliers of the normalized traits. This visualization helps verify that the variables now share a similar scale and that the influence of extreme values (outliers) has been effectively stabilized across the entire dataset.

oldpar <- par(no.readonly = TRUE)
par(mar = c(8,4,4,2))
boxplot(
  phenonormal,
  col = rainbow(ncol(phenonormal)),
  border = "black",
  outline = TRUE,
  las = 2,
  main = "OSKT Normalized Values",
  ylab = "Values",
  cex.axis = 0.8
)

par(oldpar)

2.5.2 Histograms and Density Plots

After transformation, a new grid of histograms is generated for the phenonormal dataset. This visualization allows for an empirical evaluation of the OSKT method, confirming that the distributions now exhibit improved symmetry and follow a bell-shaped curve more closely compared to the raw data.

oldpar <- par(no.readonly = TRUE)
nvar <- ncol(phenonormal)
nrow_plot <- ceiling(sqrt(nvar))
ncol_plot <- ceiling(nvar / nrow_plot)

par(mfrow = c(nrow_plot, ncol_plot), mar = c(4,4,3,1))
cols <- rainbow(nvar)

for(i in seq_len(nvar)) {
  x <- phenonormal[[i]]
  x <- x[!is.na(x)]
  
  if(length(x) == 0) next
  
  d <- density(x)
  h_calc <- hist(x, plot = FALSE) 
  
  max_y <- max(d$y, h_calc$density) * 1.1
  
  hist(x, freq = FALSE, 
       col = adjustcolor(cols[i], 0.6),
       border = "white",
       main = colnames(phenonormal)[i],
       xlab = "", ylab = "Density",
       ylim = c(0, max_y))
  
  lines(d, col = "black", lwd = 2)
}

par(oldpar)

2.6 Optimization Parameters

This section extracts the internal parameters estimated during the transformation process. The table provides the optimized values for g (skewness parameter) and h (kurtosis parameter), along with the final A^2 value, which represents the minimized objective function for each trait.

norm_res2 <- osktnorm(phenoset, normtests = "all")

phenonormal <- norm_res2$normalized
paramstable <- norm_res2$parameters
paramstable
    Trait           g          h        A2
  1   FTB -0.13899749 0.00000000 0.6090078
  2   RTA -0.12797772 0.00000000 1.6800176
  3   RTF -0.09953805 0.00000000 0.3356170
  4  CULM -0.42127828 0.08195892 1.0143184
  5   FLL -0.19870891 0.10456733 0.1649252
  6   FLW -0.14575407 0.09071026 0.2707898
  7   PNP -0.15577358 0.09627427 0.3650047
  8   PHT -0.23855271 0.09861026 0.3858293
  9  PLEN -0.05051491 0.03224488 0.3999539

2.7 Normality Test Results

A comprehensive suite of statistical diagnostics is performed on the normalized data. By requesting normtests = “all”, the function computes multiple metrics including skewness, kurtosis, and various p-values (Shapiro-Wilk, Anderson-Darling, etc.). This table serves as the final validation that the traits have successfully achieved or approached normality.

testtable <- norm_res2$normtests
testtable
               SKEW        KURT           SW         ZA         ZC        CVM       PPM
  FTB   0.003506758  0.15576872 0.1662369210 0.34653465 0.26732673 0.19078092 0.8694301
  RTA   0.208256466  1.61948818 0.0003716262 0.01980198 0.00990099 0.00039996 4.0196891
  RTF  -0.016418109  0.28318917 0.7596107115 0.85148515 0.91089109 0.45815418 1.3005181
  CULM -0.125799038 -0.08714985 0.0303000687 0.04950495 0.04950495 0.01059894 4.3512953
  FLL  -0.044948116 -0.21528412 0.8842530426 0.77227723 0.74257426 0.92770723 0.6538860
  FLW  -0.203279109  0.39615765 0.3676688039 0.31683168 0.20792079 0.71602840 1.0849741
  PNP  -0.193742665 -0.01383552 0.2929899519 0.09900990 0.26732673 0.44715528 1.1181347
  PHT   0.354595673  1.85781505 0.0015081575 0.00990099 0.00990099 0.69483052 0.3886010
  PLEN  0.056966782 -0.35928894 0.3711271166 0.49504950 0.28712871 0.42095790 1.2010363

2.8 Saving the Normalized Data

In the final step, the successfully transformed and normalized dataset is exported to an external file. Using the writexl package, the data is saved as an Excel spreadsheet (sativapheno_normalized.xlsx), making it ready for use in subsequent downstream statistical analyses.

writexl::write_xlsx(phenonormal, path = "sativapheno_normalized.xlsx")

 options(oldopts)

A Practical Example for Working with Datasets

Zeynel Cebeci

2026-2-14

1 Introduction