Functions by Topic

Allison C Fialkowski

2018-06-28

The following gives a description of SimMultiCorrData’s functions by topic. The user should visit the appropriate help page for more information.

Simulation Functions:

  1. nonnormvar1 simulates one non-normal continuous variable using either Fleishman (1978)’s third-order (method = “Fleishman”) or Headrick (2002)’s fifth-order (method = “Polynomial”) approximation. See Comparison of Simulated Distribution to Theoretical Distribution or Empirical Data vignette for an example.

  2. rcorrvar simulates k_cat ordinal (\(\Large r \ge 2\) categories), k_cont continuous, k_pois Poisson, and/or k_nb Negative Binomial variables with a specified correlation matrix rho using Correlation Method 1. The variables are generated from multivariate normal variables with intermediate correlation matrix Sigma, calculated by findintercorr, and then transformed appropriately. The ordering of the variables in rho must be ordinal, continuous, Poisson, and Negative Binomial (note that it is possible for k_cat, k_cont, k_pois, and/or k_nb to be 0).

  3. rcorrvar2 simulates k_cat ordinal (\(\Large r \ge 2\) categories), k_cont continuous, k_pois Poisson, and/or k_nb Negative Binomial variables with a specified correlation matrix rho using Correlation Method 2. The variables are generated from multivariate normal variables with intermediate correlation matrix Sigma, calculated by findintercorr2, and then transformed appropriately. The ordering of the variables in rho must be ordinal, continuous, Poisson, and Negative Binomial (note that it is possible for k_cat, k_cont, k_pois, and/or k_nb to be 0).

Please see the Comparison of Correlation Method 1 and Correlation Method 2 vignette for more information about the two different simulation pathways.

Power Method Constants Functions:

  1. find_constants calculates the constants used to generate continuous variables via either Fleishman’s third-order (using fleish equations) or Headrick’s fifth-order (using poly equations) polynomial transformation. It attempts to find constants that generate a valid power method pdf. When using Headrick’s method, if no solutions converged or no valid pdf solutions could be found and a vector of sixth cumulant correction values (Six) is provided, the function will attempt to find the smallest correction value that generates a valid power method pdf. If not, invalid pdf constants will be given.

  2. fleish contains Fleishman’s third-order polynomial transformation equations.

  3. poly contains Headrick’s fifth-order polynomial transformation equations.

Data Description (Summary) Functions:

  1. calc_fisherk uses Fisher’s k-statistics to calculate the mean, standard deviation, skewness, standardized kurtosis, and standardized fifth and sixth cumulants given a vector of data.

  2. calc_moments uses the method of moments to calculate the mean, standard deviation, skewness, standardized kurtosis, and standardized fifth and sixth cumulants given a vector of data.

  3. calc_theory calculates the mean, standard deviation, skewness, standardized kurtosis, and standardized fifth and sixth cumulants given either a distribution name with up to 4 associated parameters or pdf function fx with lower and upper support bounds. There are 39 available distributions by name. Please see the appropriate help pages for information regarding parameter inputs (in the VGAM Yee (2018), triangle Carnell (2016), or stats R Core Team (2018) packages).

  4. cdf_prob calculates a cumulative probability using the theoretical power method cdf \(\Large F_p(Z)(p(z)) = F_p(Z)(p(z), F_Z(z))\) up to \(\Large sigma * y + mu = delta\), where \(\Large y = p(z)\), after using pdf_check to verify that the given constants produce a valid pdf. If the given constants do not produce a valid power method pdf, a warning is given.

  5. power_norm_corr calculates the correlation between a continuous variable produced using a polynomial transformation and the generating standard normal variable. If the correlation is <= 0, the signs of c1 and c3 should be reversed (for method = “Fleishman”), or c1, c3, and c5 (for method = “Polynomial”). These sign changes have no effect on the cumulants of the resulting distribution.

  6. pdf_check determines if a given set of constants generates a valid power method pdf. This requires yielding a continuous variable with a positive correlation with the generating standard normal variable and satisfying certain contraints that vary by approximation method (see Headrick and Kowalchuk (2007)).

  7. sim_cdf_prob calculates the simulated (empirical) cumulative probability up to a given y-value (delta). It uses Martin Maechler’s stats::ecdf function to find the empirical cdf \(\Large Fn\). \(\Large Fn\) is a step function with jumps \(\Large i/n\) at observation values, where \(\Large i\) is the number of tied observations at that value. Missing values are ignored. For observations \(\Large y = (y1, y2, ..., yn)\), \(\Large Fn\) is the fraction of observations less or equal to \(\Large t\), i.e., \(\Large Fn(t) = \#[y_{i} <= t]/n\).

  8. stats_pdf calculates the \(\Large 100 * \alpha %\) symmetric trimmed mean (\(\Large 0 < \alpha < 0.50\)), median, mode, and maximum height of a valid power method pdf using the equations given by Headrick & Kowalchuk (2007).

Lower Kurtosis Boundary Functions:

  1. calc_lower_skurt determines the lower standardized kurtosis boundary for a continuous variable generated using the power method transformation. This boundary depends on skewness (for Fleishman’s third-order method, see Headrick and Sawilowsky (2002)) or skewness and standardized fifth and sixth cumulants (for Headrick’s fifth-order method, see Headrick (2002)).

  2. fleish_Hessian calculates the Fleishman transformation Hessian matrix and its determinant, which are used in finding the lower kurtosis boundary for asymmetric distributions.

  3. fleish_skurt_check contains the Fleishman transformation Lagrangean constraints which are used in finding the lower kurtosis boundary for asymmetric distributions.

  4. poly_skurt_check contains the Headrick transformation Lagrangean constraints which are used in finding the lower kurtosis boundary.

Correlation Validation Functions:

valid_corr (correlation method 1) and valid_corr2 (correlation method 2) determine the feasible correlation bounds for ordinal, continuous, Poisson, and/or Negative Binomial variables. If a target correlation matrix rho is specified, the functions check each pairwise correlation to see if it falls within the bounds. The indices of any variable pair with a target correlation that is outside the bounds are given. If continuous variables are required, the functions return the calculated constants, the required sixth cumulant correction (if a Six vector of possible values was given), and whether each set of constants generate a valid power method pdf.

Intermediate Correlation Functions:

findintercorr (correlation method 1) and findintercorr2 (correlation method 2) are the two main intermediate correlation calculation functions. These functions call the other functions:

  1. chat_nb calculates the upper Frechet-Hoeffding correlation bound for Negative Binomial - Normal variable pairs used to determine the intermediate correlation for Negative Binomial - Continuous variable pairs in method 1.

  2. chat_pois calculates the upper Frechet-Hoeffding correlation bound for Poisson - Normal variable pairs used to determine the intermediate correlation for Poisson - Continuous variable pairs in method 1.

  3. denom_corr_cat is used in intermediate correlation calculations involving ordinal variables (or variables treated as ordinal, as in method 2).

  4. findintercorr_cat_nb calculates the intermediate correlation for ordinal - Negative Binomial variables in method 1.

  5. findintercorr_cat_pois calculates the intermediate correlation for ordinal - Poisson variables in method 1.

  6. findintercorr_cont calculates the intermediate correlation for continuous variables based on either Fleishman’s third-order or Headrick’s fifth-order approximation.

  7. findintercorr_cont_cat calculates the intermediate correlation for continuous - ordinal variables.

  8. findintercorr_cont_nb and findintercorr_cont_nb2 calculate the intermediate correlations for continuous - Negative Binomial variables in method 1 or 2 (respectively).

  9. findintercorr_cont_pois and findintercorr_cont_pois2 calculate the intermediate correlation for continuous - Poisson variables in method 1 or 2 (respectively).

  10. findintercorr_nb calculates the intermediate correlation for Negative Binomial variables in method 1.

  11. findintercorr_pois calculates the intermediate correlation for Poisson variables in method 1.

  12. findintercorr_pois_nb calculates the intermediate correlation for Poisson - Negative Binomial variables in method 1.

  13. intercorr_fleish contains Fleishman’s third-order polynomial transformation intercorrelation equations.

  14. intercorr_poly contains Headrick’s fifth-order polynomial transformation intercorrelation equations.

  15. max_count_support calculates the maximum support value for count variables by extending the method of Barbiero and Ferrari (2015) to include Negative Binomial variables. It is used in method 2.

  16. ordnorm calculates the intermediate correlation for ordinal variables or variables treated as ordinal (as in method 2). It is based off of GenOrd::ordcont with some important corrections.

  17. var_cat is used in intermediate correlation calculations involving ordinal variables (or variables treated as ordinal, as in method 2) to calculate the variance.

Error Loop Functions:

  1. error_loop is the main error_loop function called by rcorrvar or rcorrvar2.

  2. error_vars is used to generate variable pairs within the error loop.

Graphing Functions:

The 8 graphing functions either use simulated data as an input or a set of constants (found by find_constants or from simulation). In the first case, the empirical cdf or pdf is found. In the second case, the theoretical cdf or pdf is found using the equations from Headrick and Kowalchuk (2007). These functions (plot_cdf, plot_pdf_ext, plot_pdf_theory) work only for continuous variable inputs. The other graphing functions work for continuous or count variable inputs. The graphs either display data values, pdfs, or cdfs. In the case of cdfs of continuous variables, the cumulative probability up to a given y-value (delta) can be calculated and displayed on the graph (using cdf_prob for a set of constants or sim_cdf_prob for a vector of simulated data). The empirical cdf can also be graphed for ordinal data. In the case of pdfs or actual data values, the target distribution can be overlayed on the graph. This target distribution can either be an empirical data set, or a distribution specified by name (Dist plus up to 4 parameters) or by a user-supplied pdf fx with support bounds. See plot_sim_pdf_theory for names of Dist inputs. The graphing functions work for invalid or valid power method pdfs. They are ggplot2 objects so designated graphing parameters (i.e. line color and type, title) can be specified by the user and the results can be further modified as necessary.

  1. plot_cdf plots the theoretical power method cumulative distribution function \(\Large F_p(Z)(p(z)) = F_p(Z)(p(z), F_Z(z))\), given a set of constants. If calc_prob = TRUE, it will also calculate the cumulative probability up to a user-specified delta value, where \(\Large sigma * y + mu = delta\) and \(\Large y = p(z)\).

  2. plot_sim_cdf plots the empirical cdf \(\Large Fn\) of simulated continuous, ordinal, or count data (see ggplot2::stat_ecdf). If calc_cprob = TRUE and the variable is continuous, the cumulative probability up to a user-specified y-value (delta) is calculated (see sim_cdf_prob) and the region on the plot is filled with a dashed horizontal line drawn at \(\Large Fn(delta)\).

  3. plot_pdf_ext plots the theoretical probability density function \(\Large f_p(Z)(p(z)) = f_p(Z)(p(z), f_Z(z)/p'(z))\), given a set of constants, and the target pdf calculated from a vector of external data. Unlike in plot_pdf_theory, the vector of external data is required. If the user wants to plot only the theoretical pdf, plot_pdf_theory should be used with overlay = FALSE.

  4. plot_pdf_theory plots the theoretical probability density function \(\Large f_p(Z)(p(z)) = f_p(Z)(p(z), f_Z(z)/p'(z))\), given a set of constants, and the target pdf (if overlay = TRUE), given either a continuous distribution name and parameters or a user-supplied pdf fx (bounds set equal to bounds of simulated data).

  5. plot_sim_ext plots simulated continuous or count data and overlays external data (both as histograms). Unlike in plot_sim_theory, the vector of external data is required. If the user wants to plot only the simulated data, plot_sim_theory should be used with overlay = FALSE.

  6. plot_sim_pdf_ext plots the pdf of simulated continuous or count data and overlays the target pdf computed from a vector of external data. Unlike in plot_sim_pdf_theory, the vector of external data is required. If the user wants to plot only the pdf of simulated data, plot_sim_pdf_theory should be used with overlay = FALSE.

  7. plot_sim_pdf_theory plots the pdf of simulated continuous or count data and overlays the target pdf (if overlay = TRUE) specified by distribution name and parameters or by pdf fx (bounds set equal to bounds of simulated data).

  8. plot_sim_theory plots simulated continuous or count data and overlays data (if overlay = TRUE) randomly generated from a target distribution specified by name and parameters or by pdf fx (bounds set equal to bounds of simulated data). Both distributions are plotted as histograms. If the target distribution is specified by a function fx, it must be continuous.

Additional Helper Functions:

These would not ordinarily be called by the user.

  1. calc_final_corr calculates the final correlation matrix.

  2. separate_rho separates a target correlation matrix by variable type.

References

Barbiero, A, and P A Ferrari. 2015. “Simulation of Correlated Poisson Variables.” Applied Stochastic Models in Business and Industry 31: 669–80. doi:10.1002/asmb.2072.

Carnell, Rob. 2016. Triangle: Provides the Standard Distribution Functions for the Triangle Distribution. https://CRAN.R-project.org/package=triangle.

Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” Psychometrika 43: 521–32. doi:10.1007/BF02293811.

Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” Computational Statistics and Data Analysis 40 (4): 685–711. doi:10.1016/S0167-9473(02)00072-5.

Headrick, T C, and R K Kowalchuk. 2007. “The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data.” Journal of Statistical Computation and Simulation 77: 229–49. doi:10.1080/10629360600605065.

Headrick, T C, and S S Sawilowsky. 2002. “Weighted Simplex Procedures for Determining Boundary Points and Constants for the Univariate and Multivariate Power Methods.” Journal of Educational and Behavioral Statistics 25: 417–36. doi:10.3102/10769986025004417.

R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Yee, T W. 2018. VGAM: Vector Generalized Linear and Additive Models. https://CRAN.R-project.org/package=VGAM.