Data Transformation Techniques

Data transformations can help correct skewness and stabilize variance. MVN supports simple transforms (log, square root, square) via the transform argument, as well as optimal Box–Cox transformation. Additionally, MVN provides support for a broader class of power transformations through the power_family argument. These include the standard Box–Cox, its extension that accommodates non-positive values (Box–Cox with negatives allowed), and the Yeo–Johnson transformation. Each method automatically estimates optimal lambda values for transformation and can be used in conjunction with multivariate normality tests to evaluate improvements in distributional properties.


Example Data and Baseline Assessment

# Load package and example data
library(MVN)
df <- iris[1:50, 1:4]

# Baseline Henze–Zirkler test without transformation
result <- mvn(data = df, mvn_test = "hz")
summary(result, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
           Test Statistic p.value     Method          MVN
1 Henze-Zirkler     0.949    0.05 asymptotic ✗ Not normal

The untransformed data are marginally on the threshold of normality (p = 0.05), indicating potential departure from multivariate normality.

1. Basic Transformations

Basic transformations can be used to address issues like skewness and heteroscedasticity before assessing multivariate normality. The mvn() function allows simple transformations using the transform argument. Supported options include natural logarithm (“log”), square root (“sqrt”), and squaring the values (“square”). These methods are quick to apply and can help improve the suitability of data for multivariate analysis, although their effectiveness may vary depending on the distribution of each variable.

1.1 Log Transformation

The log transformation applies the natural logarithm to all variables, which is especially useful when dealing with right-skewed data.

res_log <- mvn(data = df, mvn_test = "hz", transform = "log")
summary(res_log, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
           Test Statistic p.value     Method      MVN
1 Henze-Zirkler     0.789   0.379 asymptotic ✓ Normal

In this case, applying the log transformation yields a p-value of 0.379 from the Henze-Zirkler test, indicating that the transformed data conform well to multivariate normality.

1.2 Square-root Transformation

The square-root transformation offers a less aggressive correction than the log and is suitable for moderately skewed data.

res_sqrt <- mvn(data = df, mvn_test = "hz", transform = "sqrt")
summary(res_sqrt, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
           Test Statistic p.value     Method      MVN
1 Henze-Zirkler     0.829   0.253 asymptotic ✓ Normal

After applying this transformation, the p-value is 0.253, which still supports the assumption of multivariate normality. This method retains more of the original data structure while improving distributional properties.

1.3 Square Transformation

The square transformation raises each value to the power of two. While mathematically valid, this operation tends to amplify skewness and outliers.

res_sq <- mvn(data = df, mvn_test = "hz", transform = "square")
summary(res_sq, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
           Test Statistic p.value     Method          MVN
1 Henze-Zirkler     1.278  <0.001 asymptotic ✗ Not normal

In the given example, squaring the data results in a p-value less than 0.001, indicating strong deviation from normality. As such, squaring is generally not advisable when the goal is to achieve or maintain multivariate normality.


2. Power Transformation

Power transformations are commonly applied in multivariate normality testing to stabilize variance and make the data more normally distributed. The mvn() function supports various power transformation families, allowing automatic selection of optimal lambda values to transform each variable accordingly. Below are three supported methods—Box-Cox, Box-Cox with negatives allowed, and Yeo-Johnson—each applied with Henze-Zirkler’s test to assess multivariate normality.

2.1. Box-Cox Transformation

The Box–Cox transformation is applied first. It estimates the optimal lambda values to transform the variables, assuming all data are strictly positive.

res_bc_sum <- mvn(
  data              = df,
  mvn_test          = "hz",
  power_family = "bcPower"
)
summary(res_bc_sum, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
           Test Statistic p.value     Method      MVN
1 Henze-Zirkler     0.803   0.332 asymptotic ✓ Normal

After transformation, Henze-Zirkler’s test yields a p-value of 0.332, suggesting no evidence against multivariate normality.

The optimal lambda values used in the transformation can be accessed for further inspection.

res_bc_sum$power_transform_lambda
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
  0.41655582   1.27294127   0.72864401   0.02443928 

2.2. Box-Cox with Negatives Allowed

The Box–Cox with negatives allowed transformation extends the original Box-Cox method by enabling transformations on variables that may include non-positive values.

res_bc_sum <- mvn(
  data              = df,
  mvn_test          = "hz",
  power_family = "bcnPower"
)
summary(res_bc_sum, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
           Test Statistic p.value     Method      MVN
1 Henze-Zirkler     0.802   0.334 asymptotic ✓ Normal

When applied to the same data, it results in a p-value of 0.334, similarly suggesting a good fit to multivariate normality. Like the standard Box-Cox, it also automatically estimates and applies suitable lambda values.


2.3. Yeo-Johnson Transformation

The Yeo-Johnson transformation is suitable for both positive and negative data and does not require shifting the variable range.

res_bc_sum <- mvn(
  data              = df,
  mvn_test          = "hz",
  power_family = "yjPower"
)
summary(res_bc_sum, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
           Test Statistic p.value     Method          MVN
1 Henze-Zirkler     1.925  <0.001 asymptotic ✗ Not normal

In this example, however, it yields a p-value less than 0.001, indicating strong evidence against multivariate normality even after transformation. This suggests that, for this specific dataset, Yeo-Johnson may not be as effective as the Box-Cox-based alternatives in achieving multivariate normality.


References

Korkmaz S, Goksuluk D, Zararsiz G. MVN: An R Package for Assessing Multivariate Normality. The R Journal. 2014;6(2):151–162. URL: https://journal.r-project.org/archive/2014-2/korkmaz-goksuluk-zararsiz.pdf