missing-data

Handling Missing Data

Handling missing values is a crucial step in data preprocessing, particularly when assessing multivariate normality. The mvn() function provides several options via the impute argument to handle missing data effectively. Supported methods include “none” (default), “mean”, “median”, and “mice” (Multiple Imputation by Chained Equations). Each approach affects how missing values are treated before running multivariate normality tests.

Example Data

# Load the package:
library(MVN)

We begin by introducing missingness into the dataset:

set.seed(123)  # For reproducibility

# Create a copy of the iris dataset with random missing values
iris_na <- iris

# Randomly assign 10 NA values across the first 4 numeric columns
for (i in 1:10) {
  row <- sample(1:nrow(iris_na), 1)
  col <- sample(1:4, 1)
  iris_na[row, col] <- NA
}

# Preview modified data
iris_na[11:20,]

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
11          5.4         3.7          1.5         0.2  setosa
12          4.8         3.4          1.6         0.2  setosa
13          4.8         3.0          1.4         0.1  setosa
14          4.3          NA           NA         0.1  setosa
15          5.8         4.0          1.2         0.2  setosa
16          5.7         4.4          1.5         0.4  setosa
17          5.4         3.9          1.3         0.4  setosa
18          5.1         3.5          1.4         0.3  setosa
19          5.7         3.8          1.7         0.3  setosa
20          5.1         3.8          1.5         0.3  setosa

Let’s explore how different imputation methods influence the results of the Henze-Zirkler multivariate normality test.

No Imputation

Using impute = "none" will remove any rows containing missing values. A warning will notify how many rows were excluded.

res <- mvn(data = iris_na, subset = "Species", impute = "none")

Warning in mvn(data = iris_na, subset = "Species", impute = "none"): Missing
values detected in 9 rows. These rows will be removed.

summary(res, select = "mvn")

── Multivariate Normality Test Results ─────────────────────────────────────────

       Group          Test Statistic p.value      MVN
1     setosa Henze-Zirkler     0.865   0.153 ✓ Normal
2 versicolor Henze-Zirkler     0.858   0.168 ✓ Normal
3  virginica Henze-Zirkler     0.744   0.541 ✓ Normal

In this case, 9 rows with missing values were removed. All three species groups passed the Henze-Zirkler test with p-values greater than 0.05, indicating multivariate normality was retained even after dropping incomplete cases.

Mean Imputation

Setting impute = "mean" replaces each missing value with the mean of its corresponding variable within the group.

res <- mvn(data = iris_na, subset = "Species", impute = "mean")

Missing values detected. Applying 'mean' imputation method.

summary(res, select = "mvn")

── Multivariate Normality Test Results ─────────────────────────────────────────

       Group          Test Statistic p.value          MVN
1     setosa Henze-Zirkler     1.996  <0.001 ✗ Not normal
2 versicolor Henze-Zirkler     0.874   0.147     ✓ Normal
3  virginica Henze-Zirkler     0.745   0.544     ✓ Normal

Mean imputation slightly altered the data. While versicolor and virginica remained normally distributed, the setosa group now failed the normality test (p < 0.001), suggesting that mean imputation may introduce artifacts, especially in small or skewed groups.

Median Imputation

The “median” option replaces missing values with the median of each variable.

res <- mvn(data = iris_na, subset = "Species", impute = "median")

Missing values detected. Applying 'median' imputation method.

summary(res, select = "mvn")

── Multivariate Normality Test Results ─────────────────────────────────────────

       Group          Test Statistic p.value          MVN
1     setosa Henze-Zirkler     2.098  <0.001 ✗ Not normal
2 versicolor Henze-Zirkler     0.909   0.091     ✓ Normal
3  virginica Henze-Zirkler     0.745   0.543     ✓ Normal

Results are similar to mean imputation. Setosa again failed the test, while the other two groups maintained multivariate normality. Median imputation is more robust to outliers, but can still shift distribution characteristics depending on the missingness pattern.

Multiple Imputation

The “mice” method applies model-based multiple imputation using chained equations to estimate missing values.

res <- mvn(data = iris_na, subset = "Species", impute = "mice")

Missing values detected. Applying 'mice' imputation method.

summary(res, select = "mvn")

── Multivariate Normality Test Results ─────────────────────────────────────────

       Group          Test Statistic p.value      MVN
1     setosa Henze-Zirkler     0.927   0.069 ✓ Normal
2 versicolor Henze-Zirkler     0.876   0.142 ✓ Normal
3  virginica Henze-Zirkler     0.779   0.415 ✓ Normal

This approach restored normality across all groups, with all p-values above 0.05. The MICE method generally preserves the structure of the data better than simple mean or median imputation and is recommended when the dataset contains substantial or non-random missingness.

References

Korkmaz S, Goksuluk D, Zararsiz G. MVN: An R Package for Assessing Multivariate Normality. The R Journal. 2014;6(2):151–162. URL: https://journal.r-project.org/archive/2014-2/korkmaz-goksuluk-zararsiz.pdf