# Load the package:
library(MVN)
Handling Missing Data
Handling missing values is a crucial step in data preprocessing, particularly when assessing multivariate normality. The mvn()
function provides several options via the impute argument to handle missing data effectively. Supported methods include “none
” (default), “mean
”, “median
”, and “mice
” (Multiple Imputation by Chained Equations). Each approach affects how missing values are treated before running multivariate normality tests.
Example Data
We begin by introducing missingness into the dataset:
set.seed(123) # For reproducibility
# Create a copy of the iris dataset with random missing values
<- iris
iris_na
# Randomly assign 10 NA values across the first 4 numeric columns
for (i in 1:10) {
<- sample(1:nrow(iris_na), 1)
row <- sample(1:4, 1)
col <- NA
iris_na[row, col]
}
# Preview modified data
11:20,] iris_na[
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 NA NA 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
Let’s explore how different imputation methods influence the results of the Henze-Zirkler multivariate normality test.
No Imputation
Using impute = "none"
will remove any rows containing missing values. A warning will notify how many rows were excluded.
<- mvn(data = iris_na, subset = "Species", impute = "none") res
Warning in mvn(data = iris_na, subset = "Species", impute = "none"): Missing
values detected in 9 rows. These rows will be removed.
summary(res, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
Group Test Statistic p.value MVN
1 setosa Henze-Zirkler 0.865 0.153 ✓ Normal
2 versicolor Henze-Zirkler 0.858 0.168 ✓ Normal
3 virginica Henze-Zirkler 0.744 0.541 ✓ Normal
In this case, 9 rows with missing values were removed. All three species groups passed the Henze-Zirkler test with p-values greater than 0.05, indicating multivariate normality was retained even after dropping incomplete cases.
Mean Imputation
Setting impute = "mean"
replaces each missing value with the mean of its corresponding variable within the group.
<- mvn(data = iris_na, subset = "Species", impute = "mean") res
Missing values detected. Applying 'mean' imputation method.
summary(res, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
Group Test Statistic p.value MVN
1 setosa Henze-Zirkler 1.996 <0.001 ✗ Not normal
2 versicolor Henze-Zirkler 0.874 0.147 ✓ Normal
3 virginica Henze-Zirkler 0.745 0.544 ✓ Normal
Mean imputation slightly altered the data. While versicolor and virginica remained normally distributed, the setosa group now failed the normality test (p < 0.001), suggesting that mean imputation may introduce artifacts, especially in small or skewed groups.
Median Imputation
The “median
” option replaces missing values with the median of each variable.
<- mvn(data = iris_na, subset = "Species", impute = "median") res
Missing values detected. Applying 'median' imputation method.
summary(res, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
Group Test Statistic p.value MVN
1 setosa Henze-Zirkler 2.098 <0.001 ✗ Not normal
2 versicolor Henze-Zirkler 0.909 0.091 ✓ Normal
3 virginica Henze-Zirkler 0.745 0.543 ✓ Normal
Results are similar to mean imputation. Setosa again failed the test, while the other two groups maintained multivariate normality. Median imputation is more robust to outliers, but can still shift distribution characteristics depending on the missingness pattern.
Multiple Imputation
The “mice
” method applies model-based multiple imputation using chained equations to estimate missing values.
<- mvn(data = iris_na, subset = "Species", impute = "mice") res
Missing values detected. Applying 'mice' imputation method.
summary(res, select = "mvn")
── Multivariate Normality Test Results ─────────────────────────────────────────
Group Test Statistic p.value MVN
1 setosa Henze-Zirkler 0.927 0.069 ✓ Normal
2 versicolor Henze-Zirkler 0.876 0.142 ✓ Normal
3 virginica Henze-Zirkler 0.779 0.415 ✓ Normal
This approach restored normality across all groups, with all p-values above 0.05. The MICE method generally preserves the structure of the data better than simple mean or median imputation and is recommended when the dataset contains substantial or non-random missingness.
References
Korkmaz S, Goksuluk D, Zararsiz G. MVN: An R Package for Assessing Multivariate Normality. The R Journal. 2014;6(2):151–162. URL: https://journal.r-project.org/archive/2014-2/korkmaz-goksuluk-zararsiz.pdf