# How to compute an accuracy measure based on RMSE? Is my large dataset normally distributed?

I have several datasets on the order of thousands of points. The values in each dataset are X,Y,Z referring to a coordinate in space. The Z-value represents a difference in elevation at coordinate pair (x,y).

Typically in my field of GIS, elevation error is referenced in RMSE by subtracting the ground-truth point to a measure point (LiDAR data point). Usually a minimum of 20 ground-truthing check points are used. Using this RMSE value, according to NDEP (National Digital Elevation Guidelines) and FEMA guidelines, a measure of accuracy can be computed: Accuracy = 1.96*RMSE.

This Accuracy is stated as: “The fundamental vertical accuracy is the value by which vertical accuracy can be equitably assessed and compared among datasets. Fundamental accuracy is calculated at the 95-percent confidence level as a function of vertical RMSE.”

I understand that 95% of the area under a normal distribution curve lies within 1.96*std.deviation, however that does not relate to RMSE.

Generally I am asking this question: Using RMSE computed from 2-datasets, how can I relate RMSE to some sort of accuracy (i.e. 95-percent of my data points are within +/- X cm)? Also, how can I determine if my dataset is normally distributed using a test that works well with such a large dataset? What is “good enough” for a normal distribution? Should p<0.05 for all tests, or should it match the shape of a normal distribution?

I found some very good information on this topic in the following paper:

http://paulzandbergen.com/PUBLICATIONS_files/Zandbergen_TGIS_2008.pdf

Using RMSE computed from 2-datasets, how can I relate RMSE to some sort of accuracy (i.e. 95-percent of my data points are within +/- X cm)?

Take a look at a near duplicate question: Confidence interval of RMSE?

Is my large dataset normally distributed?

A good start would be to observe the empirical distribution of `z` values. Here is a reproducible example.

``````set.seed(1)
z <- rnorm(2000,2,3)
z.difference <- data.frame(z=z)

library(ggplot2)

ggplot(z.difference,aes(x=z)) +
geom_histogram(binwidth=1,aes(y=..density..), fill="white", color="black") +
ylab("Density") + xlab("Elevation differences (meters)") +
theme_bw() +
coord_flip()
`````` At a first glance, it looks normal, right? (actually, we know it is normal because the `rnorm` command we used).

If one wants to analyse small samples over the dataset there is the Shapiro-Wilk Normality Test.

``````z_sample <- sample(z.difference\$z,40,replace=T)
shapiro.test(z_sample) #high p-value indicates the data is normal (null hypothesis)

Shapiro-Wilk normality test

data:  z_sample
W = 0.98618, p-value = 0.8984 #normal
``````

One can also repeat the SW test many times over different small samples, and then, look at the distribution of `p-values`.

Be aware that normality tests on large datasets are not so useful as it is explained in this answer provided by Greg Snow.

On the other hand, with really large datasets the central limit theorem comes into play and for common analyses (regression, t-tests, …) you really don’t care if the population is normally distributed or not.

The good rule of thumb is to do a qq-plot and ask, is this normal enough?

So, let’s make a QQ-plot:

``````#qq-plot (quantiles from empirical distribution - quantiles from theoretical distribution)
mean_z <- mean(z.difference\$z)
sd_z <- sd(z.difference\$z)
set.seed(77)
normal <- rnorm(length(z.difference\$z), mean = mean_z, sd = sd_z)

qqplot(normal, z.difference\$z, xlab="Theoretical", ylab="Empirical")
`````` If dots are aligned in the `y=x` line it means the empirical distribution matches the theoretical distribution, which in this case is the normal distribution.