Sampling methods for visual inspection or quality control of GIS data

Common quality errors for visual inspection
1. Missing features
2. Redundant feature
3. Misplaced
4. Wrong shape (line and area)
5. Miscoded (wrong attributes)

Typical process for visual inspection
  1. Determine sampling method
  2. Create a sample
  3. Inspect sampling record
  4. Mark as pass or fail
  5. Generate report
  6. Determine acceptability
Common sampling methods 
  1. By Fixed number - randomly choose features based on a fixed number of sample that can be carried out by the team.
  2. By grid or polygon - create an index grid and sample features per grid. 
  3. By percentage of features - determine the sampling percentage of all features. Amount of features for inspection/QC is based on direct percentage count or applied weights, if there is any.
  4. By calculation - determine the statistically significant number of samples at a certain confidence interval, plus or minus the acceptable error. ( Given the sample size, how many features can fail inspection before my entire database fails.)
The sample size is determined based on four factors:
  • The probability ( p) of the outcome, that is, given a feature, the probability of a “pass” versus a “fail.” This value is maximized at 0.5; that is, since we have no prior knowledge of past probability that a certain percent of features from a given client will pass or fail, there is an equal probability of the features passing or failing, so 0.5 is the value used in the tool. 0.5 represents the most pessimistic (conservative) value when used in the equation for variance p(1- p). That is, p(1 - p) is maximized when p = 0.5.
  • The population size (N).
  • The acceptable margin of error in the confidence interval (m).
  • The z-statistic for the desired confidence level (z). This is used to compare the sample to a normal distribution. The value is supplied by a lookup table.
For an infinite population, the equation for determining the sample size (n) is:
n = ((z/m)2)(p (1 - p))
This value must then be truncated to conform to the actual population, which gives the actual sample size (n'):
n' = n(N)/(n + (N - 1))


Determine acceptability by failure threshold
The failure threshold value is given by the Test of Proportions equation. This equation determines whether the number of failures is significant enough to fail the entire dataset, given a population size, confidence interval, and specified failure ratio. Determination of failure threshold depends on three factors:
  • Population size (n' from above)
  • The acceptable maximum failure ratio (r)
  • The z-statistic for the desired confidence interval (z), which is used to compare the sample to a normal distribution. This value is supplied by a lookup table.
The maximum failure ratio allowable (r') is given by this equation:

r'= z *(sqrt(r(1-r)/n')+r

Since this is a ratio, the resulting value must then be multiplied by the sample size to get the maximum number of failures allowable (f):
f= r'(n')

Remediation
If a given dataset fails to pass (that is, the number of actual failures exceeds the maximum allowable number of failures), it is not sufficient to fix the failures that were detected, then pass the dataset. If a dataset fails, it means that the sample has revealed a deficiency with the entire dataset, not just the detected failures. The quality of the entire dataset will have to be improved to pass a retest based on a new random sample.

    Comments

    Popular posts from this blog

    Free Open Street Map data (i.e.shapefiles etc.) for the Philippines

    Free Philippine Administrative Boundaries shapefile

    University of the Philippines Diliman Campus