Data Validation

Data Validation

Data validation is essential in identifying suspicious and invalid cases, data values, and variables in the active data set. The rapid development in the computing world has made it possible for companies to store and process large amounts of data. Surveys and researches that require more variables and with larger sample sizes have been designed in recent years.

Analyzing and validating more variables and cases mean more workload for all data handlers. This includes the coding staff, data editors, and entry clerks. This can adversely affect the quality of data transmitted from the data manager to the analysis. Also, a combination of pressure to deliver work on time and inefficiencies in training can make the data questionable. There are situations where surveys have been planned without verifying that the data have been entered correctly.

For example, in education, the data is often obtained from surveys about education from different sources. It is usually a daunting task to recheck the coding or accuracy of the data entered. For this reason, validation rules are used to check the consistency and validity of the data before the dataset is used.

Rules of validation

There are three general rules for validating a dataset.

  1. Single-variable rules
  2. Cross-variable rules
  3. Multi-case rules

The SPSS (PASW) Statistics 17.0 does not have these rules in the base system. Instead, they are part of the optional data preparation add-on module. You can use common SPSS commands to carry out these tasks. However, you must have an in-depth understanding of the structure and syntax of the SPSS programming language.

The first two rules, single-variable rules, and cross-variable rules require you to be familiar with case selection. Multi-case rules are, however, more complicated. You may have to manipulate the data using several steps like creating temporary variables, matching, selection of cases, and aggregation.

SPSS boasts of the “Identify Duplicate Cases” procedure in the data menu. This procedure is used to identify duplicate cases in a data file, which is the most important part of the multi-case rule. Our SPSS online experts have discussed below simple but powerful tools that can be used to identify invalid and improper cases and values.

Single-variable rules

The single-variable rules are a set of checks that can be applied to a variable. This category checks for out-of-range (invalid values) and missing values. Additionally, single-variable rules can check if values other than male and female are entered into the variable sex.

The process of validating the data entered for a variable involves three stages. It is only after these stages that any invalid values that are identified can be edited. We have listed the stages below:

  1. Obtaining valid values or ranges from the codebook
  2. Construct a frequency table. The variable under observation is valid with the single-variable rule, if there are no invalid values displayed in the frequency table.
  3. This stage is used if invalid values are observed in the frequency table.

Cross-variable Rules

These are rules that are used to check inconsistencies in a variable through the values of other variables in the same case. With cross-variable rules, you have to use cross-tabulations instead of frequency tables to identify if invalid cases exist or not. Also, slightly different rules are applied for the conditional selection of invalid cases.

Multi-case Rules

Multi-case rules are user-defined rules that are usually applied to a single variable or a combination of variables in a group of cases.  These rules are defined by a sequence of logical expressions (procedure) that flags invalid cases. The fundamental application of multi-case rules is to check whether there are duplications on the data set. For example, cases that have been entered more than once.

In SPSS, you can check duplicate cases and inspect unusual cases using the steps below:

  1. On the main menu bar, click on data
  2. Then, choose to identify duplicate cases and a new window will appear
  3. Next, choose the variable to identify duplicate cases. You can press Ctrl + A to select all and release unnecessary variables.
  4. The next step is to set the options
  5. Sorting within a matching group – choose the variables from the remaining ones in the list. These will be the key for sorting within the group
  6. Sort – define the sort order if a key variable for sorting is selected
  7. Variables to create – If you want a frequency table to show the number of duplicates detected or to highlight the duplicate cases then click on the checkbox.
  8. Choose ok to proceed

Consider taking our data validation assignment help if you need assistance with finding invalid or suspicious cases in your data.