Preparing and cleaning survey data

Why do we clean survey data?

We clean data to make sure our data is accurate, high quality, and complete. Otherwise, our conclusions may be inaccurate or biased, and our results may lose credibility and generalizability. Data cleaning is an important step at the start of the data analysis process, and shouldn’t be skipped.

What we'll cover

How do we go about cleaning data?

  • Prepare the spreadsheet
  • Visualize and count each variable
  • Check for coding errors
  • Revise variables
  • Compute frequencies and descriptives
  • Deal with missing data
  • Check for outliers
  • Test for normality
  • Collapse categories
  • Start the analyses

How do we go about cleaning data?

Prepare the spreadsheet

First of all, the basics: have a look at the spreadsheet. If we’ve downloaded data from an online survey software into Excel we’ll need to code the data first (using a coding frame we have prepared in advance). Then we want to check for typos and tidy up variable names, remove duplicated rows, or rows that contain test or pilot data. We might want to separate numeric and non numeric data (eg from open ended questions). We might need to merge or split columns. We can use find and replace to correct text, and change number formats. We typically do this in Excel and then transfer to SPSS to examine further.

Visualise and count each variable

We start by computing frequencies and percentages for all the variables in our study. Thankfully, if we use a statistical software program, we can quickly conduct these with a few clicks of the mouse. It’s also a good idea to check the box to produce a chart (bar chart or histogram) for each variable, to visualise the data. Have a close look at the output for each variable, and see if you can spot any anomalies or errors. You can then go back to the dataset to correct any.

Check for coding errors

We check the data to see if there are any values entered that are outside the expected range for each variable. For example, if values for a variable should be 1, 2, 3, or 4, and we spot a value of 5, then this is a coding error. These errors frequently indicate missing values. It can also be a mistake we made when we set the response values in the statistical program, which we can quickly correct.

Revise variables

We make revisions to variables to prepare them for the analyses we have planned. If not done when writing the survey questions, we might need to reverse code variables if we want to merge them with variables on an opposite scale. For example, we might have ‘Strongly Agree’ scored as 1 in one question but scored 5 in another, in a related set of questions, because we reversed the order of the responses in the survey.

We can recode variables as necessary, and add any new variables we need for our analyses. Sometimes we create new variables for total scores, averages, or to merge responses (e.g. if we need to create dichotomous nominal variables).

Compute frequencies and descriptives

Then we have a closer look at the data by computing frequencies for all variables and descriptives for continuous variables. By checking the boxes in our statistical program, we can quickly compute the mean, median and mode to measure central tendency, and the standard deviation, as appropriate for each variable according to the data type. We will also look at skewness and kurtosis (to test the normality of the dataset), and minimum and maximum. 

We can also compute a z-score (standard score) for continuous variables. We do this to compare our results to a ‘normal’ population. It measures how far from the mean a data point is (the number of standard deviations from the population mean).

Check for outliers

From the histograms and standard scores, we can identify any outliers. An outlier is a score that lies outside (is far away from) most other values in the dataset. In other words, it’s much smaller or larger than most other values, and deviates a lot from the mean. An outlier can distort our results by creating or hiding statistical significance. We can get quite different results with or without the outlier. So we need to decide what to do with any outliers we find. We can remove them, modify them, or keep them.

First we need to try to figure out why one value is so far from the other values. We need to decide if it is normal to have an outlier, i.e. if one value can be that high or low, or if some anomaly has occurred. There can be unusual reasons for an outlier. In this case, we can remove it. If the outlier is a ‘normal’ variation that can arise in a population, then we should think twice about removing it, as this is changing the data (and the results) without a reason. At the very least we should explain why outliers were removed.

Test for normality

If our survey includes interval and ratio level data scales, and we want to apply parametric statistical tests, including regressions, the dependent (outcome) variable must be normally distributed. The test’s validity depends on it. We should check the distribution of the values of the dependent variable. If the distribution is normal, the mean, median and mode will be the same. We can do this visually with a histogram, stem-and-leaf plot or box plot. This can be supplemented with a normality test, such as the Kolmogorov-Smirnov (K-S) test (which can be computed with SPSS). 

If we do not have a normal distribution (and/or we have nominal or ordinal data), we use nonparametric statistical tests.

Deal with missing data

Data can be missing from either records (where a respondent has missing values) or from a variable (where a variable has missing values). To deal with missing data we:

  • Ignore the missing data (if missing data is random) and analyze the available data
  • Remove the records or the variable with a specified amount of missing data, or 
  • Use a statistical method to deal with missing data.

 

If there are missing values from a variable, the decision to remove it may depend on the importance of that variable to our research question. What action we take depends on whether missing data is random or non random. If there are patterns of missing data, our results may be misleading or ungeneralizable. We’ll talk more about dealing with missing data in another article.

Collapse categories

For some analyses (using parametric tests), when we are comparing groups of survey respondents, we want groups of equal sample size. Unequal sample sizes can lower generalizability. If our sub-samples are unequal, we can collapse categories within a variable. For example, if our simple includes a much smaller number of respondents in one age group, we may think about merging age groups into wider groupings. Alternatively, with unequal groups we can use nonparametric statistical tests.

Start the analyses

The data is now cleaned and prepared for analysis. We start the process of analysis by running frequencies and descriptives again, before deciding which inferential statistics to compute.

Summary: Preparing and cleaning survey data

In this article we considered how to clean quantitative survey data. We covered:

  • Prepare the spreadsheet
  • Visualize and count each variable
  • Check for coding errors
  • Revise variables
  • Compute frequencies and descriptives
  • Deal with missing data
  • Check for outliers
  • Test for normality
  • Collapse categories
  • Start the analyses

Want to talk to us about data cleaning or analysis?