In this article we consider the benefits and concerns of combining data from different sources. Data can be combined from various sources. We will focus on combining survey data with administrative data.
What we'll cover
- Why combine data from different sources?
- Limitations of using administrative data
- How to combine data from different sources
- Concerns about combining data from different sources
Why combine data from different sources?
We may want to combine administrative data with survey data for several reasons. Combining data can help to improve the efficiency of surveys, supplement data collected from a survey, and improve the quality of survey data.
Improve data quality
Data quality is improved by using the administrative data to adjust for nonresponses, validate responses, fill in missing responses, and replace misreported responses.
Reduce survey length
Combining data can also reduce the number of questions we ask in a survey. Let’s imagine, for example, that we are conducting a survey of employees for an organization. If we make use of the organization’s administrative data, we don’t need to ask employees about the department or location they work in, or their job role. If the organization routinely collects information about employees’ age, gender, ethnicity, languages spoken, and so on, we can combine this information with our survey data, and then we don’t need to ask survey respondents these questions. This can help shorten the survey length, which can help improve response rate.
More useful data
Alternatively, as we said above, if we include these questions in the survey, we can check responses against the administrative data to check for accuracy, and fill in missing survey responses. Also, administrative data may contain information not collected in the survey. This provides us with valuable data to explore relationships with other variables (for example, income levels, disability or health). Ultimately, we can end up with more useful data, increase the amount of data we have, but with a shorter survey, which reduces the burden on survey respondents. This is great for research, but it has limitations, and concerns. We’ll talk about these next.
Limitations of using administrative data
Of course, other data sources have limitations. Administrative data is collected for a particular purpose, and may not fit the needs of the analysis we have planned. The data may use different categories which may not measure the variables of interest to our research.
There can be problems with data quality too. We’ll talk more about data quality in another article. Let’s just say for now that we may not know the quality of the data. We will not know the reliability or completeness of data recording, or the accuracy of the responses given.
So, how do we go about combining data from different data sources?
How to combine data from different sources
Record linkage
To combine data from different sources we must link records. (By record, we mean data about each individual.) So, both data sources must contain the same identifying information to match records. Typically, the identifier will be a key variable that is unique to a person – such as an employee number, student number, phone number, or email address. If these are not available, we can link records by matching by more than one key variables, such as name, date of birth, ethnicity, gender, and so on.
Merge data files
After we’ve linked the data we can merge the two datasets using statistical software. In SPSS we merge two data files by using key variables to match cases in the two files.
Sometimes there is not sufficient information available to link records. This may be because the information is not collected, or cannot be accessed (for example, due to privacy concerns). In this case, we may look at statistical matching techniques for combining data, but we won’t get into this in this article. We’ll now turn to the concerns that have to be addressed before linking data from different sources.
Concerns about combining data from different sources
Linkage errors
Errors in linkage can occur. How much this happens depends to a great extent on the linkage method we use. If we link records using a unique identifier (such as employee number), the risk of errors is small. If we link records by linking a combination of demographic variables such as date of birth, gender and ethnicity, the risk of errors is much higher. Clearly, there can be more than one person with the same date of birth, gender and ethnicity. This can lead to false links. Errors also arise if identifying information is missing in the administrative data. These errors can distort the results and the relationships between variables.
Privacy
A main concern with merging survey data with administrative data is privacy. The researcher has access to information about individuals, without their consent. Linking the data can result in a greater amount of information about an individual being brought together, more than we would find in either the administrative dataset or survey data alone. It is possible that the resulting larger dataset will contain information sufficiently detailed for an individual to be identified from it, even if personal identifiers (name, address, date of birth, etc.) are removed. Wherever possible, informed consent should be sought for the use of personal information in research. This is not possible though, especially when we are talking about data stored by an organization and collected from a large population.
Reducing privacy risks
There are measures we can take to reduce privacy risks.
- Allocate an identifier number to each record in both datasets, to link records, and then remove all personal information, including the linkage variable.
- Link data without storing the two datasets in the same place – not in the same file or folder, or even the same computer.
- Store data securely, including password protecting all data files, and avoid cloud storage. Use encryption is possible.
Data linkage must comply with relevant data protection / information privacy laws. In British Columbia, the privacy law that applies to employee information is the Personal Information Protection Act. An organization may disclose, without the individual’s consent, personal information for a research purpose, only if certain conditions are met.
Summary: Combining data from different sources
What we've covered in this article
- Why combine data from different sources?
- Limitations of using administrative data
- How to combine data from different sources
- Concerns about linking data from different sources
Want to talk to us about data collection or analysis?