Once an individual data set has passed through the cleaning and validation process, it may need to be combined (or joined) with other data sets. This process is handled by the {targets} pipeline during the integration steps. There are two phases, the first is performing checks the second is the join operation.
The checks perform tests to ensure that the records in both data sets are compatible and match the expectations of the relationship between the two data sets. Secondly, the integration of the data sets is performed by using an SQL join operation. This is typically either a right-join, a left-join, an inner-join or a full-join.
The type of join operation selected depends on the relationship between the data sets. The critical information here are the primary-key (unique identifier) of the base table and the foreign-key of the table to be joined which is the attribute that should match the primary-key. In addition, the cardinality of the relationship is important to understand the expected result when joining the data. Below is an example of an entity relationship diagram that shows the relationship between two example data sets. Crow’s feet notation is used to illustrate that there is an optional 1:Many relationship with the left table and the right table. There is also a mandatory 1:1 relationship with the right table and the left table.
Also below is the relevant target that performs the joining operation and integrates these data sets together.
tar_target(integrated_mosq_field,
left_join(
x = fs_mosquito_field_semiclean,
y = longitudinal_identification_semiclean,
by = c("Batch_ID" = "batch_id")
)
)
It is critical that data validation steps are correctly performed to ensure the integration of multiple data sets is successful. In the case where there are missing, malformed or duplicate primary key, the expectations around the relationship type will not hold up.
An overview of this integration process is below.
Throughout the data cleaning pipeline, we take in raw data and convert it to some form of clean data. There are several intermediate steps in this process. A standard terminology has been adopted to describe the steps in this process.
Over the course of a long data collection exercise, standards and formats can diverge. This makes the data cleaning steps difficult and will slow down the ability to integrate data as above. Some general strategies can help to mitigate these risks:
janitor::clean_names()
.