Title: | One Health Data Cleaning and Quality Checking Package |
---|---|
Description: | This package provides useful functions to orchestrate analytics and data cleaning pipelines for One Health projects. |
Authors: | Collin Schwantes [cre, aut] , Johana Teigen [aut] , Ernest Guevarra [aut] , Dean Marchiori [aut] , Melinda Rostal [aut] , EcoHealth Alliance [cph, fnd] (https://ror.org/02zv3m156) |
Maintainer: | Collin Schwantes <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.11 |
Built: | 2024-11-22 02:51:15 UTC |
Source: | https://github.com/ecohealthalliance/ohcleandat |
This compares two columns. Where there are differences, it extracts the values and compiles a correctly formatted validation log. This is intended to be used when an automated formatting correction is proposed in the data, but the actual updating of the records is required to happen via the validation log.
autobot(data, old_col, new_col, key)
autobot(data, old_col, new_col, key)
data |
data.frame or tibble |
old_col |
The existing column with formatting issues |
new_col |
The new column with corrections applied |
key |
column that uniquely identifies the records in data |
tibble formatted as validation log
This returns rows in x without a match in y. Returning selected columns only. It
is a this wrapper around dplyr::anti_join
.
check_id_existence(x, y, by, select_cols, ...)
check_id_existence(x, y, by, select_cols, ...)
x |
data.frame or tibble containing match id to check for non existence in y |
y |
data.frame or tibble to check for non-existence of match id from x |
by |
character containing match id, or if named different, a named character vector like c("a" = "b") |
select_cols |
character vector of columns to select in the output. Note that during the join, columns with identical names in both data sets will have a suffix of .x or .y added to disambiguate. These need to be added to ensur the correct column is returned. |
... |
other variables passed to dplyr::anti_join |
tibble rows from x without a match in y
dplyr::anti_join
## Not run: check_id_existence(x, y, by = c("Batch_ID" = "batch_id"), select_cols = c("Batch_ID", "iDate", "Farm_ID")) ## End(Not run)
## Not run: check_id_existence(x, y, by = c("Batch_ID" = "batch_id"), select_cols = c("Batch_ID", "iDate", "Farm_ID")) ## End(Not run)
A table that links classes to readr
column types.
Created from csv file of the same name in inst/
class_to_col_type
class_to_col_type
class_to_col_type
A data frame with 9 rows and 3 columns:
Type of column as described in readr
Class of R object that matches that column type
Abbreviation for that column type from readr
...
class_to_col_type <- read.csv(file = "inst/class_to_col_type.csv") usethis::use_data(class_to_col_type,overwrite = TRUE)
Checks for the existence of an existing validation log and appends new records from the current run.
combine_logs(existing_log, new_log)
combine_logs(existing_log, new_log)
existing_log |
tibble existing validation log |
new_log |
tibble newly generated validation log |
tibble appended validation log for upload
Takes a validation log and applies the required changes to the data
correct_data(validation_log, data, primary_key)
correct_data(validation_log, data, primary_key)
validation_log |
tibble a validation log |
data |
tibble the original unclean data |
primary_key |
character the quoted column name for the unique identifier in data |
tibble the semi-clean data set
Creates custom validation log for 'other: explain' free text responses that may contain valid multi-choice options.
create_freetext_log(response_data, form_schema, url, lookup)
create_freetext_log(response_data, form_schema, url, lookup)
response_data |
data.frame ODK questionnaire response data |
form_schema |
data.frame ODK flattened form schema data |
url |
The ODK submission URL excluding the uuid identifier |
lookup |
a tibble formatted as a lookup to match questions with their free text responses. The format must match
the output of |
This function needs to link a survey question with its corresponding free text response. Users can use the
othertext_lookup()
function to handle this, or provide their own tibble in the same format. See below:
tibble::tribble(
~name, ~other_name,
question_1, question_1_other
)
data.frame validation log
## Not run: # Using othertext_lookup helper test_a <- create_freetext_log(response_data = animal_owner_semiclean, form_schema = animal_owner_schema, url = "https://odk.xyz.io/#/projects/5/forms/project/submissions", lookup = ohcleandat::othertext_lookup(questionnaire = "animal_owner") ) # using custom lookup table mylookup <- tibble::tribble( ~name, ~other_name, "f2_species_own", "f2a_species_own_oexp" ) test_b <- create_freetext_log(response_data = animal_owner_semiclean, form_schema = animal_owner_schema, url = "https://odk.xyz.io/#/projects/5/forms/project/submissions", lookup = mylookup ) ## End(Not run)
## Not run: # Using othertext_lookup helper test_a <- create_freetext_log(response_data = animal_owner_semiclean, form_schema = animal_owner_schema, url = "https://odk.xyz.io/#/projects/5/forms/project/submissions", lookup = ohcleandat::othertext_lookup(questionnaire = "animal_owner") ) # using custom lookup table mylookup <- tibble::tribble( ~name, ~other_name, "f2_species_own", "f2a_species_own_oexp" ) test_b <- create_freetext_log(response_data = animal_owner_semiclean, form_schema = animal_owner_schema, url = "https://odk.xyz.io/#/projects/5/forms/project/submissions", lookup = mylookup ) ## End(Not run)
Create Validation Log for Questionnaire data
create_questionnaire_log(data, form_schema, pkey, rule_set, url)
create_questionnaire_log(data, form_schema, pkey, rule_set, url)
data |
data fame Input data to be validated |
form_schema |
data frame The ODK form schema data |
pkey |
character A character vector giving the column name of the primary key or unique row identifier in the data |
rule_set |
a rule set of class validator from the validate package |
url |
The ODK submission URL excluding the uuid identifier |
a data frame formatted as a validation log for human review
Creates a rules file from a template to show general structure of the rule file.
create_rules_from_template( name, dir = "R", open = TRUE, showWarnings = FALSE, overwrite_file = FALSE )
create_rules_from_template( name, dir = "R", open = TRUE, showWarnings = FALSE, overwrite_file = FALSE )
name |
String. Name of rule set function e.g. create_rules_my_dataset |
dir |
String. Name of directory where file should be created. If it doesnt exist, a folder will be created. |
open |
Logical. Should the file be opened? |
showWarnings |
Logical. Should dir.create show warnings? |
overwrite_file |
Logical. Should a rules file with the same name be overwritten? |
String. File path of newly created file
## Not run: # create a ruleset and immediately open it create_rules_from_template(name = "create_rules_field_data") # create a ruleset and don't open it create_rules_from_template(name = "create_rules_lab_data", open = FALSE) # create a ruleset and store it in a different folder create_rules_from_template(name = "create_rules_lab_data", dir = "/path/to/rulesets" open = FALSE) ## End(Not run)
## Not run: # create a ruleset and immediately open it create_rules_from_template(name = "create_rules_field_data") # create a ruleset and don't open it create_rules_from_template(name = "create_rules_lab_data", open = FALSE) # create a ruleset and store it in a different folder create_rules_from_template(name = "create_rules_lab_data", dir = "/path/to/rulesets" open = FALSE) ## End(Not run)
This is the metadata that describes the data themselves. This metadata can be generated then joined to pre-existing metadata via field names.
create_structural_metadata( data, primary_key = "", foreign_key = "", additional_elements = tibble::tibble() )
create_structural_metadata( data, primary_key = "", foreign_key = "", additional_elements = tibble::tibble() )
data |
Any named object. Expects a table but will work superficially with lists or named vectors. |
primary_key |
Character. name of field that serves as a primary key |
foreign_key |
Character. Field or fields that are foreign keys |
additional_elements |
Empty tibble with structural metadata elements and their types. |
The metadata table produced has the following elements
name
= The name of the field. This is taken as is from data
.
description
= Description of that field. May be provided by controlled vocabulary
units
= Units of measure for that field. May or may not apply
term_uri
= Universal Resource Identifier for a term from a controlled vocabulary or schema
comments
= Free text providing additional details about the field
primary_key
= TRUE
or FALSE
, Uniquely identifies each record in the data
foreign_key
= TRUE
or FALSE
, Allows for linkages between data sets. Uniquely identifies
records in a different data set
dataframe with standard metadata requirements
## Not run: df <- data.frame(a = 1:10, b = letters[1:10]) df_metadata <- ohcleandat::create_structural_metadata(df) write.csv(df_metadata,"df_metadata.csv") Additional elements can be added via a tibble additional_elements <- tibble::tibble(table_name = NA_character_, created_by = NA_character_, updated = NA ) df_metadata <- ohcleandat::create_structural_metadata(df, additional_elements = additional_elements) # lets pretend we are using a dataset which already has ## in airtable, you can add field descriptions directly ## in the base. We want those exported and properly formatted ## in our ohcleandat workflow base <- "appMyBaseID" table_name <- "My Table" airtable_metadata <- airtabler::air_generate_metadata_from_api(base = base, field_names_to_snake_case = FALSE ) |> dplyr::filter(table_name == {table_name}) |> dplyr::select(field_name,field_desc,primary_key) airtable_df <- airtabler::fetch_all(base = base, table_name = table_name) airtable_df_metadata <- ohcleandat::create_structural_metadata(airtable_df) metadata_joined <- dplyr::left_join(airtable_df_metadata,airtable_metadata, by = c("name"="field_name")) metdata_updated <- metadata_joined |> dplyr::mutate(description = field_desc, primary_key = primary_key.y, ) |> dplyr::select(-matches('\\.[xy]|field_desc')) # ODK # get all choices from ODK form dotenv::load_dot_env() ruODK::ru_setup( svc = "https://odk.server.org/v1/projects/5/forms/myproject.svc", un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), tz = "GMT", odkc_version = "1.1.2") schema <- ruODK::form_schema_ext() schema$choices_flat <-schema$`choices_english_(en)` |> purrr::map_chr(\(x){ if("labels" %in% names(x)){ paste(x$labels,collapse = ", ") } else { "" } }) data_odk <- ruODK::odata_submission_get() data_odk_rect <- ruODK::odata_submission_rectangle(data_odk) odk_metadata <- ohcleandat::create_structural_metadata(data_odk_rect) odk_metadata_joined <- dplyr::left_join(odk_metadata,schema_simple, by = c("name" = "ruodk_name")) odk_metadata_choices <- odk_metadata_joined |> mutate(description = choices_flat) |> select(-choices_flat) ## End(Not run)
## Not run: df <- data.frame(a = 1:10, b = letters[1:10]) df_metadata <- ohcleandat::create_structural_metadata(df) write.csv(df_metadata,"df_metadata.csv") Additional elements can be added via a tibble additional_elements <- tibble::tibble(table_name = NA_character_, created_by = NA_character_, updated = NA ) df_metadata <- ohcleandat::create_structural_metadata(df, additional_elements = additional_elements) # lets pretend we are using a dataset which already has ## in airtable, you can add field descriptions directly ## in the base. We want those exported and properly formatted ## in our ohcleandat workflow base <- "appMyBaseID" table_name <- "My Table" airtable_metadata <- airtabler::air_generate_metadata_from_api(base = base, field_names_to_snake_case = FALSE ) |> dplyr::filter(table_name == {table_name}) |> dplyr::select(field_name,field_desc,primary_key) airtable_df <- airtabler::fetch_all(base = base, table_name = table_name) airtable_df_metadata <- ohcleandat::create_structural_metadata(airtable_df) metadata_joined <- dplyr::left_join(airtable_df_metadata,airtable_metadata, by = c("name"="field_name")) metdata_updated <- metadata_joined |> dplyr::mutate(description = field_desc, primary_key = primary_key.y, ) |> dplyr::select(-matches('\\.[xy]|field_desc')) # ODK # get all choices from ODK form dotenv::load_dot_env() ruODK::ru_setup( svc = "https://odk.server.org/v1/projects/5/forms/myproject.svc", un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), tz = "GMT", odkc_version = "1.1.2") schema <- ruODK::form_schema_ext() schema$choices_flat <-schema$`choices_english_(en)` |> purrr::map_chr(\(x){ if("labels" %in% names(x)){ paste(x$labels,collapse = ", ") } else { "" } }) data_odk <- ruODK::odata_submission_get() data_odk_rect <- ruODK::odata_submission_rectangle(data_odk) odk_metadata <- ohcleandat::create_structural_metadata(data_odk_rect) odk_metadata_joined <- dplyr::left_join(odk_metadata,schema_simple, by = c("name" = "ruodk_name")) odk_metadata_choices <- odk_metadata_joined |> mutate(description = choices_flat) |> select(-choices_flat) ## End(Not run)
Collates free text responses from 'other' and 'notes' fields in the survey data. Some language detection is performed and placed in the log notes section for possible translation.
create_translation_log(response_data, form_schema, url)
create_translation_log(response_data, form_schema, url)
response_data |
data.frame of ODK questionnaire responses |
form_schema |
data.frame or flattened ODK form schema |
url |
The ODK submission URL excluding the uuid identifier |
data.frame validation log
## Not run: create_translation_log(response_data = semi_clean_data, form_schema = odk_schema_data, url = "https://odk.xyz.io/#/projects/project-name/submissions")) ## End(Not run)
## Not run: create_translation_log(response_data = semi_clean_data, form_schema = odk_schema_data, url = "https://odk.xyz.io/#/projects/project-name/submissions")) ## End(Not run)
Create Validation Log
create_validation_log(data, pkey, rule_set, ...)
create_validation_log(data, pkey, rule_set, ...)
data |
data fame Input data to be validated |
pkey |
character a character vector giving the column name of the primary key or unique row identifier in the data |
rule_set |
a rule set of class validator from the validate package |
... |
other arguments passed to validate::confront |
a data frame formatted as a validation log for human review
A function that extracts the top guess of the language of a piece of text.
detect_language(text)
detect_language(text)
text |
character any text string |
Utilizes the stringi package encoding detector as the means to infer language.
character estimate for language abbreviation
detect_language(text = "buongiorno")
detect_language(text = "buongiorno")
Downloads files from dropbox into a given directory
download_dropbox(dropbox_path, dropbox_filename, download_path, ...)
download_dropbox(dropbox_path, dropbox_filename, download_path, ...)
dropbox_path |
character The formal folder path on dropbox |
dropbox_filename |
character The formal file name on dropbox |
download_path |
character Local file path to download file to |
... |
other arguments passed to rdrop2::drop_download |
returns file path if successful
## Not run: download_dropbox(dropbox_path = "XYZ/Project-Datasets", dropbox_filename = "Project dataset as at 01-02-2024.xlsx", download_path = here::here("data"), overwrite = TRUE) ## End(Not run)
## Not run: download_dropbox(dropbox_path = "XYZ/Project-Datasets", dropbox_filename = "Project dataset as at 01-02-2024.xlsx", download_path = here::here("data"), overwrite = TRUE) ## End(Not run)
For a given Google Drive folder this function will find and download all files matching a given pattern.
download_googledrive_files( key_path, drive_path, search_pattern, MIME_type = NULL, out_path )
download_googledrive_files( key_path, drive_path, search_pattern, MIME_type = NULL, out_path )
key_path |
character path to Google authentication key |
drive_path |
character The Google drive folder path |
search_pattern |
character A search pattern for files in the Google drive |
MIME_type |
character Google Drive file type, file extension, or MIME type. |
out_path |
character The local file directory for files to be downloaded to |
Note: This relies on the googledrive::drive_ls()
function which uses a search function
and is not deterministic when recursively searching. Please pay attention to what is returned.
a character vector of files downloaded
## Not run: download_googledrive_files( key_path = here::here("./key.json"), drive_path = "https://drive.google.com/drive/u/0/folders/asdjfnasiffas8ef7y7y89rf", search_pattern = ".*\\.xlsx", out_path = here::here("data/project_data/") ) ## End(Not run)
## Not run: download_googledrive_files( key_path = here::here("./key.json"), drive_path = "https://drive.google.com/drive/u/0/folders/asdjfnasiffas8ef7y7y89rf", search_pattern = ".*\\.xlsx", out_path = here::here("data/project_data/") ) ## End(Not run)
Upload a local file to dropbox and handle authentication. Automatically zips files over 300mb by default.
dropbox_upload(log, file_path, dropbox_path, compress = TRUE)
dropbox_upload(log, file_path, dropbox_path, compress = TRUE)
log |
dataframe. Validation Log for OH cleaning pipelines. Will work with any tabular data. |
file_path |
character. local file path for upload |
dropbox_path |
character. relative dropbox path |
compress |
logical. Should files over 300mb be compressed? |
This is a wrapper of rdrop2::drop_upload()
which first reads in a local
CSV file and then uploads to a DropBox path.
performs drop box upload
## Not run: dropbox_upload( kzn_animal_ship_semiclean, file_path = here::here("outputs/data.csv"), dropbox_path = "XYZ/Data/semi_clean_data" ) ## End(Not run)
## Not run: dropbox_upload( kzn_animal_ship_semiclean, file_path = here::here("outputs/data.csv"), dropbox_path = "XYZ/Data/semi_clean_data" ) ## End(Not run)
Loops over elements in the structural metadata and adds them to frictionless metadata schema. Will overwrite existing values.
expand_frictionless_metadata( structural_metadata, resource_name, resource_path, data_package_path, prune_datapackage = TRUE )
expand_frictionless_metadata( structural_metadata, resource_name, resource_path, data_package_path, prune_datapackage = TRUE )
structural_metadata |
Dataframe. Structural metadata from
|
resource_name |
Character. Item within the datapackage to be updated |
resource_path |
Character. Path to csv file |
data_package_path |
Character. Path to datapackage.json file |
prune_datapackage |
Logical. Should properties not in the structural metadata be removed? |
Updates the datapackage, returns nothing
## Not run: # read in file data_path <- "my/data.csv" data <- read.csv(data_path) # create structural metadata data_codebook <- create_structural_metadata(data) # update structural metadata write.csv(data_codebook,"my/codebook.csv", row.names = FALSE) data_codebook_updated <- read.csv(""my/codebook.csv"") # create frictionless package - this is done automatically with the # deposits package my_package <- create_package() |> add_resource(resource_name = "data", data = data_path) write_package(my_package,"my") expand_frictionless_metadata(structural_metadata = data_codebook_updated, resource_name = "data", resource_path = data_path, data_package_path = "my/datapackage.json" ) ## End(Not run)
## Not run: # read in file data_path <- "my/data.csv" data <- read.csv(data_path) # create structural metadata data_codebook <- create_structural_metadata(data) # update structural metadata write.csv(data_codebook,"my/codebook.csv", row.names = FALSE) data_codebook_updated <- read.csv(""my/codebook.csv"") # create frictionless package - this is done automatically with the # deposits package my_package <- create_package() |> add_resource(resource_name = "data", data = data_path) write_package(my_package,"my") expand_frictionless_metadata(structural_metadata = data_codebook_updated, resource_name = "data", resource_path = data_path, data_package_path = "my/datapackage.json" ) ## End(Not run)
Downloads existing validation logs that are stored on dropbox
get_dropbox_val_logs(file_name, folder, path_name)
get_dropbox_val_logs(file_name, folder, path_name)
file_name |
character file name with extension of the validation log. Note that file may have been zipped on upload if its over 300mb. This file will be automatically unzipped on download so provide the file extenstion for the compressed file, not the zipped file. E.g. "val_log.csv" even if on dropbox its stored as "val_log.zip". |
folder |
character the folder the log is saved in on drop box. Can be NULL if not in subfolder. |
path_name |
character the default drop box path |
This function will check if the log exists and return NULL if not. Else it will locally download the file to 'dropbox_validations' directory and read in to the session.
tibble a Validation Log
## Not run: get_dropbox_val_logs(file_name = "log.csv", folder = NULL) ## End(Not run)
## Not run: get_dropbox_val_logs(file_name = "log.csv", folder = NULL) ## End(Not run)
This function handles the authentication and pulling of questionnaire form schema information.
get_odk_form_schema( url, un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION") )
get_odk_form_schema( url, un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION") )
url |
character The survey URL |
un |
character The ODK account username |
pw |
character The ODK account password |
odkc_version |
character The ODKC Version string |
This is a wrapper around the ruODK
package. It handles the setup and
authentication. See https://github.com/ropensci/ruODK
data frame of survey responses
## Not run: get_odk_form_schema(url ="https://odk.xyz.io/v1/projects/5/forms/survey.svc", un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION")) ## End(Not run)
## Not run: get_odk_form_schema(url ="https://odk.xyz.io/v1/projects/5/forms/survey.svc", un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION")) ## End(Not run)
This function handles the authentication and pulling of responses
data for ODK Questionnaires. The raw return list is 'rectangularized' into
a data frame first. See the ruODK
package for more info on how this happens.
get_odk_responses( url, un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION") )
get_odk_responses( url, un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION") )
url |
character The survey URL |
un |
character The ODK account username |
pw |
character The ODK account password |
odkc_version |
character The ODK version |
This is a wrapper around the ruODK
package. It handles the setup and
authentication. See https://github.com/ropensci/ruODK
data.frame of flattened survey responses
## Not run: get_odk_responses(url ="https://odk.xyz.io/v1/projects/5/forms/survey.svc", un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION")) ## End(Not run)
## Not run: get_odk_responses(url ="https://odk.xyz.io/v1/projects/5/forms/survey.svc", un = Sys.getenv("ODK_USERNAME"), pw = Sys.getenv("ODK_PASSWORD"), odkc_version = Sys.getenv("ODKC_VERSION")) ## End(Not run)
Get Precision
get_precision(x, func = c, ...)
get_precision(x, func = c, ...)
x |
Numeric. Vector of gps points |
func |
Function. Apply some function to the vector of precisions. Default is c so that all values are returned |
... |
Additional arguments to pass to func. |
output of func - likely a vector
Nathan Layman
x <- c(1,100,1.11) get_precision(x,func = min)
x <- c(1,100,1.11) get_precision(x,func = min)
This function maps the relationship between animal species and hum_anim_id codes. This is for use in id_checker()
get_species_letter( species = c("human", "cattle", "small_mammal", "sheep", "goat") )
get_species_letter( species = c("human", "cattle", "small_mammal", "sheep", "goat") )
species |
character The species identifier. See argument options |
character The hum_anim_id code
uses column class to set readr column type
guess_col_type(data, default_col_abv = "c")
guess_col_type(data, default_col_abv = "c")
data |
data.frame Data who column types you would like to guess |
default_col_abv |
string. Column type abbreviation from |
character vector of column abbreviations
data <- data.frame(time = Sys.time(), char = "hello", num = 1, log = TRUE, date = Sys.Date(), list_col = list("hello") ) guess_col_type(data) ## change default value of default column abbreviation guess_col_type(data, default_col_abv = "g")
data <- data.frame(time = Sys.time(), char = "hello", num = 1, log = TRUE, date = Sys.Date(), list_col = list("hello") ) guess_col_type(data) ## change default value of default column abbreviation guess_col_type(data, default_col_abv = "g")
General function for checking and correcting ID columns.
id_checker(col, type = c("animal", "hum_anim", "site"), ...)
id_checker(col, type = c("animal", "hum_anim", "site"), ...)
col |
The vector of ID's to be checked |
type |
The ID type, see argument options for allowable settings |
... |
other function arguments passed to get_species_letter |
In order to use the autobot process for correcting ID columns, a new 'corrected' column is created by the user using the id_checker() function. It will take an existing vector of ID's, and an ID type (animal, mosquito, etc) and apply the bespoke corrections. This can then be consumed by the autobot log.
vector of corrected ID's
## Not run: # with a species identifier data |> mutate(animal_id_new = id_checker(animal_id, type = "animal", species = "cattle")) data |> mutate(farm_id_new = id_checker(farm_id, type = "site")) ## End(Not run)
## Not run: # with a species identifier data |> mutate(animal_id_new = id_checker(animal_id, type = "animal", species = "cattle")) data |> mutate(farm_id_new = id_checker(farm_id, type = "site")) ## End(Not run)
Several HTML reports are emailed via an automated process. To do this a secure URL is to be generated as a download link. This function is to be used in an opinionated targets pipeline.
make_report_urls(aws_deploy_target, pattern = "")
make_report_urls(aws_deploy_target, pattern = "")
aws_deploy_target |
List. Output from aws_s3_upload |
pattern |
String. Regex pattern for matching file paths |
character URL for report
Collin Schwantes
Take a file path, remove the extension, replace the extension with .zip
make_zip_path(file_path)
make_zip_path(file_path)
file_path |
character. |
character. String where extension is replaced by zip
file_path <- "hello.csv" make_zip_path(file_path) file_path_with_dir <- "foo/bar/hello.csv" make_zip_path(file_path_with_dir)
file_path <- "hello.csv" make_zip_path(file_path) file_path_with_dir <- "foo/bar/hello.csv" make_zip_path(file_path_with_dir)
This function fuzzes gps points by first adding error then rounding to a certain number of digits.
obfuscate_gps( x, precision = 2, fuzz = 0.125, type = c("lat", "lon"), func = min, ... ) obfuscate_lat(x, precision = 2, fuzz = 0.125) obfuscate_lon(x, precision = 2, fuzz = 0.125)
obfuscate_gps( x, precision = 2, fuzz = 0.125, type = c("lat", "lon"), func = min, ... ) obfuscate_lat(x, precision = 2, fuzz = 0.125) obfuscate_lon(x, precision = 2, fuzz = 0.125)
x |
Numeric. Vector of gps points |
precision |
Integer. Number of digits to keep. See |
fuzz |
Numeric. Positive number indicating how much error to introduce
to the gps measurements. This is used to generate the random uniform
distribution |
type |
Character. One of "lat" or "lon" |
func |
Function. Function used in |
... |
Additional arguments for func. |
Numeric. A vector of fuzzed and rounded GPS points
Numeric vector
Numeric vector
# make data gps_data <- data.frame(lat = c(1.0001, 10.22223, 4.00588), lon = c(2.39595, 4.506930, -60.09999901)) # Default obfuscation settings correspont to roughly a 27 by 27 km area gps_data$lat |> obfuscate_gps(type = "lat") # Obfuscation can be made more or less precise by changing the number of # decimal points included or modifying the amount of fuzz (error) # introduced gps_data$lon |> obfuscate_gps(precision = 4, fuzz = 0.002, type = "lon") ### working at the poles gps_data_poles <- data.frame(lat = c(89.0001, 89.22223, -89.8881), lon = c(2.39595, 4.506930, -60.09999901)) gps_data_poles$lat |> obfuscate_gps(fuzz = 1, type = "lat") ### working at the 180th meridian gps_data_180 <- data.frame(lat = c(2, 3, 4), lon = c(179.39595, -179.506930, -178.09999901)) gps_data_180$lon |> obfuscate_gps(fuzz = 1, type = "lon") ### working NA GPS data gps_data_180 <- data.frame(lat = c(2, 3, 4), lon = c(179.39595, NA, -178.09999901)) gps_data_180$lon |> obfuscate_gps(fuzz = 1, type = "lon") ### GPS is on the fritz! ## Not run: gps_data_fritz <- data.frame(lat = c(91, -91, 90), lon = c(181.0001, -181.9877, -178.09999901)) gps_data_fritz$lon |> obfuscate_gps(fuzz = 1, type = "lon") gps_data_fritz$lat |> obfuscate_gps(fuzz = 1, type = "lat") ## End(Not run)
# make data gps_data <- data.frame(lat = c(1.0001, 10.22223, 4.00588), lon = c(2.39595, 4.506930, -60.09999901)) # Default obfuscation settings correspont to roughly a 27 by 27 km area gps_data$lat |> obfuscate_gps(type = "lat") # Obfuscation can be made more or less precise by changing the number of # decimal points included or modifying the amount of fuzz (error) # introduced gps_data$lon |> obfuscate_gps(precision = 4, fuzz = 0.002, type = "lon") ### working at the poles gps_data_poles <- data.frame(lat = c(89.0001, 89.22223, -89.8881), lon = c(2.39595, 4.506930, -60.09999901)) gps_data_poles$lat |> obfuscate_gps(fuzz = 1, type = "lat") ### working at the 180th meridian gps_data_180 <- data.frame(lat = c(2, 3, 4), lon = c(179.39595, -179.506930, -178.09999901)) gps_data_180$lon |> obfuscate_gps(fuzz = 1, type = "lon") ### working NA GPS data gps_data_180 <- data.frame(lat = c(2, 3, 4), lon = c(179.39595, NA, -178.09999901)) gps_data_180$lon |> obfuscate_gps(fuzz = 1, type = "lon") ### GPS is on the fritz! ## Not run: gps_data_fritz <- data.frame(lat = c(91, -91, 90), lon = c(181.0001, -181.9877, -178.09999901)) gps_data_fritz$lon |> obfuscate_gps(fuzz = 1, type = "lon") gps_data_fritz$lat |> obfuscate_gps(fuzz = 1, type = "lat") ## End(Not run)
Provides a look up table matching ODK survey questions with their free text response question.
othertext_lookup(questionnaire = c("animal_owner"))
othertext_lookup(questionnaire = c("animal_owner"))
questionnaire |
The ODK questionnaire. Used to ensure the correct look up table is found. |
In many ODK surveys, a multiple choice question can have a response for 'other' where the respondent can add free text as a response. There is no consistent link in the response data to match the captured responses and the other free-text collected. This function provides a manual look up reference so free text responses can be compared to the original questions in the validation workflow.
This function can be expanded by providing a tibble with two columns: name
and
other_name
which maps the question name in ODK to the question name containing
'other' or 'free text'.
tibble
othertext_lookup(questionnaire = c("animal_owner"))
othertext_lookup(questionnaire = c("animal_owner"))
method to remove properties from the metadata for a dataset in a datapackage
prune_datapackage(my_data_schema, structural_metadata)
prune_datapackage(my_data_schema, structural_metadata)
my_data_schema |
list. schema object from frictionless |
structural_metadata |
dataframe. structural metadata for a dataset |
pruned data_schema -
For a given excel file, this will detect all sheets, and iteratively read all sheets and place them in a list.
If primary keys are added, the primary key is the triplet of the file, sheet name, and row number e.g. "file_xlsx_sheet1_1". Row numbering is based on the data ingested into R. R automatically skips empty rows at the beginning of the spreadsheet so id 1 in the primary key will belong to the first row with data.
read_excel_all_sheets( file, add_primary_key_field = FALSE, primary_key = "primary_key" )
read_excel_all_sheets( file, add_primary_key_field = FALSE, primary_key = "primary_key" )
file |
character. File path to an excel file |
add_primary_key_field |
Logical. Should a primary key field be added? |
primary_key |
character. The column name for the unique identifier to be added to the data. |
list
The primary key method is possible because Excel forces sheet names to be unique.
## Not run: # Adding primary key field read_excel_all_sheet(file = "test_pk.xlsx",add_primary_key_field = TRUE) # Don't add primary key field read_excel_all_sheet(file = "test_pk.xlsx") ## End(Not run)
## Not run: # Adding primary key field read_excel_all_sheet(file = "test_pk.xlsx",add_primary_key_field = TRUE) # Don't add primary key field read_excel_all_sheet(file = "test_pk.xlsx") ## End(Not run)
For a given sheet id, this handles authentication and reads in a specified sheet, or all sheets.
read_googlesheets( key_path, sheet = "all", ss, add_primary_key_field = FALSE, primary_key = "primary_key", ... )
read_googlesheets( key_path, sheet = "all", ss, add_primary_key_field = FALSE, primary_key = "primary_key", ... )
key_path |
character path to Google authentication key json file |
sheet |
Sheet to read, in the sense of "worksheet" or "tab". |
ss |
Something that identifies a Google Sheet such as drive id or URL |
add_primary_key_field |
Logical. Should a primary key field be added? |
primary_key |
character. The column name for the unique identifier to be added to the data. |
... |
other arguments passed to |
tibble
## Not run: read_googlesheets(ss = kzn_animal_ship_sheets, sheet = "all",) ## End(Not run)
## Not run: read_googlesheets(ss = kzn_animal_ship_sheets, sheet = "all",) ## End(Not run)
Filters for records matching a given string.
remove_deletions(x, val = "Delete")
remove_deletions(x, val = "Delete")
x |
input vector |
val |
The value to check for inequality. Defaults to 'Delete' |
To be used within dplyr::filter()
. The function returns a logical vector
with TRUE resulting from values that are not equal to the val
argument. Also
protects from NA values.
Used within verbs such as tidyselect::all_of()
this can work effectively across all
columns in a data frame. See examples
logical vector
## Not run: data |> filter(if_all(everything(), remove_deletions)) ## End(Not run)
## Not run: data |> filter(if_all(everything(), remove_deletions)) ## End(Not run)
Unlike setdiff, this function creates the union of x and y then removes values that are in the intersect, providing values that are unique to X and values that are unique to Y.
set_diff(x, y)
set_diff(x, y)
x |
a set of values. |
y |
a set of values. |
Unique values from X and Y, NULL if no unique values.
a <- 1:3 b <- 2:4 set_diff(a,b) # returns 1,4 x <- 1:3 y <- 1:3 set_diff(x,y) # returns NULL
a <- 1:3 b <- 2:4 set_diff(a,b) # returns 1,4 x <- 1:3 y <- 1:3 set_diff(x,y) # returns NULL
This function overwrites the descriptive metadata associated with a frictionless datapackage. It does NOT validate the metadata, or check for conflicts with existing descriptive metadata. It is very easy to create invalid metadata.
update_frictionless_metadata(descriptive_metadata, data_package_path)
update_frictionless_metadata(descriptive_metadata, data_package_path)
descriptive_metadata |
List of descriptive metadata terms. |
data_package_path |
Character. Path to datapackage.json file |
invisibly writes datapackage.json
## Not run: descriptive_metadata <- list ( title = "Example Dataset", description = "This is the abstract but it needs more detail", creator = list (list (name = "A. Person"), list (name = "B. Person"), list (name = "C. Person"),list (name = "F. Person")) # , accessRights = "open" ) update_frictionless_metadata(descriptive_metadata = descriptive_metadata, data_package_path = "data_examples/datapackage.json" ) ## End(Not run)
## Not run: descriptive_metadata <- list ( title = "Example Dataset", description = "This is the abstract but it needs more detail", creator = list (list (name = "A. Person"), list (name = "B. Person"), list (name = "C. Person"),list (name = "F. Person")) # , accessRights = "open" ) update_frictionless_metadata(descriptive_metadata = descriptive_metadata, data_package_path = "data_examples/datapackage.json" ) ## End(Not run)
Appends rows and/or columns to existing metadata, change primary key and/or adds foreign keys.
update_structural_metadata( data, metadata, primary_key = "", foreign_key = "", additional_elements = tibble::tibble() )
update_structural_metadata( data, metadata, primary_key = "", foreign_key = "", additional_elements = tibble::tibble() )
data |
Any named object. Expects a table but will work superficially with lists or named vectors. |
metadata |
Data frame. Output from |
primary_key |
Character. OPTIONAL Primary key in the data |
foreign_key |
Character. OPTIONAL Foreign key or keys in the data |
additional_elements |
data frame. OPTIONAL Empty tibble with structural metadata elements and their types. |
data.frame
See vignette on metadata for examples
Validation correction tests to be run on data before and after validation to test expectations.
validation_checks(validation_log, before_data, after_data, idcol)
validation_checks(validation_log, before_data, after_data, idcol)
validation_log |
tibble Validation log |
before_data |
tibble Data before corrections |
after_data |
tibble Data after corrections |
idcol |
character the primary key for the 'after_data' |
As part of the OH cleaning pipelines, raw data is converted to 'semi-clean' data through a process of upserting records from an external Validation Log. To ensure these corrections were made as expected, some checks are performed in this function.
If no existing log exists > no changes are make to data
Same variables
same Rows
No unequal values
If log exists but no changes are recommended > no changes to data.
Same variables
same Rows
No unequal values
Log exists and changes recommended > number of changes are same as log
Same variables
same Rows
Number of changing records in data match records in log
Correct fields and records are being updated
Checks before and after variables and rows are the same
Checks the variable names and row indexes are the same in the logs and the changed data.
NULL if passed or stops with error
## Not run: validation_checks( validation_log = kzn_animal_ship_existing_log, before_data = kzn_animal_ship, after_data = kzn_animal_ship_semiclean, idcol = "animal_id" ) ## End(Not run)
## Not run: validation_checks( validation_log = kzn_animal_ship_existing_log, before_data = kzn_animal_ship, after_data = kzn_animal_ship_semiclean, idcol = "animal_id" ) ## End(Not run)