Title: | Validate, Share, and Download Data |
---|---|
Description: | Designed to enhance data validation and management processes by employing a set of functions that read a set of rules from a 'CSV' or 'Excel' file and apply them to a dataset. Funded by the National Renewable Energy Laboratory and Possibility Lab, maintained by the Moore Institute for Plastic Pollution Research. |
Authors: | Hannah Sherrod [cre, aut] |
Maintainer: | Hannah Sherrod <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.5 |
Built: | 2025-03-07 05:20:59 UTC |
Source: | https://github.com/moore-institute-4-plastic-pollution-res/one4all |
This function creates a data frame with certificate information including the current time, data and rule hashes, package version, and web hash.
certificate_df(x, time = Sys.time())
certificate_df(x, time = Sys.time())
x |
A list containing 'data_formatted' and 'rules' elements. |
time |
the time the certificate is generated, can be passed a value or uses current system time. |
A data frame with certificate information.
certificate_df(x = list(data_formatted = data.frame(a = 1:3, b = 4:6), rules = validate::validator(a > 0, b > 0)), time = Sys.time())
certificate_df(x = list(data_formatted = data.frame(a = 1:3, b = 4:6), rules = validate::validator(a > 0, b > 0)), time = Sys.time())
This function checks if a file with a given name exists in a specified zip file.
check_exists_in_zip(zip_path, file_name)
check_exists_in_zip(zip_path, file_name)
zip_path |
A character string representing the path of the zip file. |
file_name |
A character string representing the name of the file to check. |
A logical value indicating whether the file exists in the zip file (TRUE) or not (FALSE).
## Not run: check_exists_in_zip(zip_path = "/path/to/your.zip", file_name = "file/in/zip.csv") ## End(Not run)
## Not run: check_exists_in_zip(zip_path = "/path/to/your.zip", file_name = "file/in/zip.csv") ## End(Not run)
This function checks for the presence of files with extensions known to be associated with malicious activities. The function can be used to screen zip files or individual files for these potentially dangerous file types.
check_for_malicious_files(files)
check_for_malicious_files(files)
files |
A character vector of file paths. These can be paths to zip files or individual files. |
A logical value indicating if any of the files in the input have a malicious file extension. Returns 'TRUE' if any malicious file is found, otherwise 'FALSE'.
## Not run: check_for_malicious_files("path'(s)'/to/your/files") check_for_malicious_files(utils::unzip("path/to/your/file.zip", list = TRUE)$Name) ## End(Not run)
## Not run: check_for_malicious_files("path'(s)'/to/your/files") check_for_malicious_files(utils::unzip("path/to/your/file.zip", list = TRUE)$Name) ## End(Not run)
This function checks if the input string contains an image URL (PNG or JPG) and formats it as an HTML img tag with a specified height.
check_images(x)
check_images(x)
x |
A character string to check for image URLs. |
A character string with the HTML img tag if an image URL is found, otherwise the input string.
check_images("https://example.com/image.png") check_images("https://example.com/image.jpg") check_images("https://example.com/text")
check_images("https://example.com/image.png") check_images("https://example.com/image.jpg") check_images("https://example.com/text")
This function checks if the input string contains a non-image hyperlink and formats it as an HTML anchor tag.
check_other_hyperlinks(x)
check_other_hyperlinks(x)
x |
A character string to check for non-image hyperlinks. |
A character string with the HTML anchor tag if a non-image hyperlink is found, otherwise the input string.
check_other_hyperlinks("https://example.com/page") check_other_hyperlinks("https://example.com/image.png") check_other_hyperlinks("https://example.com/image.jpg")
check_other_hyperlinks("https://example.com/page") check_other_hyperlinks("https://example.com/image.png") check_other_hyperlinks("https://example.com/image.jpg")
This function checks if a given number passes the Luhn algorithm. It is commonly used to validate credit card numbers.
checkLuhn(number)
checkLuhn(number)
number |
A character string of the number to check against the Luhn algorithm. |
A logical value indicating whether the number passes the Luhn algorithm (TRUE) or not (FALSE).
checkLuhn("4532015112830366") # TRUE checkLuhn("4532015112830367") # FALSE
checkLuhn("4532015112830366") # TRUE checkLuhn("4532015112830367") # FALSE
This function creates an Excel file with conditional formatting and data validation based on the given validation rules in a CSV or Excel file. This function is currently compatible with Windows and Linux operating systems. When using a macOS system, the excel file is able to download, but has some bugs with formatting the LOOKUP sheet.
create_valid_excel( file_rules, negStyle = createStyle(fontColour = "#9C0006", bgFill = "#FFC7CE"), posStyle = createStyle(fontColour = "#006100", bgFill = "#C6EFCE"), row_num = 1000 )
create_valid_excel( file_rules, negStyle = createStyle(fontColour = "#9C0006", bgFill = "#FFC7CE"), posStyle = createStyle(fontColour = "#006100", bgFill = "#C6EFCE"), row_num = 1000 )
file_rules |
A CSV or Excel file containing validation rules. |
negStyle |
Style to apply for negative conditions (default is red text on a pink background). |
posStyle |
Style to apply for positive conditions (default is green text on a light green background). |
row_num |
Number of rows to create in the output file (default is 1000). |
A workbook object containing the formatted Excel file.
data("test_rules") create_valid_excel(file_rules = test_rules)
data("test_rules") create_valid_excel(file_rules = test_rules)
This function allows users to download all data rather than one data set at a time.
download_all( file_path = NULL, s3_key_id = NULL, s3_secret_key = NULL, s3_region = NULL, s3_bucket = NULL, callback = NULL )
download_all( file_path = NULL, s3_key_id = NULL, s3_secret_key = NULL, s3_region = NULL, s3_bucket = NULL, callback = NULL )
file_path |
location and name of the zip file to create. |
s3_key_id |
A character string representing the AWS S3 access key ID. |
s3_secret_key |
A character string representing the AWS S3 secret access key. |
s3_region |
A character string representing the AWS S3 region. |
s3_bucket |
A character string representing the AWS S3 bucket name. |
callback |
Prints if the download was a success. |
Any return objects from the downloads.
## Not run: download_all_data <- download_all(file_path = "your/path/file.zip", s3_key_id = "your_s3_key_id", s3_secret_key = "your_s3_secret_key", s3_region = "your_s3_region", s3_bucket = "your_s3_bucket", callback = NULL) ## End(Not run)
## Not run: download_all_data <- download_all(file_path = "your/path/file.zip", s3_key_id = "your_s3_key_id", s3_secret_key = "your_s3_secret_key", s3_region = "your_s3_region", s3_bucket = "your_s3_bucket", callback = NULL) ## End(Not run)
This is a list containing three data frames as an example of invalid_example.
A list with 3 data frames:
A data frame with 18 variables: MethodologyID, SamplingDevice, AirFiltration, AirFiltrationType, ClothingPolicy, NonplasticPolicy, SealedEnvironment, SealedEnvironmentType, SieveMeshSizes, FilterType, FilterDiameter, FilterPoreSize, VisIDMethod, VisualSoftware, PickingStrategy, VisMagnification, MatIDMethod, MatIDSoftware
A data frame with 8 variables: SampleID, OwnerOrganization, AnalysisOrganization, ReportingOrganization, Latitude, Longitude, CollectionDate, SampleVolume
A data frame with 17 variables: ParticleID, MethodologyID, SampleID, PhotoID, SpectraID, FinalAnalysisDate, Comments, Polymer, Morphology, Color, Length, Width, Height, Mass, SurfaceArea, Volume, Tactile
data("invalid_example")
data("invalid_example")
This function checks if the given object is of class POSIXct. It returns TRUE if the object inherits the POSIXct class, otherwise FALSE.
is.POSIXct(x)
is.POSIXct(x)
x |
An object to be tested for POSIXct class inheritance. |
A logical value indicating if the input object is of class POSIXct.
x <- as.POSIXct("2021-01-01") is.POSIXct(x) # TRUE y <- Sys.Date() is.POSIXct(y) # FALSE
x <- as.POSIXct("2021-01-01") is.POSIXct(x) # TRUE y <- Sys.Date() is.POSIXct(y) # FALSE
This function extracts the names of the datasets provided in the input files. If specific data names are provided, they are used, otherwise the function tries to extract the names from the files themselves.
name_data(files_data, data_names = NULL)
name_data(files_data, data_names = NULL)
files_data |
A vector of file paths or list of data frames. |
data_names |
A vector of names to be assigned to datasets. |
A vector of dataset names.
name_data(files_data = c("path/to/data1.csv", "path/to/data2.csv")) name_data(files_data = c("path/to/data.xlsx"), data_names = c("sheet1", "sheet2"))
name_data(files_data = c("path/to/data1.csv", "path/to/data2.csv")) name_data(files_data = c("path/to/data.xlsx"), data_names = c("sheet1", "sheet2"))
This function queries a mongodb database using its API to retrieve a document by its ObjectID. Use the MongoDB Atlas Data API to create an API key.
query_document_by_object_id(apiKey, collection, database, dataSource, objectId)
query_document_by_object_id(apiKey, collection, database, dataSource, objectId)
apiKey |
The API key for accessing the MongoDB API. |
collection |
The name of the collection in the MongoDB database. |
database |
The name of the MongoDB database. |
dataSource |
The data source in MongoDB. |
objectId |
The object ID of the document to query. |
The queried document.
## Not run: apiKey <- 'your_mongodb_api_key' collection <- 'your_mongodb_collection' database <- 'your_database' dataSource <- 'your_dataSource' objectId <- 'example_object_id' query_document_by_object_id(apiKey, collection, database, dataSource, objectId) ## End(Not run)
## Not run: apiKey <- 'your_mongodb_api_key' collection <- 'your_mongodb_collection' database <- 'your_database' dataSource <- 'your_dataSource' objectId <- 'example_object_id' query_document_by_object_id(apiKey, collection, database, dataSource, objectId) ## End(Not run)
Read and format data from csv or xlsx files
read_data(files_data, data_names = NULL)
read_data(files_data, data_names = NULL)
files_data |
List of files to be read |
data_names |
Optional vector of names for the data frames |
A list of data frames
read_data(files_data = valid_example, data_names = c("methodology", "particles", "samples"))
read_data(files_data = valid_example, data_names = c("methodology", "particles", "samples"))
This function reads rules from a file or a data frame. The file can be in csv or xlsx format. The data should have the column names "name", "description", "dataset", "valid example", "severity", "rule". The function also checks that the rules do not contain sensitive words and that all the rules fields are character type.
read_rules(file_rules)
read_rules(file_rules)
file_rules |
The file containing the rules. Can be a CSV or XLSX file, or a data frame. |
A data frame containing the rules.
## Not run: read_rules("path/to/rules") ## End(Not run)
## Not run: read_rules("path/to/rules") ## End(Not run)
This function is responsible for handling the rule reformating, dataset handling and foreign key checks.
reformat_rules(rules, data_formatted, zip_data = NULL)
reformat_rules(rules, data_formatted, zip_data = NULL)
rules |
A data.frame containing rules to be reformatted. |
data_formatted |
A named list of data.frames with data. |
zip_data |
A file path to a zip folder with additional data to check. |
A data.frame with reformatted rules.
data("test_rules") data("valid_example") reformat_rules(rules = test_rules, data_formatted = valid_example)
data("test_rules") data("valid_example") reformat_rules(rules = test_rules, data_formatted = valid_example)
This function downloads data from remote sources like CKAN, AWS S3, and MongoDB. It retrieves the data based on the hashed_data identifier and assumes the data is stored using the same naming conventions provided in the 'remote_share' function.
remote_download( hashed_data = NULL, ckan_url, ckan_key, ckan_package, s3_key_id, s3_secret_key, s3_region, s3_bucket, mongo_key, mongo_collection )
remote_download( hashed_data = NULL, ckan_url, ckan_key, ckan_package, s3_key_id, s3_secret_key, s3_region, s3_bucket, mongo_key, mongo_collection )
hashed_data |
A character string representing the hashed identifier of the data to be downloaded. |
ckan_url |
A character string representing the CKAN base URL. |
ckan_key |
A character string representing the CKAN API key. |
ckan_package |
A character string representing the CKAN package identifier. |
s3_key_id |
A character string representing the AWS S3 access key ID. |
s3_secret_key |
A character string representing the AWS S3 secret access key. |
s3_region |
A character string representing the AWS S3 region. |
s3_bucket |
A character string representing the AWS S3 bucket name. |
mongo_key |
A character string representing the mongo key. |
mongo_collection |
A character string representing the mongo collection. |
A named list containing the downloaded datasets.
## Not run: downloaded_data <- remote_download(hashed_data = "example_hash", ckan_url = "https://example.com", ckan_key = "your_ckan_key", ckan_package = "your_ckan_package", s3_key_id = "your_s3_key_id", s3_secret_key = "your_s3_secret_key", s3_region = "your_s3_region", s3_bucket = "your_s3_bucket", mongo_key = "mongo_key", mongo_collection = "mongo_collection") ## End(Not run)
## Not run: downloaded_data <- remote_download(hashed_data = "example_hash", ckan_url = "https://example.com", ckan_key = "your_ckan_key", ckan_package = "your_ckan_package", s3_key_id = "your_s3_key_id", s3_secret_key = "your_s3_secret_key", s3_region = "your_s3_region", s3_bucket = "your_s3_bucket", mongo_key = "mongo_key", mongo_collection = "mongo_collection") ## End(Not run)
This function downloads data from remote sources like CKAN and AWS S3. It retrieves the data based on the hashed_data identifier and assumes the data is stored using the same naming conventions provided in the 'remote_share' function.
remote_raw_download( hashed_data = NULL, file_path = NULL, ckan_url = NULL, ckan_key = NULL, ckan_package = NULL, s3_key_id = NULL, s3_secret_key = NULL, s3_region = NULL, s3_bucket = NULL )
remote_raw_download( hashed_data = NULL, file_path = NULL, ckan_url = NULL, ckan_key = NULL, ckan_package = NULL, s3_key_id = NULL, s3_secret_key = NULL, s3_region = NULL, s3_bucket = NULL )
hashed_data |
A character string representing the hashed identifier of the data to be downloaded. |
file_path |
location and name of the zip file to create. |
ckan_url |
A character string representing the CKAN base URL. |
ckan_key |
A character string representing the CKAN API key. |
ckan_package |
A character string representing the CKAN package identifier. |
s3_key_id |
A character string representing the AWS S3 access key ID. |
s3_secret_key |
A character string representing the AWS S3 secret access key. |
s3_region |
A character string representing the AWS S3 region. |
s3_bucket |
A character string representing the AWS S3 bucket name. |
Any return objects from the downloads.
## Not run: downloaded_data <- remote_raw_download(hashed_data = "example_hash", file_path = "your/path/file.zip", ckan_url = "https://example.com", ckan_key = "your_ckan_key", ckan_package = "your_ckan_package", s3_key_id = "your_s3_key_id", s3_secret_key = "your_s3_secret_key", s3_region = "your_s3_region", s3_bucket = "your_s3_bucket") ## End(Not run)
## Not run: downloaded_data <- remote_raw_download(hashed_data = "example_hash", file_path = "your/path/file.zip", ckan_url = "https://example.com", ckan_key = "your_ckan_key", ckan_package = "your_ckan_package", s3_key_id = "your_s3_key_id", s3_secret_key = "your_s3_secret_key", s3_region = "your_s3_region", s3_bucket = "your_s3_bucket") ## End(Not run)
Get the rows in the data that violate the specified rules.
rows_for_rules(data_formatted, report, broken_rules, rows)
rows_for_rules(data_formatted, report, broken_rules, rows)
data_formatted |
A formatted data frame. |
report |
A validation report generated by the 'validate' function. |
broken_rules |
A data frame with broken rules information. |
rows |
A vector of row indices specifying which rules from the suite of rules with errors to check for violations. |
A data frame with rows in the data that violate the specified rules.
data("invalid_example") data("test_rules") # Generate a validation report result_invalid <- validate_data(files_data = invalid_example, data_names = c("methodology", "particles", "samples"), file_rules = test_rules) # Find the broken rules broken_rules <- rules_broken(results = result_invalid$results[[1]], show_decision = TRUE) # Get rows for the specified rules violating_rows <- rows_for_rules(data_formatted = result_invalid$data_formatted[[1]], report = result_invalid$report[[1]], broken_rules = broken_rules, rows = 1)
data("invalid_example") data("test_rules") # Generate a validation report result_invalid <- validate_data(files_data = invalid_example, data_names = c("methodology", "particles", "samples"), file_rules = test_rules) # Find the broken rules broken_rules <- rules_broken(results = result_invalid$results[[1]], show_decision = TRUE) # Get rows for the specified rules violating_rows <- rows_for_rules(data_formatted = result_invalid$data_formatted[[1]], report = result_invalid$report[[1]], broken_rules = broken_rules, rows = 1)
Filter the results of validation to show only broken rules, optionally including successful decisions.
rules_broken(results, show_decision)
rules_broken(results, show_decision)
results |
A data frame with validation results. |
show_decision |
A logical value to indicate if successful decisions should be included in the output. |
A data frame with the filtered results.
# Sample validation results data frame sample_results <- data.frame( description = c("Rule 1", "Rule 2", "Rule 3"), status = c("error", "success", "error"), name = c("rule1", "rule2", "rule3"), expression = c("col1 > 0", "col2 <= 5", "col3 != 10"), stringsAsFactors = FALSE ) # Show only broken rules broken_rules <- rules_broken(sample_results, show_decision = FALSE)
# Sample validation results data frame sample_results <- data.frame( description = c("Rule 1", "Rule 2", "Rule 3"), status = c("error", "success", "error"), name = c("rule1", "rule2", "rule3"), expression = c("col1 > 0", "col2 <= 5", "col3 != 10"), stringsAsFactors = FALSE ) # Show only broken rules broken_rules <- rules_broken(sample_results, show_decision = FALSE)
This wrapper function starts the user interface of your app choice.
run_app( path = "system", log = TRUE, ref = "main", test_mode = FALSE, app = "validator", ... )
run_app( path = "system", log = TRUE, ref = "main", test_mode = FALSE, app = "validator", ... )
path |
to store the downloaded app files; defaults to |
log |
logical; enables/disables logging to |
ref |
git reference; could be a commit, tag, or branch name. Defaults to "main". Only change this in case of errors. |
test_mode |
logical; for internal testing only. |
app |
your app choice |
... |
arguments passed to |
After running this function the Validator, Microplastic Image Explorer, or Data Visualization GUI should open in a separate window or in your computer browser.
This function normally does not return any value, see
runGitHub()
.
Hannah Sherrod, Nick Leong, Hannah Hapich, Fabian Gomez, Win Cowger
## Not run: run_app(app = "validator") ## End(Not run)
## Not run: run_app(app = "validator") ## End(Not run)
This function checks if the input string contains any profane words.
test_profanity(x)
test_profanity(x)
x |
A character string to check for profanity. |
A logical value indicating whether the input string contains no profane words.
test_profanity("This is a clean sentence.") test_profanity("This sentence contains a badword.")
test_profanity("This is a clean sentence.") test_profanity("This sentence contains a badword.")
A dataset containing rules and their descriptions, datasets, valid examples, severity, and rules.
A data frame with 6 columns:
Name of the rule (e.g., "MethodologyID_valid")
Description of the rule (e.g., "URL address is valid and can be found on the internet.")
Dataset associated with the rule (e.g., "methodology")
A valid example of the rule (e.g., "https://www.waterboards.ca.gov/drinking_water/certlic/drinkingwater/documents/microplastics/mcrplsts_plcy_drft.pdf")
Severity of the rule (e.g., "error")
The actual rule (e.g., "check_uploadable(MethodologyID) == TRUE")
data("test_rules")
data("test_rules")
This is a list containing three data frames as an example of valid_example.
A list with 3 data frames:
A data frame with 15 variables: MethodID, MatIDMethod, Equipment, Magnification, MethodComments, Protocols, Deployment, SamplingDevice, SmallestParticle, TopParticle, FilterType, FilterDiameter, FilterPoreSize, ImageFile, ImageType
A data frame with 131 variables: SampleID, SampleSize, Project, Affiliation, Citation, OwnerContributor, AnalysisContributor, ReportingContributor, SiteName, Location, Compartment, SampleComments, SamplingDepth, SamplingVolume, SamplingWeight, BlankContamination, Latitude, Longitude, Matrix, CollectionStartDateTime, CollectionEndDateTime, SpatialFile, Concentration, ConcentrationUnits, StandardizedConcentration, StandardizedConcentrationUnits, Color_Transparent, Color_Blue, Color_Red, Color_Brown, Color_Green, Color_Orange, Color_White, Color_Yellow, Color_Pink, Color_Black, Color_Other, Material_PEST, Material_PE, Material_PP, Material_PA, Material_PE_PS, Material_PS, Material_CA, Material_PVC, Material_ER, Material_PAM, Material_PET, Material_PlasticAdditive, Material_PBT, Material_PU, Material_PET_PEST, Material_PAN, Material_Silicone, Material_Acrylic, Material_Vinyl, Material_Vinyon, Material_Other, Material_PA_ER, Material_PTT, Material_PE_PP, Material_PPS, Material_Rayon, Material_PAA, Material_PMPS, Material_PI, Material_Olefin, Material_Styrene_Butadiene, Material_PBA, Material_PMMA, Material_Cellophane, Material_SAN, Material_PC, Material_PDMS, Material_PLA, Material_PTFE, Material_SBR, Material_PET_Olefin, Material_PES, Material_ABS, Material_LDPE, Material_PEVA, Material_AR, Material_PVA, Material_PPE, Morphology_Fragment, Morphology_Fiber, Morphology_Nurdle, Morphology_Film, Morphology_Foam, Morphology_Sphere, Morphology_Line, Morphology_Bead, Morphology_Sheet, Morphology_Film_Fragment, Morphology_Rubbery_Fragment, Size_3000um, Size_2_5mm, Size_1_5mm, Size_1_2mm, Size_0.5_1mm, Size_less_than_0.5mm, Size_500um, Size_300_500um, Size_125_300um, Size_100_500um, Size_greater_than_100um, Size_50_150um, Size_50_100um, Size_50um, Size_45_125um, Size_greater_than_25um, Size_20um_5mm, Size_20_100um, Size_20_50um, Size_10_50um, Size_10_45um, Size_10_20um, Size_greater_than_10um, Size_8_316um, Size_5_100um, Size_5_10um, Size_4_10um, Size_1.5_5um, Size_less_than_1.5um, Size_1_100um, Size_1_50um, Size_1_10um, Size_1_5um, Size_110_124nm, Size_0_20um
A data frame with 19 variables: ParticleID, Amount, Color, Polymer, Shape, PhotoID, ParticleComments, PlasticType, Length, Width, Height, Units, Mass, SurfaceArea, SizeDimension, Volume, Tactile, ArrivalDate, AnalysisDate
data("valid_example")
data("valid_example")
Validate data based on specified rules
validate_data( files_data, data_names = NULL, file_rules = NULL, zip_data = NULL )
validate_data( files_data, data_names = NULL, file_rules = NULL, zip_data = NULL )
files_data |
A list of file paths for the datasets to be validated. |
data_names |
(Optional) A character vector of names for the datasets. If not provided, names will be extracted from the file paths. |
file_rules |
A file path for the rules file, either in .csv or .xlsx format. |
zip_data |
A file path to a zip folder for validating unstructured data. |
A list containing the following elements: - data_formatted: A list of data frames with the validated data. - data_names: A character vector of dataset names. - report: A list of validation report objects for each dataset. - results: A list of validation result data frames for each dataset. - rules: A list of validator objects for each dataset. - status: A character string indicating the overall validation status ("success" or "error"). - issues: A logical vector indicating if there are any issues in the validation results. - message: A data.table containing information about any issues encountered.
# Validate data with specified rules data("valid_example") data("invalid_example") data("test_rules") result_valid <- validate_data(files_data = valid_example, data_names = c("methodology", "particles", "samples"), file_rules = test_rules) result_invalid <- validate_data(files_data = invalid_example, data_names = c("methodology", "particles", "samples"), file_rules = test_rules)
# Validate data with specified rules data("valid_example") data("invalid_example") data("test_rules") result_valid <- validate_data(files_data = valid_example, data_names = c("methodology", "particles", "samples"), file_rules = test_rules) result_invalid <- validate_data(files_data = invalid_example, data_names = c("methodology", "particles", "samples"), file_rules = test_rules)