API

API Reference

DeIdDicts(maxdays, shiftyears, dateformat)

Structure containing dictionaries for project level mappings

  • Primary ID -> Research ID
  • Research ID -> DateShift number of days
  • Research ID -> Salt value
source
ProjectConfig(config_file::String)

Structure containing configuration information for project level information in the configuration YAML file. This includes an array containing the FileConfig structures for dataset level information.

source
build_config(data_dir::String, config_file::String)

Interactively guides user through writing a configuration YAML file for DeIdentification. The data_dir should contain one of each type of dataset you expect to deidentify (e.g. the data directory ./test/data' contains pat.csv, med.csv, and dx.csv). The config builder reads the headers of each CSV file and iteratively asks about the output name and deidentification type of each column. The results are written to config_file.

source
build_config_from_csv(project_name::String, file::String)

Generates a configuration YAML file from a CSV file that defines the mappings. The CSV file needs to have at least three named columns, one called Source Table which defines the name of the CSV file the data will be read from, a second called Field which defines the name of the field in the data source and a final column called Method which contains the method to apply (one of Hash - Research ID, Hash, Hash & Salt, Date Shift, or Drop).

Any column renames and pre- or post-processing will need to be added manually to the file.

source
deidentify(cfg::ProjectConfig)

This is the constructor for the DeIdentified struct. We use this type to store arrays of DeIdDataFrame variables, while also keeping a common salt_dict and dateshift_dict between DeIdDataFrames. The salt_dict allows us to track what salt was used on what cleartext. This is only necessary in the case of doing re-identification. The id_dict argument is a dictionary containing the hash digest of the original primary ID to our new research IDs.

source
deidentify(config_path)

Run entire pipeline: Processes configuration YAML file, de-identifies the data, and writes the data to disk. Returns the dictionaries containing the mappings.

source
FileConfig(name, filename, colmap, rename_cols)

Structure containing configuration information for each datset in the configuration YAML file. The colmap contains mapping of column names to their deidentification action (e.g. hash, salt, drop).

source
dateshift_val!(dicts, val, pid)

Dateshift fields containing dates. Dates are shifted by a maximum number of days specified in the project config. All of the dates for the same primary key are shifted the same number of days. Of note is that missing values are left missing.

source
deid_file!(dicts, file_config, project_config, logger)

Reads raw file and deidentifies per file configuration and project configurationg. Writes the deidentified data to a CSV file and updates the global dictionaries tracking identifier mappings.

source
getcurrentdate()

Returns the current date as a string conforming to ISO8601 basic format.

This is used to generate filenames in a cross-platform compatible way.

source
hash_salt_val!(dicts, val, pid)

Salt and hash fields containing unique identifiers. Hashing is done in place using SHA256 and a 64-bit salt. Of note is that missing values are left missing.

source
setrid(val, dicts)

Set the value passed (a hex string) to a human readable integer. It generates a new ID if the value hasn't been seen before, otherwise the existing ID is used.

source
write_dicts(deid_dicts)

Writes DeIdDicts structure to file. The dictionaries are written to josn. The files are written to the output_path specified in the configuration YAML.

source
write_yaml(file::String, yml::AbstractDict)

Recursively writes YAML object to file. A YAML object is a dictionary, which can contain arrays of YAML objects. See YAML.jl for more on format.

source