Use Deidentified Data
Abbreviations Key | |
ATAC | assays for transposase-accessible chromatin sequencing |
dpGaP | Database of Genotypes and Phenotypes (NIH) |
DOB | date of birth |
GEO | Gene Expression Omnibus (NCBI) |
HIPAA | Health Insurance Accountability and Portability Act |
HISE | Human Immune System Explorer |
IDE | integrated development environment |
NCBI | National Center for Biotechnology Information |
NIH | National Institutes of Health |
PHI | protected health information |
PII | personally identifiable information |
RNAseq | ribonucleic acid sequencing |
SRA | Sequence Read Archive (NCBI) |
WGS | whole genome sequencing |
At a Glance
HISE supports the delivery of deidentified data from human subjects research. AIFI is careful not to ingest data that contains PHI. All sensitive data remains housed outside of HISE in third-party systems at collaborating clinical sites. Instead, we use random subject IDs that can't be linked to specific information about human subjects.
Description
Open science at AIFI depends on responsible collaboration with members of our partnership collective. When we exchange data, we aim to balance our scientific needs with the privacy rights of the human subjects who participate in our research. We also strive to maintain the trust of IRBs tasked with overseeing such research.
Data Sharing Policy
In addition to basic data about cohorts, samples, specimens, and subject demographics ingested through AIFI's LIMS system, HISE accepts clinical questionnaire data, CBC results, and selected subject metadata. The time frame for public release of data is governed by applicable data-sharing agreements, laws, and regulations. Studies that receive even partial NIH funding for example, are subject to agency regulations. These rules require that deidentified data be made publicly available no more than 12 months after the completion of each longitudinal study cohort. When we share such data in HISE, we follow NIH guidance, which recommends two data access tiers:
Tier 1
Tier 1 includes whole genome or whole exome sequencing data. If there is a risk that a patient could be identified, only deidentified data is shared. It's placed in an NIH-designated managed-access repository.
Tier 2
Tier 2 includes all other kinds of deidentified data. Such data poses little or no risk of exposing the patient's identity, and the data can therefore be shared publicly in HISE.
Definitions
For key terms that pertain to AIFI data sharing, see the following table.
Term | Definition |
information | Techniques and methods, test data, results (including pharmacological, toxicological, and clinical test data and results), analytical and quality control data, and algorithms. |
PHI | A subtype of PII that includes all individually identifiable health information, including demographic data, medical histories, test results, insurance information, and other information used to identify a patient or provide healthcare services or coverage. |
PII | Information that can be used to distinguish or trace an individual’s identity, either alone (direct) or in combination with other personal or identifying information linked to a specific individual (indirect). |
protected | Information covered by the HIPAA Privacy Rule, a 1996 U.S. law that protects patients' privacy rights. |
sample | A biological sample obtained from a human subject. |
Data Masking
Data can be handled in a way that protects personally identifiable information (PII) but keeps the anonymized data available for analysis and testing. This process is called data masking. If you have a Data App, it's important to mask PII fields on your Certificate of Reproducibility (CertPro). You can either mask select fields on a vertex or delete the metadata for an entire vertex (that is, remove the vertex entry from the metadata field). If the metadata has a revision history, you should remove the other revisions.
Data Release
During the 12 months preceding public release of NIH-associated data, members of the partnership collective have full access to it in HISE. Supporting data for any interim published results is shared in accordance with journal requirements. This policy covers raw data files and analyzed data in the following areas:
Area | Special considerations (if any) |
Human metadata | To facilitate full deidentification, partner organizations that provide samples remove the month of the specified event, such as a blood draw. To allow longitudinal data analysis, they instead preserve a variable that represents, for example, the number of days elapsed since a specified baseline event, such as the number of days elapsed from the initial study sample collection to the current blood draw. |
Plasma proteomics/targeted proteomics | None |
Flow cytometry data | None |
Single cell and bulk ATACseq data | Processed H5 data files are made publicly available within the same time frame as single cell and bulk ATACseq data. Supporting RNAseq data for interim results is placed in an open access data repository, such as SRA or GEO (both hosted by NCBI), and H5 files are made publicly available in HISE. |
WGS | Supporting WGS data for interim results is placed in dbGaP or another NIH-designated controlled-access repository. |
Data Processing
When we receive a metadata payload, we compare the data set with a known data dictionary and allowlist only recognized fields. All other data is rejected. We also validate selected fields. For example, the DOB field is validated to ensure that we receive the birth year and month but not the day.
Revision history
Processed metadata is stored in a multitenant database. To record any changes in the metadata, we keep a revision history. It documents which entries were changed, when they were changed (for example, during reingest), and what the previous values were.
Legal review
Any data dictionary AIFI uses to process metadata is considered part of the research agreement and is therefore subject to legal review. If you work with one of our partners, be sure you understand your institution's data sharing policy. For detailed information, check with your legal representative.