Genomic research data: open vs. restricted access.
Article Type: Report
Subject: Genetic research (Ethical aspects)
Confidential communications (Laws, regulations and rules)
Genomics (Research)
Author: Resnik, David B.
Pub Date: 01/01/2010
Publication: Name: IRB: Ethics & Human Research Publisher: Hastings Center Audience: Academic Format: Magazine/Journal Subject: Health Copyright: COPYRIGHT 2010 Hastings Center ISSN: 0193-7758
Issue: Date: Jan-Feb, 2010 Source Volume: 32 Source Issue: 1
Topic: Event Code: 200 Management dynamics; 310 Science & research; 930 Government regulation; 940 Government regulation (cont); 980 Legal issues & crime; 290 Public affairs Advertising Code: 94 Legal/Government Regulation; 91 Ethics Computer Subject: Company business management; Government regulation
Organization: Government Agency: United States. National Institutes of Health
Geographic: Geographic Scope: United States Geographic Code: 1USA United States
Accession Number: 239462724
Full Text: The free and open exchange of information is an essential part of scientific methodology and ethics. Openness plays an important role in collaboration, accountability, replication, discovery, and debate in science, and helps to promote the development of well-informed public policy. (1) Many different research organizations have developed policies that promote openness. For example, the National Institutes of Health (NIH) requires investigators seeking $500,000 or more in direct costs in any single year to submit a data sharing plan or state why data sharing is not possible. (2) Many scientific journals require authors to make data available to the public and fulfill reasonable requests for research materials, and most biomedical journals require clinical trials to be registered on a public registry as a condition of publication. (3) And beginning with a conference of researchers and other stakeholders in Bermuda in 1996, (4) several statements and recommendations for data sharing have been issued, the most recent in September 2009 by attendees at the Toronto International Data Release Workshop. (5)

While openness is crucial to scientific progress, maintaining the confidentiality of data is equally important in protecting the privacy of individuals who provide their biospecimens for genomic research. International guidelines, professional standards, and federal research regulations

require investigators to safeguard the confidentiality of research data. (6)

Unauthorized disclosure of confidential information related to an individual's participation in research may cause significant harm, such as discrimination, bias, embarrassment, and anxiety. Although there have only been a handful of documented cases of discrimination based on genetic information in the United States, public concern about potential harms from access to genetic information led to passage of the federal Genetic Information Non-Discrimination Act (GINA), which went into effect in May 2008. (7)

Despite the passage of GINA and the various federal and state privacy laws and regulations that protect the use and disclosure of personal information, many people remain concerned about the potential adverse consequences resulting from the use and disclosure of information about them obtained in the research setting. (8) Thus, institutions, investigators, and oversight committees have developed a variety of strategies for managing the conflict between maintaining the confidentiality of personal information and ensuring the free and open exchange of research data. Two polar ends of the policy spectrum are the restricted access approach and the open access approach. Both approaches have distinct advantages and disadvantages. The restricted access approach offers excellent protection for confidentiality, but it does so at the expense of potentially slowing the pace of scientific progress because investigators may find it difficult or burdensome to comply with the requirements for obtaining access to research data. The open access approach helps to stimulate scientific progress by making data freely available, but it may do so at the expense of compromising confidentiality. Recent developments in bioinformatics have called into question the legitimacy of an open access approach that involves making genomic data available on public Web sites.

Examples of Access Policies

Under the restricted access strategy, research data are shared only if the data recipients agree to specific terms and conditions. For example, the Framingham Heart Study--a prospective cohort study that now contains over 15,000 men and women from Framingham, Massachusetts--permits access to the research data only when certain conditions are met. Access is restricted to researchers who comply with requirements for use of the data, sign a data and materials use agreement, and have obtained institutional review board (IRB) approval for their study. (9)

Researchers are not allowed to identify the individuals from whom the data were obtained and may only use the data for research projects described in the signed agreement. They must take security measures to protect the use of electronic data, such as encryption and firewalls, and may not share the data with third parties. A committee reviews requests for access to data to ensure that the proposed research projects are well designed, scientifically important, and comply with the goals and standards of the Framingham Study. (10)

In contrast to a restricted access approach, an open access approach permits the use and sharing of data with no conditions attached. To protect the confidentiality of individuals from whom the data were obtained, the data are stripped of identifying information such as name, social security number, and date of birth. The data may be coded or anonymized. If the data are coded, the researchers retain a code that links the data to specific individuals, but they do not share the code with people who request access to the data. (11) If the data are anonymized, the researchers who collected the data have no method for linking the data to the individuals from whom the data were obtained. For example, the Human Genome Project, a 13-year coordinated effort to identify and sequence the entire human genome, makes annotated DNA sequence data freely available to researchers through a public Web site. The DNA for the study was provided by several individuals chosen from hundreds of potential DNA donors, and their identifying information was removed from the DNA that investigators examined. (12)

In between these two extreme positions are intermediate approaches to sharing research data. Some institutions have developed agreements that allow data recipients to use data indefinitely for many projects, not just those described in the agreement, or that do not mandate that recipients obtain IRB approval for use of the data. Federal regulations governing research with humans do not require IRB approval for research with human data if the data are publicly available or the information is recorded in such a way that the investigator cannot readily identify the individuals from whom the data were obtained. (13) For example, the data use agreement for the State Cancer Profiles data compiled by the National Cancer Institute (NCI) and the Centers for Disease Control and Prevention requires that data users agree to use the data only for statistical analysis and reporting, not to attempt to identify individuals in the database, and that they obtain IRB approval. However, the agreement does not require data users to specify the nature of their research projects. (14) It is also worth noting that some forms of open access are not as open as posting data to a public Web site. For example, some institutions and investigators make data available to qualified scientists only upon request. (15)

Problems with Open Access

In August 2007, the NIH initiated a data-sharing policy for genomewide association studies (GWAS). Data from GWAS are deposited in two main databases, the NCI's Cancer Genetic Markers of Susceptibility (CGEMS) database and Database of Genotypes and Phenotypes (dbGaP). The data can be analyzed to determine relationships between genotypes and disease. (16) According to the 2007 GWAS policy, individual-level data will be available only to investigators and institutions that submit an application to a data access committee and sign a data use agreement. Under the agreement, investigators agree to use the data only for the approved research; to protect the confidentiality of the data; to adopt appropriate data security measures; to follow applicable laws; to not attempt to identify the individuals from whom the data were obtained; to not sell the data; and to not transfer the data to a third party. (17) The 2007 policy said that aggregate data would be openly available to anyone, including the public, via a public Web site. At the time the policy was initiated, the NIH believed that deidentifying genomic data would make it impossible for anyone to trace the data back to the individuals from whom the data were obtained. Other organizations followed the NIH's lead by granting open access to aggregate human genomic data. (18)

Yet doubts about deidentification as a strategy for protecting the confidentiality of data and the privacy of the individuals who provided their biospecimens for research had emerged before the GWAS policy went into effect. (19) For instance, in 2004, Lin and colleagues demonstrated that it is possible to identify an individual from 30-80 statistically independent single nucleotide polymorphisms (SNPs). This finding implied that someone with access to genomic data from an individual could identify that individual or the individual's close relatives (parent, child, or sibling) in a public SNP database. (20) A year after the GWAS policy went into effect, Homer and colleagues demonstrated how to use statistical techniques to identify an individual in a complex DNA mixture consisting of 10,000 to 50,000 SNPs if one already has information about that individual's DNA. The authors also showed that it is possible to identify close relatives of the individual in the database. (21) To identify individuals in deidentified databases using the methods described by Lin or Homer, one must have access to an identified DNA sample from an individual. There are a number of databases containing identified DNA materials, including those used for military, research, forensic, or commercial purposes. People with access to these databases would be able to match the data in them to deidentified genetic data contained in open access research databases. Additionally, DNA can be acquired by other means, such as extracting it from a person's blood, hair, or tissue. (22)

Another method for identifying individuals in deidentified genomic databases is to match individuals to nongenomic data. For example, one could match the individual's genetic data and associated phenotypes (such as age, gender, disease status) to data in health care, criminal justice, commercial, or other databases. If the individual is explicitly identified in those databases, the matching would be relatively straightforward. Even if the individual is not explicitly identified it may be possible to use statistical methods to identify the individual and match him/her to a genetic database. A third method is to develop a profile of an individual from phenotypic and genetic data in databases, such as the individual's height, weight, gender, disease status, age, approximate skin pigmentation, eye color, and hair color. (23)

After learning about the findings of Homer et al., the NIH decided in August 2008 to remove aggregate GWAS data from the public Web site. Access to aggregate GWAS data is now granted only to investigators and institutions that submit an application to the data access committee and abide by the terms of the data use agreement. The NIH also urged other organizations to consider the policy implications of the recent developments indicating the ability to reidentify genomic data. (24)

The risks of someone identifying an individual in aggregate, public genomic databases are extremely low at present because people are not likely to put forth the effort required to identify individuals. However, the risks of reidentification are likely to increase as science and technology advance and more identified data become available. Close family members may be at risk as well because it may be possible to identify them and make inferences about their phenotypes once an immediate genetic family member is identified. In the future, it may also be possible to identify more distant family members.

Addressing Problems with Open Access

n response to the potential to reidentify previously deidentified data, commentators have suggested several approaches to protecting the confidentiality of genomic research data. One approach is to limit the amount of genomic data that is publicly released so that reidentification is not possible. There are two problems with this approach. First, there is no general formula for how much data to release, since how much is needed to identify an individual depends on the region of the genome, the rarity of variants, and other factors. (25) Also, what can be safely released may change as science and technology advance. Second, releasing only a minimal amount of data may not provide researchers with what they need for genomic studies, therefore defeating the purpose of public data release. (26)

Another approach is to statistically degrade the data to prevent reidentification of individuals. Like the first approach, this one would also compromise the usefulness of the data. (27)

A third approach is for individuals who provide their DNA for research to consent to open release of their genomic data, with the knowledge that they can be or will be identified. Thus, they would relinquish traditional confidentiality protections in order to help promote biomedical research. (28) This is the approach used by the Personal Genome Project, a study that will initially sequence the genomes of 10 volunteers and place their data on the Internet, together with phenotypic data such as height, weight, age, and disease status. The 10 volunteers, including project leader George Church, are all geneticists. The goal of the project is to move beyond the initial pilot phase and expand the study to include 100,000 research participants. (29) Other researchers have previously sequenced the genomes of two identified individuals: James Watson, who won the Nobel Prize for discovering the structure of DNA, and Craig Venter, who led a private effort to sequence the human genome in the 1990s. Both have allowed open release of their genomic data. (30)

While the participants in the Personal Genome Project who have agreed to public release of their genomic data should be commended for their commitment to biomedical research, this strategy has some significant risks. If the participants were the only people whose genomic information would be involved, there would be no serious problems with this strategy, since they would consent to the risks associated with public disclosure of their information. However, the participants in the Personal Genome Project are not the only people who would be affected. As discussed earlier, family members may also be at risk of harm from public access to genomic data. (31) Someone who consents to open release of genomic data could be threatening the privacy of his or her relatives. Relatives may experience discrimination, bias, and embarrassment, and they may inadvertently learn about genetic information that they do not want to know about or are not prepared to know about--such as mistaken paternity--which could cause considerable stress. Although researchers' primary ethical obligations are to the individuals who participate in their studies, they may also have obligations to identifiable third parties, such as the individuals' relatives. (32) One could argue that to protect the relatives of research subjects, investigators should not adopt an open release policy for genomic data, even if the research participants agree to such a policy.

In response to this criticism, proponents of the open release approach could argue that research participants have a right to take these risks, as long as they understand the risks and freely consent to provide their genomic information for research. From this perspective, individual autonomy and the benefits of the research project take precedence over minimizing potential risks to relatives, which are remote. However, this response is unsatisfactory. While individuals may have a right to place themselves or their relatives at risk (as Watson and Venter did), they forfeit this right when they participate in research because investigators have an obligation to protect third parties from harm. For example, a pregnant woman is free to take a medication that places her unborn child at serious risk of harm, but she may not be free to take the same medication when she is a research subject because investigators have obligations to protect unborn children from harm.

Proponents of the open release approach could acknowledge that investigators have an obligation to protect relatives from risks related to research, but they could argue that these risks can be managed if research participants discuss the research with family members before agreeing to permit their genomic data to be publicly accessible. (33) The Personal Genome Project advises participants to talk to their close relatives about the research and requires consent from both identical twins. (34) But this response is not entirely satisfactory, because it does not give relatives (other than twins) the right to veto a family member's participation in research. The best way to protect the relatives' rights to control their confidential information would be to ask close relatives to consent to participation in the research project. However, this approach could be difficult to implement because one would need permission from all of an individual's close relatives in order to enroll the individual in the project. Furthermore, the strategy would offer no protection to more distant relatives who might be at risk some day as a result of anticipated advances in science and technology. For these reasons, IRBs should be deeply skeptical of protocols that call for open release of genomic data.

A fourth approach is the revised NIH policy for GWAS data, which involves moving from the open access model toward some type of restricted access to genomic data. This strategy can help to protect the confidentiality of individuals' genomic data and the confidentiality interests of their relatives. Though a restricted access model may impede or delay research by making it more difficult for some researchers to obtain access to data, these problems can be alleviated, in part, by developing policies that are not excessively restrictive; publicizing the availability of data; linking available genomic data to publications; streamlining the request process; and standardizing data use agreements.

IRB Considerations

RBs are responsible for deciding whether to approve protocols that include plans for sharing genomic data. Although many protocols that include data-sharing plans--such as minimal-risk research--might qualify for expedited review, (35) given the complexity of the issues and the potential risks, the full IRB should review these protocols. IRB review should focus on two key concerns: confidentiality protections and the informed consent process.

To address confidentiality issues, the IRB should consider some pertinent questions. If the investigators plan to restrict access to the data:

* What types of policies and procedures will they use to control access to data?

* Will they require recipients to put data security measures in place?

* Will they prohibit recipients from transferring data to third parties and from attempting to identify individuals?

* Will they allow data to be used only for a specific project or will they allow use for many types of projects?

* Will the investigators form a committee to review requests for data?

* If the investigators plan to share individually identifiable data, will they require recipients to produce proof of IRB review in their application for data?

If the investigators plan not to restrict access to data:

* How will they maintain the confidentiality of genomic data?

* Will they take appropriate steps to deidentify data?

* Will the data be coded or anonymous?

* Will the investigators take additional steps to make it more difficult to reidentify individuals, such as limiting data release?

* Will data be placed on a public Web site?

* Are the investigators proposing to ask individuals who provide a biospecimen for research to waive traditional confidentiality protections and to consent to open release of their genomic data? If so, how do the investigators propose to protect family members from potential harms associated with the release of their relative's genomic data?

To address informed consent issues, the IRB should consider an additional set of questions. If the investigators propose to use a restricted-access approach:

* Will individuals who provide a biospecimen for research be informed about the policies and procedures for controlling access to their genomic data?

* Will they be informed about data use agreements, data use committees, and security measures?

* Will they be told that their data will be used for a specific research project or for many types of projects?

* Will they be informed about whether there are plans to deidentify their data?

If the investigators propose to use an open-access approach:

* Will individuals who provide a biospecimen for research be informed about plans to protect the confidentiality of their data, such as deidentifying the data?

* Will they be informed about the potential risks of reidentification and the investigator's plans to minimize those risks?

* Will they be able to understand the risks of reidentification?

* If they will be asked to consent to open release of their data, will they be informed about the risks of this approach, including risks to family members?

In addition to addressing these and other questions when they review a protocol that involves sharing genomic data, IRBs may also wish to develop policies pertaining to some of these issues, such as the requirements for data access agreements.

Openness is one of science's fundamental ethical norms, but it should not take precedence over the obligation to protect the confidentiality of individuals' genomic data. Due to recent advances in methods for reidentifying individuals in deidentified genomic databases, the best way in most cases to balance the need for openness against the need to protect the confidentiality of data may be to restrict access to the data.


This research was supported by the intramural program of the National Institute of Environmental Health Sciences (NIEHS), National Institutes of Health (NIH). It does not represent the opinions of the NIEHS or NIH. I am grateful to Laura Beskow and Doug Bell for helpful comments.


(1.) Resnik D. Openness vs. secrecy in scientific research. Episteme 2006;2:135-147.

(2.) National Institutes of Health. Data sharing policy and implementation guidance. data_sharing_guidance.htm.

(3.) Science. General information for authors. /authors/prep/gen_info.dtl; International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals. http://www

(4.) Marshall E. Bermuda rules: Community spirit, with teeth. Science 2001;291:1192.

(5.) World Medical Association. Declaration of Helsinki: Ethical Principles for Medical Research Involving Humans Subjects.; Council of the International Organizations of Medical Sciences. International Ethical Guidelines for Biomedical Research Involving Human Subjects.; American Medical Association. Code of Ethics. resources/medical- ethics/code-medical-ethics.shtml; Department of Health and Human Services. Protection of Human Subjects. 45 CFR 46.ma7.

(6.) Toronto International Data Release Workshop. Prepublication data sharing. Nature 2009;491:168-170.

(7.) Genetic Information Nondiscrimination Act of 2008 (P.L. 110-233, 122. Stat. 881).

(8.) Hudson K. Prohibiting genetic discrimination. NEJM 2007; 356:2021-2023; Hudson K, Holohan M, Collins F. Keeping pace with the times--the Genetic Information Nondiscrimination Act of 2008. NEJM 2008;358:2661-2663.

(9.) Framingham Heart Study. Proposal application and distribution agreements. /research/proposal.html.

(10.) Framingham Heart Study. Research application overview.

(11.) Weir R, Olick R, Murray J. The Stored Tissue Issue: Biomedical Research, Ethics and Law in the Era of Genomic Medicine. New York: Oxford University Press, 2004.

(12.) Human Genome Project. About the Human Genome Project. Available at: _Genome/project/about.shtml.

(13.) Department of Health and Human Services. Protection of Human Subjects. 45 CFR 46.101b4; Department of Health and Human Services, Office for Human Research Protections, Guidance on Research Involving Coded Private Information or Biological Specimens, October 2008. /ohrp/humansubjects/guidance/cdebiol.htm.

(14.) State Cancer Profiles. Data use agreement. /datause.html.

(15.) See ref. 11, Weir et al. 2004.

(16.) National Institutes of Health. Genome-wide association studies (GWAS).

(17.) National Institutes of Health. Policy for sharing of data obtained in NIH supported or conducted genome-wide association studies (GWAS). /guide/notice-files/NOT-OD-07-088.html.

(18.) Couzin J. Whole-genome data not anonymous, challenging assumptions. Science 2008;321:1278.

(19.) McGuire A, Gibbs R. No longer deidentified. Science 2006;312:370-371.

(20.) Lin Z, Owen A, Altman R. Genomic research and human subject privacy. Science 2004;305:183.

(21.) Homer N, Szelinger S, Redman M, et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genetics 2008; 4(8)^1000167:1-9.

(22.) Lowrance W, Collins F. Identifiability in genomic research. Science 2007; 317: 600-602.

(23.) See ref. 20, Lin et al. 2004.

(24.) National Institutes of Health. Modifications to genome-wide association studies (GWAS) access. /grants/gwas/data_sharing_policy_modifications_20080828.pdf.

(25.) See ref. 22, Lowrance and Collins 2007.

(26.) See ref. 22, Lowrance and Collins 2007.

(27.) See ref. 22, Lowrance and Collins 2007.

(28.) Lunshof J, Chadwick R, Vorhaus D, Church G. From genetic privacy to open consent. Nature Reviews Genetics 2008; 9:406-411.

(29.) Church G. The personal genome project. Molecular Systems Biology December 13, 2005. /v1/n1/full/msb4100040.html.

(30.) Harmon A. Taking a peek at the experts' genetic secrets. New York Times, October 19, 2008; Marshall E. Sequencers of a famous genome confront privacy issues. Science 2007^15:1780.

(31.) McGuire A, Cho M, McGuire S, Caulfield T. The future of personal genomics. Science 2007^17:687.

(32.) Resnik D, Sharp R. Protecting third parties in human subjects research. IRB: Ethics & Human Research 2006;28(4):1-7.

(33.) See ref. 28, Lunshof et al. 2008.

(34.) See ref. 29, Church 2005.

(35.) Department of Health and Human Services. Protection of Human Subjects. 45 CFR 46.110b!.

David B. Resnik, "Genomic Research Data: Open vs. Restricted Access," IRB: Ethics & Human Research 32, no. 1 (2010): 1-6.

* David B. Resnik, JD, PhD, is Bioethicist and IRB Chair, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC.
Gale Copyright: Copyright 2010 Gale, Cengage Learning. All rights reserved.