How Well do Animal Teratology Studies Predict Human Hazard? – Setting the Bar for Alternatives

Home / New Perspectives / Reproductive & Developmental Toxicity / How Well do Animal Teratology Studies Predict Human Hazard? – Setting the Bar for Alternatives

Reproductive & Developmental Toxicity

How Well do Animal Teratology Studies Predict Human Hazard? – Setting the Bar for Alternatives

Jarrod Bailey, Physicians Committee for Responsible Medicine

Published: September 3, 2008

About the Author(s)
Following the completion of his PhD in viral genetics in 1998 at Newcastle University, England, Jarrod spent several years as a senior postdoctoral research associate examining the causes of premature birth in humans, using human tissue samples. During this time he developed an interest in the relevance and validity of animal experiments to human disease.

He has authored and co-authored reviews outlining the limitations and hazards of the use of animals to test for substances that can cause birth defects and cancer, and of using chimpanzees and other nonhuman primates in various forms of medical research. He has authored a report on the redundancy of using genetically modified animals to research diseases such as cystic fibrosis, Alzheimer’s and Parkinson’s, among others; and was a chief author of a petition submitted in late 2007 by a coalition of organisations to the Food and Drug Administration, requesting that it requires scientists to use valid non-animal methods in research and testing in place of animal methods. He has recently published a paper examining the role of chimpanzees in the development of AIDS vaccines.

He is an Honorary Research Associate at Newcastle University, a Senior Research Scientist for the Physicians Committee for Responsible Medicine and Science Director of the New England Anti-Vivisection Society and its Project R&R campaign.

Dr. Jarrod Bailey
Senior Research Consultant
Physicians Committee for Responsible Medicine
5100 Wisconsin Ave., N.W.
Ste. 400
Washington, DC 20016

Birth defects induced by maternal exposure to exogenous agents during pregnancy are preventable, if the agents themselves can be identified and avoided. Though only around 10% of congenital anomalies are caused in this way (1) representing roughly one in every thousand live births, they compromise the quality of life for millions of individuals worldwide and cost billions of dollars in health care every year. It therefore follows that developmental toxicity, or the study of “abnormalities of development caused by exposure to deleterious [teratogenic] agents” is an enormously important area of toxicology.

Animal-based studies have historically been used to provide initial guidance for human teratogenesis for decades, in which a range of doses of test substances is administered to pregnant animals during the period of embryonic organogenesis, and the outcomes compared to control untreated animals. However, it is widely accepted that non-animal alternatives in this area are desirable given the potential for such alternatives to be more relevant to, and more predictive of, human hazard, and also because of the animal-intensive nature of the current tests.

In discussing alternative approaches, it is important to “set the bar” – to evaluate the performance of the animal tests by assessing their predictive value and correlation to the human situation, so that any proposed alternative methods can be evaluated against them. This evaluation is also in accordance with the recommendations of the National Research Council’s report: “Toxicity Testing in the 21st Century” (2). Prior to the potential replacement of animal testing, this report advocates a two-pronged approach consisting of in vitro screening followed, when necessary, by complementary “targeted testing,” which will largely comprise in vivo approaches, at least in the near future. However, proposing the continued use of animal tests obligates further critical assessment of them.

This essay therefore summarizes some of the salient points of a comprehensive study of this nature done in 2005 (3), with the aim of helping to position this bar, and to foster important discussion and debate. This systematic and non-selective study looked at animal-based test results for almost 1400 substances in 12 different species, using the REPRORISK system (TERIS and REPROTOX databases, MICROMEDEX; (4)), by consulting reference texts such as Schardein (5), and via PubMed literature searches using the following terms: Teratogen, Teratology, Teratogenicity, Concordance, Discordance, Results, Species, Extrapolation, Differences, Human, Animal. The results of this study are summarized below.

Reliability and Concordance of Animal Data
Historical – Across Twelve Species

An analysis of the responses of up to 12 animal species to 11 groups of known human teratogens (grouped by drug class/chemical nature) revealed significant discordance: positive predictability ranged from 75% for the hamster down to 40% for the rabbit, which also exhibited a false negative rate of 40%. The mean positive predictability rate in the six most frequently used species historically (mouse, rat, rabbit, hamster, primate, dog) was less than 55%, and the number of equivocal results remained high across these six species at just under 25%.

Further, there were 139 animal results across different species for 35 individual substances positively linked with human teratogenicity. Just over half (56%) of the animal results were positive. This poor predictability was underlined by an FDA report detailing the responses of the mice, rats, rabbits, hamsters and monkeys to 38 known human teratogens, in which the mean percentage of correct positives was only 60% (6). This report also analyzed 165 compounds known to be non-teratogenic in humans. The mean negative predictive value for these species was 54%.

Contemporary – Two Species; Rats and Rabbits

While these data from 12 animal species are of interest and illustrate the impact of species differences in teratology studies, any regard to the human relevance of contemporary in vivo studies must naturally focus upon the main species used in them; namely, the rat and the rabbit.

For groups of known human teratogens (such as androgenic hormones and tetracyclines, for example), results from tests using the rat correlated with human classifications in 64% of cases; for the rabbit, this correlation was 40%. When this analysis was focused on 35 individual substances known to be associated with human teratogenesis, positive predictability for the rat was 61%, with 29% of results falsely negative. The rabbit was positively predictive in 41% of cases, and produced false negative results for 56% of substances. A small number of equivocal results were obtained in both these species (see Table 1 below, depicting our analysis of results and classifications contained in Schardein (5)).

Total Substances



% True +ve

% False -ve















Table 1: Results of rat & rabbit teratology tests for known human teratogens. Total number of results in the rat and rabbit for the 35 known human teratogens (according to Schardein (5)) are shown, followed by the number of positive (+), equivocal (+/-) and negative (-) conclusions. The final two columns reveal the percentage of rat and rabbit results that represent True Positives and False Negatives for these substances.
* = One of these results was strain-dependent.

The performance of the rat and rabbit in teratology tests were further elucidated by examining results for the 20 chemicals used in the ECVAM validation studies of three non-animal alternative methods for developmental toxicology (see Table 2 below).

Nine of these substances had a human risk classification with which to compare the rat and rabbit results:

  • 4 ‘unlikely’ human teratogens produced 2 negative, 1 equivocal and 1 positive result in the rat; and 3 negative and 1 positive result in the rabbit.
  • 3 ‘minimal to small risk’ human teratogens produced 3 positive results in the rat, and 1 positive in the rabbit (with the other 2 not tested).
  • 2 ‘moderate to high risk’ human teratogens produced positive results in both rat and rabbit.

Human Teratogenic Potential


Minimal-Small Risk

Moderate-High Risk


















Table 2: Results of rat & rabbit teratology tests for chemicals used in ECVAM validation studies of non-animal alternative methods that had human risk classifications. Results for those nine substances, where they existed for the rat and the rabbit, are provided for chemicals classified as posing ‘unlikely’, ‘minimal to small’ and ‘moderate to high’ human teratogenic risk.
It can therefore be argued, based on these data, that the modest predictivity of the rat and rabbit-based tests plus the high rate of false-negatives raise concerns surrounding their human relevance and applicability. This lack of predictive power is also underlined by the statistic that of 3301 substances tested prior to 1993, 37% were classified as definitely, probably or possibly teratogenic in animals; but fewer than 2.3% of these substances were linked to human birth defects (5).

We must therefore ask if alternatives to these tests can do better – or at least match the performance of the in vivo tests. Can alternatives provide useful data, faster, on more substances and with more efficient use of resources?

In Vitro Alternatives to Animal-Based Teratology

A number of alternatives to animal testing exist or are in development, with potential to improve the field of developmental toxicology in terms of time, cost and human predictivity. Thus, it is anticipated that alternatives to current animal-based methods will greatly enhance the number of substances that may be evaluated for potential developmental toxicity, at lower cost and in a shorter timeframe. It is estimated there are greater than 100,000 man-made chemicals to which humans may be exposed on a regular basis (7), and it is therefore generally accepted that in vivo developmental toxicology could not possibly be used to assess all new (or even existing) chemical substances due to the scale of its demand upon time and resources (8).

Some of the current alternatives include:

1. Information

Some data from teratology studies are not freely available in the public domain because of proprietary considerations of commercial significance, or simply as a result of access problems and/or difficulties with the construction and maintenance of appropriate databases. Unrestricted and easier access to this information would avoid repetition of completed animal-based work, and improve data comparison in the search for alternative methods.

2. Computer-based systems & Physico-chemical techniques

Computer-based methods, such as physiologically based pharmacokinetic modeling (PBPK) and structure activity relationship (SAR) analyses, have already been responsible for the elimination of many animal tests in the pre-screening of candidate drug compounds. PBPK acts to predict effects through the integration of cross-species physiological parameters coupled with specific information about chemicals and their metabolites. SAR uses information concerning the molecular structure and properties of chemicals to predict biological responses to them. Further, EPA launched the ToxCastTM program in 2006. Under this initiative, EPA is developing chemical “fingerprints” using a multitude of in vitro methods and one in vivo approach (zebrafish embryos), comprising hundreds of endpoints. These assays employ high throughput screening techniques to assist in the prediction of potential toxicities of chemicals, including reproductive as well as other endpoints, using biochemical assays of protein function, cell-based transcriptional reporter assays, multi-cell interaction assays, transcriptomics on primary cell cultures, and developmental assays.

The proof-of-concept phase for ToxCast is expected to be completed in 2008 and EPA will make the results available to the public. More information on this program may be found at

3. Human studies

While not a prospective alternative per se, human studies and databases are clearly of high value. Virtually every substance or dietary deficiency currently recognized as being teratogenic in humans was initially identified as a result of case reports and clinical series (9), including infamous examples such as thalidomide, diethylstilbestrol (DES), mercury, fetal alcohol syndrome, fetal hydantoin syndrome, spina bifida, fetal rubella syndrome, and folic acid deficiency. Central birth defect registries have been established in many countries (for example Canada and Australia) to expedite the identification of potential teratogens through retrospective epidemiology methods (5).

4. Use of lower organisms, embryo stages and cell, tissue and organ cultures

These methods constitute the main basis of the in vitro systems currently being used, developed and researched, and also form the core methods involved in the ToxCast program described under #2 above. At the 17th meeting of the European Centre for the Validation of Alternative Methods Scientific Advisory Committee in 2001, three methods were endorsed as scientifically validated and ready for consideration for regulatory acceptance and application: the embryonic stem-cell test (EST), the micromass test (MM) and whole embryo culture (WEC). The best of these three validated methods is considered to be the EST.

The Embryonic Stem Cell Test

This test uses two permanent murine cell lines to screen for teratogenic potential (10-14), whereby inhibition of cellular differentiation and growth following exposure to a range of concentrations of the test substance is used, via mathematical manipulations, to derive classifications of ‘not-, weak- or strong- embryotoxic.’ When applied to a range of test compounds in the formal validation study and compared to the equivalent in vivo classifications of embryotoxic potential in animals, the EST scored highly on predictability, precision and accuracy. The predictability for non- and weakly embryotoxic substances is classed by ECVAM as sufficient, approaching good, and for strongly embryotoxic substances, it is 100% accurate. Precision for non-embryotoxic substances was sufficient, but good and approaching excellent for both weakly and strongly embryotoxic chemicals. Overall accuracy was scored at 78%, or good.

Clearly, a comparison of its performance against human teratogenic classifications, as opposed to rat and/or rabbit results, is necessary to derive a full assessment of its relevance, utility and human-predictive performance. In this regard, efforts are underway to further improve its performance including the use of alternative endpoints of a molecular nature (as opposed to the interpretive aspects of the current microscopic methods). These include: 1) monitoring the expression of specific marker genes; 2) quantifying the presence of tissue-specific marker-proteins by flow cytometry; 3) developing protocols for differentiation of tests cell lines into other cell types such as neural cells; and 4) avoiding species-specific confounding factors by using human stem cells (15, 16).

Summary and Conclusion

Examination of substantive data from decades of animal-based teratology revealed significant variability in positive and negative predictability. Historically, there were high rates of false positives and false negatives across 12 different species used, and almost a quarter of all outcomes in the six main species used (mouse, rat, rabbit, hamster, primate and dog) were equivocal. An analysis of individual substances linked with human teratogenicity showed just 56% of animal classifications to be positive: mean positive predictability was 60% and negative predictability just 54%. Further, fewer than 1 in 40 of the substances designated as potential teratogens from animal studies have been conclusively linked to human birth defects.

When attention was focused on the main contemporary species used in developmental toxicity tests (the rat and the rabbit), the data revealed a mean positive predictive rate of 51%; the rabbit incorrectly gave a negative result for 56% of known human teratogens, while the false negative rate for the more commonly used rat was not insignificant, at 29%.

Advocates of in vivo developmental toxicity highlight that every human teratogen has produced a positive test result in at least one species of animal. However, from one perspective the impact of this statement is questionable: firstly, it seems logical that if enough animal species are tested with a compound already known to be teratogenic in humans, at least one will prove positive; secondly, it seems imprudent to place confidence in the ability of an animal model as being positively predictive for human teratogens, simply because every agent that elicits teratogenesis in humans does likewise in an animal species (17-19). It is the value of extrapolation from animal to human that is the salient point and that must be scientifically examined. Retrospective “confirmatory” results from animal models, obtained after events in humans have already been documented, is not evidence of their predictive nature.

As for the best of the proposed alternatives, the EST is already considered to be more reproducible, provide easier end-points, present no problems with respect to ‘route of exposure,’ placental transfer and metabolic differences, and is devoid of confounding factors associated with animal tests such as intra-species variability, environmental factors, differences in metabolism, placental and other anatomies, absorption, sensitivity, metabolic activation, routes of administration, dose levels and strategies. It provides a means to establish vital mechanistic models of teratogenic action via gene expression analysis for example, will decrease the cost and increase the number of chemicals evaluated for developmental toxicity, could reduce the human impact of the false positive and false negative results generated by animal models, and could also greatly reduce the numbers of animals used (3,15). As human cell culture and other technologies improve, new protocols will evolve that will enable an even closer in vitro approximation of in vivo human teratogenesis.

There is, therefore, a scientific basis for developmental toxicology to evolve from an animal-based approach, through the adoption of already validated in vitro methods, towards the development of new, cheaper, easier, more reliable, more predictive in vitro screening techniques.


This essay, and the original study on which it is based, was funded by the Physicians Committee for Responsible Medicine (PCRM), Washington DC.

©2008 Jarrod Bailey

  1. Brent, R.L. (1995). The application of the principles of toxicology and teratology in evaluating the risks of new drugs for treatment of drug addiction in women of reproductive age. NIDA Res. Monogr.149, 130-184.
  2. National Research Council; Committee on Toxicity and Assessment of Environmental Agents. (2007). Toxicity Testing in the 21stCentury: A Vision and a Strategy. National Academies Press; 146.
  3. Bailey, J., Knight, A. & Balcombe, J. (2005). The Future of Teratology Research is In Vitro. Biogenic Amines – Stress and Neuroprotection.19, 97-145.
  4. Klasco, R.K. & Heitland, G. REPRORISK® System.
  5. Schardein, J.L. (1993). Chemically Induced Birth Defects. New York: Dekker; xiv. 902 p.
  6. United States Food and Drug Administration. (1980). Caffeine: Deletion of GRAS Status, Proposed Declaration That No Prior Sanction Exists, and Use on an Intern Basis Pending Additional Study. Federal Register.45, 69817.
  7. European Parliament Memo. (2006). Q & A on the New Chemicals Policy, REACH. MEMO/06/488.
  8. Kotwani, A., Mehta, V.L. & Iyengar, B. (1995). Aspirin By Virtue of Its Acidic Property May Act as Teratogen in Early Chick Embryo. Indian J. Physiol. Pharmacol.39, 131-134.
  9. Polifka, J.E. & Friedman, J.M. (1999). Clinical Teratology: Identifying Teratogenic Risks in Humans. Clin. Genet. 56, 409-420.
  10. Genschow, E., Scholz, G., Brown, N., Piersma, A., et al. (2000). Development of Prediction Models for Three In Vitro Embryotoxicity Tests in an ECVAM Validation Study. In Vitro Mol. Toxicol.13, 51-66.
  11. Genschow, E., Scholz, G., Brown, N.A., Piersma, A.H., et al. (1999). [Development of Prediction Models for Three In Vitro Embryotoxicity Tests Which are Evaluated in an ECVAM Validation Study]. ALTEX.16, 73-83.
  12. Scholz, G., Pohl, I., Genschow, E., Klemm, M., et al. (1999). Embryotoxicity Screening Using Embryonic Stem Cells In Vitro: Correlation to In Vivo Teratogenicity. Cells Tissues Organs.165, 203-211.
  13. Scholz, G., Genschow, E., Pohl, I., Bremer, S., et al. (1999). Prevalidation of the Embryonic Stem Cell Test (EST)-A New In Vitro Embryotoxicity Test. Toxicol. In Vitro.13, 675-681.
  14. Spielmann, H., Pohl, I., Döring, B., Liebsch, M., et al. (1997). The Embryonic Stem Cell Test (EST), an In Vitro Embryotoxicity Test Using Two Permanent Mouse Cell Lines: 3T3 Fibroblasts and Embryonic Stem Cells. In Vitro Toxicol.10, 119-127.
  15. Adler, S., Pellizzer, C., Hareng, L., Hartung, T., et al. (2008). First Steps in Establishing a Developmental Toxicity Test Method Based on Human Embryonic Stem Cells. Toxicol. In Vitro.22, 200-211.
  16. Seiler, A., Buesen, R., Hayess, K., Schlechter, K., et al. (2006). Current status of the embryonic stem cell test: the use of recent advances in the field of stem cell technology and gene expression analysis. ALTEX.23, Suppl. 393-399.
  17. Heikkila, A.M., Erkkola, R.U. & Nummi, S.E. (1994). Use of Medication During Pregnancy–A Prospective Cohort Study on Use and Policy of Prescribing. Ann. Chir. Gynaecol. Suppl.208, 80-83.
  18. McElhatton, P. (2002). Teratogenic Drugs – Part 1. Adv. Drug React. Bull.213, 815-818.
  19. Peters, P. (1998). General Commentary to Drug Therapy and Drug Risks in Pregnancy. In: CH S, editor. Drugs During Pregnancy and Lactation. Amsterdam: Elsevier; pp1-13.