Reproducibility of behavioral phenotypes in mouse models-a short history with critical and practical notes

Progress in pre-clinical research is built on  reproducible findings, yet reproducibility has  different dimensions and even meanings. Indeed,  the terms reproducibility, repeatability, and  replicability are often used interchangeably,  although each has a distinct definition. Moreover,  reproducibility can be discussed at the level of  methods, analysis, results, or conclusions (1, 2).  Despite these differences in definitions and  dimensions, the main aim for an individual  research group is the ability to develop new  studies and hypotheses based on firm and reliable  findings from previous experiments. In practice  this wish is often difficult to accomplish. In this  review, issues affecting reproducibility in the field  of mouse behavioral phenotyping are discussed.

Progress in pre-clinical research is built on reproducible findings, yet reproducibility has different dimensions and even meanings. Indeed, the terms reproducibility, repeatability, and replicability are often used interchangeably, although each has a distinct definition. Moreover, reproducibility can be discussed at the level of methods, analysis, results, or conclusions (1,2). Despite these differences in definitions and dimensions, the main aim for an individual research group is the ability to develop new studies and hypotheses based on firm and reliable findings from previous experiments. In practice this wish is often difficult to accomplish. In this review, issues affecting reproducibility in the field of mouse behavioral phenotyping are discussed.

Crisis in reproducibility.
Over the last ten years, the "reproducibility crisis" has often appeared in the headlines of scientific journals (3)(4)(5)(6). Several factors have been identified as the major causes for irreproducibility -including p-hacking, cherry-picking, low statistical power, publication bias, and hypothesizing after results are known (7). However, these issues mostly occur after the animal experiments are done. There are many more items to consider during the planning and running of an experiment -good experimental design includes considerations regarding randomization, blinding, details of housing, husbandry and animal care, the definition of the experimental unit, inclusion and exclusion criteria, and the choice of animal subjects (the source, health status, strain, sex and age of the animal), among others. (8)(9)(10). Guidelines and recommendations (e.g. ARRIVE, PREPARE) are available for addressing these factors (11,12). However, despite the fact that the ARRIVE guidelines have existed for ten years and are endorsed by more than 1000 journals so far, awareness of researchers and the quality of publications have not been sufficiently improved (13)(14)(15). In order to facilitate and enhance implementation of the ARRIVE guidelines, a revised version with exhaustive explanation and elaborative documentation was recently published (16,17).

Paradigm shift.
Mice and rats are the most widely used model animals in basic biomedicine. However, there has been a drastic change in the relative use of these two rodent species over time ( Figure 1). Historically, the rat was the model of choice for behavioral studies but from the beginning of 1990s a sharp shift from rats to mice took place. Obviously, this was due to rapid technological development in genetic engineering and the ability to create genetically modified mice (i.e. transgenic or targeted mutants). Initially these mice were only available in the most advanced laboratories (18,19), but within ten years the use of genetically modified mouse models was widespread. Importantly, as it became a routine tool for almost every team in biomedical research, there was likely a prevailing impression that behavioral assessment was the easiest part of the process in discovering the function(s) of each gene. Moreover, it was supposed that rat paradigms could be easily translated and applied to mice. Yet it quickly became clear that mice are not little rats, and extensive work with mice has serious challenges (20)(21)(22). Another caveat is that while rat behavior in the laboratory has been studied for decades with the clear goal of understanding the mechanisms of behavior, the mouse is in the majority of cases studied only in the context of genetic modification (phenotyping the effects of gene targeting). This means that we may still be missing a lot of important basic information and knowledge about mouse behavior in laboratory conditions (note the trends in Figure  1).
A mouse is not a mouse. The first mutant mice were made using embryonic stem cells from the 129 mouse strain. However, it was known that these mice harbored several peculiarities complicating their use for neurobehavioral research, including poor breeding performance, hypo-activity, impaired learning, absent corpus callosum, and genetic contamination (23,24). Therefore, the mice were crossed with another inbred strain, C57BL/6, which had been shown as a reasonable strain for various research topics, possessing intermediate phenotypes in many readouts that allowed identification of both gainand loss-of-function (25). Subsequently, phenotypes of the mutant (knockout) and control (wild type) mice could be compared in the F2-hybrid generation. Yet the phenotypes of these F2 mice may be convoluted by the presence of unusual background genes, especially by flanking (passenger) genes from the 129 strain (26,27). Thus, the recommendation was to continue backcrossing with parental strains until congenic and co-isogenic lines were established (28). According to this recommendation, researchers would have to have in hand two distinct genetic backgrounds with the possibility to make an F1-hybrid, which would have allowed for a powerful experimental design. However, in reality backcrossing has usually been done only to the C57BL/6J strain (considered a "gold standard" strain, after its genome was sequenced in 2001). In order to overcome the problems associated with a mixed genetic background and to facilitate the production of mutant mice, embryonic stem cell lines from the C57BL/6N strain were established (29). The International Mouse Phenotyping Consortium (IMPC) is currently creating mutant mice for large-scale phenotyping in a C57BL/6N background (30,31). However, many researchers are unaware of the genetic and phenotypic differences between these two sub-strains (C57BL/6J and C57BL/6N), and the issue is further complicated by poor reporting of animal characteristics (full and correct strain name often missing) (32). Unfortunately, this is a major limitation for external validity and applicability of research with mutant mice. The use of inbred strains is well justified for reducing variability and increasing the precision of measurements (targeting genes in a known background). However, good design (and applicability) would require using more than one strain (33,34) and there are numerous examples of phenotypic differences from the same mutation depending on the background strain (35).
The phenotype is a result of gene-environment interplay. If a pre-clinical research group has created a mutant mouse line, then eventually the phenotypic characterization of live animals will be conducted. The early literature of such studies is full of controversies. Therefore, the behavioral neuroscience community has already been aware of the problems with (ir)reproducibility much earlier than the last decade, and in a way has been better prepared for the "crisis" (36)(37)(38). There has also been a major conflict between "molecular biologists" and "behaviorists" where the former could not understand or accept different results obtained by different laboratories. In order to tackle the discrepancies between laboratories, it was recommended to apply extensive standardization of procedures and environment. Such a solution was tested in a seminal study published in 1999 by Crabbe, Wahlsten and Dudek (39). They found that despite rigorous standardization of almost everything in three different laboratories, some of the results were idiosyncratic to a particular laboratory. Later, it was shown that among different factors contributing to the variability, the experimenter is the most prominent (40). However, there may be cases where standardization is required or desired -for instance, the IMPC has invested quite a bit in this type of effort, although the success has been variable (41). Exploring and tracking the causes for different outcomes between facilities may be a bumpy and painful process (42)(43)(44). All these findings exaggerated further the suspicions that behavioral studies are unreliable, an opinion especially expressed by researchers not working in the field of neurobehavioral research (43). However, an opposing theory was presented by Hanno Wurbel, suggesting that extreme standardization is a cause rather than cure for poor reproducibility (45,46). Moreover, in addition to the principle of 3Rs (47), researchers working with animal models should adopt thinking in terms of 3V's -construct validity, internal validity and external validity (applicability, generalizability) (48). A comprehensive review of the current standing and future perspectives for embracing biological variation for enhancing reproducibility was recently published (49). Phenotyping efforts without considering the impact of environmental and developmental factors can be misleading.
Core facilities. Nowadays, making a knock-out mouse is a routine and standard procedure. The real challenge is in comprehensive phenotyping (50). In 2000, Jacqueline Crawley published a book, "What's Wrong With My Mouse" (second edition in 2007) (51), where she warned readers coming from molecular genetics that the aim was not to write a "how to" manual. Given that behavioral analysis is too complex to be treated as a "cookbook discipline", "descriptions of the methods are intentionally superficial", to give an overview of what is available. Inexperienced readers were advised to seek collaboration with experienced behavioral neuroscientists before setting up behavioral procedures. The real "pain and beauty" of mouse behavioral testing is comprehensively discussed in a book by Douglas Wahlsten (52). One solution for effectively tackling complexity in behavioral analysis has been the establishment of "core facilities" for behavioral assessment. By now, this is not surprising at all, because modern science is multidisciplinary, requiring special equipment (and more importantly, expertise) for adequately dealing with research questions. Thus, core facilities should help enhance the replicability and reliability of behavioral testing (53). The strengths and challenges of core facilities have been recently discussed (54,55).
The need for establishing a core facility must be based on demands from the research community, the potential users of the facility. In Helsinki, the Transgenic Unit was formed at the Laboratory Animal Center in 1996 by the Institute of Biotechnology. Two years later, an initiative for behavioral analysis of mutant mice was launched and I was recruited for this purpose. The laboratory in Helsinki was developed from the beginning with the idea of being open and offering as broad support as possible to everyone interested in behavioral phenotyping -testing of basic sensory and motor functions followed by more complex tasks for coping with stress, learning and memory, and testing approach / avoidance behavior (e.g. exploration-curiosity, fear-anxiety).
However, as it often happens, I started with an instance of failure, which taught me a lot. The first transgenic line to be tested did not seem to learn spatial navigation task in the water maze (at that time considered a gold standard test for learning). I then found out that the mice were in FVB/N background (standard strain for transgenic mice at that time) which suffer from a mutation causing retinal degeneration and blindness (56). Curiously enough, there were papers published showing spatial learning in this strain (57). This was a lesson that warned me that meaningful work with mutant mice requires parallel studies with inbred strains -know thy mouse (58-60) and methods (54)! The end of 1990s was a very active period in the field of behavioral phenotyping -a new opening and shifting of paradigms. We learned a lot about differences between mouse strains (25) and about strategies to set up test batteries (as opposed to the tradition of conducting only one test per animal) (61,62). Excellent international training courses and workshops were organized and I personally had the opportunity to attend courses arranged by Cold Spring Harbor Laboratory, EMBO / FENS, IBRO, IBANGS, and EUMORPHIA program. Inspiring interaction with fellow students and respected faculty (e.g. Jacqueline Crawley, Richard Paylor, Howard Eichenbaum, Seth Grant, Richard Morris, Hans-Peter Lipp, David Wolfer, Wim Crusio and many others) created a solid network and offered many good ideas for proceeding. My personal impression is that during the last 15 years there has been a decline in such high quality interactive courses. We have tried to fill this gap by organizing Baltic summer schools on the topic of rodent behavioral analysis (63). Indeed, learning, teaching and challenging existing paradigms are best achieved in communication and interaction between established and starting researchers -for impressions from recent FENS courses, see (64).
Quality monitoring! The main purpose of the core facility is to serve the research community (55). The organization of the facility and collaboration with users can have different forms, from full to minimum service. The responsibility of the facility is to maintain and take good care of the equipment and space, including monitoring of performance, necessary calibrations, and timely repairs and replacements. An essential part of the responsibility is training users and supporting them in all steps of the project (planning, conducting, analyzing, and reporting). This is one part of quality. Another part, as mentioned above, is to be up-to-date with developments in theory and methodology in behavioral neuroscience and laboratory animal science. In addition, internal validity and consistency should be verified with some standards and calibrations for animal behavior.
Users frequently ask what normal mouse behavior is, while thinking only in terms of their particular disease model. Moreover, it is often thought that the control, "wild-type" for gene-targeted mice, represents "normal", immediately implying that gene targeting will result in "abnormal" animals. As described previously, it is difficult or impossible to answer the question of what "normal" behavior is given the many inbred strains available, along with the impact of environmental conditions. Each inbred strain has some peculiarities due to inbreeding -retinal degeneration, deafness, anatomical differences, susceptibility or resistance for certain conditions that can develop (e.g. diabetes) (65). We all may have heard the saying "your genetics is only as good as your phenotype" and "this mutant does not have a phenotype" (66). First, these ideas may cause bias towards the hypothesis and second, having no phenotype is impossible -the only conclusion in this case would be that the given mice in the given situation did not display a phenotypic difference from the other study groups. Therefore, each model needs to be placed in the broader context of mouse behavior and with observation of the environment. Another question that may be asked is which is the best test for memory (or anxiety, or any other domain). Researchers with different backgrounds may not be aware of different memory systems or different types of anxiety. This is further complicated by the fact that there are hundreds of tests available (67). Navigating this landscape is one of the tasks of experts working in core facilities. Of course, the facility also learns from its users -a good facility is open and flexible to adapting and developing new methods.
Quality monitoring can be done by regular testing of inbred strains with known phenotypes. Although we cannot speak of animals as tools, we hope that the phenotypes of inbred strains are stable over time (68). Therefore, the testing of such animals could reveal if the conditions in the laboratory are stable. The C57BL/6 and DBA/2 strains are the oldest strains available and much information has been collected about physiology, anatomy and behavior in these mice. In our studies, we have found consistent differences between these two strains in several conventional tests (open field, light-dark box, and forced swim test) throughout the years when our laboratory was located in three different buildings (60,(69)(70)(71).
Another important issue that is frequently discussed is the use of male and/or female mice (72). Indeed, it might be still difficult to convince researchers that including female mice in your studies does not ruin it -despite the evidence that female mice are no more variable than males (73,74). Including both sexes is mandatory for sound design and enhanced external validity (75). Despite many years of recommendations to consider sex as a biological variable, the change is taking place very slowly (76,77).
Finally yet importantly, the human factor (experimenter) in animal experiments cannot be neglected (40,78). Therefore, handling techniques need to be trained and refined (79) in addition to improving the understanding of the behavior that is measured and recorded, even if the process is automated (80,81). Testing animal behavior can be challenging -"Despite our best efforts, the mice will continue to win some innings" (82). For enhanced validity, I believe we need to put even more emphasis on recording and mining the behavior of animals in their home-cage environment (83). Nevertheless, it is not yet possible to fully replace conventional testing, yet at the same time the role of experimenter in the process of data collection needs to be critically evaluated (84).

Summary.
In this short review, I tried to highlight the factors that I consider important or essential in running a meaningful (mouse) behavioral phenotyping program. I would like to conclude with the words of Michael Festing and Ulrich Dirnagl: 'We are not born knowing how to design and analyze scientific experiments (85).' 'We should be [moving to the world] where biological thinking rules, and sound data production is emphasized through careful planning, design, execution, and reporting of our studies. A world where methods and results are transparently described so that effects and inferences can be independently confirmed (86).' Thus, I encourage everyone to be open to new ideas while being skeptical of the phrase "we've always done it like this".