The Practice of Archiving Model Code of Agent-Based Models

To evaluate the concern over the reproducibility of computational science we reviewed 2367 journal articles on agent-based models published between 1990 and 2014 and documented the public availability of source code. The percentage of publications that make the model code available is about 10%. The percentages are similar for publications that are reportedly dependent on public funding. There are big differences among journals in the public availability of model code and software used. This suggests that the varying social norms and practical convenience around sharing code may explain some of the differences among different sectors of the scientific community. Bibliometrics, Replication, Open Science, Computational Science The Practice of Archiving Model Code of Agent-Based Models Marco A. Janssen School of Sustainability Center for Behavior, Institutions and the Environment Introduction There is increasing concern over the repeatability and reproducibility of computational science (Barnes, 2010; Joppa et al., 2013; Morin et al. 2012; Peng, 2011; Easterbrook, 2014). If computational scientific enterprises want to be accumulative, more transparency is required, including the archiving of computer code in public repositories. A recent study reported that around 50% of findings published in the Association for Computing Machinery (ACM) conference proceedings and journal articles could not be compiled into valid executables by computer science students, even after authors were requested to provide source code and build instructions (Collberg and Proebsting, 2014). Various code repositories have been created (Stodden et al., 2012, 2015; McLennan et al. 2010; DeRoure et al. 2009), but their use is limited. In this paper we document the practice of archiving model code for agent-based models, an increasingly popular methodology in the social and life sciences. Recent years have seen the emergence of standard platforms such as Cormas (Bousquet et al., 1998), Netlogo (Wilensky, 1999), Repast (Collier, 2003), and Mason (Luke et al. 2005), but also text books (Railsback and Grimm, 2012; Wilensky and Rand, 2015), conferences and summer schools. As such, the use of agent-based modeling has become a recognized method in the life and social sciences. Since the use of what we now call “agent-based modeling” did not originate from a particular discipline or application, we may expect that the applications will spread widely across various disciplines. Part of this exercise is to map the use of the method in different fields and document whether there are different practices in sharing model code and model documentation. In the rest of this paper we first describe the methodology used to derive a sample of 2367 publications presenting the results of agent-based models and the protocol we used to collect metadata on the availability of model code, the software used, and the way models are documented. We then report the descriptive statistics of the data and perform a network analysis of the publications citing each other. We conclude with a discussion on the implications of our findings. Methodology In order to derive a sample of relevant publications, we used the search term “agent-based model*” on the ISI Web of Science database in the spring of 2015 for publications up to 2014. The term “agentbased model*” could be used in the title, abstract or keywords. This resulted in 2855 publications. All publications were evaluated in order to verify that it was about an agent-based model. Reviews, conference abstract or presented conceptual models were discarded. This resulted in 2367 publications that report a model and results of model simulations. For each publication we checked whether the model code was made available through a provided URL to a website or as an appendix. We also checked whether the URL was still available. Hence our criterion on public availability of the model code depends on the valid information provided in the article. We recognize that the model code could be published online but not mentioned in the article or could be provided by authors if we had requested this. As such our estimate of the public availability of model code is an underrepresentation of what might be available with more investigation. Furthermore, we listed which programing platform was used and which sponsors funded the research. Finally, we recorded how the model was described in the articles and appendices. Based on Müller et al. (2014) we distinguished the following items: Narrative. How was the model description organized? Did it use a standard protocol called OverviewDesign-Details (ODD)(Grimm et al. 2006), or did it use a non-prescriptive narrative. Visualized Relationships. How were the relationships visualized? Did it include flow charts, a Unified Modelling Language (UML) diagram or provide an explicit depiction of an ontology that describes entities and their structural interrelationships. Code and formal description. How were the algorithmic procedures documented? Did the authors provide the source code? Did they describe the model in pseudocode or use mathematical equations to describe (parts) of the model? The downloaded information from ISI Web of Science included references for each article. This information was entered into a database and unique identifiers were provided for the publications in order to perform a network analysis. The resulting database can be found at https://osf.io/8n663/. Results Out of the 2367 articles 236 articles contained information (often via a link to an online database) on the availability of the source code, which is 10.0%. Excluded from the count were 69 articles which provided a link to online databases, but either the website did not exist anymore or the link was password protected. Although authors may be able to provide the code if one requests it, as sometimes stated in the publication, we only consider a model code publicly available if the actual code is made publicly available. In some cases code might have been made available without mentioning it in the publication. But this would be unknown to us since we only rely on the information in the publication. Figure 1 describes the number of publications on agent-based models over time. Each publication is a new or updated agent-based model for which computer code is used to generate the published results. We see an exponential increase of the number of publications. Figure 2 shows that the percentage of the publications that makes the model code publicly available is below 10% until 2012 and increases to about 15% in 2014. With the rapid increase of the absolute number of publications, this means a very sharp increase of the amount of model code made publicly available. Nevertheless, for 90% of the publications the model code is not publicly available, which will hinder replication of the results and the accumulation of knowledge. Figure 1. Number of publications over time. Figure 2. Percentage of publications for which model code is publicly available. What facilitated the increase of archiving model code? To investigate this we traced where the code was made available (Table 1). The most common option is to have the code available on the journal publisher’s website. The next most common option is to have the code available on the author’s personal website or that of the researcher or research group. In some cases authors made their code available via a Dropbox link or ResearchGate post. There are various public archives for computer code such as Github, SourceForge, CCPForge, Bitbucket, Dataverse and GoogleCode, but the most commonly used archive is the specialized Computational Model Archive at OpenABM.org with code of 55 publications from our data set. Finally, we consider platform specific repositories such as Netlogo and Cormas. Table 1: The locations where source code was stored, as referred to in the journal articles. Location name Description Number of publications Journal As supplementary information 72 Personal Websites of researchers or research groups 71 0 50 100 150 200 250 300 350 400 450 1 9 9 6 1 9 9 7 1 9 9 8 1 9 9 9 2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 2 0 0 4 2 0 0 5 2 0 0 6 2 0 0 7 2 0 0 8 2 0 0 9 2 0 1 0 2 0 1 1 2 0 1 2 2 0 1 3 2 0 1 4 N u m b er o f p u b lic at io n s Code not publicly available Code publicly available 0% 2% 4% 6% 8% 10% 12% 14% 16%


Introduction
There is increasing concern over the repeatability and reproducibility of computational science (Barnes, 2010;Joppa et al., 2013;Morin et al. 2012;Peng, 2011;Easterbrook, 2014).If computational scientific enterprises want to be accumulative, more transparency is required, including the archiving of computer code in public repositories.A recent study reported that around 50% of findings published in the Association for Computing Machinery (ACM) conference proceedings and journal articles could not be compiled into valid executables by computer science students, even after authors were requested to provide source code and build instructions (Collberg and Proebsting, 2014).Various code repositories have been created (Stodden et al., 2012(Stodden et al., , 2015;;McLennan et al. 2010;DeRoure et al. 2009), but their use is limited.
In this paper we document the practice of archiving model code for agent-based models, an increasingly popular methodology in the social and life sciences.Recent years have seen the emergence of standard platforms such as Cormas (Bousquet et al., 1998), Netlogo (Wilensky, 1999), Repast (Collier, 2003), and Mason (Luke et al. 2005), but also text books (Railsback and Grimm, 2012;Wilensky and Rand, 2015), conferences and summer schools.As such, the use of agent-based modeling has become a recognized method in the life and social sciences.
Since the use of what we now call "agent-based modeling" did not originate from a particular discipline or application, we may expect that the applications will spread widely across various disciplines.Part of this exercise is to map the use of the method in different fields and document whether there are different practices in sharing model code and model documentation.
In the rest of this paper we first describe the methodology used to derive a sample of 2367 publications presenting the results of agent-based models and the protocol we used to collect metadata on the availability of model code, the software used, and the way models are documented.We then report the descriptive statistics of the data and perform a network analysis of the publications citing each other.We conclude with a discussion on the implications of our findings.

Methodology
In order to derive a sample of relevant publications, we used the search term "agent-based model*" on the ISI Web of Science database in the spring of 2015 for publications up to 2014.The term "agentbased model*" could be used in the title, abstract or keywords.This resulted in 2855 publications.All publications were evaluated in order to verify that it was about an agent-based model.Reviews, conference abstract or presented conceptual models were discarded.This resulted in 2367 publications that report a model and results of model simulations.
For each publication we checked whether the model code was made available through a provided URL to a website or as an appendix.We also checked whether the URL was still available.Hence our criterion on public availability of the model code depends on the valid information provided in the article.We recognize that the model code could be published online but not mentioned in the article or could be provided by authors if we had requested this.As such our estimate of the public availability of model code is an underrepresentation of what might be available with more investigation.Furthermore, we listed which programing platform was used and which sponsors funded the research.Finally, we recorded how the model was described in the articles and appendices.Based on Müller et al. (2014) we distinguished the following items: -Narrative.How was the model description organized?Did it use a standard protocol called Overview-Design-Details (ODD) (Grimm et al. 2006), or did it use a non-prescriptive narrative.
-Visualized Relationships.How were the relationships visualized?Did it include flow charts, a Unified Modelling Language (UML) diagram or provide an explicit depiction of an ontology that describes entities and their structural interrelationships.
-Code and formal description.How were the algorithmic procedures documented?Did the authors provide the source code?Did they describe the model in pseudocode or use mathematical equations to describe (parts) of the model?
The downloaded information from ISI Web of Science included references for each article.This information was entered into a database and unique identifiers were provided for the publications in order to perform a network analysis.The resulting database can be found at https://osf.io/8n663/.

Results
Out of the 2367 articles 236 articles contained information (often via a link to an online database) on the availability of the source code, which is 10.0%.Excluded from the count were 69 articles which provided a link to online databases, but either the website did not exist anymore or the link was password protected.Although authors may be able to provide the code if one requests it, as sometimes stated in the publication, we only consider a model code publicly available if the actual code is made publicly available.In some cases code might have been made available without mentioning it in the publication.But this would be unknown to us since we only rely on the information in the publication.
Figure 1 describes the number of publications on agent-based models over time.Each publication is a new or updated agent-based model for which computer code is used to generate the published results.We see an exponential increase of the number of publications.Figure 2 shows that the percentage of the publications that makes the model code publicly available is below 10% until 2012 and increases to about 15% in 2014.With the rapid increase of the absolute number of publications, this means a very sharp increase of the amount of model code made publicly available.Nevertheless, for 90% of the publications the model code is not publicly available, which will hinder replication of the results and the accumulation of knowledge.What facilitated the increase of archiving model code?To investigate this we traced where the code was made available (Table 1).The most common option is to have the code available on the journal publisher's website.The next most common option is to have the code available on the author's personal website or that of the researcher or research group.In some cases authors made their code available via a Dropbox link or ResearchGate post.There are various public archives for computer code such as Github, SourceForge, CCPForge, Bitbucket, Dataverse and GoogleCode, but the most commonly used archive is the specialized Computational Model Archive at OpenABM.org with code of 55 publications from our data set.Finally, we consider platform specific repositories such as Netlogo and Cormas.Since most research is sponsored by tax money, sponsors often explicitly require that the data, including software code, be made publicly available.About 55% of the publications list the sponsors of their research.In some cases these are multiple sponsors.In Table 4 we list the 10 most common sponsors mentioned and provide the percentage in which model code is made publicly available.From this table it is clear that publicly funded research does not produce a higher percentage of publications with publicly available model code.The numbers suggest that there is no enforcement of public data availability required by the sponsors.Which software platforms were used?Not every manuscript provides information on which software is used.In fact 1223 of the 2367 publications (52%) do not provide information on the software implementation.Of those who provide information, we find more than 100 different types of platforms and computer languages.Some publications use combinations of platforms and languages.In Table 5 we list the 10 most commonly used platforms and languages as mentioned in the publications.Netlogo and Repast are the most common, and they are agent-based modeling specific platforms.How are models described in journal publications?Table 6 reports the various ways in which models are documented.A verbal narrative is the most frequent description.A more precise narrative is the ODD protocol (Grimm et al. 2006) which provides a structured description of the different components and mechanisms of the model.A mathematical description is also commonly used, but note that this does not mean that in all those publications a complete mathematical description is provided.In many cases some key equations are provided which are essential to understand the model together with the verbal narrative.Do model publications build on each other?Among the 2367 publications there are 2704 citations, which is an average of 2.3 connections of each paper.We map the network of connections between the articles in Figure 4 using the ForceAtlas 2 algorithm in the network visualization tool Gephi.We focus here on the largest number of connected papers in the network.Based on the evaluation of the paper topics in the various clusters of the network, we indicate different topic areas.The most dense topic area of interactions (meaning citations) is land use change modeling.This is an application area of agentbased modeling that has many users.Figure 4 also demonstrate that the lack of archiving model code is widespread among all research domains.Figure 5 depicts the software that is used, as mentioned in the publication.We see here also that the 10 most commonly used languages and platforms are used among all topic areas.

Conclusions
In this article we provided a brief report on the practice of making agent-based model code publicly available.We relied on information in the publications.We found that about 10.0% of the publications provide model code, and that this percentage is increasing.We noticed major differences between journals and platforms.The increasing use of some common easy to use platforms like Netlogo and R, makes it more convenient to share model code, but journals need to facilitate this.Most journals do not provide any information on requirements for computational studies in their journals.Only recently some high profile journals have started to encourage transparency of scientific research by improving the standards of reproducibility (McNutt, 2014).So far the focus is on biomedical and behavioral research, but computational research is expected to follow (Alberts et al., 2015).
The results provided are the initial results of a broader project to map the field of computational modeling.Since many of the publications are recent, due to the exponential increase of agent-based model publications, the impact of model availability on citations cannot yet be evaluated in a reliable way.A further extension of the database will include a broader range of agent-based simulation models (including those that use different terms like multi-agent simulation, agent-based simulation and agentbased computational economics), as well as updating the database with more recent publications.The resulting database will enable us to derive a better understanding of the practices in the rather fragmented scholarly landscape of computational modeling.
In conclusion, the sharing of the model code of agent-based models is low, but slowly improving.The technical facilities are available to archive model code.However, to increase the actual sharing of model code, and enhance knowledge accumulation, journals need to improve their standards and sponsors need to enforce their policies.

Figure 1 .
Figure 1.Number of publications over time.

Figure 2 .
Figure 2. Percentage of publications for which model code is publicly available.

Figure 4 :
Figure 4: Network of model publications connected with other model publications among the 2371 publications in the dataset.Green nodes define whether the model code is publicly available.Red nodes define whether model code is not publicly available.Note that only publications are depicted that have a connection with another publication.

Figure 5 :
Figure 5: The network of publications with connections to other model publications colored according to the known use of the computer language or platform.White nodes indicate that the software used is unknown or is a less frequently used platform.

Table 1 :
The locations where source code was stored, as referred to in the journal articles.Figure3shows the use of different locations where code is archived over time.This demonstrates the increase of the use of open source archives, especially OpenABM.Figure3also demonstrates that model code that was available for publications about 10 years ago are often not accessible anymore.This demonstrates the importance of storing model code and documentation in public archives to preserve the scientific output for future generations.Figure 3: The percentage of model publications split up in different categories where the source code of the model is available.The 2367 publications appeared in 722 different journals which demonstrate the spread and scope of the use of agent-based models.The 10 most popular journals are listed in Table2.This table shows the wide diversity of standards and practices of the journals.The popular journal JASSS has a high percentage of publications that make the model code available.They also indicate in their guidelines "Authors are strongly encouraged to include sufficient information to enable readers to replicate reported simulation experiments."Althoughit is not a requirement, the journal encourages authors to share model code.On the other hand a popular journal like Physica A has no articles for which model code is made available.The short articles in this journal typically describe models mathematically and present results of computer simulations.

Table 2 :
Model code availability of the 10 most popular journals in the database.

Table 4 :
Model code availability for the 10 most common sponsors.

Table 5 :
Model code availability for the most common platforms or programming languages.

Table 6 :
Relative frequencies in which models are described in the publication.