ODD+2D: An ODD Based Protocol for Mapping Data to Empirical ABMs

: The quantity of data and processes used in modeling projects has been dramatically increasing in recent years due to the progress in computation capability and to the popularity of new approaches such as open data. Modelers face an increasing difficulty in analyzing and modeling complex systems that consist of many heterogeneous entities. Adapting existing models is relevant to avoid dealing with the complexity of writing and studying a new model from scratch. ODD ( Overview, Design concepts, Details ) protocol has emerged as a solution to document Agent-Based Models (ABMs). It appears to be a convenient solution to address significant problems such as comprehension, replication, and dissemination. However, it lacks a standard that formalizes the use of data in empirical models. This paper tackles this issue by proposing a set of rules that outline the use of empirical data inside an ABM. We call this new protocol ODD+2D ( ODD+Decision + Data). ODD+2D integrates a mapping diagram called DAMap ( Data to Agent Mapping ). This mapping model formalizes how data are processed and mapped to agent-based models. In this paper, we focus on the architecture of ODD+2D, and we illustrate it with a residential mobility model in Marrakesh.


Introduction
. The quantity of data and processes used in modeling projects has been dramatically increasing in recent years due to the progress in computation capability and to the popularity of new approaches such as open data. Modelers face an increasing di iculty in analyzing and modeling complex systems that consist of many heterogeneous entities.
. Today, a large number of models exist to represent various complex phenomena. Adapting existing models would be relevant in order to avoid dealing with the complexity of writing and studying a new model from scratch, but this faces the major di iculty to reuse them because of a lack of transparency in descriptions (Richiardi et al. ; Hinkelmann et al. ; Müller et al. ; Donkin et al. ). .
The need for a standard protocol to describe and share agent-based models has emerged from the community. Such descriptions make it easier to understand and replicate models by researchers from various disciplines. Grimm et al. ( entities, equations, rules, and schedules contained by ABMs. They advise to always keep the same sequence to ease the read, the understanding, and the replication, and definitely to make ODD a standard protocol. . A er evaluating previous uses of the ODD, Grimm et al. ( ) were convinced that the protocol needs improvement to fix limitations and ambiguities. In their paper, they promoted a slightly modified structure of the protocol to guarantee greater clarity and e iciency. .
ODD upgrades contributed to the extensive use of ODD ( citations for the first version, for the second one -Google Scholar, the December/ / ). For the complex systems community, ODD also appears as an excellent tool to disseminate, evaluate and assess various multi-agent models (Polhill et al. ; Le et al. ; Lammoglia ; Balbi et al. ). .
In fact, creating together textual descriptions and models by following methodologies (e.g. ODD, UML, . . . ) improves modeling process and exchanges between scientific disciplines. It opens new perspectives such as dissemination, comprehension, assessment, replication, comparison, theory building, and code generation (Müller et al. ). However, major modeling projects focus on the model design from the theory of the literature. Data are not taken into account in the model design, and they are separately integrated in an ad-hoc manner (Truong et al. ; Wang ; Holm et al. ; Groeneveld et al. ). That is why developed models are o en very theoretical and tend to be far away from the reality (Bykovsky ; Filatova ). .
Using data to design and parameterize models is still a methodological challenge (Geller ) for which protocols such as ODD seem to respond to. Also, some relevant issues have been raised by ODD creators (Grimm et al. ), who advised to share input files and the source code of the model. But, sometimes it is not possible or suitable due to legacy and privacy. This practice is nowadays still rare (O'Sullivan et al. ) but under growth (Janssen ).

ODD+D: an extension of ODD .
Some researchers noticed that ODD does not give su icient response to their needs. They extended the original structure and adapted it to their specific cases. For instance, Nguyen et al. ( ) created a common representation (called CoODD) to specify collaboration/participation rules, while Hinkelmann et al. ( ) added an algebraic specification to describe ABMs with algebraic structure.
. ODD was designed for ecological modeling and is not much suitable for socioeconomic models (Müller et al. ). ODD shows its limits when intelligent and social entities (as humans) are integrated into the model. The protocol does not support human behaviors (decisions, adaptation, and learning) well. .
To overcome this limitation, the authors presented a new extension called ODD+D (ODD + Decision) that aims at introducing human decision-making. For that purpose Müller et al. ( ) added some blocks into ODD without modifying the original form. This new extension was used and discussed in many research studies focusing on modeling human societies to describe agent-based models with decision aspect (Filatova ; Klabunde et al. ). .
However, ODD+D still has the same limits as ODD about integrating data. Authors of ODD+D (Müller et al. ) tried to include some purposes of this work in their protocol, under the element "Theoretical and empirical background". But it does not satisfy all the modeling needs about formalizing the use of empirical data in the model.

.
In what follows, we fill this gap by extending ODD+D and introducing data-oriented directives in the "Input data" block. This improvement emphasizes the link between data and model. This accurate description facilitates using and re-using a model with other data sets, such as raw data from surveys or unprocessed databases ( Figure  ).
in a theoretical context (toy models) increases the gap between the model and the reality. It may question the contribution of such models (Lammoglia ; Klabunde & Willekens ) for the case study, especially when realism is the goal of the modeling process (Smajgl & Barreteau ). Therefore, empirical knowledge has to be integrated in modeling practice through specific strategies and methods (Boero & Squazzoni ; Filatova ), since it can directly address essential modeling needs, such as spatiality, temporal resolution, and behavioral rules (Altaweel et al. ). .
Hence, data are not used just at the end of the modeling process to obtain results. It is a crucial resource along the modeling process to produce knowledge, configure agents, determine behaviors, validate, select scenarios (Geller ), and reveal the relevant data sources and how they were implemented (Barreteau & Smajgl ). This kind of data-driven models can contribute to obtaining simulation results that fit into observations of the corresponding target (Hassan et al. ).
. Creating a model from empirical data favors the confidence of end-users because it reproduces an observed phenomenon and could be validated thanks to real data (Hassan et al. ). Hence, consistent use of empirical data increases the trust of various stakeholders in any model (Filatova ). Nevertheless, the use of real data must be justified by the research goal. .
Introducing data has an impact on the model complexity for the dynamics and complicatedness of the model structure (Grimm et al. ). Thus, such a model becomes more di icult to understand, explain, share and disseminate (Hassan et al. ). Modelers are faced with a trade-o between designing theoretically-grounded models (with global assumptions) and empirical models (with contextualized assumptions) (Boero & Squazzoni ; O'Sullivan et al. ; Sun et al. ). .
It is di icult to associate disciplinary theory with empirical data, but it is necessary to bring actual response for decision-making. What is the suitable amount of data that should be introduced in a model? It is a real debate within the agent-based community (Sansores et al. ; Bruch & Atwell ; Sun et al. ). Filatova ( ) outlines some challenges such as: • Maintaining a link between empirical data and theory; • The necessity to collect case-specific data to match the design of an ABM; • The di iculty in replication and generalization of the results; • The translation of qualitative data into formal rules when coding.
Advances in the domain should propose new approaches, concepts, and tools to introduce data in modeling process (Filatova ). It tends to the development of mid-level models (O'Sullivan et al. ) that associate theory with empiricism, science with a case study, and researchers with stakeholders.
Providing methods and languages that describe data, the model and relationships between them may ease the development and the promotion of an environment for decision-making. This is the research we are developing by combining natural descriptive model (ODD) and implementation models (Geller ; Sun et al. ) with data-model mapping description.

Shortcomings of existing descriptions .
In this subsection, we will show the most critical shortcomings of the current descriptions, and how describing data and data-model connections can improve the empirical foundations of agent-based models.
. For example, Klabunde et al. ( ) and Filatova ( ) promote an agent-based decision model of migration and an empirical agent-based land market, respectively. These two models (as major ODD+D descriptions) use their own manner to describe data. It shows an inconsistency in the description, and a lack of a unified and comprehensible approach. They do not provide su icient information about empirical data, and how it is structured and used to design and develop the model. The connection between data and model entities is quite confusing: (i) agents' state variables are hard to locate in data; (ii) the relationships between the data structure and the model architecture are omitted. Also, the few information and descriptions given about data are dispersed between overall the ODD+D document, which hinders reading and understanding. .
This insu iciency makes it very di icult to reproduce the data-model connections of the two previous examples. Hence, the gap between data and model persists and needs more description to be bridged. We tried to solve the problem by extending the two descriptions with our proposal (Appendices B and C). .
Despite the popularity of agent-based models, there is still no accepted methodological standard for these models (Richiardi et al. ; Hinkelmann et al. ) especially in the empirical research (Bruch & Atwell ). Popper & Pichler ( ) argue that the ODD protocol covers foremost the aspect of model definition, but it does not document: (i) the process of modeling; (ii) the theoretical knowledge involved; (iii) the development of the model; and (iv) the analysis of results. A model description should also make transparent where and which data have been used for its creation, development, calibration, and validation (Barreteau & Smajgl ; Groeneveld et al. ). By answering questions about the choice of methods and parameterization, the reader can reproduce not only the model but also get a holistic view of its context. Such a view may reduce misunderstandings and misinterpretations about the model and help to successfully replicate its structure and dynamics (Donkin et al. ).
How to solve the empirical challenge? .
To solve the empirical challenge of agent-based models, methods should take into account the following points: . Relation to related data -a formal link between data structure and model entities must be established. According to the collection approach and the size of the sample, data have a limited validity domain. The modeler has to consider that along the modeling process. Thus, metadata and the initial context are vital points to assess the relevance of these data to the research question (Altaweel et al. ). .
Note that data are o en used for unforeseen projects intended to address various questions. A preprocessing must be done to understand and to format them to the new question. It conducts to filter away unnecessary data complexity (Hassan et al. ; Geller ) and contributes to creating new knowledge throughout the employed expertise. This scientific work should be formalized and capitalized for further research (Siebers & Aickelin ).
. Relation to involved participants -experts give meaning to data. Their knowledge helps to identify the structure (agents, environment) and dynamics (behaviors) of the model (Hassan et al. ; Filatova ). Indeed, from interdisciplinary collaboration (experts, modelers, stakeholders), new points of view will emerge. That is why this progressive knowledge should be formalized along the modeling process (Barreteau et al. ; Barreteau & Smajgl ).
. Transparency -it allows to redo the modeling process by any scientist outside the initial team. A transparent approach implies notifying scientific choices, analysis methods, tools and source codes. Transparency is essential for the capitalization of scientific knowledge (Janssen ). .
For example, a transparent model that reproduces urban dynamics based on urban data can be used again for another city. The initial preprocessing could also be discussed and applied strictly to analyze the data of the new city. Such reuse is allowed by the readability of description/code, the accuracy of analysis details that results from a modeling project. Also, transparency eases confidence given by any disciplines and exchanges about the expected model (Anh et al. ).
. Structuring -the structure is a guideline for users, which permits to ask the right modeling questions, to obtain correct answers and to formalize them comprehensively. Therefore it must ensure transparency, relation to related data and relation to involved participants.
. A data-modeling method must invite involved participants to follow structures and outlines, to produce accurate and understandable descriptions. It is also a manual for anyone to read these descriptions, to analyze and to replicate the process (Barreteau & Smajgl ). Such a method is based on suitable languages, protocols and a suite that: (i) eases modeling process and data mapping; (ii) allows multidisciplinary exchanges; (iii) and simulates.

ODD+ D: Extending ODD+D for Describing Data in ABMs
. ODD can be overdone for straightforward models (Grimm et al. ). In such a case the documentation may be done by using continuous text instead of separate document subsections for each ODD element (Popper & Pichler ). Also, any structured method that organizes data to be directly applicable for modeling projects can facilitate model creation (Altaweel et al. ). However, ODD is more e icient in disseminating, understanding and structuring the design of models (Polhill et al. ; Wolf et al. ) in comparison to other frameworks such as MR POTATOHEAD framework (Parker et al. ), Dahlem ABM Documentation Guidelines (Wolf et al. ) and Characterization and Parameterization (CAP) framework (Smajgl & Barreteau ). Figure : DAMap approach separates the modeling process into three layers: Code, Description and View. The modeler modifies model and updates data (Code), while he interacts with the domain expert who tests and validates outputs (View). In parallel, the two actors participate in commenting and documenting (Description) the model.

.
Therefore, an extension of ODD protocol would be suitable and much easier to understand and use than another dedicated method of describing data, such as Delineate, Structure, and Gather (DSG) proposed by Altaweel et al. ( ). The emphasis on data to design models can be reflected by so adaptations of the ODD protocol (Geller ). .
The next subsection presents a collaborative approach to developing and describing empirically agent-based models called DAMap. It is based on a diagram that maps data to the model. This diagram permits to generate GAML (GAma Modeling Language) implementation model and a textual model extending the ODD protocol.

DAMap (Data to Agent Mapping) approach
.
The experience shows that elaborating both textual model and design model in the same modeling process favors taking into account participants' interests (researchers and stakeholders) (Sargent ). Exchanges between participants from various domains should be engaged to collect input data, conceptualize, and generalize the model with scientific knowledge and hypothesis. They are the key to the success to build a usable model in line with the local case study and scientific advances. A short iterative cycle between these two models establishment is a reliable approach to ensure the development of a usable model.

.
We propose to make describing the model part of the development cycle and keep the connection between the three parts: model code, description, and view. We call this scheme DAMap (Data to Agent Mapping) approach as shown in Figure . The modeler elaborates the model code and collaborates with the domain expert at the same time, to produce the model description interactively. The two actors test and validate the simulation results and outputs (View), and update the other components: Code and Description. This architecture simplifies model comprehension and encourages the reusability of each part outcomes. Domain experts, stakeholders and modelers are actors of the modeling process and take roles according to their skills (Jones et al. ) (Figure ).

.
To make this approach possible, we are developing in parallel a graphical user interface (GUI) called DAMap (Laatabi et al. ). This interface allows the user to design a diagram of mapping between data elements and components of the agent model. Thanks to dedicated tools, the user is guided to generate a natural (ODD+ D description) and implementation (GAMA code) models.
. As Groeneveld et al. ( ) argued, the model description should be conducted in collaboration with someone who has not implemented the model. This actor may identify redundant, confusing and forgotten details: people outside classical modeling process are not burdened with the technical di iculties of the project.

.
Note that this paper focuses on the ODD+ D extension, so details on DAMap method (Laatabi et al. ) are not given here. Nevertheless, DAMap diagram is provided because it is included as a part of the ODD+ D description. The ODD+ D protocol extends ODD+D and allows specifying the usage of data inside a model. It gives new ways to understand and consider data, for a better integration into agent models. By using this protocol, we ought to favor model reuse for another case study. Data-model specification determines using application domains of a model and data that could be applied. It also helps to feed a model with new data of a new case study.
. ODD based protocols are also tools for modelers to check if all necessary information is available for the model understanding and replicating (Groeneveld et al. ). ODD+ D adds this aspect for data description.
. Such protocol can also be assimilated to a media that synthesizes available data in an understandable form. Such description is vital in modeling process to collect data and to make an e icient contextualized analysis of them to extract information and feed the modeling process.
. ODD+ D reuses ODD+D architecture and adds four new blocks inside Input Data part ( Figure ): (i) data overview; (ii) data structure; (iii) data mapping; (iv) data patterns. The overview permits to disseminate data context. Structure is about data scheme and hierarchy. Mapping allows to project structure on the model. Patterns describes the mapping and models dynamics. These four parts are detailed in this section a er that.
. This add-on was imagined in order to: (i) be synthetic and precise; (ii) keep sources and understand the done usage of data; (iii) give enough information to use the model with another data; (iv) inform readers about the validity of domain model; (v) facilitate exchanges between disciplines. As such, ODD+ D combines both textual and graphical descriptions, to be understood by a large community.

Data overview
.
Questions to answer: Where does data come from? How is it collected? What is the level of available data? How is it structured? How are data tables built from the survey? These are the central questions that users should answer to qualify the data they introduce in the model.

Contents:
Used databases are titled and associated with a short description. Authors must not forget to explain their role in the modeling project and which parts are used. Giving a complete overview of data is beneficial to keep in mind available data along the modeling process (Hassan et al. ) and to know briefly which kind of data are required by the model. Finally, authors should not forget to associate a hypertext link with each database to allow readers to look for more information.

Data structure
.
Questions to answer: What are the variables, entities and classes available in data? What do they represent? What is their format? What are their properties? How are they linked?

Contents:
This block describes the structure of the dataset, and specifies the di erent classes that can become agents in the model. Users are free to explain data structure with plain text, tabular or diagrams, but the clarity and the accuracy of the description is the key to the understanding. Thus, we advise to describe each database with a formalized and unified language such as basic tabular or UML Class diagram; one diagram per database is required. Additional plain text can be fair to give more information about schema for a better understanding and to assume an expertise about data. .
Describing data structure plays an essential role in the DAMap method because it allows having an excellent overview of available data and facilitates conceptualization of the model then (Hassan et al. ). Additionally, all data of database may be described in schema (used and unused data) to avoid hidden links that can make data-model confusing (Groeneveld et al. ).

Data mapping
.  Figure ) that gives a synthetic and accurate model of these links. The diagram shows the overall mapping between database schema model (le ) and agent-based model (right), and the processing required to transform empirical data into agent characteristics. It shows, for example, how few columns of database tables are aggregated to determine one characteristic of an agent thanks to a mapping pattern of type aggregation.
. DAMap proposes two categories of patterns, mapping patterns, and assumption patterns. .
Mapping patterns determine the category of the link between data and agent and also the mapping/transformation processing. We distinguish five patterns: • «mapped to» -this pattern tells about which agent is linked to which data entity. It means that state variables of agents are linked by name to data attributes. It allows reducing displayed links and prevents overfull models.
• «aggregation» -this element explains how new variables are built from separated attributes. It defines thus transformations by using expressions composed of operators such "sum" or "mean".
• «transtyping» -indicates casting rules between a data attribute and a state variable of an agent, for example, to convert income to a social category (social_class).
• «dependence» -indicates how data attributes are associated to explain behaviors.
Assumption patterns provide additional information about agents, their attributes and behaviors. We distinguish three patterns: • «constraint» -determines a constraint on a state variable, to prevent the unexpected use of model variables. It permits to keep the integrity of the model (between[0,1], in{0,1,2}).
• «knowledge» -expresses a knowledge about the phenomenon from the literature. It allows justifying current choices by previous studies, research and results.
• «domainExpert» -outlines knowledge that comes from the experience and practice of the domain expert. This knowledge might not o en be found in the scientific literature.
The enumeration above is not exhaustive; users can add new pattern categories as a response to their problems. Therefore, plain text can describe the choice of this new pattern briefly. An accurate presentation of these patterns may be exposed in the Data Patterns block.
Note that, the DAMap meta-model is explained in more details in (Laatabi et al. ). It was developed accordingly to recommendations that argue for the incorporation of UML diagrams into agent-based documentations (Amouroux ; Bersini ; Bruch & Atwell ; Sun et al. ), and the emphasis on the importance of using a graphical model for a better design of attractive, readable and reproducible models (Groeneveld et al. ).

Data patterns
.
Questions to answer: What are the relationships and patterns that exist in data? Are they translated into actions and behaviors in the model? And how do some attribute variations a ect other variables and then agent behaviors? .

Contents:
This block gives a list of patterns and formalizes relations between the database and the agents. It is an excellent description of transformation rules that convert data to agent characteristics. As a result, modelers are advised to specify with accuracy rules by writing equations, formal predicates or algorithms. Plain text documentation could accompany these specifications for a better understanding.
. Users do not concede the accuracy of this specification of these patterns because the transparency, the readability and the understanding of data analysis are depending on it. Thanks to it, modelers who read the ODD+ D description can redo data analysis and apply or modify it for another case study.

Describing Residential Mobility in Marrakesh with ODD+D
. This section follows the ODD+D protocol to describe an agent-based model that reproduces residential mobility observed in Marrakesh.

.
Marrakesh has undergone profound structural changes to tend to a more sustainable city. Urban programs are now redrawing transportation, housing, services: (i) a new urban transportation system based on ecological vehicle is projected; (ii) new districts are under construction; and (iii) economical services are evolving. Consequently, daily mobility, residential mobility, and citizen habits are profoundly changing and a ecting the relevance of structural decision. A consensus should be found to take into account environmental objectives, economic developments, and inhabitant wishes.
. We developed a model of residential dynamics to understand the impact of the decision on urban dynamics. The model focuses on the main factors that make people decide to change their places of living, such as income (Jordan et al. ), household size (Clark ), and properties of the dwelling such as size (expressed by the surface area or by the number of bedrooms) and standing. .
Urban decision makers consider residential migration as a process of push and pull between an origin and a destination (Lee ). Households try to adjust their dwelling to the evolution of their needs over time. It is caused by many socioeconomic, housing and environmental factors that can be used and analyzed to study its consequences. .
To conduct this study, we collected various types of data: (i) a survey on residential mobility that we performed; (ii) reports produced by the administration of the municipality; (iii) exchanges with local administrations, especially the housing observatory. Data were compiled in few spreadsheets and GIS with their own structure. For example, the dataset storing the survey is organized into categories regrouping attributes about the household structure, the actual dwelling and the next preference that the household wishes to have. Most of the values are coded in numerical form as depicted in Figure , and a codebook is joined with data to explain the meaning of each value (the corresponding file data.xlsx is provided in appendix A). .
We now present the model we are developing, following the structure of the ODD+D protocol. In this paper, this model is considered as a case study to outline the limits of ODD+D to describe data and show how ODD+ D gives a response.

Overview
Purpose .
The purpose of the model is to simulate residential mobility of Marrakesh over time to understand how different factors (demographic, socioeconomic, housing, environmental) a ect this phenomenon. The model is designed for urban researchers who can use the simulation to test their hypotheses and scenarios, to help the decision makers. Urban dynamics are modeled by agents (Districts, Dwellings, and Households) and interactions between them such as moving decisions and relocation.

Entities, state variables, and scales .
This model focuses on the residential dynamics at the town scale during years. Stakeholders usually measure this mobility year by year. To get the same output data and enough accuracy in simulation, time step represents one month. The model is composed of three entities: household, dwelling, and district.
. Household -models a group of people that belong to the same family. Four state variables qualify it: (i) social_class (income) with three classes: poor, middle and rich class; (ii) hh_size for the number of inhabitants living in the family; (iii) tenure (housing tenure) with two classes (owner or renter); and (iv) moves5y for the number of relocations during the last five years.
. Dwelling -represents the habitat unit, an accommodation that can be shared by one or many households. According to the number of bedrooms, a house or an apartment in Marrakesh is sometimes shared by few families. The dwelling is characterized by: (i) surface of the internal area; (ii) dw_age representing the dwelling age in years; and (iii) standing which tells about the level of housing standing (low, medium or high standing).
. District -models an area of the city containing a set of dwellings. The model focuses on Marrakesh. It extracts data from a GIS (shapefile) to configure the simulation at the start-up. This file references six districts; each of them is described by an identifier (cid), a label (label) and a space occupation. In addition, they are qualified by a set of parameters about dwellings and households, namely: (i) initial number of dwellings (dw2004) and households (hh2004); (ii) dwelling (dw_mean_size) and household (hh_mean_size) mean sizes; and (iii) the standing rates of dwellings (l_standing, m_standing, h_standing).

Process overview and scheduling .
At each time step (each month), a set of households (a percentage given as a parameter) constrained by their number of past moves and their homestead mode, check all the districts to see if there is one closer to their needs. In the chosen district (which may be the same as the current one), each household checks a given number of available dwellings to see if there is a better choice (a dwelling that is closer to its profile) and if it is the case they move to the best available choice. The distance between the household's preference and both the district and the dwelling is calculated as the Euclidean distance between the attributes of each entity: household (income, size), dwelling (standing, size) and district (mean standing, mean size). This dynamic alternates the state of districts, households, and dwellings.

Design concepts
. Theoretical and empirical background -migration and housing decisions are the central concepts of the model. The complexity comes from the competition to access the restricted resources that a ect the household choice when they migrate. Deciding to move or not is based on the perception of the household about the availability of a better alternative that will increase the household satisfaction. This decision is constrained by its characteristics especially the income and the household size. The concept of choice-preference in the context of residential mobility is discussed in the literature (Lee ; Zinas & Jusan ). It was already used in a previous work about residential relocation in Marrakesh (Laatabi et al. ). Statistics from a local administration called HCP (http://rgphentableaux.hcp.ma/) give household growth and building rates.
. Individual decision-making -each household agent decides to move or not from the current dwelling to a new one. By performing this action, it wishes to maximize its housing satisfaction (minimize the distance between the current choice and the preference). This decision is taken in two stages: inter-district choice and intra-district (dwelling) choice. An agent moves when it finds a house that is closer to its profile. Constraints such as income and household size are taken into account in the distance calculation. At each step, some attributes may change. As a consequence, the household may move again to adjust its dwelling to its new needs.
. Learning -the decision process does not include any learning.
. Individual sensing -every household has only a limited random set of choices (dwellings) at every time step. The chosen dwelling may not be the best one, and the household may never find the closest house to its preference as its knowledge and perception of the environment are limited. The decision is uncertain.
. Individual prediction -the agents do not make any predictions.
. Interaction -the interactions are between households, dwelling and district agents. Each household compares its housing needs (preference) to its current choice (dwelling), and to its available choices before making a decision. Due to the limited number of available dwellings, there is a competition between households to access these limited resources. So the model is only based on stigmergic interactions, but not direct ones.
. Collectives -dwellings are located in a unique district. Dwellings inside a district are impacted by the evolution of the housing stock and by human migration. Households of the same district form also a community that is altered by migration and population growth.

Implementation details .
The model is implemented in GAML language under Gama platform (Taillandier et al. ), an open-source and multi-platform so ware for multi-agent simulations.

Initialization .
The simulation is initialized with the available statistics saved in the GIS shapefile (number of dwellings and households, mean values for household and dwelling sizes). Continuous state variables are initialized using a function of normal distribution. Categorical variables are initialized by using a uniform law. Used data are loaded from source files at the initialization of the model.

Input data
.
ODD+ D protocol has extended this block. To avoid repetition, we will detail this part in the following section: "ODD+ D improvements to describe a model".

Submodels .
This block aims to explain the agents' behaviors. At each time step (each month), households consider their state and try to find a better option for them, e.g. moving to a dwelling nearby their preferred area. .

Three main behaviors conduct household dynamics:
• change_location -the household checks if each one of the selected free dwellings, is closer to the preference than his current choice. If it is, it moves and updates all dependent variables (current district, current dwelling, degree of satisfaction, number of previous moves). This behavior is executed with a probability p 1 .
• grow -the household gives birth to a new household with nearby social characteristics with a probability p 2 .
• income -the household may increase or decrease its total revenue and changes its social class. Such a change a ects the decision to move or to stay, as well as the choice of the next destination. This behavior runs with a specific probability p 3 .
. Dwellings dynamics are governed by a behavior called adjust. According to household needs (family size -hh_size) and capabilities (financial statusincome), this behavior adjusts the standing (standing) and the number of bedrooms that a ects the surface area (surface) of the building.

ODD+ D Improvements to Describe a Model
. The improvements of ODD+ D focus on data description which takes place inside the block "Input Data" of ODD+D. ODD+ D distinguishes four sub-blocks called Data Overview, Data Structure, Data Mapping and Data Patterns. The block "Data Overview" incorporates the contents of the old "Input Data" of ODD and ODD+D protocols, while the three other rubrics introduce new details that seem to be important to understand and replicate the agent-based model successfully. The use of these extended rubrics may help bridge the gap between data and model and push forward solving related problems as introduced in Section : "Shortcomings of the ODD protocol for describing data linking".

Input data
Data overview .
Data are synthesized based on a survey that was elaborated to collect information about residential mobility and households in Marrakesh, their housing choice and preferences. The database is alimented with broad statistics of Marrakesh extracted from the General Census of Population and Housing (GCPH ). The map we are going to simulate comes from a GIS shapefile of the city we obtained from an online database (openstreetmap.org/node/ , January ). The original dataset is composed of variables we extracted from questions of the survey, which are grouped into four categories: household attributes, dwelling, choice and preference attributes.

Data structure .
A data analysis and design using UML led to classifying all our selected variables into three classes (Table ).
Household describes the household as the entity chosen to model the population. We define household as an atomic element to reduce complexity because the decision to move is taken at this level. Dwelling represents the house as the housing unit, but we can also use the district for a macro-level modeling. The city of Marrakesh is composed of six districts (zones), and every District represents a collection of dwellings.  The overall mapping diagram is depicted in Figure which represents the transformations and operations applied to data before being loaded into the model: • Households and Dwellings -these two operations of population synthesis are used to generate agents for each district: households with a number of hh2004 and dwellings with a number of dw2004. These operations use the conditional probabilities method.
• standing_transfor -creates the variable score_standing of the agent entity District, with an aggregation of three variables (h_standing, m_standing, l_standing) from the data class District. This aggregation uses the simple function mean. • income_transfor -converts the attribute income of type float to a state variable of type integer (social_class), to express the social class of the household in three categories.
• distance_transfor -uses two attributes from Household data entity (hh_size, income), and two attributes from Dwelling (dw_size, standing) to build a new state variable distance which represents the Euclidean distance between a household and a dwelling.
• area_transfor -builds the variable surface of the agent Dwelling by a multiplication of two attributes of Dwelling data entity (dw_size, room_size). This variable expresses the surface of a dwelling based on the number of bedrooms and a mean room area.
• Moving decision -expresses the dependence between the decision to move and two variables of the Household entity (tenure and moves5y).
• Housing choice -expresses the dependence between the chosen destination and the principal variables of the two entities: household and dwelling. These variables are the same as those used to calculate the distance variable by the previous distance_transfor pattern. This variable is used to decide what dwelling to choose.
• Financial change -this dependence tells about the positive correlation that exists between the household size (hh_size) and the dwelling size (dw_size). When there are new individuals in the family, the household has to move, or the dwelling has to be adjusted. The household income restricts this operation.

Data patterns
• Distance transformation: the distance between a household and a dwelling is a function of four variables ( ).
hh ∈ households, dw ∈ dwellings hh.d ← dw / hh.distance(dw) = min of hh.distance(dw i ) ( ) Additional constraints are applied to the three agents (District, Household, and Dwelling) as depicted in Figure  . These patterns guarantee the integrity of the model and ensure simulation achievement.

Discussion and benefits of newly added parts
. ODD+ D is a guideline that organizes information and ideas according to a determined schedule. This architecture prevents repetitions and favors understanding. Nevertheless, modelers are free in their choices to describe a complex system.
. Classical ODD descriptions do not give enough information about data use and imputation in agent-based models. ODD+ D gives a response to this lake by (i) keeping the role of Input data block of ODD+D thanks to "Data Overview"; and (ii) expanding it by new blocks ("Data Structure", "Data Mapping" and "Data patterns") . These newly added parts of (Input Data) are necessary to understand the relationship between data and model as we explained before. The two sections "Entities, state variables and scales" and "Input data" seem to have redundant contents, but they are complementary. The former represents the agent entities and variables of the model. The latter describes data entities, their attributes, and the link between data and the model.
. A clear distinction between "Data Structure" and "Data Mapping" has to be made. The two parts detail the structure of empirical data. Nevertheless the first one "Data Structure" describes the structure of native data coming from literature and stakeholders for the case study, whereas "Data Mapping" focuses on used data and the link between them and the agent-based model. Some information is repeated, but it shows which data are used and which one is not. Note that in further works, "Initialization" block and "Input data" should be switched to be in accordance with empirical models where data have an impact on the initialization process. ). It will undoubtedly benefit researchers in combining models and data because it favors: • Data analysis transparency by describing the data preprocessing.
• Avoiding too complicated models by highlighting only reliable data.
• Establishing the link between data and agents by using a dedicated graphical language.
• Model engineering by so ware that generates the ODD+ D description and GAML implementation model in a semi-automatic way.
• Interpreting simulation results by keeping in mind the context of research, from the beginning to the end of the modeling process.
• Validating models by controlling and documenting modeling process.
• Readability of ODD descriptions by grouping information about data under one block and specifying which information is required.

Conclusions and Perspectives
. In this paper, we argue that the use of data should be more detailed to ease the understanding, developing, validating, replicating and disseminating. ODD+ D description through DAMap diagram prompts the user to choose data, to analyze them, and to link them with an agent-based model. Establishing these direct links synthesizes the whole experience of participants in the modeling process. .

DAMap diagram is based on graphical languages inspired by UML. A Graphical User
Interface permits to draw such visual model and to generate GAML implementation model and ODD+ D textual model. This interface becomes the shared space to discuss knowledge and data to consider.
. The ODD+ D intends to describe the role of data inside an agent-based model. It takes advantages of ODD and ODD+D protocols because of their e iciency and their popularity. ODD+ D improves ODD+D by providing new building blocks dedicated to data integration in ABMs. It keeps ODDs philosophy and recommendations: generic, structured, and detailed.
. Four blocks are added to the Input Data. Data Overview and Data Structure give an excellent description of input data. Data Mapping and Data Patterns detail migration processing from data to model.
. ODD+ D and DAMap diagram are already used to model the residential mobility of Marrakesh. This work was completed in collaboration with the local housing observatory administration. Thanks to this approach, we convinced stakeholders of the perspective o ered by agent modeling to simulate mobility based on their data. This work shows the e iciency of the approach to support multidisciplinary exchanges and integrate data into ABMs.
. Stakeholders' acceptance of the approach depends on the ease of drawing models with empirical concepts and to play simulation as a shared game. DAMap Graphical Interface gives a response to this challenge. ODD+ D should also be disseminated among the scientific community. The use of this extension for various case studies may provide experience to evaluate and to add improvements.
. The panel of ODD add-ons is expanding while ODD is increasingly used. A jungle of extensions may appear in the next few years. It may hinder e orts of making ODD as a standard for the complex systems community. Therefore, extending ODD protocol has to be regulated by creating an ODD "meta-protocol" (as MOF for UML). Such meta-protocol gives a set of rules to normalize ODD extensions. It also: (i) limits conflicts and redundancies between extensions; and (ii) permits to merge few extensions given a case study to benefit from features of each one of them. For example, to design and develop a participating game of urban migration in Marrakesh: CoODD (Collaborative ODD) and ODD+ D could be merged to introduce both urban data and collaboration description into an agent-based model of urban mobility. ) This appendix extends the "Input data" rubric of the ODD+D description of an agent-based decision model of migration embedded in the life course (Klabunde et al. ).
As we mentioned in the previous section "Shortcomings of existing descriptions", this model described with ODD+D still have some issues in term of the relationship with empirical data. Klabunde et al. ( ) give a short introduction of the dataset used in their model under the "Design Concepts / Theoretical and empirical background" building block, specifying the partnerships and funding of the project. Then they outline which data are available at the individual level. In the next rubric "Individual Decision-Making", they specify that the dataset comprises information on individuals migrating to a wide range of di erent countries. A er that, at the "Initialization" rubric, they say that initial values will be based on data. In the "Input Data" element, the model is said to use external data files for all the demographic processes. "Submodels" element specifies parameters that are estimated from data: (i) the waiting time distributions between demographic transitions; (ii) the probability to be married to specific individuals with given characteristics; (iii) and the maximum number of children.
As we can see from this summarized description, information about data are distributed over all the document. It is hard to make a clear insight into data integration while reading such a description. The handicap of ODD protocols in term of data-model connection is obvious, and our extension will significantly help to bridge this gap by (i) grouping data description under one block; and (ii) giving more details about data structure and its mapping to model agents.
The whole description is given in the original paper. We focus here only on the new blocks added by the ODD+ D.

Data overview
The model uses external input files from MAFE-Senegal data (http://mafeproject.site.ined.fr/fr/donnees/). It contains all the demographic processes using the MicSim package in R. This is achieved through the 'r'-extension in Netlogo (Klabunde et al. ). Data about migration from Senegal to many countries is available in the household and in the individual level. The MAFE-Senegal survey contains individuals and variables.

Data structure
In the MAFE data, all survey weights have been normalized. Individuals are organized in households and can be connected through a social network. The dataset contains files, and the two principals are: • "sn_qm_household" -containing information about households with attributes; • "sn_qm_indiv" -representing individuals with attributes.
Hence, the dataset is too complicated and cannot be represented by a UML diagram or a data table. The solution would be to represent only selected and related data. Since we did not develop the model, we cannot identify exactly which data are used to build model entities. Nevertheless, we give a sample (Table ) to show datamodel mapping according to our understanding.

Data mapping
Given ODD+D description (Klabunde et al. ), we deduce the DAMap diagram ( Figure ). It gives the structure of the agent-based model (in yellow), data structure model (in blue), and the mapping between these two submodels (by linked mapping patterns). The data sub-model is a sample of available data. Only used attributes are shown in this diagram. For the others, the reader can refer to tables "sn_qm_household" and "sn_qm_indiv" at (https://mafeproject.site. ined.fr/fichier/rte/29/Codebooksenegalfr.pdf). Note that mapping patterns depicted in Figure are deducted from the initial ODD+D description (Klabunde et al. ). Due to the lack of information about both data and mapping links, the mapping we give here is undoubtedly incomplete. Details about some patterns (e.g. Mortality rates) were improvised.
• Income -aggregates two attributes hworkers and hhwealth to one state variable capital; • Departure -creates variable migration_stage from a set of attributes in the entity sn_qm_indiv; • Mortality rates -are assumed to depend on age and gender.
• Marriage rates -are assumed to depend on age only for the unmarried population from the age of until the age of .
• Dissolution of marriage -it depends on age and duration of the marriage.
• Childbirth -fertility rates depend on age, marital status and time elapsed since last birth.
• Wages -depend on location (home country or host country) and are drawn from an empirically determined distribution.

Data patterns
Given the ODD+D description of Klabunde et al. ( ), the following data patterns are identified: Intention to migrate -based on many elements such as the agent attitude, social norms and the individuals perceptions of their ability to perform a migration.
Additional constraints are applied to the two agents (Household and Individual) as depicted in Figure . These patterns guarantee the integrity of the model and ensure simulation achievement.

Appendix C: ODD+ D description of (Filatova )
This appendix extends the "Input data" block of the ODD+D description of an empirical agent-based land market integrating adaptive economic behavior in urban land-use models (Filatova ).
Under "Entities, state variables and scales", Filatova ( ) presents a spatially explicit model based on GIS (Geographical Information System) and cadaster data coming from di erent sources. These empirical data are used to initialize spatial landscape and to determine agents' properties ("Theoretical and empirical background"). In the rubric "Input Data", the author specifies that, during the initialization, the model uploads vector data from multiple GIS data-sets. The paper also proposes a UML class diagram of the bilateral housing market: agents, their properties and their functions. Similar to the previous example, it is hard to understand what data is used, and how it was loaded to the agentbased model by reading only the given ODD+D description. Additional details and information are required to fully understand and replicate the model with its data basis.
The whole description is given in the original paper. Additional data can be found in Bin et al. ( ). We focus here only on the new blocks added by the ODD+ D.

Input data Data overview
"RHEA is applied to the coastal town of Beaufort. The area is in general low lying and is prone to flooding with a probability of : and : in certain zones. At initialization, RHEA uploads vector data from multiple GIS data-sets on the locations of residential housing, coastal amenities (measured regarding distance from coastal water and sound, and a Boolean measure of waterfront), flood probabilities, distances to the CBD and national parks, and data on structural characteristics of properties. Distance to CBD in the GIS dataset is measured as the distance to the nearest main employment center in the area -a neighboring town Morehead (Bin et al. ). Also at initialization, realtor-agents get the empirical hedonic function (Bin et al. ) based on the real estate transactions from to a er a period of active hurricane seasons from the middle of the s to . Data on households' incomes and preferences is taken from various sources". Extracted from Filatova ( ) ODD+D description.
The model uses GIS and cadaster data (flood zones and residential property sales) from Carteret County North Carolina. These data are produced by the National Flood Insurance Program.

Data structure
"Entity" data entity ( Table ) is the result of merging many sources of unknown origin. In Filatova ( ), no data are related to Households, Market and Realtors agent entities. So data structures and the meaning of their attributes cannot be retrieved to consolidate used data.

Data mapping
DAMap diagram (see Figure ) shows that (i) available data are summarized by "Entity" data entity; (ii) and four agent entities are identified in the ABM (Parcels, Households, Realtors, and Market). Due to the lack of information about data structure, Households, Realtors and Market cannot be associated with the dataset. Parcels entity is linked with "Entity" data entity by following mapping patterns: • Sales price -parcels prices depend on their location.
• Bathrooms -BATHRM attribute is converted to type: float.
• Aging -AGE attribute is converted to type float.
• Flooding -three variables (FLOOD, FLOOD , and FLOOD ) are aggregated to build the boolean state variable probabilityOfFlood.
• Coastal amenities -the state variable distanceAmen of agent Parcels depends on the result of previous pattern (Flooding).

Data patterns
The following data patterns were deduced from two works using the same dataset: Bin et al. ( ) and Filatova ( ). Data patterns are given in the following list: • We identify a spatial dependence in data. Residential properties sharing common features tend to cluster in space.
• Sales prices tend to cluster in space because houses in a neighborhood share similar location amenities.
• A strong positive correlation between coastal amenities and flood hazard.
Additional constraints are applied to the four agents (Parcels, Households, Realtors and Market) as depicted in Figure . These patterns guarantee the integrity of the model and ensure the success of simulation. Figure : DAMap diagram showing the mapping between the data entity "Entity" and the Parcels agent. The absence of other data sources makes the origin of other agents opaque and unclear.