`R/BIOMOD_FormatingData.R`

`BIOMOD_FormatingData.Rd`

This function gathers together all input data needed (*xy,
presences/absences, explanatory variables, and the same for evaluation data if available*) to
run biomod2 models. It allows to select pseudo-absences if no absence data is available,
with different strategies (see Details).

```
BIOMOD_FormatingData(
resp.name,
resp.var,
expl.var,
dir.name = ".",
resp.xy = NULL,
eval.resp.var = NULL,
eval.expl.var = NULL,
eval.resp.xy = NULL,
PA.nb.rep = 0,
PA.nb.absences = 1000,
PA.strategy = NULL,
PA.dist.min = 0,
PA.dist.max = NULL,
PA.sre.quant = 0.025,
PA.fact.aggr = NULL,
PA.user.table = NULL,
na.rm = TRUE,
filter.raster = FALSE,
seed.val = NULL
)
```

- resp.name
a

`character`

corresponding to the species name- resp.var
a

`vector`

, a`SpatVector`

without associated data (*if presence-only*), or a`SpatVector`

object containing binary data (`0`

: absence,`1`

: presence,`NA`

: indeterminate) for a single species that will be used to build the species distribution model(s)*Note that old format from sp are still supported such as*`SpatialPoints`

(if presence-only) or`SpatialPointsDataFrame`

object containing binary data.- expl.var
a

`matrix`

,`data.frame`

,`SpatVector`

or`SpatRaster`

object containing the explanatory variables (in columns or layers) that will be used to build the species distribution model(s)*Note that old format from raster and sp are still supported such as*`RasterStack`

and`SpatialPointsDataFrame`

objects.- dir.name
(

*optional, default*`.`

)

A`character`

corresponding to the modeling folder- resp.xy
(

*optional, default*`NULL`

)

If`resp.var`

is a`vector`

, a 2-columns`matrix`

or`data.frame`

containing the corresponding`X`

and`Y`

coordinates that will be used to build the species distribution model(s)- eval.resp.var
(

*optional, default*`NULL`

)

A`vector`

, a`SpatVector`

without associated data (*if presence-only*), or a`SpatVector`

object containing binary data (`0`

: absence,`1`

: presence,`NA`

: indeterminate) for a single species that will be used to evaluate the species distribution model(s) with independent data*Note that old format from sp are still supported such as*`SpatialPoints`

(if presence-only) or`SpatialPointsDataFrame`

object containing binary data.- eval.expl.var
(

*optional, default*`NULL`

)

A`matrix`

,`data.frame`

,`SpatVector`

or`SpatRaster`

object containing the explanatory variables (in columns or layers) that will be used to evaluate the species distribution model(s) with independent data.*Note that old format from raster and sp are still supported such as*`RasterStack`

and`SpatialPointsDataFrame`

objects.- eval.resp.xy
(

*optional, default*`NULL`

)

If`resp.var`

is a`vector`

, a 2-columns`matrix`

or`data.frame`

containing the corresponding`X`

and`Y`

coordinates that will be used to evaluate the species distribution model(s) with independent data- PA.nb.rep
(

*optional, default*`0`

)

If pseudo-absence selection, an`integer`

corresponding to the number of sets (repetitions) of pseudo-absence points that will be drawn- PA.nb.absences
(

*optional, default*`0`

)

If pseudo-absence selection, and`PA.strategy = 'random'`

or`PA.strategy = 'sre'`

or`PA.strategy = 'disk'`

, an`integer`

corresponding to the number of pseudo-absence points that will be selected for each pseudo-absence repetition (true absences included).

It can also be a`vector`

of the same length as`PA.nb.rep`

containing`integer`

values corresponding to the different numbers of pseudo-absences to be selected- PA.strategy
(

*optional, default*`NULL`

)

If pseudo-absence selection, a`character`

defining the strategy that will be used to select the pseudo-absence points. Must be`random`

,`sre`

,`disk`

or`user.defined`

(see Details)- PA.dist.min
(

*optional, default*`0`

)

If pseudo-absence selection and`PA.strategy = 'disk'`

, a`numeric`

defining the minimal distance to presence points used to make the`disk`

pseudo-absence selection (in the same projection system units as`resp.xy`

and`expl.var`

, see Details)- PA.dist.max
(

*optional, default*`0`

)

If pseudo-absence selection and`PA.strategy = 'disk'`

, a`numeric`

defining the maximal distance to presence points used to make the`disk`

pseudo-absence selection (in the same projection system units as`resp.xy`

and`expl.var`

, see Details)- PA.sre.quant
(

*optional, default*`0`

)

If pseudo-absence selection and`PA.strategy = 'sre'`

, a`numeric`

between`0`

and`0.5`

defining the half-quantile used to make the`sre`

pseudo-absence selection (see Details)- PA.fact.aggr
(

*optional, default*`NULL`

)

If`strategy = 'random'`

or`strategy = 'disk'`

, a`integer`

defining the factor of aggregation to reduce the resolution- PA.user.table
(

*optional, default*`NULL`

)

If pseudo-absence selection and`PA.strategy = 'user.defined'`

, a`matrix`

or`data.frame`

with as many rows as`resp.var`

values, as many columns as`PA.nb.rep`

, and containing`TRUE`

or`FALSE`

values defining which points will be used to build the species distribution model(s) for each repetition (see Details)- na.rm
(

*optional, default*`TRUE`

)

A`logical`

value defining whether points having one or several missing values for explanatory variables should be removed from the analysis or not- filter.raster
(

*optional, default*`FALSE`

)

If`expl.var`

is of raster type, a`logical`

value defining whether`resp.var`

is to be filtered when several points occur in the same raster cell- seed.val
(

*optional, default*`NULL`

)

An`integer`

value corresponding to the new seed value to be set

A `BIOMOD.formated.data`

object that can be used to build species distribution
model(s) with the `BIOMOD_Modeling`

function.

`print/show`

,
`plot`

and
`summary`

functions
are available to have a summary of the created object.

This function gathers and formats all input data needed to run biomod2 models. It
supports different kind of inputs (e.g. `matrix`

,
`SpatVector`

, `SpatRaster`

)
and provides different methods to select pseudo-absences if needed.

**Concerning explanatory variables and XY coordinates :**

if

`SpatRaster`

,`RasterLayer`

or`RasterStack`

provided for`expl.var`

or`eval.expl.var`

,

biomod2 will extract the corresponding values from XY coordinates provided :either through

`resp.xy`

or`eval.resp.xy`

respectivelyor

`resp.var`

or`eval.resp.var`

, if provided as`SpatVector`

or`SpatialPointsDataFrame`

*Be sure to give the objects containing XY coordinates in the same projection system than the raster objects !*if

`data.frame`

or`matrix`

provided for`expl.var`

or`eval.expl.var`

,

biomod2 will simply merge it (`cbind`

) with`resp.var`

without considering XY coordinates.*Be sure to give explanatory and response values in the same row order !*

**Concerning pseudo-absence selection (see bm_PseudoAbsences) :**

if both presence and absence data are available, and there is enough absences : set

`PA.nb.rep = 0`

and no pseudo-absence will be selected.if no absence data is available, several pseudo-absence repetitions are recommended (to estimate the effect of pseudo-absence selection), as well as high number of pseudo-absence points.

*Be sure not to select more pseudo-absence points than maximum number of pixels in the studied area !*it is possible now to create several pseudo-absence repetitions with different number of points, BUT with the same sampling strategy.

- Response variable
biomod2 models single species at a time (no multi-species). Hence,

`resp.var`

must be a uni-dimensional object (either a`vector`

, a one-column`matrix`

,`data.frame`

, a`SpatVector`

(*without associated data - if presence-only*), a`SpatialPoints`

(*if presence-only*), a`SpatialPointsDataFrame`

or`SpatVector`

object), containing values among :`1`

: presences`0`

: true absences (if any)`NA`

: no information point (might be used to select pseudo-absences if any)

If no true absences are available, pseudo-absence selection must be done.

If`resp.var`

is a non-spatial object (`vector`

,`matrix`

or`data.frame`

), XY coordinates must be provided through`resp.xy`

.

If pseudo-absence points are to be selected,`NA`

points must be provided in order to select pseudo-absences among them.- Explanatory variables
Factorial variables are allowed, but might lead to some pseudo-absence strategy or models omissions (e.g.

`sre`

).- Evaluation data
Although biomod2 provides tools to automatically divide dataset into calibration and validation parts through the modeling process (see

`CV.[..]`

parameters in`BIOMOD_Modeling`

function ; or`bm_CrossValidation function`

), it is also possible (and strongly advised) to directly provide two independent datasets, one for calibration/validation and one for evaluation- Pseudo-absence selection (see
`bm_PseudoAbsences`

) If no true absences are available, pseudo-absences must be selected from the

*background data*, meaning data there is no information whether the species of interest occurs or not. It corresponds either to the remaining pixels of the`expl.var`

(if provided as a`SpatRaster`

or`RasterSatck`

) or to the points identified as`NA`

in`resp.var`

(if`expl.var`

provided as a`matrix`

or`data.frame`

).

Several methods are available to do this selection :- random
all points of initial background are pseudo-absence candidates.

`PA.nb.absences`

are drawn randomly, for each`PA.nb.rep`

requested.- sre
pseudo-absences have to be selected in conditions (combination of explanatory variables) that differ in a defined proportion (

`PA.sre.quant`

) from those of presence points. A*Surface Range Envelop*model is first run over the species of interest (see`bm_SRE`

), and pseudo-absences are selected outside this envelop.*This case is appropriate when all the species climatic niche has been sampled, otherwise it may lead to over-optimistic model evaluations and predictions !*- disk
pseudo-absences are selected within circles around presence points defined by

`PA.dist.min`

and`PA.dist.max`

distance values (in the same projection system units as`coord`

and`expl.var`

). It allows to select pseudo-absence points that are not too close to (avoid same niche and pseudo-replication) or too far (localized sampling strategy) from presences.- user.defined
pseudo-absences are defined in advance and given as

`data.frame`

through the`PA.user.table`

parameter.

`bm_PseudoAbsences`

, `BIOMOD_Modeling`

Other Main functions:
`BIOMOD_EnsembleForecasting()`

,
`BIOMOD_EnsembleModeling()`

,
`BIOMOD_LoadModels()`

,
`BIOMOD_Modeling()`

,
`BIOMOD_Projection()`

,
`BIOMOD_RangeSize()`

```
library(terra)
# Load species occurrences (6 species available)
data(DataSpecies)
head(DataSpecies)
# Select the name of the studied species
myRespName <- 'GuloGulo'
# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])
# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]
# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)
# \dontshow{
myExtent <- terra::ext(0,30,45,70)
myExpl <- terra::crop(myExpl, myExtent)
# }
# ---------------------------------------------------------------#
# Format Data with true absences
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName)
myBiomodData
summary(myBiomodData)
plot(myBiomodData)
# ---------------------------------------------------------------#
# # Transform true absences into potential pseudo-absences
# myResp.PA <- ifelse(myResp == 1, 1, NA)
#
# # Format Data with pseudo-absences : random method
# myBiomodData.r <- BIOMOD_FormatingData(resp.var = myResp.PA,
# expl.var = myExpl,
# resp.xy = myRespXY,
# resp.name = myRespName,
# PA.nb.rep = 4,
# PA.nb.absences = 1000,
# PA.strategy = 'random')
#
# # Format Data with pseudo-absences : disk method
# myBiomodData.d <- BIOMOD_FormatingData(resp.var = myResp.PA,
# expl.var = myExpl,
# resp.xy = myRespXY,
# resp.name = myRespName,
# PA.nb.rep = 4,
# PA.nb.absences = 500,
# PA.strategy = 'disk',
# PA.dist.min = 5,
# PA.dist.max = 35)
#
# # Format Data with pseudo-absences : SRE method
# myBiomodData.s <- BIOMOD_FormatingData(resp.var = myResp.PA,
# expl.var = myExpl,
# resp.xy = myRespXY,
# resp.name = myRespName,
# PA.nb.rep = 4,
# PA.nb.absences = 1000,
# PA.strategy = 'sre',
# PA.sre.quant = 0.025)
#
# # Format Data with pseudo-absences : user.defined method
# myPAtable <- data.frame(PA1 = ifelse(myResp == 1, TRUE, FALSE),
# PA2 = ifelse(myResp == 1, TRUE, FALSE))
# for (i in 1:ncol(myPAtable)) myPAtable[sample(which(myPAtable[, i] == FALSE), 500), i] = TRUE
# myBiomodData.u <- BIOMOD_FormatingData(resp.var = myResp.PA,
# expl.var = myExpl,
# resp.xy = myRespXY,
# resp.name = myRespName,
# PA.strategy = 'user.defined',
# PA.user.table = myPAtable)
#
# myBiomodData.r
# myBiomodData.d
# myBiomodData.s
# myBiomodData.u
# plot(myBiomodData.r)
# plot(myBiomodData.d)
# plot(myBiomodData.s)
# plot(myBiomodData.u)
# ---------------------------------------------------------------#
# # Select multiple sets of pseudo-absences
#
# # Transform true absences into potential pseudo-absences
# myResp.PA <- ifelse(myResp == 1, 1, NA)
#
# # Format Data with pseudo-absences : random method
# myBiomodData.multi <- BIOMOD_FormatingData(resp.var = myResp.PA,
# expl.var = myExpl,
# resp.xy = myRespXY,
# resp.name = myRespName,
# PA.nb.rep = 4,
# PA.nb.absences = c(1000, 500, 500, 200),
# PA.strategy = 'random')
# myBiomodData.multi
# summary(myBiomodData.multi)
# plot(myBiomodData.multi)
```