`R/BIOMOD_EnsembleModeling.R`

`BIOMOD_EnsembleModeling.Rd`

This function allows to combine a range of models built with the
`BIOMOD_Modeling`

function in one (or several) ensemble model. Modeling
uncertainty can be assessed as well as variables importance, ensemble predictions can be
evaluated against original data, and created ensemble models can be projected over new
conditions (see Details).

```
BIOMOD_EnsembleModeling(
bm.mod,
models.chosen = "all",
em.by = "PA+run",
em.algo,
metric.select = "all",
metric.select.thresh = NULL,
metric.select.table = NULL,
metric.select.dataset = NULL,
metric.eval = c("KAPPA", "TSS", "ROC"),
var.import = 0,
EMci.alpha = 0.05,
EMwmean.decay = "proportional",
nb.cpu = 1,
seed.val = NULL,
do.progress = TRUE,
prob.mean,
prob.median,
prob.cv,
prob.ci,
committee.averaging,
prob.mean.weight,
prob.mean.weight.decay,
prob.ci.alpha
)
```

- bm.mod
a

`BIOMOD.models.out`

object returned by the`BIOMOD_Modeling`

function- models.chosen
a

`vector`

containing model names to be kept, must be either`all`

or a sub-selection of model names that can be obtained with the`get_built_models`

function- em.by
a

`character`

corresponding to the way kept models will be combined to build the ensemble models, must be among`all`

,`algo`

,`PA`

,`PA+algo`

,`PA+run`

- em.algo
a

`vector`

corresponding to the ensemble models that will be computed, must be among`'EMmean'`

,`'EMmedian'`

,`'EMcv'`

,`'EMci'`

,`'EMca'`

,`'EMwmean'`

- metric.select
a

`vector`

containing evaluation metric names to be used together with`metric.select.thresh`

to exclude single models based on their evaluation scores (for ensemble methods like probability weighted mean or committee averaging). Must be among`all`

(same evaluation metrics than those of`bm.mod`

),`user.defined`

(and defined through`metric.select.table`

) or`POD`

,`FAR`

,`POFD`

,`SR`

,`ACCURACY`

,`BIAS`

,`ROC`

,`TSS`

,`KAPPA`

,`OR`

,`ORSS`

,`CSI`

,`ETS`

,`BOYCE`

,`MPA`

- metric.select.thresh
(

*optional, default*`NULL`

)

A`vector`

of`numeric`

values corresponding to the minimum scores (one for each`metric.select`

) below which single models will be excluded from the ensemble model building- metric.select.table
(

*optional, default*`NULL`

)

If`metric.select = 'user.defined'`

, a`data.frame`

containing evaluation scores calculated for each single models and that will be compared to`metric.select.thresh`

values to exclude some of them from the ensemble model building, with`metric.select`

rownames, and`models.chosen`

colnames- metric.select.dataset
(

*optional, default*`'validation'`

*if possible*). A character determining which dataset should be used to filter and/or weigh the ensemble models should be among 'evaluation', 'validation' or 'calibration'.- metric.eval
a

`vector`

containing evaluation metric names to be used, must be among`POD`

,`FAR`

,`POFD`

,`SR`

,`ACCURACY`

,`BIAS`

,`ROC`

,`TSS`

,`KAPPA`

,`OR`

,`ORSS`

,`CSI`

,`ETS`

,`BOYCE`

,`MPA`

- var.import
(

*optional, default*`NULL`

)

An`integer`

corresponding to the number of permutations to be done for each variable to estimate variable importance- EMci.alpha
(

*optional, default*`0.05`

)

A`numeric`

value corresponding to the significance level to estimate confidence interval- EMwmean.decay
(

*optional, default*`proportional`

)

A value defining the relative importance of the weights (if`'EMwmean'`

was given to argument`em.algo`

). A high value will strongly discriminate*good*models from the*bad*ones (see Details), while`proportional`

will attribute weights proportionally to the models evaluation scores- nb.cpu
(

*optional, default*`1`

)

An`integer`

value corresponding to the number of computing resources to be used to parallelize the single models predictions and the ensemble models computation- seed.val
(

*optional, default*`NULL`

)

An`integer`

value corresponding to the new seed value to be set- do.progress
(

*optional, default*`TRUE`

)

A`logical`

value defining whether the progress bar is to be rendered or not- prob.mean
(

*deprecated*, please use`em.algo`

instead)

A`logical`

value defining whether to compute the mean probabilities across predictions or not- prob.median
(

*deprecated*, please use`em.algo`

instead)

A`logical`

value defining whether to compute the median probabilities across predictions or not- prob.cv
(

*deprecated*, please use`em.algo`

instead)

A`logical`

value defining whether to compute the coefficient of variation across predictions or not- prob.ci
(

*deprecated*, please use`em.algo`

instead)

A`logical`

value defining whether to compute the confidence interval around the`prob.mean`

ensemble model or not- committee.averaging
(

*deprecated*, please use`em.algo`

instead)

A`logical`

value defining whether to compute the committee averaging across predictions or not- prob.mean.weight
(

*deprecated*, please use`em.algo`

instead)

A`logical`

value defining whether to compute the weighted sum of probabilities across predictions or not- prob.mean.weight.decay
(

*deprecated*, please use`EMwmean.decay`

instead)

old argument name for`EMwmean.decay`

- prob.ci.alpha
(

*deprecated*, please use`EMci.alpha`

instead)

old argument name for`EMci.alpha`

A `BIOMOD.ensemble.models.out`

object containing models outputs, or links to saved
outputs.

Models outputs are stored out of R (for memory storage reasons) in 2 different
folders created in the current working directory :

a

*models*folder, named after the`resp.name`

argument of`BIOMOD_FormatingData`

, and containing all ensemble modelsa

*hidden*folder, named`.BIOMOD_DATA`

, and containing outputs related files (original dataset, calibration lines, pseudo-absences selected, predictions, variables importance, evaluation values...), that can be retrieved with`get_[...]`

or`load`

functions, and used by other biomod2 functions, like`BIOMOD_EnsembleForecasting`

- Models sub-selection (
`models.chosen`

) Applying

`get_built_models`

function to the`bm.mod`

object gives the names of the single models created with the`BIOMOD_Modeling`

function. The`models.chosen`

argument can take either a sub-selection of these single model names, or the`all`

default value, to decide which single models will be used for the ensemble model building.- Models assembly rules (
`em.by`

) Single models built with the

`BIOMOD_Modeling`

function can be combined in 5 different ways to obtain ensemble models :`PA+run`

: each combination of pseudo-absence and repetition datasets is done,*merging*algorithms together`PA+algo`

: each combination of pseudo-absence and algorithm datasets is done,*merging*repetitions together`PA`

: pseudo-absence datasets are considered individually,*merging*algorithms and repetitions together`algo`

: algorithm datasets are considered individually,*merging*pseudo-absence and repetitions together`all`

: all models are combined into one

Hence, depending on the chosen method, the number of ensemble models built will vary.

*Be aware that if no evaluation data was given to the*`BIOMOD_FormatingData`

function, some ensemble model evaluations may be biased due to difference in data used for single model evaluations.**Be aware that all of these combinations are allowed, but some may not make sense depending mainly on how pseudo-absence datasets have been built and whether all of them have been used for all single models or not (see**`PA.nb.absences`

and`models.pa`

parameters in`BIOMOD_FormatingData`

and`BIOMOD_Modeling`

functions respectively).- Evaluation metrics
: the selected metrics must be chosen among the ones used within the`metric.select`

`BIOMOD_Modeling`

function to build the`model.output`

object, unless`metric.select = 'user.defined'`

and therefore values will be provided through the`metric.select.table`

parameter.

In the case of the selection of several metrics, they will be used at different steps of the ensemble modeling function :remove

*low quality*single models, having a score lower than`metric.select.thresh`

perform the binary transformation needed if

`'EMca'`

was given to argument`em.algo`

weight models if

`'EMwmean'`

was given to argument`em.algo`

: as many values as evaluation metrics selected with the`metric.select.thresh`

`metric.select`

parameter, and defining the corresponding quality thresholds below which the single models will be excluded from the ensemble model building.: a`metric.select.table`

`data.frame`

must be given if`metric.select = 'user.defined'`

to allow the use of evaluation metrics other than those calculated within biomod2. The`data.frame`

must contain as many columns as`models.chosen`

with matching names, and as many rows as evaluation metrics to be used. The number of rows must match the length of the`metric.select.thresh`

parameter. The values contained in the`data.frame`

will be compared to those defined in`metric.select.thresh`

to remove*low quality*single models from the ensemble model building.: a`metric.select.dataset`

`character`

determining the dataset which evaluation metric should be used to filter and/or weigh the ensemble models. Should be among`evaluation`

,`validation`

or`calibration`

. By default`BIOMOD_EnsembleModeling`

will use the validation dataset unless no validation is available in which case calibration dataset are used.: the selected metrics will be used to validate/evaluate the ensemble models built`metric.eval`

- Ensemble-models algorithms
The set of models to be calibrated on the data.

6 modeling techniques are currently available :: Mean of probabilities over the selected models. Old name:`EMmean`

`prob.mean`

: Median of probabilities over the selected models`EMmedian`

The median is less sensitive to outliers than the mean, however it requires more computation time and memory as it loads all predictions (on the contrary to the mean or the weighted mean). Old name:`prob.median`

: Coefficient of variation (sd / mean) of probabilities over the selected models`EMcv`

This model is not scaled. It will be evaluated like all other ensemble models although its interpretation will be obviously different. CV is a measure of uncertainty rather a measure of probability of occurrence. If the CV gets a high evaluation score, it means that the uncertainty is high where the species is observed (which might not be a good feature of the model).*The lower is the score, the better are the models.*CV is a nice complement to the mean probability. Old name:`prob.cv`

&`EMci`

: Confidence interval around the mean of probabilities of the selected models`EMci.alpha`

It is also a nice complement to the mean probability. It creates 2 ensemble models :*LOWER*: there is less than`100 * EMci.alpha / 2`

% of chance to get probabilities lower than the given ones*UPPER*: there is less than`100 * EMci.alpha / 2`

% of chance to get probabilities upper than the given ones

These intervals are calculated with the following function : $$I_c = [ \bar{x} - \frac{t_\alpha sd }{ \sqrt{n} }; \bar{x} + \frac{t_\alpha sd }{ \sqrt{n} }]$$

Old parameter name:`prob.ci`

&`prob.ci.alpha`

: Probabilities from the selected models are first transformed into binary data according to the thresholds defined when building the`EMca`

`model.output`

object with the`BIOMOD_Modeling`

function, maximizing the evaluation metric score over the testing dataset. The committee averaging score is obtained by taking the average of these binary predictions. It is built on the analogy of a simple vote :each single model votes for the species being either present (

`1`

) or absent (`0`

)the sum of

`1`

is then divided by the number of single models*voting*

The interesting feature of this measure is that it gives both a prediction and a measure of uncertainty. When the prediction is close to

`0`

or`1`

, it means that all models agree to predict`0`

or`1`

respectively. When the prediction is around`0.5`

, it means that half the models predict`1`

and the other half`0`

.

Old parameter name:`committee.averaging`

&`EMwmean`

: Probabilities from the selected models are weighted according to their evaluation scores obtained when building the`EMwmean.decay`

`model.output`

object with the`BIOMOD_Modeling`

function (*better a model is, more importance it has in the ensemble*) and summed.

Old parameter name:`prob.mean.weight`

&`prob.mean.weight.decay`

The

`EMwmean.decay`

is the ratio between a weight and the next or previous one. The formula is :`W = W(-1) * EMwmean.decay`

.*For example, with the value of*`1.6`

and`4`

weights wanted, the relative importance of the weights will be`1/1.6/2.56(=1.6*1.6)/4.096(=2.56*1.6)`

from the weakest to the strongest, and gives`0.11/0.17/0.275/0.445`

considering that the sum of the weights is equal to one. The lower the`EMwmean.decay`

, the smoother the differences between the weights enhancing a weak discrimination between models.If

`EMwmean.decay = 'proportional'`

, the weights are assigned to each model proportionally to their evaluation scores. The discrimination is fairer than using the*decay*method where close scores can have strongly diverging weights, while the proportional method would assign them similar weights.It is also possible to define the

`EMwmean.decay`

parameter as a function that will be applied to single models scores and transform them into weights.*For example, if*`EMwmean.decay = function(x) {x^2}`

, the squared of evaluation score of each model will be used to weight the models predictions.

`BIOMOD_FormatingData`

, `bm_ModelingOptions`

,
`bm_CrossValidation`

, `bm_VariablesImportance`

,
`BIOMOD_Modeling`

, `BIOMOD_EnsembleForecasting`

,
`bm_PlotEvalMean`

, `bm_PlotEvalBoxplot`

,
`bm_PlotVarImpBoxplot`

, `bm_PlotResponseCurves`

Other Main functions:
`BIOMOD_EnsembleForecasting()`

,
`BIOMOD_FormatingData()`

,
`BIOMOD_LoadModels()`

,
`BIOMOD_Modeling()`

,
`BIOMOD_Projection()`

,
`BIOMOD_RangeSize()`

```
library(terra)
# Load species occurrences (6 species available)
data(DataSpecies)
head(DataSpecies)
# Select the name of the studied species
myRespName <- 'GuloGulo'
# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])
# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]
# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)
# \dontshow{
myExtent <- terra::ext(0,30,45,70)
myExpl <- terra::crop(myExpl, myExtent)
# }
## ----------------------------------------------------------------------- #
file.out <- paste0(myRespName, "/", myRespName, ".AllModels.models.out")
if (file.exists(file.out)) {
myBiomodModelOut <- get(load(file.out))
} else {
# Format Data with true absences
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName)
# Model single models
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
modeling.id = 'AllModels',
models = c('RF', 'GLM'),
CV.strategy = 'random',
CV.nb.rep = 2,
CV.perc = 0.8,
OPT.strategy = 'bigboss',
metric.eval = c('TSS','ROC'),
var.import = 3,
seed.val = 42)
}
## ----------------------------------------------------------------------- #
# Model ensemble models
myBiomodEM <- BIOMOD_EnsembleModeling(bm.mod = myBiomodModelOut,
models.chosen = 'all',
em.by = 'all',
em.algo = c('EMmean', 'EMca'),
metric.select = c('TSS'),
metric.select.thresh = c(0.7),
metric.eval = c('TSS', 'ROC'),
var.import = 3,
seed.val = 42)
myBiomodEM
# Get evaluation scores & variables importance
get_evaluations(myBiomodEM)
get_variables_importance(myBiomodEM)
# Represent evaluation scores
bm_PlotEvalMean(bm.out = myBiomodEM, dataset = 'calibration')
bm_PlotEvalBoxplot(bm.out = myBiomodEM, group.by = c('algo', 'algo'))
# # Represent variables importance
# bm_PlotVarImpBoxplot(bm.out = myBiomodEM, group.by = c('expl.var', 'algo', 'algo'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodEM, group.by = c('expl.var', 'algo', 'merged.by.PA'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodEM, group.by = c('algo', 'expl.var', 'merged.by.PA'))
# # Represent response curves
# bm_PlotResponseCurves(bm.out = myBiomodEM,
# models.chosen = get_built_models(myBiomodEM),
# fixed.var = 'median')
# bm_PlotResponseCurves(bm.out = myBiomodEM,
# models.chosen = get_built_models(myBiomodEM),
# fixed.var = 'min')
# bm_PlotResponseCurves(bm.out = myBiomodEM,
# models.chosen = get_built_models(myBiomodEM, algo = 'EMmean'),
# fixed.var = 'median',
# do.bivariate = TRUE)
```