R/BIOMOD_EnsembleModeling.R
BIOMOD_EnsembleModeling.Rd
This function allows to combine a range of models built with the
BIOMOD_Modeling
function in one (or several) ensemble model. Modeling
uncertainty can be assessed as well as variables importance, ensemble predictions can be
evaluated against original data, and created ensemble models can be projected over new
conditions (see Details).
BIOMOD_EnsembleModeling(
bm.mod,
models.chosen = "all",
em.by = "PA+run",
em.algo,
metric.select = "all",
metric.select.thresh = NULL,
metric.select.table = NULL,
metric.select.dataset = NULL,
metric.eval = c("KAPPA", "TSS", "ROC"),
var.import = 0,
EMci.alpha = 0.05,
EMwmean.decay = "proportional",
nb.cpu = 1,
seed.val = NULL,
do.progress = TRUE,
prob.mean,
prob.median,
prob.cv,
prob.ci,
committee.averaging,
prob.mean.weight,
prob.mean.weight.decay,
prob.ci.alpha
)
a BIOMOD.models.out
object returned by the
BIOMOD_Modeling
function
a vector
containing model names to be kept, must be either
all
or a sub-selection of model names that can be obtained with the
get_built_models
function
a character
corresponding to the way kept models will be combined to build
the ensemble models, must be among all
, algo
, PA
, PA+algo
,
PA+run
a vector
corresponding to the ensemble models that will be computed,
must be among 'EMmean'
, 'EMmedian'
, 'EMcv'
, 'EMci'
,
'EMca'
, 'EMwmean'
a vector
containing evaluation metric names to be used together with
metric.select.thresh
to exclude single models based on their evaluation scores
(for ensemble methods like probability weighted mean or committee averaging). Must be among
all
(same evaluation metrics than those of bm.mod
), user.defined
(and defined through metric.select.table
) or POD
, FAR
, POFD
,
SR
, ACCURACY
, BIAS
, ROC
, TSS
, KAPPA
, OR
,
ORSS
, CSI
, ETS
, BOYCE
, MPA
(optional, default NULL
)
A vector
of numeric
values corresponding to the minimum scores (one for each
metric.select
) below which single models will be excluded from the ensemble model
building
(optional, default NULL
)
If
metric.select = 'user.defined'
, a data.frame
containing evaluation scores
calculated for each single models and that will be compared to metric.select.thresh
values to exclude some of them from the ensemble model building, with metric.select
rownames, and models.chosen
colnames
(optional, default 'validation'
if possible). A character determining which dataset should be used to filter and/or
weigh the ensemble models should be among 'evaluation', 'validation' or 'calibration'.
a vector
containing evaluation metric names to be used, must
be among POD
, FAR
, POFD
, SR
, ACCURACY
, BIAS
,
ROC
, TSS
, KAPPA
, OR
, ORSS
, CSI
, ETS
,
BOYCE
, MPA
(optional, default NULL
)
An integer
corresponding to the number of permutations to be done for each variable to
estimate variable importance
(optional, default 0.05
)
A numeric
value corresponding to the significance level to estimate confidence interval
(optional, default proportional
)
A
value defining the relative importance of the weights (if 'EMwmean'
was given to argument em.algo
). A high value will strongly
discriminate good models from the bad ones (see Details),
while proportional
will attribute weights proportionally to the
models evaluation scores
(optional, default 1
)
An integer
value corresponding to the number of computing resources to be used to
parallelize the single models predictions and the ensemble models computation
(optional, default NULL
)
An integer
value corresponding to the new seed value to be set
(optional, default TRUE
)
A logical
value defining whether the progress bar is to be rendered or not
(deprecated, please use em.algo
instead)
A logical
value defining whether to compute the mean probabilities
across predictions or not
(deprecated, please use em.algo
instead)
A logical
value defining whether to compute the median
probabilities across predictions or not
(deprecated, please use em.algo
instead)
A
logical
value defining whether to compute the coefficient of
variation across predictions or not
(deprecated, please use em.algo
instead)
A
logical
value defining whether to compute the confidence interval
around the prob.mean
ensemble model or not
(deprecated, please use em.algo
instead)
A logical
value defining whether to compute the
committee averaging across predictions or not
(deprecated, please use em.algo
instead)
A logical
value defining whether to compute the
weighted sum of probabilities across predictions or not
(deprecated, please use
EMwmean.decay
instead)
old argument name for
EMwmean.decay
(deprecated, please use EMci.alpha
instead)
old argument name for EMci.alpha
A BIOMOD.ensemble.models.out
object containing models outputs, or links to saved
outputs.
Models outputs are stored out of R (for memory storage reasons) in 2 different
folders created in the current working directory :
a models folder, named after the resp.name
argument of
BIOMOD_FormatingData
, and containing all ensemble models
a hidden folder, named .BIOMOD_DATA
, and containing outputs related
files (original dataset, calibration lines, pseudo-absences selected, predictions,
variables importance, evaluation values...), that can be retrieved with
get_[...]
or load
functions, and used by other biomod2 functions, like
BIOMOD_EnsembleForecasting
models.chosen
)Applying get_built_models
function to the bm.mod
object gives the names of the single models created
with the BIOMOD_Modeling
function. The models.chosen
argument can take
either a sub-selection of these single model names, or the all
default value, to
decide which single models will be used for the ensemble model building.
em.by
)Single models built with the
BIOMOD_Modeling
function can be combined in 5 different ways to obtain
ensemble models :
PA+run
: each combination of pseudo-absence and repetition
datasets is done, merging algorithms together
PA+algo
: each combination of pseudo-absence and algorithm datasets
is done, merging repetitions together
PA
: pseudo-absence datasets are considered individually,
merging algorithms and repetitions together
algo
: algorithm datasets are considered individually, merging
pseudo-absence and repetitions together
all
: all models are combined into one
Hence, depending on the chosen method, the number of ensemble models built will vary.
Be aware that if no evaluation data was given to the
BIOMOD_FormatingData
function, some ensemble model evaluations may be biased
due to difference in data used for single model evaluations.
Be aware that all of these combinations are allowed, but some may not make sense
depending mainly on how pseudo-absence datasets have been built and whether all of them
have been used for all single models or not (see PA.nb.absences
and models.pa
parameters in BIOMOD_FormatingData
and BIOMOD_Modeling
functions
respectively).
metric.select
: the selected metrics must be chosen among the ones used
within the BIOMOD_Modeling
function to build the model.output
object,
unless metric.select = 'user.defined'
and therefore values will be provided through
the metric.select.table
parameter.
In the case of the selection of several
metrics, they will be used at different steps of the ensemble modeling function :
remove low quality single models, having a score lower than
metric.select.thresh
perform the binary transformation needed if 'EMca'
was given to argument em.algo
weight models if 'EMwmean'
was given to argument em.algo
metric.select.thresh
: as many values as evaluation metrics
selected with the metric.select
parameter, and defining the corresponding quality
thresholds below which the single models will be excluded from the ensemble model
building.
metric.select.table
: a data.frame
must be given if
metric.select = 'user.defined'
to allow the use of evaluation metrics other than
those calculated within biomod2. The data.frame
must contain as many columns
as models.chosen
with matching names, and as many rows as evaluation metrics to be
used. The number of rows must match the length of the metric.select.thresh
parameter. The values contained in the data.frame
will be compared to those defined
in metric.select.thresh
to remove low quality single models from
the ensemble model building.
metric.select.dataset
: a character
determining the dataset
which evaluation metric should be used to filter and/or weigh the
ensemble models. Should be among evaluation
, validation
or
calibration
. By default BIOMOD_EnsembleModeling
will use
the validation dataset unless no validation is available in which case
calibration dataset are used.
metric.eval
: the selected metrics will be used to validate/evaluate
the ensemble models built
The set of models to be calibrated on the data.
6 modeling techniques are currently available :
EMmean
: Mean of probabilities over the selected models.
Old name: prob.mean
EMmedian
: Median of probabilities over the selected models
The median is less sensitive to outliers than the mean, however it requires more
computation time and memory as it loads all predictions (on the contrary to the mean or
the weighted mean). Old name: prob.median
EMcv
: Coefficient of variation (sd / mean) of probabilities
over the selected models
This model is not scaled. It will be evaluated like all other ensemble models although its
interpretation will be obviously different. CV is a measure of uncertainty rather a
measure of probability of occurrence. If the CV gets a high evaluation score, it means
that the uncertainty is high where the species is observed (which might not be a good
feature of the model). The lower is the score, the better are the models.
CV is a nice complement to the mean probability. Old name: prob.cv
EMci
& EMci.alpha
: Confidence interval around
the mean of probabilities of the selected models
It is also a nice complement to the mean probability. It creates 2 ensemble models :
LOWER : there is less than 100 * EMci.alpha / 2
% of chance to
get probabilities lower than the given ones
UPPER : there is less than 100 * EMci.alpha / 2
% of chance to
get probabilities upper than the given ones
These intervals are calculated with the following function :
$$I_c = [ \bar{x} - \frac{t_\alpha sd }{ \sqrt{n} };
\bar{x} + \frac{t_\alpha sd }{ \sqrt{n} }]$$
Old parameter name: prob.ci
& prob.ci.alpha
EMca
: Probabilities from the selected models are
first transformed into binary data according to the thresholds defined when building the
model.output
object with the BIOMOD_Modeling
function, maximizing the
evaluation metric score over the testing dataset. The committee averaging score is
obtained by taking the average of these binary predictions. It is built on the analogy
of a simple vote :
each single model votes for the species being either present (1
) or absent
(0
)
the sum of 1
is then divided by the number of single models voting
The interesting feature of this measure is that it gives both a prediction and a measure
of uncertainty. When the prediction is close to 0
or 1
, it means that all
models agree to predict 0
or 1
respectively. When the prediction is around
0.5
, it means that half the models predict 1
and the other half 0
.
Old parameter name: committee.averaging
EMwmean
& EMwmean.decay
:
Probabilities from the selected models are weighted according to their evaluation scores
obtained when building the model.output
object with the BIOMOD_Modeling
function (better a model is, more importance it has in the ensemble) and summed.
Old parameter name: prob.mean.weight
& prob.mean.weight.decay
The EMwmean.decay
is the ratio between a weight and the next or previous one.
The formula is : W = W(-1) * EMwmean.decay
. For example, with the value
of 1.6
and 4
weights wanted, the relative importance of the weights will be
1/1.6/2.56(=1.6*1.6)/4.096(=2.56*1.6)
from the weakest to the strongest, and gives
0.11/0.17/0.275/0.445
considering that the sum of the weights is equal to one. The
lower the EMwmean.decay
, the smoother the differences between the weights
enhancing a weak discrimination between models.
If EMwmean.decay = 'proportional'
, the weights are assigned to each model
proportionally to their evaluation scores. The discrimination is fairer than using the
decay method where close scores can have strongly diverging weights, while the
proportional method would assign them similar weights.
It is also possible to define the EMwmean.decay
parameter as a function that
will be applied to single models scores and transform them into weights. For example,
if EMwmean.decay = function(x) {x^2}
, the squared of evaluation score of each
model will be used to weight the models predictions.
BIOMOD_FormatingData
, bm_ModelingOptions
,
bm_CrossValidation
, bm_VariablesImportance
,
BIOMOD_Modeling
, BIOMOD_EnsembleForecasting
,
bm_PlotEvalMean
, bm_PlotEvalBoxplot
,
bm_PlotVarImpBoxplot
, bm_PlotResponseCurves
Other Main functions:
BIOMOD_EnsembleForecasting()
,
BIOMOD_FormatingData()
,
BIOMOD_LoadModels()
,
BIOMOD_Modeling()
,
BIOMOD_Projection()
,
BIOMOD_RangeSize()
library(terra)
# Load species occurrences (6 species available)
data(DataSpecies)
head(DataSpecies)
# Select the name of the studied species
myRespName <- 'GuloGulo'
# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])
# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]
# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)
DONTSHOW({
myExtent <- terra::ext(0,30,45,70)
myExpl <- terra::crop(myExpl, myExtent)
})
## ----------------------------------------------------------------------- #
file.out <- paste0(myRespName, "/", myRespName, ".AllModels.models.out")
if (file.exists(file.out)) {
myBiomodModelOut <- get(load(file.out))
} else {
# Format Data with true absences
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName)
# Model single models
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
modeling.id = 'AllModels',
models = c('RF', 'GLM'),
CV.strategy = 'random',
CV.nb.rep = 2,
CV.perc = 0.8,
OPT.strategy = 'bigboss',
metric.eval = c('TSS','ROC'),
var.import = 3,
seed.val = 42)
}
## ----------------------------------------------------------------------- #
# Model ensemble models
myBiomodEM <- BIOMOD_EnsembleModeling(bm.mod = myBiomodModelOut,
models.chosen = 'all',
em.by = 'all',
em.algo = c('EMmean', 'EMca'),
metric.select = c('TSS'),
metric.select.thresh = c(0.7),
metric.eval = c('TSS', 'ROC'),
var.import = 3,
seed.val = 42)
myBiomodEM
# Get evaluation scores & variables importance
get_evaluations(myBiomodEM)
get_variables_importance(myBiomodEM)
# Represent evaluation scores
bm_PlotEvalMean(bm.out = myBiomodEM, dataset = 'calibration')
bm_PlotEvalBoxplot(bm.out = myBiomodEM, group.by = c('algo', 'algo'))
# # Represent variables importance
# bm_PlotVarImpBoxplot(bm.out = myBiomodEM, group.by = c('expl.var', 'algo', 'algo'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodEM, group.by = c('expl.var', 'algo', 'merged.by.PA'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodEM, group.by = c('algo', 'expl.var', 'merged.by.PA'))
# # Represent response curves
# bm_PlotResponseCurves(bm.out = myBiomodEM,
# models.chosen = get_built_models(myBiomodEM),
# fixed.var = 'median')
# bm_PlotResponseCurves(bm.out = myBiomodEM,
# models.chosen = get_built_models(myBiomodEM),
# fixed.var = 'min')
# bm_PlotResponseCurves(bm.out = myBiomodEM,
# models.chosen = get_built_models(myBiomodEM, algo = 'EMmean'),
# fixed.var = 'median',
# do.bivariate = TRUE)