This function allows to calibrate and evaluate a range of modeling techniques for a given species distribution. The dataset can be split up in calibration/validation parts, and the predictive power of the different models can be estimated using a range of evaluation metrics (see Details).
BIOMOD_Modeling(
bm.format,
modeling.id = as.character(format(Sys.time(), "%s")),
models = c("ANN", "CTA", "FDA", "GAM", "GBM", "GLM", "MARS", "MAXENT", "MAXNET", "RF",
"RFd", "SRE", "XGBOOST"),
models.pa = NULL,
CV.strategy = "random",
CV.nb.rep = 1,
CV.perc = NULL,
CV.k = NULL,
CV.balance = NULL,
CV.env.var = NULL,
CV.strat = NULL,
CV.user.table = NULL,
CV.do.full.models = TRUE,
OPT.strategy = "default",
OPT.user.val = NULL,
OPT.user.base = "bigboss",
OPT.user = NULL,
metric.eval = c("KAPPA", "TSS", "ROC"),
var.import = 0,
weights = NULL,
prevalence = NULL,
scale.models = FALSE,
nb.cpu = 1,
seed.val = NULL,
do.progress = TRUE
)
a BIOMOD.formated.data
or BIOMOD.formated.data.PA
object returned by the BIOMOD_FormatingData
function
a character
corresponding to the name (ID) of the simulation set
(a random number by default)
a vector
containing model names to be computed, must be among
ANN
, CTA
, FDA
, GAM
, GBM
, GLM
, MARS
,
MAXENT
, MAXNET
, RF
, RFd
, SRE
, XGBOOST
(optional, default NULL
)
A list
containing for each model a vector
defining which pseudo-absence datasets
are to be used, must be among colnames(bm.format@PA.table)
a character
corresponding to the cross-validation selection strategy,
must be among random
, kfold
, block
, strat
, env
or
user.defined
(optional, default 0
)
If strategy = 'random'
or strategy = 'kfold'
, an integer
corresponding
to the number of sets (repetitions) of cross-validation points that will be drawn
(optional, default 0
)
If strategy = 'random'
, a numeric
between 0
and 1
defining the
percentage of data that will be kept for calibration
(optional, default 0
)
If strategy = 'kfold'
or strategy = 'strat'
or strategy = 'env'
, an
integer
corresponding to the number of partitions
(optional, default 'presences'
)
If strategy = 'strat'
or strategy = 'env'
, a character
corresponding
to how data will be balanced between partitions, must be either presences
or
absences
(optional, default NULL
)
If strategy = 'env'
, a character
corresponding to the environmental variables
used to build the partition (all available variables by default), and for which CV.k
partitions will be built
(optional, default 'both'
)
If strategy = 'env'
, a character
corresponding to how data will partitioned
along gradient, must be among x
, y
, both
(optional, default NULL
)
If strategy = 'user.defined'
, a matrix
or data.frame
defining for each
repetition (in columns) which observation lines should be used for models calibration
(TRUE
) and validation (FALSE
)
(optional, default TRUE
)
A logical
value defining whether models should be also calibrated and validated over
the whole dataset (and pseudo-absence datasets) or not
a character
corresponding to the method to select models'
parameters values, must be either default
, bigboss
, user.defined
,
tuned
(optional, default NULL
)
A list
containing parameters values for some (all) models
(optional, default bigboss
)
A character,
default
or bigboss
used when OPT.strategy = 'user.defined'
.
It sets the bases of parameters to be modified by user defined values.
(optional, default TRUE
)
A BIOMOD.models.options
object returned by the bm_ModelingOptions
function
a vector
containing evaluation metric names to be used, must
be among ROC
, TSS
, KAPPA
, ACCURACY
, BIAS
, POD
,
FAR
, POFD
, SR
, CSI
, ETS
, OR
, ORSS
,
BOYCE
, MPA
(binary data),
RMSE
, MAE
, MSE
, Rsquared
, Rsquared_aj
, Max_error
(abundance / count / relative data),
Accuracy
, Recall
, Precision
, F1
(ordinal data)
(optional, default NULL
)
An integer
corresponding to the number of permutations to be done for each variable to
estimate variable importance
(optional, default NULL
)
A vector
of numeric
values corresponding to observation weights (one per
observation, see Details)
(optional, default 0.5
)
A numeric
between 0
and 1
corresponding to the species prevalence to
build 'weighted response weights' (see Details)
(optional, default FALSE
)
A logical
value defining whether all models predictions should be scaled with a
binomial GLM or not
(optional, default 1
)
An integer
value corresponding to the number of computing resources to be used to
parallelize the single models computation
(optional, default NULL
)
An integer
value corresponding to the new seed value to be set
(optional, default TRUE
)
A logical
value defining whether the progress bar is to be rendered or not
A BIOMOD.models.out
object containing models outputs, or links to saved outputs.
Models outputs are stored out of R (for memory storage reasons) in 2 different folders
created in the current working directory :
a models folder, named after the resp.name
argument of
BIOMOD_FormatingData
, and containing all calibrated models for each
repetition and pseudo-absence run
a hidden folder, named .BIOMOD_DATA
, and containing outputs related
files (original dataset, calibration lines, pseudo-absences selected, predictions,
variables importance, evaluation values...), that can be retrieved with
get_[...]
or load
functions, and used by other biomod2 functions, like
BIOMOD_Projection
or BIOMOD_EnsembleModeling
If pseudo absences have been added to the original dataset (see
BIOMOD_FormatingData
), PA.nb.rep *(nb.rep + 1)
models will be
created.
The set of models to be calibrated on the data. 12 modeling techniques are currently available :
ANN
: Artificial Neural Network (nnet
)
CTA
: Classification Tree Analysis (rpart
)
FDA
: Flexible Discriminant Analysis (fda
)
GAM
: Generalized Additive Model (gam
, gam
or bam
)
(see bm_ModelingOptions
for details on algorithm selection)
GBM
: Generalized Boosting Model, or usually called Boosted Regression Trees
(gbm
)
GLM
: Generalized Linear Model (glm
)
MARS
: Multiple Adaptive Regression Splines (earth
)
MAXENT
: Maximum Entropy
(see Maxent website)
MAXNET
: Maximum Entropy (maxnet
)
RF
: Random Forest (randomForest
)
RFd
: Random Forest downsampled (randomForest
)
SRE
: Surface Range Envelop or usually called BIOCLIM (bm_SRE
)
XGBOOST
: eXtreme Gradient Boosting Training (xgboost
)
ANN | CTA | FDA | GAM | GBM | GLM | MARS | MAXENT | MAXNET | RF | RFd | SRE | XGBOOST | |
binary | x | x | x | x | x | x | x | x | x | x | x | x | x |
ordinal | x | x | x | x | x | x | x | ||||||
abundance / count / relative | x | x | x | x | x | x | x |
Different models might respond differently to different numbers of
pseudo-absences. It is possible to create sets of pseudo-absences with different numbers
of points (see BIOMOD_FormatingData
) and to assign only some of these
datasets to each single model.
Different methods are available to calibrate/validate the
single models (see bm_CrossValidation
).
Different methods are available to parameterize the
single models (see bm_ModelingOptions
and
BIOMOD.options.dataset
).
default
: only default parameter values of default parameters of the single
models functions are retrieved. Nothing is changed so it might not give good results.
bigboss
: uses parameters pre-defined by biomod2 team and that are
available in the dataset OptionsBigboss
.
to be optimized in near future
user.defined
: updates default or bigboss parameters with some parameters
values defined by the user (but matching the format of a
BIOMOD.models.options
object)
tuned
: calling the bm_Tuning
function to try and optimize
some default values
Please refer to
CAWRC website ("Methods for
dichotomous forecasts") to get detailed description (simple/complex metrics).
Several evaluation metrics can be selected.
Optimal value of each method can be obtained with the get_optim_value
function.
POD
: Probability of detection (hit rate)
FAR
: False alarm ratio
POFD
: Probability of false detection (fall-out)
SR
: Success ratio
ACCURACY
: Accuracy (fraction correct)
BIAS
: Bias score (frequency bias)
ROC
: Relative operating characteristic
TSS
: True skill statistic (Hanssen and Kuipers discriminant, Peirce's
skill score)
KAPPA
: Cohen's Kappa (Heidke skill score)
OR
: Odds Ratio
ORSS
: Odds ratio skill score (Yule's Q)
CSI
: Critical success index (threat score)
ETS
: Equitable threat score (Gilbert skill score)
BOYCE
: Boyce index
MPA
: Minimal predicted area (cutoff optimizing MPA to predict 90% of
presences)
RMSE
: Root Mean Square Error
MSE
: Mean Square Error
MAE
: Mean Absolute Error
Rsquared
: R squared
Rsquared_aj
: R squared adjusted
Max_error
: Maximum error
Accuracy
: Accuracy
Recall
: Macro average Recall
Precision
: Macro average Precision
F1
: Macro F1 score
Results after modeling can be obtained through the get_evaluations
function.
Evaluation metric are calculated on the calibrating data (column calibration
), on
the cross-validation data (column validation
) or on the evaluation data
(column evaluation
).
For cross-validation data, see CV.[...]
parameters in
BIOMOD_Modeling
function.
For evaluation data, see
eval.[...]
parameters in BIOMOD_FormatingData
.
A value characterizing how much each variable has an impact on each model
predictions can be calculated by randomizing the variable of interest and computing the
correlation between original and shuffled variables (see bm_VariablesImportance
).
More or less weight can be given to some specific observations.
Automatically created
weights
will be integer
values to prevent some modeling issues.
Note that MAXENT
, MAXNET
, RF
, RFd
and SRE
models
do not take weights into account.
If weights = prevalence = NULL
, each observation (presence or absence) will
have the same weight, no matter the total number of presences and absences.
If prevalence = 0.5
, presences and absences will be weighted equally
(i.e. the weighted sum of presences equals the weighted sum of absences).
If prevalence
is set below (above) 0.5
, more weight will be
given to absences (presences).
If weights
is defined, prevalence
argument will be ignored
(EXCEPT for MAXENT
).
If pseudo-absences have been generated (PA.nb.rep > 0
in
BIOMOD_FormatingData
), weights are by default calculated such that
prevalence = 0.5
. ##TODO C'EST FAUX
A binomial GLM is created to scale predictions from 0 to 1. SRE
is never scaled, and ANN
and FDA
categorical models always are.
Note that it may lead to reduction in projected scale amplitude.
This parameter is quite experimental and it is recommended not to use it. It was
developed in the idea to ensure comparable predictions by removing the scale prediction
effect (the more extended projections are, the more they influence ensemble
forecasting results).
glm
, gam
,
gam
, bam
, gbm
,
rpart
, nnet
,
fda
, earth
,
randomForest
, maxnet
,
xgboost
, BIOMOD_FormatingData
,
bm_ModelingOptions
, bm_Tuning
,
bm_CrossValidation
,
bm_VariablesImportance
, BIOMOD_Projection
,
BIOMOD_EnsembleModeling
, bm_PlotEvalMean
,
bm_PlotEvalBoxplot
, bm_PlotVarImpBoxplot
,
bm_PlotResponseCurves
Other Main functions:
BIOMOD_EnsembleForecasting()
,
BIOMOD_EnsembleModeling()
,
BIOMOD_FormatingData()
,
BIOMOD_LoadModels()
,
BIOMOD_Projection()
,
BIOMOD_RangeSize()
library(terra)
# Load species occurrences (6 species available)
data(DataSpecies)
head(DataSpecies)
# Select the name of the studied species
myRespName <- 'GuloGulo'
# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])
# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]
# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)
DONTSHOW({
myExtent <- terra::ext(0,30,45,70)
myExpl <- terra::crop(myExpl, myExtent)
})
# ---------------------------------------------------------------------------- #
# Format Data with true absences
myBiomodData <- BIOMOD_FormatingData(resp.name = myRespName,
resp.var = myResp,
resp.xy = myRespXY,
expl.var = myExpl)
# ---------------------------------------------------------------------------- #
# Model single models
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
modeling.id = 'AllModels',
models = c('RF', 'GLM'),
CV.strategy = 'random',
CV.nb.rep = 2,
CV.perc = 0.8,
OPT.strategy = 'bigboss',
metric.eval = c('TSS','ROC'),
var.import = 2,
seed.val = 42)
myBiomodModelOut
# Get evaluation scores & variables importance
get_evaluations(myBiomodModelOut)
get_variables_importance(myBiomodModelOut)
# Represent evaluation scores
bm_PlotEvalMean(bm.out = myBiomodModelOut, dataset = 'calibration')
bm_PlotEvalMean(bm.out = myBiomodModelOut, dataset = 'validation')
bm_PlotEvalBoxplot(bm.out = myBiomodModelOut, group.by = c('algo', 'run'))
# # Represent variables importance
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('expl.var', 'algo', 'algo'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('expl.var', 'algo', 'run'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('algo', 'expl.var', 'run'))
# # Represent response curves
# mods <- get_built_models(myBiomodModelOut, run = 'RUN1')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut,
# models.chosen = mods,
# fixed.var = 'median')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut,
# models.chosen = mods,
# fixed.var = 'min')
# mods <- get_built_models(myBiomodModelOut, full.name = 'GuloGulo_allData_RUN2_RF')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut,
# models.chosen = mods,
# fixed.var = 'median',
# do.bivariate = TRUE)