Run a range of species distribution models

This function allows to calibrate and evaluate a range of modeling techniques for a given species distribution. The dataset can be split up in calibration/validation parts, and the predictive power of the different models can be estimated using a range of evaluation metrics (see Details).

BIOMOD_Modeling(
  bm.format,
  modeling.id = as.character(format(Sys.time(), "%s")),
  models = c("ANN", "CTA", "FDA", "GAM", "GBM", "GLM", "MARS", "MAXENT", "MAXNET", "RF",
    "RFd", "SRE", "XGBOOST"),
  models.pa = NULL,
  CV.strategy = "random",
  CV.nb.rep = 1,
  CV.perc = NULL,
  CV.k = NULL,
  CV.balance = NULL,
  CV.env.var = NULL,
  CV.strat = NULL,
  CV.user.table = NULL,
  CV.do.full.models = TRUE,
  OPT.strategy = "default",
  OPT.user.val = NULL,
  OPT.user.base = "bigboss",
  OPT.user = NULL,
  metric.eval = c("KAPPA", "TSS", "AUCroc"),
  var.import = 0,
  weights = NULL,
  prevalence = 0.5,
  scale.models = FALSE,
  nb.cpu = 1,
  seed.val = NULL,
  do.progress = TRUE
)

Arguments

bm.format: a BIOMOD.formated.data or BIOMOD.formated.data.PA object returned by the BIOMOD_FormatingData function
modeling.id: a character corresponding to the name (ID) of the simulation set (a random number by default)
models: a vector containing model names to be computed, must be among ANN, CTA, DNN, FDA, GAM, GBM, GLM, MARS, MAXENT, MAXNET, RF, RFd, SRE, XGBOOST
models.pa: (optional, default NULL)
A list containing for each model a vector defining which pseudo-absence datasets are to be used, must be among colnames(bm.format@PA.table)
CV.strategy: a character corresponding to the cross-validation selection strategy, must be among random, kfold, block, strat, env or user.defined
CV.nb.rep: (optional, default 0)
If strategy = 'random' or strategy = 'kfold', an integer corresponding to the number of sets (repetitions) of cross-validation points that will be drawn
CV.perc: (optional, default 0)
If strategy = 'random', a numeric between 0 and 1 defining the percentage of data that will be kept for calibration
CV.k: (optional, default 0)
If strategy = 'kfold' or strategy = 'strat' or strategy = 'env', an integer corresponding to the number of partitions
CV.balance: (optional, default 'presences')
If strategy = 'strat' or strategy = 'env', a character corresponding to how data will be balanced between partitions, must be either presences or absences
CV.env.var: (optional, default NULL)
If strategy = 'env', a character corresponding to the environmental variables used to build the partition (all available variables by default), and for which CV.k partitions will be built
CV.strat: (optional, default 'both')
If strategy = 'strat', a character corresponding to how data will partitioned along gradient, must be among x, y, both
CV.user.table: (optional, default NULL)
If strategy = 'user.defined', a matrix or data.frame defining for each repetition (in columns) which observation lines should be used for models calibration (TRUE) and validation (FALSE)
CV.do.full.models: (optional, default TRUE)
A logical value defining whether models should be also calibrated and validated over the whole dataset (and pseudo-absence datasets) or not
OPT.strategy: a character corresponding to the method to select models' parameters values, must be either default, bigboss, user.defined, tuned
OPT.user.val: (optional, default NULL)
A list containing parameters values for some (all) models
OPT.user.base: (optional, default bigboss)
A character, default or bigboss used when OPT.strategy = 'user.defined'. It sets the bases of parameters to be modified by user defined values.
OPT.user: (optional, default TRUE)
A BIOMOD.models.options object returned by the bm_ModelingOptions function
metric.eval: a vector containing evaluation metric names to be used, must be among AUCroc, AUCprg, TSS, KAPPA, ACCURACY, BIAS, POD, FAR, POFD, SR, CSI, ETS, OR, ORSS, BOYCE, MPA (binary data), RMSE, MAE, MSE, Rsquared, Rsquared_aj, Max_error (abundance / count / relative data), Accuracy, Recall, Precision, F1 (multiclass / ordinal data)
var.import: (optional, default NULL)
An integer corresponding to the number of permutations to be done for each variable to estimate variable importance
weights: (optional, default NULL)
A vector of numeric values corresponding to observation weights (one per observation, see Details)
prevalence: (optional, default 0.5)
A numeric between 0 and 1 corresponding to the species prevalence to build 'weighted response weights' (see Details)
scale.models: (optional, default FALSE)
A logical value defining whether all models predictions should be scaled with a binomial GLM or not
nb.cpu: (optional, default 1)
An integer value corresponding to the number of computing resources to be used to parallelize the single models computation
seed.val: (optional, default NULL)
An integer value corresponding to the new seed value to be set
do.progress: (optional, default TRUE)
A logical value defining whether the progress bar is to be rendered or not

Value

A BIOMOD.models.out object containing models outputs, or links to saved outputs.
Models outputs are stored out of R (for memory storage reasons) in 2 different folders created in the current working directory :

a models folder, named after the resp.name argument of BIOMOD_FormatingData, and containing all calibrated models for each repetition and pseudo-absence run
a hidden folder, named .BIOMOD_DATA, and containing outputs related files (original dataset, calibration lines, pseudo-absences selected, predictions, variables importance, evaluation values...), that can be retrieved with get_[...] or load functions, and used by other biomod2 functions, like BIOMOD_Projection or BIOMOD_EnsembleModeling

Details

bm.format

If pseudo absences have been added to the original dataset (see BIOMOD_FormatingData),
PA.nb.rep *(nb.rep + 1) models will be created.

models

The set of models to be calibrated on the data. 12 modeling techniques are currently available :

ANN : Artificial Neural Network (nnet)
CTA : Classification Tree Analysis (rpart)
DNN : Deep Neural Network (cito)
FDA : Flexible Discriminant Analysis (fda)
GAM : Generalized Additive Model (gam, gam or bam)
(see bm_ModelingOptions for details on algorithm selection)
GBM : Generalized Boosting Model, or usually called Boosted Regression Trees (gbm)
GLM : Generalized Linear Model (glm)
MARS : Multiple Adaptive Regression Splines (earth)
MAXENT : Maximum Entropy (see Maxent website)
MAXNET : Maximum Entropy (maxnet)
RF : Random Forest (randomForest)
RFd : Random Forest downsampled (randomForest)
SRE : Surface Range Envelop or usually called BIOCLIM (bm_SRE)
XGBOOST : eXtreme Gradient Boosting Training (xgboost)

ANN

CTA

DNN

FDA

GAM

GBM

GLM

MARS

MAXENT

MAXNET

RFd

SRE

XGBOOST

binary

multiclass

ordinal

abundance / count / relative

models.pa

Different models might respond differently to different numbers of pseudo-absences. It is possible to create sets of pseudo-absences with different numbers of points (see BIOMOD_FormatingData) and to assign only some of these datasets to each single model.

CV.[...] parameters

Different methods are available to calibrate/validate the single models (see bm_CrossValidation).

OPT.[...] parameters

Different methods are available to parameterize the single models (see bm_ModelingOptions and BIOMOD.options.dataset).

default : only default parameter values of default parameters of the single models functions are retrieved. Nothing is changed so it might not give good results.
bigboss : uses parameters pre-defined by biomod2 team and that are available in the dataset OptionsBigboss.
to be optimized in near future
user.defined : updates default or bigboss parameters with some parameters values defined by the user (but matching the format of a BIOMOD.models.options object)
tuned : calling the bm_Tuning function to try and optimize some default values

metric.eval

Please refer to CAWRC website ("Methods for dichotomous forecasts") to get detailed description (simple/complex metrics).
Several evaluation metrics can be selected.
Optimal value of each method can be obtained with the get_optim_value function.

simple

POD : Probability of detection (hit rate)
FAR : False alarm ratio
POFD : Probability of false detection (fall-out)
SR : Success ratio
ACCURACY : Accuracy (fraction correct)
BIAS : Bias score (frequency bias)

complex

AUCroc : Area Under Curve of Relative operating characteristic
AUCprg : Area Under Curve of Precision-Recall-Gain curve
TSS : True skill statistic (Hanssen and Kuipers discriminant, Peirce's skill score)
KAPPA : Cohen's Kappa (Heidke skill score)
OR : Odds Ratio
ORSS : Odds ratio skill score (Yule's Q)
CSI : Critical success index (threat score)
ETS : Equitable threat score (Gilbert skill score)

presence-only

BOYCE : Boyce index
MPA : Minimal predicted area (cutoff optimizing MPA to predict 90% of presences)

abundance / count / relative data

RMSE : Root Mean Square Error
MSE : Mean Square Error
MAE : Mean Absolute Error
Rsquared : R squared
Rsquared_aj : R squared adjusted
Max_error : Maximum error

multiclass/ordinal data

Accuracy : Accuracy
Recall : Macro average Recall
Precision : Macro average Precision
F1 : Macro F1 score

Results after modeling can be obtained through the get_evaluations function.
Evaluation metric are calculated on the calibrating data (column calibration), on the cross-validation data (column validation) or on the evaluation data (column evaluation).
For cross-validation data, see CV.[...] parameters in BIOMOD_Modeling function.
For evaluation data, see eval.[...] parameters in BIOMOD_FormatingData.

var.import

A value characterizing how much each variable has an impact on each model predictions can be calculated by randomizing the variable of interest and computing the correlation between original and shuffled variables (see bm_VariablesImportance).

weights & prevalence

More or less weight can be given to some specific observations.
Automatically created weights will be integer values to prevent some modeling issues.
Note that MAXENT, MAXNET, RF, RFd and SRE models do not take weights into account.

If prevalence = 0.5 (the default), presences and absences will be weighted equally (i.e. the weighted sum of presences equals the weighted sum of absences).
If prevalence is set below (above) 0.5, more weight will be given to absences (presences).
If weights is defined, prevalence argument will be ignored (EXCEPT for MAXENT).

scale.models

A binomial GLM is created to scale predictions from 0 to 1.
SRE is never scaled, and ANN and FDA categorical models always are.
Note that it may lead to reduction in projected scale amplitude.
This parameter is quite experimental and it is recommended not to use it. It was developed in the idea to ensure comparable predictions by removing the scale prediction effect (the more extended projections are, the more they influence ensemble forecasting results).

Author

Wilfried Thuiller, Damien Georges, Robin Engler

Examples

library(terra)

# Load species occurrences (6 species available)
data(DataSpecies)
head(DataSpecies)

# Select the name of the studied species
myRespName <- 'GuloGulo'

# Get corresponding presence/absence data
myResp <- as.numeric(DataSpecies[, myRespName])

# Get corresponding XY coordinates
myRespXY <- DataSpecies[, c('X_WGS84', 'Y_WGS84')]

# Load environmental variables extracted from BIOCLIM (bio_3, bio_4, bio_7, bio_11 & bio_12)
data(bioclim_current)
myExpl <- terra::rast(bioclim_current)

DONTSHOW({
myExtent <- terra::ext(0,30,45,70)
myExpl <- terra::crop(myExpl, myExtent)
})

# ---------------------------------------------------------------------------- #
# Format Data with true absences
myBiomodData <- BIOMOD_FormatingData(resp.name = myRespName,
                                     resp.var = myResp,
                                     resp.xy = myRespXY,
                                     expl.var = myExpl)


# ---------------------------------------------------------------------------- #
# Model single models
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
                                    modeling.id = 'AllModels',
                                    models = c('RF', 'GLM'),
                                    CV.strategy = 'random',
                                    CV.nb.rep = 2,
                                    CV.perc = 0.8,
                                    OPT.strategy = 'bigboss',
                                    metric.eval = c('TSS','AUCroc'),
                                    var.import = 2,
                                    seed.val = 42)
myBiomodModelOut

# Get evaluation scores & variables importance
get_evaluations(myBiomodModelOut)
get_variables_importance(myBiomodModelOut)

# Represent evaluation scores 
bm_PlotEvalMean(bm.out = myBiomodModelOut, dataset = 'calibration')
bm_PlotEvalMean(bm.out = myBiomodModelOut, dataset = 'validation')
bm_PlotEvalBoxplot(bm.out = myBiomodModelOut, group.by = c('algo', 'run'))

# # Represent variables importance 
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('expl.var', 'algo', 'algo'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('expl.var', 'algo', 'run'))
# bm_PlotVarImpBoxplot(bm.out = myBiomodModelOut, group.by = c('algo', 'expl.var', 'run'))

# # Represent response curves 
# mods <- get_built_models(myBiomodModelOut, run = 'RUN1')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut, 
#                       models.chosen = mods,
#                       fixed.var = 'median')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut, 
#                       models.chosen = mods,
#                       fixed.var = 'min')
# mods <- get_built_models(myBiomodModelOut, full.name = 'GuloGulo_allData_RUN2_RF')
# bm_PlotResponseCurves(bm.out = myBiomodModelOut, 
#                       models.chosen = mods,
#                       fixed.var = 'median',
#                       do.bivariate = TRUE)

Arguments

Value

Details

See also

Author

Examples