Whether independent data is available or not, data-splitting methods allow to divide input data into pieces to calibrate and validate the models on different parts.
Most common procedures either split randomly the original dataset in
two parts (random) with higher proportion to calibrate
the models ; or in k datasets of equal sizes (k-fold),
each of them being used in turn to validate the model, while the
remaining is used for calibration. For both methods, the splitting can
be repeated several times.
Other procedures are available to test for model overfitting and to
assess transferability either in geographic or environmental space :
block method described in Muscarella et
al. 2014 partitions data in four bins of equal size (bottom-left,
bottom-right, top-left and top-right), while
x-y-stratification described in Wenger and Olden
2012 uses k partitions along the x-(or y-) gradient and returns 2k
partitions ; environmental partitioning returns k
partitions for each environmental variable provided.
These methods can be balanced over presences or absences to ensure equal
distribution over space, especially if some data is clumped on an edge
of the study area.
The user can also define its own data partitioning
(user.defined).
biomod2
allows you to use different strategies to separate
your data into a calibration dataset and a validation dataset for the
cross-validation. With the argument CV.strategy
in BIOMOD_Modeling
, you
can select : nb.rep
times. You can adjust the size of the splitting
between calibration and validation with perc
.
k-fold
method splits the original dataset in
k
datasets of equal sizes : each part is used successively
as the validation dataset while the other k-1
parts are
used for the calibration, leading to k
calibration/validation ensembles. This multiple splitting can be
repeated nb.rep
times.
block
stratification
was described in Muscarella et al. 2014 (see References). Four
bins of equal size are partitioned (bottom-left, bottom-right, top-left
and top-right).
x
and y
stratification was described in Wenger and Olden 2012 (see
References). y
stratification uses k
partitions along the y-gradient, x
stratification does the
same for the x-gradient. both returns 2k
partitions:
k
partitions stratified along the x-gradient and
k
partitions stratified along the y-gradient. You can
choose x
, y
and both
stratification with the argument strat
.
k
partitions for each variable given in env.var
.You can
choose if the presences or the absences are balanced over the partitions
with balance
.
_allData_RUNx
with x
an integer. For a
presence-only dataset for which several pseudo-absence dataset were
generated, column names must be formatted as: _PAx_RUNy
with x
an integer and PAx
an existing
pseudo-absence dataset and y
an integer
biomod2
allows to use a cross-validation method to build
(calibration) and validate
(validation) the model, but it can also be tested on
another independent dataset if available (evaluation).
This second independent dataset can be integrated with the
eval.resp
, eval.xy
and
eval.env.data
parameters in the BIOMOD_FormatingData
function.
Note that, if you can have as many evaluation values (with dataset 2) as your number of cross-validation splitting for single models, you can have only one evaluation value for ensemble models.
This can be circumvented by using em.by = 'PA+run'
within the BIOMOD_EnsembleModeling
function to build for each cross-validation fold an ensemble model
across algorithms. You will obtain as many ensemble models as
cross-validation split, and thus as many evaluation values. But
you will also have several ensemble models, which may defeat your
purpose of having a final single model.
All the examples are made with the data of the package.
For the beginning of the code, see the main functions
vignette.
To do a random cross-validation method with 2 runs and with a distribution 80/20 for calibration and validation.
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
modeling.id = 'Example',
models = c('RF', 'GLM'),
CV.strategy = 'random',
CV.nb.rep = 2,
CV.perc = 0.8,
metric.eval = c('TSS','ROC'))
To get the cross-validation table and visualize it on your BIOMOD.formated.data
or BIOMOD.formated.data.PA
object.
myCalibLines <- get_calib_lines(myBiomodModelOut)
plot(myBiomodData, calib.lines = myCalibLines)
To create a cross-validation table with bm_CrossValidation
.
bm_CrossValidation(bm.format = myBiomodData,
strategy = "strat",
k = 2,
balance = "presences",
strat = "x")
Example of a table for the user.defined
method :
myCVtable
.
_PA1_RUN1 | _PA1_RUN2 | _PA2_RUN1 | _PA2_RUN2 |
---|---|---|---|
FALSE | FALSE | FALSE | TRUE |
TRUE | TRUE | FALSE | FALSE |
TRUE | TRUE | TRUE | TRUE |
… | … | … | … |
myBiomodModelOut <- BIOMOD_Modeling(bm.format = myBiomodData,
modeling.id = 'Example',
models = c('RF', 'GLM'),
CV.strategy = 'user.defined',
CV.user.table = myCVtable,
metric.eval = c('TSS','ROC'))
You can find more examples in the Secondary functions vignette.
To add an independent dataset for evaluation, you will need to
provide the correspondent environment variable (myEvalExpl
)
as a raster, a matrix or a data.frame.
Case 1 : If your evaluation response
(myEvalResp
) is a raster :
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName,
eval.resp.var = myEvalResp,
eval.expl.var = myEvalExpl)
Case 2 : If your evaluation response
(myEvalResp
) is a vector :
myEvalCoord
myEvalExpl
is a data.frame or a matrix, be sure the
points are in the same order.
myBiomodData <- BIOMOD_FormatingData(resp.var = myResp,
expl.var = myExpl,
resp.xy = myRespXY,
resp.name = myRespName,
eval.resp.var = myEvalResp,
eval.expl.var = myEvalExpl,
eval.resp.xy = myEvalCoord)
Wenger, S.J. and Olden, J.D. (2012), Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods in Ecology and Evolution, 3: 260-267. https://doi.org/10.1111/j.2041-210X.2011.00170.x
Muscarella, R., Galante, P.J., Soley-Guardia, M., Boria, R.A., Kass, J.M., Uriarte, M. and Anderson, R.P. (2014), ENMeval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. Methods in Ecology and Evolution, 5: 1198-1205. https://doi.org/10.1111/2041-210X.12261