
EcoCleanR: Overview on Steps for Data merging from online bioiversity resources
data_merging.RmdIntroduction
In this tutorial, we demonstrate the step-by-step process of downloading data from various sources such as GBIF, OBIS, and iDigBio using existing R packages, as well as from InvertEbase via a local CSV file. This process includes merging all data files and standardizing their formats to make them compatible for integration.
Example species: Mexacanthina lugubris
Input species name
species_name <- "Mexacanthina lugubris"
taxonkey <- name_backbone(species_name)$usageKeyCreate a attribute list with TDWG standardize names
Given attributes in the list can be changed/added based on the requirement
attribute_list <- c("source", "catalogNumber", "basisOfRecord", "occurrenceStatus", "institutionCode", "verbatimEventDate", "scientificName", "individualCount", "organismQuantity", "abundance", "decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters", "locality", "verbatimLocality", "municipality", "county", "stateProvince", "country", "countryCode")GBIF - data extraction and standardization
This step uses function occ_data of rgbif package to
extract data from GBIF.
gbif.occ <- occ_data(taxonKey = taxonkey, occurrenceStatus = NULL, limit = 10000L)$data
# refer article/cite_data.Rmd for instructions on how to cite the data from gbif- data providers
## additional field added to know the source
gbif.occ$source <- "gbif"
for (field in attribute_list) {
if (!field %in% names(gbif.occ)) {
gbif.occ[[field]] <- NA # Add the missing field as NA
}
}
## we are making one column called abundance which should have values from individual count and organism Quantity
gbif.occ$abundance <- ifelse(is.na(as.numeric(gbif.occ$individualCount)), as.numeric(gbif.occ$organismQuantity), as.numeric(gbif.occ$individualCount))
## additional field added to know the source
gbif.occ$source <- "gbif"
gbif.occ_temp <- gbif.occ[, attribute_list]
str(gbif.occ_temp[, 1:3])OBIS- data extraction and standardization
This step uses occurrence function of robis package to
extract data from OBIS.
obis.occ <- occurrence(species_name)
for (field in attribute_list) {
if (!field %in% names(obis.occ)) {
obis.occ[[field]] <- NA # Add the missing field as NA
}
}
obis.occ$abundance <- ifelse(is.na(as.numeric(obis.occ$individualCount)), as.numeric(obis.occ$organismQuantity), as.numeric(obis.occ$individualCount))
obis.occ$source <- "obis"
obis.occ$municipality <- ""
obis.occ_temp <- obis.occ[, attribute_list]
str(obis.occ_temp[, 1:3])IDIGBIO - data extraction and standardization
This step uses idig_search_records of ridigbio package
to extract data from IDIGBIO.
idig.occ <- idig_search_records(
type = "records",
rq = list("scientificname" = species_name),
field = "all",
max_items = 10000L,
limit = 10000L,
offset = 0
)
idig.occ <- idig.occ %>%
mutate(
abundance = as.numeric(individualcount),
source = "idigbio",
occurrenceStatus = "",
organismQuantity = ""
) %>%
rename(
decimalLatitude = geopoint.lat,
decimalLongitude = geopoint.lon,
basisOfRecord = basisofrecord,
catalogNumber = catalognumber,
scientificName = scientificname,
stateProvince = stateprovince,
coordinateUncertaintyInMeters = coordinateuncertainty,
individualCount = individualcount,
institutionCode = institutioncode,
verbatimLocality = verbatimlocality,
verbatimEventDate = verbatimeventdate,
countryCode = countrycode
)
idig.occ_temp <- idig.occ[, attribute_list]
str(idig.occ_temp[, 1:3])Local file (InvertEbase) - data read and standardization
This local file “example_sp_invertebase” is a manual downloaded file from InvertEbase for Mexacanthina lugubris. See the example_sp_invertEbase dataset for its attributes and DwC format.
sym.occ <- example_sp_invertebase
sym.occ$abundance <- as.numeric(sym.occ$individualCount)
for (field in attribute_list) {
if (!field %in% names(sym.occ)) {
sym.occ[[field]] <- NA # Add the missing field as NA
}
}
str(sym.occ[, 1:3])Merging the databases
ec_db_merge function in the EcoCleanR package helps
merge data from all sources, provided that each source has the same
attribute names and number of columns. It also filters the data based on
the specified type (e.g., modern or fossil) and removes records marked
as ‘absent’ occurrenceStatus.
db_list <- list(gbif.occ_temp, obis.occ_temp, idig.occ_temp, sym.occ)
Mixdb.occ <- ec_db_merge(db_list = db_list, datatype = "modern")
str(Mixdb.occ[, 1:3])
ec_geographic_map(Mixdb.occ, "decimalLatitude", longitude = "decimalLongitude") # display records those has coordinate valuesFurther documents:
*see data cleaning steps on mixdb (merged) dataset at vignette:
[data_cleaning]
*see citation guidelines for the downloaded files from gbif, obis, idigbio and InvertEbase vignettes/article/cite_data.rmd