Create from SOI

In this vignette, we will cover the main steps involved in the creation of the Core GeoLocator Data Package (i.e., before analysis) with the Swiss Ornitholigical Institute data, starting from the placement of order. Here are the mains steps, with the corresponding code provided below:

1. Placing order

Ask collaborator to enter basic metadata information on the project by creating a datapackage.json:

title
constributors: including email to define who has access to the private Zenodo during embargo
licences: CC-BY-4.0 by default.
embargo

Tip

This can be done with GeoLocatoR::create_gldp() (see below) or using this datapackage.json template file.

2. Sending geolocator together with pre-filled tags.csv and observations.csv

Provide basic instructions to ringers/collaborator on how to fill these csv filer with link to the documentation: tags.csv and observations.csv:

Caution

Do not modify columns header, do not add or remove column.

Make sure to fill all required fields (marked by * in documentations)

Each tags should be presents only once in tags.csv and at least one in observations.csv with observation_type: "equipment".

You will only have access to the data when these two tables are returned complete.

It is strongly recommended to generate these two files with add_gldp_soi() with pre-filled field based on the access database (see below).

3. Processing returned data

Upon reception of the geolocator:

Extract geolocator data and store the data (as usual) on the Z: drive.
Add also the filled tags.csv and observations.csv in the same folder
Create the first core DataPackage with add_gldp_soi() in any temporary drive (see below)
Upload on Zenodo
Share access right to contributors based on their urls.

library(frictionless)
library(GeoLocatoR)
library(tidyverse)
library(writexl)
library(readxl)

1. Initiate DataPackage with `datapackage.json`

In the contract/agreement, the following terms should be specify:

List of contributor with email giving them access write to the datapackage on zenodo while still under embargo
Licenses of the data according to agreement
Embargo date according to agreement

Here is a simple example to create a geolocator data package based on these basic information:

pkg <- create_gldp(
  title = "Geolocator study of {species_name} in {location}", # required
  contributors = list( # required
    list(
      title = "Raphaël Nussbaumer",
      roles = c("ContactPerson", "DataCurator", "ProjectLeader"),
      email = "raphael.nussbaumer@vogelwarte.ch",
      path = "https://orcid.org/0000-0002-8185-1020",
      organization = "Swiss Ornithological Institute"
    ),
    list(
      title = "Yann Rime",
      roles = c("Researcher"),
      email = "yann.rimme@vogelwarte.ch",
      path = "https://orcid.org/0009-0005-7264-6753",
      organization = "Swiss Ornithological Institute"
    )
  ),
  # The default licenses is a Creative Common
  licenses = list(list(
    name = "CC-BY-4.0",
    title = "Creative Commons Attribution 4.0",
    path = "https://creativecommons.org/licenses/by/4.0/"
  )),
  # This is the default value, set to a past date so that there are no embargo
  embargo = "1970-01-01"
)

Read the datapackage specification to learn about all recommended metadata that can be added. They can be added in create_gldp() or update manually on pkg directly as below:

# Description is really important to provide some textual background information on the project.
pkg$description <- "Geolocator study of Mangrove Kingfisher and Red-capped Robin-chat on the coast of Kenya"
# If you have a website link, it's quite a nice way to link them up
pkg$homepage <- "https://github.com/Rafnuss/MK-RCRC"
# And the url to an image describing your datapackage
pkg$image <- NULL
# Versioning of the datapackage is a good idea to allow update of new data.
pkg$version <- "1.0.0"
# You can also add keywords:
pkg$keywords <- c("Mangrove Kingfisher", "Red-capped Robin Chat", "multi-sensor geolocator")

# Important field for publication/researcher
# By default, the licenses of the data is Creative Common BY 4.0.
# pkg$licenses
# Add DOI of the datapackage if already available or reserve it https://help.zenodo.org/docs/deposit/describe-records/reserve-doi/#reserve-doi
pkg$id <- "https://doi.org/10.5281/zenodo.11207141"
# Provide the recommended citation for the package
pkg$citation <- "Nussbaumer, R., & Rime, Y. (2024). Woodland Kingfisher: Migration route and timing of South African Woodland Kingfisher (v1.1). Zenodo. https://doi.org/10.5281/zenodo.11207141"
# Funding sources
pkg$grants <- c("Swiss Ornithological Intitute")
# Identifiers of resources related to the package (e.g. papers, project pages, derived datasets, APIs, etc.).
pkg$relatedIdentifiers <- c("Nussbaumer, R., Jackson, C. Using geolocators to unravel intra-African migrant strategies. http://dx.doi.org/10.13140/RG.2.2.34477.10721 ")
# List of references related to the package
pkg$references <- NULL

Once you’re done, you can visual them

str(pkg[!names(pkg) %in% "resources"])

You can export it as datapackage.json with:

package_json <- jsonlite::toJSON(pkg, pretty = TRUE, null = "null", na = "null", auto_unbox = TRUE)
write(package_json, "datapackage.json")

Tip

Save this file in the root of the project folder on the Z: drive

2. Create pre-filled `tags.csv` and `observations.csv` from SOI database (no data)

First, we need to extract the data from the access database.

# Root folder of the dataset (typically the Z-drive)
soi_data_directory <- "/Users/rafnuss/Library/CloudStorage/Box-Box/geolocator_data/UNIT_Vogelzug/"

# Load the geolocator (GDL) database from the access file
gdl0 <- read_gdl(access_file = file.path(soi_data_directory, "database/GDL_Data.accdb"), filter_col = FALSE)
#> Warning: One or more parsing issues, call `problems()` on your data frame for details,
#> e.g.:
#>   dat <- vroom(...)
#>   problems(dat)

For this example, we will take the geolocator data from the order OtuScoES24 and add them to the pk created earlier.

pkg <- pkg %>%
  add_gldp_soi(
    gdl = gdl0 %>% filter(str_detect(OrderName, "OtuScoES24")),
    directory_data = file.path(soi_data_directory, "data"),
    allow_empty_o = TRUE, # This is required to create an empty observation table
  )
#> Warning: We could not find the data directory for 49 tags (out of 49). GDL_IDs: 32TK,
#> 32OD, 32UE, 32TI, 32OQ, 32SL, 32SV, 32OU, 32UF, 32TJ, 32PJ, 32UX, 32OR, 32OT,
#> 32TM, 32NK, 32PI, 32UV, …, 32SS, and 32NL. These will not be imported.

A warning message should confirm that no data is available in the dataset. However, as we have allowed to create an empty observations table with allow_empty_o.

Note

In this example we used the initial datapackage pkg, but if you want to use an existing datapackage.json, you can you can create the package pkg with
pkg <- read_gldp("https://gist.githubusercontent.com/Rafnuss/457b9096a4fc0ad004115df1712a8923/raw/46d5d6e4461f919f725961b037199f4fcf8313b0/datapackage.json")
#> Warning: `file`
#> 'https://gist.githubusercontent.com/Rafnuss/457b9096a4fc0ad004115df1712a8923/raw/46d5d6e4461f919f725961b037199f4fcf8313b0/datapackage.json'
#> should have a resources property containing at least one resource.
#> ℹ Use `add_resource()` to add resources.
The warning message is normal, it’s reminding you that the package is actually empty.

As the geolocator have just been built and ready to be ship, you can generate the observations.xlsx and tags.xlsx table to be sent to the ringers. I suggest to create a .xlsx spreadsheet and not .csv to preserve the column class.

write_xlsx(tags(pkg), "./tags.xlsx")
write_xlsx(observations(pkg), "./observations.xlsx")

3. Create core DataPackage with data from SOI database

If we have data available on the Z: drive, we can use the same function add_gldp_soi() to add the three core resources on pkg

# Initiate pkg with a minimalist metadata, use exisiting datapackage.json if possible
pkg <- create_gldp(
  title = "Geolocator data of Cossypha natalensis in Kenya",
  contributors = list(
    list(
      title = "Raphaël Nussbaumer",
      roles = c("ContactPerson", "DataCurator", "ProjectLeader")
    )
  ),
)

pkg <- pkg %>% add_gldp_soi(
  gdl = gdl0 %>% filter(str_detect(OrderName, "CosNatKE")),
  directory_data = file.path(soi_data_directory, "data")
)
#> Warning: We could not find the data directory for 55 tags (out of 65). GDL_IDs: 27MK,
#> 27LF, 27LC, 27MM, 27ML, 27MO, 27LE, 24SX, 24SW, 24SY, 26YC, 26YE, 28AF, 28CF,
#> 28BE, 28BG, 28BF, 28BH, …, 32YS, and 30IJ. These will not be imported.
#> ⠙ 4/10 ETA:  7s |
#> ⠹ 5/10 ETA:  6s |
#> ⠸ 7/10 ETA:  4s |
print(pkg)
#> A GeoLocator Data Package with 3 resourcess:
#> • tags
#> • measurements
#> • observations
#> Use `unclass()` to print the Data Package as a list.

Whenever you add some new resources to a datapackage, gldp_update() should be called so that the following field are updated created, temporal, reference_location, taxonomic

4. Update `tags` and/or `observations` table

There are different way to modify your resources table.

In the ideal case, tags.csv and observations.csv have been returned by the ringer. In this case you can simply replace them with

tags(pkg) <- read_csv("tags.csv") # or read_xlsx()
observation(pkg) <- read_csv("observations.csv")

In other case, you might want to edit the default table. I suggest to create a temporary .xlsx spreadsheet (and not .csv to preserve the column class), modify it in excel, and read it back into R. You can also edit the tibble directly with dplyr functions.

temp_file <- tempfile(fileext = ".xlsx")
write_xlsx(tags(pkg), temp_file)
system(paste("open", temp_file))
# Edit it on the external program and once you're done, save you file and update the table
tags(pkg) <- read_xlsx(temp_file)

Finally, you can also modify the table (not recommended). For this vignette, I’ll just randomly create value for the two required field scientific_name and ring_number

tags(pkg) <- tags(pkg) %>%
  mutate(
    scientific_name = "Cossypha natalensis",
    ring_number = ifelse(is.na(ring_number), "XXXXX", ring_number)
  )
observations(pkg) <- observations(pkg) %>%
  mutate(
    ring_number = ifelse(is.na(ring_number), "XXXXX", ring_number)
  )

Warning

Don’t forget to update the meta-data information of your Data Package when you’ve updated any tables
pkg <- update_gldp(pkg)

5. Check and Validate datapackage

Before publishing your data, it is essential to validating your GeoLocator Data Package.

A first check can be done with check_gldp()

check_gldp(pkg)

The python package frictionless-py offers more advance and general validation.

First you’ll need to install python and the frictionless package,

pip install frictionless

validate_gldp(pkg, path = "/Users/rafnuss/anaconda3/bin/")

6. Write package to directory

Now that we have confirmed that all the data are correct, we can write the package with

write_package(pkg, directory = file.path("~/", pkg$name))

Tip

The folder created can now be uploaded on Zenodo. All information needed in the upload are provided in datapackage.json!

1. Initiate DataPackage with datapackage.json

2. Create pre-filled tags.csv and observations.csv from SOI database (no data)