Retrieving iNaturalist Observations for a region / observers with PyGbif (original) (raw)
Good evening everyone
Following this first message, Iāve made a little progress in my discovery of PyGBIF and the GBIF API.
With the help of ChatGPT, Iāve written a functional script for my needs: retrieving iNaturalist observations from a given territory, for a list of users (the aim being to integrate them into another local naturalist database).
Iām sharing the script with you, and Iād be grateful for any feedback or suggestions for improvements.
For example, itās not possible in principle to set a list of observers as parameters: I therefore need to retrieve all the observations from the bounding box, and then filter them (Iām working in SQL after integrating the data into the database).
Hereās the script:
import csv
from pygbif import occurrences as occ
# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335
# Search params
search_params = {
'country': 'FR', # France
'decimalLatitude': f'{min_latitude},{max_latitude}',
'decimalLongitude': f'{min_longitude},{max_longitude}',
'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
'limit': 300, # Limite de 300 occurrences par page
}
# Function to retrieve all occurrences with pagination
def get_all_occurrences(params):
all_occurrences = []
offset = 0
while True:
params['offset'] = offset
occurrences = occ.search(**params)
results = occurrences['results']
if not results:
break
all_occurrences.extend(results)
offset += len(results)
print(f"{offset} occurrences rƩcupƩrƩes...")
return all_occurrences
# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params)
# List all available fields
all_fields = set()
for occurrence in all_occurrences:
all_fields.update(occurrence.keys())
# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'
with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
writer.writeheader()
for occurrence in all_occurrences:
writer.writerow({field: occurrence.get(field, '') for field in all_fields})
print(f"iNaturalist occurrences have been recorded in the {output_file} file.")
Thanks for the code. Iām executing it as this is being written. The screen output so far is:
ā¦
14400 occurrences rĆ©cupĆ©rĆ©esā¦
14700 occurrences rĆ©cupĆ©rĆ©esā¦
ā¦
Roughly how many occurrences should we expect before the data is written to the output file and the program terminates?
Aha! Here is the answer to my question:
15365 occurrences rƩcupƩrƩes...
iNaturalist occurrences have been recorded in the occurrences_GBIF_iNaturalist.csv file.
Yes, thatās the approximate amount of data to be recovered.
Iāve noticed that the first packets of data are recovered quickly, then that it gradually slows down.
Perhaps this is a limitation to prevent abuse?
(By the way, sorry, Iāve realised that I havenāt translated all the French comments/texts)
Regarding your list of users, is the 'identifiedBy'
key the one of interest?
With the following modification for a single user, there were 1579 occurrences recovered:
# Search params
id_by = 'Sylvain Montagner'
search_params = {
'identifiedBy': id_by,
'country': 'FR', # France
'decimalLatitude': f'{min_latitude},{max_latitude}',
'decimalLongitude': f'{min_longitude},{max_longitude}',
'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
'limit': 300, # Limite de 300 occurrences par page
}
Given a list of users, you could reorganize your code to iterate through that list in a loop to perform a similar search regarding one user at a time, in order to build the collection of data. With the loop, you would update the id_by
variable and perform a search during each iteration.
Give the following a try for working with a list of observers:
import csv
from pygbif import occurrences as occ
# Bounding box (latitude/longitude)
min_latitude = 48.175391
max_latitude = 48.977037
min_longitude = -0.867335
max_longitude = 0.98335
# List of observers
observer_list = ['Sylvain Montagner', 'ClƩment Maouche', 'Quentin Benet-Cibois']
# Search params
search_params = {
'country': 'FR', # France
'decimalLatitude': f'{min_latitude},{max_latitude}',
'decimalLongitude': f'{min_longitude},{max_longitude}',
'datasetKey': '50c9509d-22c7-4a22-a47d-8c48425ef4a7', # iNaturalist dataset
'limit': 300, # Limite de 300 occurrences par page
}
# Function to retrieve all occurrences with pagination
def get_all_occurrences(params, observers):
all_occurrences = []
for observer in observers:
offset = 0
params['identifiedBy'] = observer
while True:
params['offset'] = offset
occurrences = occ.search(**params)
results = occurrences['results']
if not results:
break
all_occurrences.extend(results)
offset += len(results)
print(f"{offset} occurrences rƩcupƩrƩes for observer {observer} ...")
return all_occurrences
# Retrieve all occurrences
all_occurrences = get_all_occurrences(search_params, observer_list)
# List all available fields
all_fields = set()
for occurrence in all_occurrences:
all_fields.update(occurrence.keys())
# Save occurrences in a CSV file
output_file = 'occurrences_GBIF_iNaturalist.csv'
with open(output_file, mode='w', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=list(all_fields))
writer.writeheader()
for occurrence in all_occurrences:
writer.writerow({field: occurrence.get(field, '') for field in all_fields})
print(f"iNaturalist occurrences have been recorded in the {output_file} file.")
Why not filter the SIMPLE_CSV download with something like AWK before integrating into the database and having to devise SQL queries?
Thank you @quercitron
Iāve replaced identifiedBy
with recordedBy
for my purposes, and it works really well!
It avoids downloading tens of thousands of pieces of data that we donāt need here.
I hadnāt tested this because ChatGPT had written to me :
The recordedBy parameter is not directly available in the GBIF API search parameters via pygbif. The GBIF API allows hits to be filtered by many criteria, but recordedBy is not a filter parameter supported directly by the API in standard hit searches.
I can therefore see that it had misled me.
So far, the script meets my needs!
I may come back with more questions/improvements, but donāt hesitate to let me know if you see any potential improvements.
Simply because I donāt know AWK, and Iām fairly comfortable with SQL.
In any case, data management is done in SQL in our local databases.
In fact, Iām working on an improvement to the script that doesnāt use a CSV export, but instead integrates the data directly into the database (tested OK with a local SQLite database, but eventually it will be on a PostGreSQL server).
@Sylvain_M, many thanks for your explanation. I would be very interested to learn how you imported the data (format?) into SQLite. If you donāt think this forum is the appropriate place for technical details, please feel free to email me directly: robert.mesibov@gmail.com
Donāt worry: I think this forum is suitable for this kind of discussion.
Itās just that I wonāt have enough time to explain in detail (Iām doing this work voluntarily, and Iām making very slow progress).
Hereās the part of the code that concerns the connection to the SQLite database (Spatialite).
But I ran into problems naming the fields, with unsupported characters and reserved names, hence the rather complex code (suggested by ChatGPT, and probably not as good as a āreal developerā would have done).
# Clean column names
def clean_column_name(name):
return re.sub(r'\W|^(?=\d)', '_', name)
# Convert values to supported types
def convert_value(value):
if isinstance(value, (int, float, str)):
return value
else:
return str(value)
# Create and connect to the Spatialite database
conn = sqlite3.connect('gbif.db')
cursor = conn.cursor()
# Collect all fields from all occurrences
all_fields = set()
for occurrence in all_occurrences:
all_fields.update(occurrence.keys())
# Sort fields alphabetically for consistent order
ordered_fields = sorted(all_fields)
# Cleaned column names
cleaned_fields = [clean_column_name(field) for field in ordered_fields]
fields_definition = ', '.join([f'"{field}" TEXT' for field in cleaned_fields])
create_table_query = f'CREATE TABLE IF NOT EXISTS "{table_name}" ({fields_definition});'
cursor.execute(create_table_query)
# Insert occurrences into the database
for occurrence in all_occurrences:
cleaned_occurrence = {clean_column_name(k): convert_value(v) for k, v in occurrence.items()}
columns = ', '.join([f'"{clean_column_name(field)}"' for field in ordered_fields])
placeholders = ', '.join(['?' for _ in ordered_fields])
values = [cleaned_occurrence.get(clean_column_name(field), '') for field in ordered_fields]
insert_query = f'INSERT INTO "{table_name}" ({columns}) VALUES ({placeholders})'
cursor.execute(insert_query, tuple(values))
# Commit and close the database connection
conn.commit()
conn.close()
For the moment, all the fields have been converted to text, which is not optimal: we need to improve the way we recognise the type of each field.
@Sylvain_M, many thanks again. Your regex substitution is a little surprising, but I donāt have a list of your raw column names (note that \W will match underscore). And I guess you already know that typing Darwin Core fields will generate large numbers of exceptions, even after GBIF processing, so that to enforce typing you will need to do a significant amount of cleaning.
Here are all fields retrieved :
original | cleaned |
---|---|
acceptedScientificName | acceptedScientificName |
acceptedTaxonKey | acceptedTaxonKey |
basisOfRecord | basisOfRecord |
catalogNumber | catalogNumber |
class | class |
classKey | classKey |
collectionCode | collectionCode |
continent | continent |
coordinateUncertaintyInMeters | coordinateUncertaintyInMeters |
country | country |
countryCode | countryCode |
crawlId | crawlId |
datasetKey | datasetKey |
datasetName | datasetName |
dateIdentified | dateIdentified |
day | day |
decimalLatitude | decimalLatitude |
decimalLongitude | decimalLongitude |
endDayOfYear | endDayOfYear |
eventDate | eventDate |
eventTime | eventTime |
extensions | extensions |
facts | facts |
family | family |
familyKey | familyKey |
gadm | gadm |
gbifID | gbifID |
gbifRegion | gbifRegion |
genericName | genericName |
genus | genus |
genusKey | genusKey |
geodeticDatum | geodeticDatum |
hostingOrganizationKey | hostingOrganizationKey |
http://unknown.org/captive | http___unknown_org_captive |
http://unknown.org/nick | http___unknown_org_nick |
identificationID | identificationID |
identificationRemarks | identificationRemarks |
identifiedBy | identifiedBy |
identifiedByIDs | identifiedByIDs |
identifier | identifier |
identifiers | identifiers |
informationWithheld | informationWithheld |
infraspecificEpithet | infraspecificEpithet |
installationKey | installationKey |
institutionCode | institutionCode |
isInCluster | isInCluster |
isSequenced | isSequenced |
issues | issues |
iucnRedListCategory | iucnRedListCategory |
key | key |
kingdom | kingdom |
kingdomKey | kingdomKey |
lastCrawled | lastCrawled |
lastInterpreted | lastInterpreted |
lastParsed | lastParsed |
license | license |
lifeStage | lifeStage |
media | media |
modified | modified |
month | month |
occurrenceID | occurrenceID |
occurrenceRemarks | occurrenceRemarks |
occurrenceStatus | occurrenceStatus |
order | order |
orderKey | orderKey |
phylum | phylum |
phylumKey | phylumKey |
protocol | protocol |
publishedByGbifRegion | publishedByGbifRegion |
publishingCountry | publishingCountry |
publishingOrgKey | publishingOrgKey |
recordedBy | recordedBy |
recordedByIDs | recordedByIDs |
references | references |
relations | relations |
reproductiveCondition | reproductiveCondition |
rightsHolder | rightsHolder |
scientificName | scientificName |
sex | sex |
species | species |
speciesKey | speciesKey |
specificEpithet | specificEpithet |
startDayOfYear | startDayOfYear |
stateProvince | stateProvince |
taxonID | taxonID |
taxonKey | taxonKey |
taxonRank | taxonRank |
taxonomicStatus | taxonomicStatus |
verbatimEventDate | verbatimEventDate |
verbatimLocality | verbatimLocality |
year | year |