Extraction process
Extraction process
COCONUT provides the data as MongoDB dump. The database contains the data in several parts. The largest part is in sourceNaturalProducts. We have exported this as json for better processing. Using a self created schema, we set up several rules to transform the json data into RDF data. For the transformation we used eccenca corporate memory
The schema can be found at https://coconutKG.aksw.org/dataset/index.html
Schema creation process
Some properties occur multiple times in a original Coconut record. This is probably due to the fact that Coconut has been created from multiple datasets. We decided not to transfer these duplications into our ontology, as far as they are visible to us. We used the command line tool jq to verify the duplicates. For example to validate if there is no simpleInchi and inchi which doesn't share the same information we used: jq 'select(.simpleInchi != .uniqueNaturalProduct.inchi)' sourceNaturalProduct.jsonl
List of duplicates:
-
$oid
and$oid
-
totalAtomNumber
andtotal_atom_number
-
simpleInchi
andinchi
-
simpleInchiKey
andinchikey
We have decided to merge citation
and citationDOI
to citationDOI
.
synonyms
and uniqueNaturalProduct.synonyms
will also be merged.
Since in many cases there is no information in the property continent
and in geoLocation
there is information about the origin, we decided to keep only geoLocation
.
We also used the command line tool jq to find out which properties do not contain values. For example with the following command: jq '.uniqueNaturalProduct.collection | select(length > 0)' sourceNaturalProduct.jsonl
List of empty properties:
-
allTaxa
-
collection
-
allWikidataIds
-
taxid
-
allChemClassifications