Skip to content

Extraction process

Extraction process

COCONUT provides the data as MongoDB dump. The database contains the data in several parts. The largest part is in sourceNaturalProducts. We have exported this as json for better processing. Using a self created schema, we set up several rules to transform the json data into RDF data. For the transformation we used eccenca corporate memory

The schema can be found at https://coconutKG.aksw.org/dataset/index.html

Schema creation process

Some properties occur multiple times in a original Coconut record. This is probably due to the fact that Coconut has been created from multiple datasets. We decided not to transfer these duplications into our ontology, as far as they are visible to us. We used the command line tool jq to verify the duplicates. For example to validate if there is no simpleInchi and inchi which doesn't share the same information we used: jq 'select(.simpleInchi != .uniqueNaturalProduct.inchi)' sourceNaturalProduct.jsonl

List of duplicates:

  • $oid and $oid

  • totalAtomNumber and total_atom_number

  • simpleInchi and inchi

  • simpleInchiKey and inchikey

We have decided to merge citation and citationDOI to citationDOI.

synonyms and uniqueNaturalProduct.synonyms will also be merged.

Since in many cases there is no information in the property continent and in geoLocation there is information about the origin, we decided to keep only geoLocation.

We also used the command line tool jq to find out which properties do not contain values. For example with the following command: jq '.uniqueNaturalProduct.collection | select(length > 0)' sourceNaturalProduct.jsonl

List of empty properties:

  • allTaxa

  • collection

  • allWikidataIds

  • taxid

  • allChemClassifications