Increasing Crossref Data Reusability With Format Experiments - Crossref

Every year, Crossref releases a full public data file of all of our metadata. This is partly a commitment to POSI and partly just what we do. We want the community to re-use our metadata and to find interesting ends to which they can be put!


This is a companion discussion topic for the original entry at https://0-www-crossref-org.pugwash.lib.warwick.ac.uk/blog/increasing-crossref-data-reusability-with-format-experiments
1 Like

Hi, thanks for enabling the discussion!

JSONL would already be a great improvement for us in ingesting the file. Currently to load this in Hive we need to unzip and flatten all of the record effectively into JSONL, concat and rezip files to a reasonable number to reduce PUT ops to HDFS, mount the files in Hive and apply a handwritten schema.

To all of that whilst still being future-proof towards schema changes we would propose Avro as an alternative data format.

It:

  • Can be split into smaller files for distribution
  • Supports native compression (snappy)
  • Embeds the schema inside the file, allowing for full hydration natively by clients (e.g. Hadoop, BigQuery, Databricks, Snowflake) - also good for data integrity
  • Facilitates schema evolution
2 Likes

To be even more POSI, has Crossref considered making your metadata available as a Dolt [1] database? and keep such a Dolt database up to date regularly rather than annually?

My current impression is that:

  • the risk of vendor lock-in is extremely small,
  • Crossref has the option to host it themselves or have a database on dolthub [2],
  • hosting the data on dolthub dot com will be free,
  • a Dolt database can handle this size of data,
  • community members can easily make clone databases (with enough available disk space),
  • clone databases can efficiently stay up-to-date with only differences being copied,
  • cloned databases automatically get the schema and SQL structure without any extra work.

[1] github dot com /dolthub/dolt
[2] dolthub dot com

Just to add my thanks to the current commenters for these suggestions - I can’t promise anything, but I will investigate these formats and see what we can do. It may be that we will release code that will allow for the dump to be converted into these formats, rather than releasing the formats ourselves.

For Avro what would be useful is to release an official AvroSchema for the dataset. That way anyone can use it in conjunction with JSONL format to generate their avro files.

https://avro.apache.org/docs/1.11.1/getting-started-python/#defining-a-schema

Given the works collection has millions of records getting the schema correct is important, as records are validated against it when written (so one malformed record can jeopardise the avro file generation).

A guide a la OpenAlex for mounting in BigQuery or similar would also be useful (whether with Avro or not)

https://docs.openalex.org/download-all-data/upload-to-your-database/load-to-a-data-warehouse