Increasing Crossref Data Reusability With Format Experiments - Crossref

Crossref · January 22, 2024, 5:59pm

Every year, Crossref releases a full public data file of all of our metadata. This is partly a commitment to POSI and partly just what we do. We want the community to re-use our metadata and to find interesting ends to which they can be put!

This is a companion discussion topic for the original entry at https://0-www-crossref-org.pugwash.lib.warwick.ac.uk/blog/increasing-crossref-data-reusability-with-format-experiments

abeechin · January 23, 2024, 8:08pm

Hi, thanks for enabling the discussion!

JSONL would already be a great improvement for us in ingesting the file. Currently to load this in Hive we need to unzip and flatten all of the record effectively into JSONL, concat and rezip files to a reasonable number to reduce PUT ops to HDFS, mount the files in Hive and apply a handwritten schema.

To all of that whilst still being future-proof towards schema changes we would propose Avro as an alternative data format.

It:

Can be split into smaller files for distribution
Supports native compression (snappy)
Embeds the schema inside the file, allowing for full hydration natively by clients (e.g. Hadoop, BigQuery, Databricks, Snowflake) - also good for data integrity
Facilitates schema evolution

castedo · January 24, 2024, 2:36am

To be even more POSI, has Crossref considered making your metadata available as a Dolt [1] database? and keep such a Dolt database up to date regularly rather than annually?

My current impression is that:

the risk of vendor lock-in is extremely small,
Crossref has the option to host it themselves or have a database on dolthub [2],
hosting the data on dolthub dot com will be free,
a Dolt database can handle this size of data,
community members can easily make clone databases (with enough available disk space),
clone databases can efficiently stay up-to-date with only differences being copied,
cloned databases automatically get the schema and SQL structure without any extra work.

[1] github dot com /dolthub/dolt
[2] dolthub dot com

meve · January 24, 2024, 1:27pm

Just to add my thanks to the current commenters for these suggestions - I can’t promise anything, but I will investigate these formats and see what we can do. It may be that we will release code that will allow for the dump to be converted into these formats, rather than releasing the formats ourselves.

abeechin · January 24, 2024, 1:44pm

For Avro what would be useful is to release an official AvroSchema for the dataset. That way anyone can use it in conjunction with JSONL format to generate their avro files.

https://avro.apache.org/docs/1.11.1/getting-started-python/#defining-a-schema

Given the works collection has millions of records getting the schema correct is important, as records are validated against it when written (so one malformed record can jeopardise the avro file generation).

A guide a la OpenAlex for mounting in BigQuery or similar would also be useful (whether with Avro or not)

https://docs.openalex.org/download-all-data/upload-to-your-database/load-to-a-data-warehouse

Topic		Replies	Views
New public data file: 120+ million metadata records - Crossref Interfaces for Machines posi , blog , open-data	9	1871	December 17, 2022
2023 public data file now available with new and improved retrieval options - Crossref News and current events rest-api , community , blog , metadata-retrieval , public-data-file	5	672	July 28, 2023
2022 public data file of more than 134 million metadata records now available - Crossref News and current events metadata , blog , public-data-file , free-to-use	0	853	May 13, 2022
A documentation for Crossref public data file Metadata Retrieval rest-api , metadata-retrieval , public-data-file	2	167	February 8, 2024
Come and get your grant metadata! - Crossref Interfaces for Machines rest-api , metadata , blog , grants	0	1103	November 8, 2021

Increasing Crossref Data Reusability With Format Experiments - Crossref

Related Topics