New public data file: 120+ million metadata records - Crossref

2020 wasn’t all bad. In April of last year, we released our first public data file. Though Crossref metadata is always openly available––and our board recently cemented this by voting to adopt the Principles of Open Scholarly Infrastructure (POSI)––we’ve decided to release an updated file. This will provide a more efficient way to get such a large volume of records. The file (JSON records, 102.6GB) is now available, with thanks once again to Academic Torrents.


This is a companion discussion topic for the original entry at https://0-www-crossref-org.pugwash.lib.warwick.ac.uk/blog/new-public-data-file-120-million-metadata-records/

This public data file is indeed very useful. I would like to use the API to get the metadata records that are not included in the public data file and avoid duplicates. I guess I should pull those that are registered after “January, 7, 2021”, according to this post description? And which date field should I use? (e.g. indexed, created, deposited)

For incremental updates we recommend using the from-index-date filter. The timestamp that from-index-date filters on is guaranteed to be updated every time there is a change to metadata requiring a reindex. This way you’ll pick up updated records in addition to new records.

I’m glad to hear you find the public data file useful! We’re preparing the 2022 public data file for release soon.

Thanks for the reply, it is all clear now.
One last question regarding the dates. If I look for a DOI with the search engine in XML format, for instance, DOI 10.1039/d0se01062f, I can see a publication_date field (29 September 2020). However, in the public data file, the JSON for that DOI includes several date fields like indexed, created, published-online, issued, deposited. Some of them like issued or published-online have only the year part. What is the field in the public data file JSON that should correspond to the publication date?

Hi, How do I import the crossref data dump of *.json.gz ?
Do I import it into ElasticSearch and if so how, please ?

Thank you for creating and offering this huge amount of valuable data as a downloadable set of files. Is there an update on the release date for the 2022 public data file?

Hi Jens,

We’re close. Expect an update in the next few days. We’re planning to perform a reindex to finish up a few fixes and enhancements, after which we’ll generate and publish the 2022 public date file.

1 Like

The 2022 public data file is now live. Please see the blog post announcement: 2022 public data file of more than 134 million metadata records now available - Crossref

1 Like

hello I need help reading the .json.gz file from the torrents using python. Also when I try to unzip the .gz file it says it’s corrupted. Why is this?