Preferred way for providing citation data


I am currently try to collect and prepare data for deposit XML files. I have faced some problems when converting the references of the articles to the citation list.

The available citation data is fairly structured. (The journal articles have compiled via LaTeX, which results .bbl files.) I am a little bit confused about the exact role of the citation element.

I think its purpose is to identify the cited publication.

  1. In ideal case, it contains the DOI number.
  2. When the data of the cited publication is available in structured form, it should describe it as precise as possible.
  3. As a fallback solution, it should contain the citation data in the <unstructured_citation> element.

I assume the followings.

  • Providing structured data is better than just using the <unstructured_citation> element.
  • When the DOI is available, all of the other elements of the citation can be ignored.
  • The unstructured data is necessary only, when it contains data which is not described in the other elements.

Are my assumptions correct?

The root of my confusion is that the citation element is

  • more verbose than necessary for identification (for instance, the ISSN implies the title, the first page implies authors),
  • but is not precise as can be (for instance, it describes only the surname of the first author, the first page is in the schema but the last is not).

I usually have the whole list of authors. In some cases my processed citations include the publisher, the country, an article identification number, a URL to the publication.

What is the preferred way of organizing the mentioned data in the XML?

Thank You for your help in advance!

Best Wishes,

Hi Imre,

Thanks for your questions.

Yes, exactly. The purpose is both to add the references to the metadata record of the citing work and, when possible, to facilitate our matching the citation to the cited work’s DOI.

This is basically correct, except I wouldn’t call unstructured citations a ‘fallback’ to the structured citations. It’s just an alternative.

It is true that when the DOI is included in the citation data, all of the other elements of the citation can be ignored. And if you supply a <doi> along with other citation data, everything except the DOI will be ignored by our system when processing the citation as well.

The other two points are only partially true. Basically, if the structured citation data is equally comprehensive to what would be in an equivalent unstructured citation (using any citation style) and they’re tagged perfectly, then there’s a slightly higher likelihood of getting a citation match from a structured citation.

In practice, however, it’s rare that structured citations are just as comprehensive as their unstructured equivalents. They often contain just a journal abbreviation, first author surname, and publication year, without volume and issue numbers, first page number, or article title, for example. And, of course, mistakes in tagging the citation elements do happen as well.

Moreover, our citation matching system is quite good at parsing unstructured citations to make accurate matches to the cited works. So, there’s really not a preference for structured over unstructured citations. Just go with whichever is easiest for you to submit in the xml.

It’s also true that, when any structured citation data is provided, no matter how minimal, an <unstructured_citation> will be ignored. So, in the case where you don’t know the DOI of the cited item, you should use either structured citation data or an <unstructured_citation> but not both.

We have some examples and all supported structured citation data tags on our documentation site.

You can see there that we don’t have support for tagging authors beyond the first author, publishers, countries, article IDs, or URLs. With the exception of additional authors, most of those elements aren’t typically included in the bibliographic metadata for Crossref DOIs, so adding them wouldn’t help facilitate citation matches.

Please let me know if you have any questions.


Hi Shayn,

Thank You very much for your prompt and detailed answer!

Just go with whichever is easiest for you to submit in the xml.

In my actual use case the usage of <unstructured_citation> is easier to implement and helps to avoid various kind of (BibTeX log file related) parsing problems.

where you don’t know the DOI of the cited item, you should use either structured citation data or an <unstructured_citation> but not both.

It is a crucial information, in the sense, that if somebody (like me) tries to extract the fields from the unstructured representation, but the result is not precise enough, then this effort is more harmful than helpful.

Thank You for the link to the examples!
I think that, I can prepare the citation data.

I have a few constructive comments about this page. (I wrote them in the hope that these will be useful, not as a critique.)

  • The citation key "ref=3" seems to be strange after "ref1" and "ref2". In the schema documentation I did not find any specific about its format beside its uniqueness and recommendation for matching the number in the reference list of the publication.
  • The indentation of the longer XML examples could be more consistent. (For instance 1 or 2 spaces for indentation in all cases.)

Best Wishes,

Hi Imre,

Thanks for that feedback about the documentation page! We’ll try and make those changes, so it reads more clearly.