Is that normal to get <html> tags in title metadata?

Hello,

In some records, we get tags in the title metadata. Is that an error ? E.G. :

Api crossref org/v1/works/10.1002/ajpa.24488

“title”: [“A population history of indigenous\n Bahamian\n islanders: Insights from ancient\n DNA”],

Is there a way to get only raw text with the CrossRef API ?

Thanks,
Fred

Hi Fred,

It’s not especially common, but it is allowed.

We support certain face markup within the metadata that publishers supply for their registered content.

It’s up to each publisher when and whether they opt to supply those markup tags.

There’s not a way to query the API such that you’ll get back only the text, without tags. You’d have to clean up the data after the fact to strip them out, if that’s what you wanted.

Best,
Shayn

1 Like

Hi Shayn,

I had the same question. Thanks for the link and confirmation. I wonder if a full specification is available of what kind of face markup is permitted.

For example. a bibtex query for the paper with DOI 10.1002/2015gl067329 gives me the title:

	title = {An automatically updated
		            $\less$i$\greater$S$\less$/i$\greater$
		            -wave model of the upper mantle and the depth extent of azimuthal anisotropy},

Notice that in order to process this I would need to first decode the latex, and then decode the html tags, in that specific order. The face markup docs only mentions html entities and MathML, not arbitrary LaTeX on top of that. I wonder if the face markup could be better constrained.

Could a specification for the permitted face markup perhaps even be used to implement content negotiation in a way that “application/x-bibtex” queries would always return metadata in LaTeX? I suppose this would require a translation layer between the html (?) based face markup and an equivalent LaTeX representation. I appreciate that this would not be trivial, but it would greatly improve the quality of automated bibliography generation.

Hi, and thanks for your feedback

The permitted markup, and the metadata elements where it’s allowed, are described in our documentation.

It’s relatively minimal, just bold (b), italic (i), underline (u), over-line (ovl), superscript (sup), subscript (sub), small caps (scp), and typewriter text (tt). And they can only occur in titles and citations.

I’m not sure, but I can pass the suggestion along to our technical team and the API product manager. Content negotiation is an especially complicated tool to make updates or improvements too, because it’s a collaboration between three organizations.

2 Likes

In some cases, the markup can be extensive, see this title in DOI: 10.1103/PhysRevB.56.6100 with lots of MathML tags. (I can’t post the XML)

I have added scrubbers in my script to remove the tags as I need metadata, not markup.

Hi Dave,

Thanks for following up. I’ve updated your privileges so you can post code and links within the forum.

We preserve deposited markup, so that’s why you’re seeing lots of MathML tags in the metadata record.

We’ve talked about this a lot, and it goes back a bit:

But, we do convert that markup to readable text in our products/services that are meant to be human readable, like this example:
https://0-search-crossref-org.pugwash.lib.warwick.ac.uk/?q=10.1103%2Fphysrevb.56.6100&from_ui=yes

My best,
Isaac

Thank you for updating my privileges and the explanation, Isaac.

For the FrontEnd CrossRef search perspective it makes sense to keep the markup while connecting to the same data source.