How to make zotero format organizational creators correctly when fields are parsed from metadata.

I puzzled over this one for a while, and given the answer was fairly simple I thought it was worth a write-up.

When I generated my bibliography for an assignment I was writing, I noticed that Zotero has made a mess of the author credit in some references. For example:
Nutrition, C. for F. S. and A. (n.d.) ‘Laboratory Methods - BAM: Staphylococcus aureus’, WebContent, [online] Available from: http://www.fda.gov/Food/FoodScienceResearch/LaboratoryMethods/ucm071429.htm (Accessed 10 February 2014).
The credit should actually be to the institution named "Center for Food Safety and Applied Nutrition".
Zotero is a bibliography management tool that is useful for taking web page snapshots, journal and academic reference lists and bibliographies, and as a bonus also stores and downloads pdfs for future reference.

This is particularly handy if you need to go back to your source quickly, and how it was then, as for example news and blog articles are often "fixed" and you lose the original unless you auto-archived it.

The major use-case for zotero is to generate in-text citations and a bibliography list for articles and dissertations. It provides a "Save to Zotero" button via a firefox plugin, and it does this by either matching the URL against a list of regular expressions, or falling back to document metadata.

If for example you wanted to reference the DrugBank.ca article on Aspirin (http://www.drugbank.ca/drugs/DB00945), you can use zotero to store the page and author reference details.

Clicking the "Save to Zotero" button, causes Zotero to match the URL against a list of URL reg-exp. In the case of DrugBank's Aspirin page it would match this one;
https?://(?:www\\.)?drugbank.ca/drugs/

This would match the hard coded template for drugbank called DrugBank.ca.js and would determine the author to be the institutional name "DrugBank", which is by convention stored in the "Last Name" field of the zotero database entry. Using the "Open University Harvard" formatting it looks like this when a bibliography report is generated;
DrugBank (ed.) (2013) ‘Acetylsalicylic acid (DB00945)’, DrugBank, [online] Available from: http://www.drugbank.ca/drugs/DB00945 (Accessed 10 February 2014).
Unfortunately, I noticed that a number of my references were being displayed with gibberish for the author credit like so;
Nutrition, C. for F. S. and A. (n.d.) ‘Laboratory Methods - BAM: Staphylococcus aureus’, WebContent, [online] Available from: http://www.fda.gov/Food/FoodScienceResearch/LaboratoryMethods/ucm071429.htm (Accessed 10 February 2014).
There is no template that covers the items that are published under this fda.gov URL, however fortunately the FDA had the foresight to include document metadata in the HTML source like so;

<meta name="dc.creator" content="Center for Food Safety and Applied Nutrition"/>

The metadata parser is apparently (I've not checked the code) parsing using a convention that treats the text string "Center for Food Safety and Applied Nutrition" into first "Center for Food Safety and Applied" and last names "Nutrition"

However the "Harvard for open University" style of output format, also abbreviates to initials the parts of the first name... so "Center for Food Safety and Applied" becomes "C. for F. S. and A."

In the general case, this works, because mostly the "dc.creator" field maps to a real person rather than an institutional name.
Havens, A. M., Pedersen, E. A., Shiozawa, Y. and Taichman, R. S. (2009) ‘Innovative mouse models for metastatic disease’, Drug Discovery Today: Disease Models, 6(1), pp. 27–31.
After looking into that for a while, and consider various patches to the parser, or submitting a dedicated translator for this FDA website, I noticed that the GUI has an option to fix the parsing issue, and concatenate the "first and last" fields into a single "last name";


After that you can see that it has got the Creator credit correctly in a single field, and is not abbreviating the parts;



Which fixes the output formatting from this;
Nutrition, C. for F. S. and A. (n.d.) ‘Laboratory Methods - BAM: Staphylococcus aureus’, WebContent, [online] Available from: http://www.fda.gov/Food/FoodScienceResearch/LaboratoryMethods/ucm071429.htm (Accessed 10 February 2014).
to this;
Center for Food Safety and Applied Nutrition (n.d.) ‘Laboratory Methods - Bacteriological Analytical Manual (BAM)’, WebContent, [online] Available from: http://www.fda.gov/food/foodscienceresearch/laboratorymethods/ucm2006949.htm (Accessed 10 February 2014).

I think it would be useful if the zotero metadata parser did a sanity check on the contents of "dc.creator" such that if it included strings like "Center for...", "Department of...", "Institution of..." then it would identify them as "single field" strings.

(I am sure that there is a "Professor John McCenter Forrington Smith" or something that would break the reg-exp in some way as Scunthorpe did for obscenity checkers back in '01. But I think that is a marginal case.)






No comments:

Post a Comment

Don't be nasty. Being rude is fine.