Accepting formatting in longer text elements

From: Monica Omodei <monica.omodei_at_nyob> Date: Fri, 24 Jun 2011 00:38:39 -0400 To: NGC4LIB_at_LISTSERV.ND.EDU

As an aggregator of metadata about research datasets/collections we (the 
Australian National Data Service) currently treat the content of description 
elements as xsd:string but strip out any tagging in the rendering which is thus 
unformatted text (except for what can be done with spaces and returns).

Our aim is to be aggregated by larger discovery services and we already 
support Opensearch, SRU, RSS and OAI-PMH. 

We want to support minimal markup because many of our contributors support 
it in their own systems from which they are exporting metadata for us to 
harvest. They would like us to preserver at least someformatting e.g. lists, 
super/sub scripting for chemical compounds, emphasis etc

Is there consensus on best practice in this area and/or what is common 
practice ?

My initial reaction for our own aggregation and portal was to

(a) set a minimal subset of xhtml which we guarantee to pass through to our 
portal display (are there any popular ones?)
(b) accept anything but strip out what's not in the minimal subset for display 
content

But what do we expose for others to harvest ?
 (a) exactly what was provided by the contributor
 (b) what was provided but cleaned of possible malevolent tagging
 (c) just text with all tagging stripped out
 (d) what we ourselves render ie the minimal subset (using the namespace of 
the minimal subset)

Monica Omodei (formerly Berko)
Senior Research Analyst
Australian National Data Service