Re: Regexp for rewriting LoC LCCN authorised personal names

From: Trail, Nate <ntra_at_nyob>
Date: Mon, 4 May 2026 21:05:01 +0000
To: CODE4LIB_at_LISTS.CLIR.ORG
MARC xml and MADS xml are listed at the bottom of each Name  page under "Alternate Formats". Since you are using XSL those should work for you way better than a scraped html page in a warc file.

If you know the lccn, you can fetch the single page in the serialization you like:
    
https://id.loc.gov/authorities/names/n2001028682.madsxml.xml

https://id.loc.gov/authorities/names/n2001028682.marcxml.xml


Nate
-----Original Message-----
From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of Stuart A. Yeates
Sent: Monday, May 04, 2026 4:45 PM
To: CODE4LIB_at_LISTS.CLIR.ORG
Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised personal names

CAUTION: This email message has been received from an external source. Please use caution when opening attachments, or clicking on links.

I've got many pages like
https://id.loc.gov/authorities/names/n2001028682.html   (stored in WARC
files)

I've got names.madsrdf.xml.gz which is all the names in madsrdf, but it's disaggregated rather than in the format exampled in https://www.loc.gov/standards/mads/rdf/  so it's not really amenable to processing in XSL. I'd prefer not to spin up a triple store and reasoner of any kind.

I suspect that what I need is the MARCXML, which I'm familiar with manipulating with XSL and has all the subfields I need explicitly marked.

As I work, I've been documenting the differences I find between LoC and wikidata, on the understanding that bridging LCCNs and wikidata is unlikely to be the work of a single person, see https://urldefense.us/v3/__https://www.wikidata.org/wiki/User:Stuartyeates/Wikidata_-_LoC_ontological_mismatches__;!!MrYkk0_46kUzGAu-DfDRZGQ!eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F55JLGs_WLC$


cheers
stuart
--
...let us be heard from red core to black sky


On Tue, 5 May 2026 at 07:40, Michael Monaco < 000000b1471f1220-dmarc-request_at_lists.clir.org> wrote:

> As Kevin mentioned, there are in fact many possible patterns for names 
> to appear in, so it's probably not possible to un-invert all the names 
> in the NAF with a single RegEx.
>
> You mention that you've downloaded the records in bulk -- what format 
> are the records in? Could you provide some examples?
>
> Thanks,
>
> Mike Monaco
> Head, Technical Services & Coordinator, Cataloging Services Associate 
> Professor of Bibliography University Libraries Technical Services 261B 
> Bierce Library The University of Akron Akron, Ohio 44325-1712 
> He/him/his
> Office: 330-972-2446
> mmonaco_at_uakron.edu
> ORCID: 0000-0001-7244-5154
> https://urldefense.us/v3/__https://www.uakron.edu/libraries__;!!MrYkk0

> _46kUzGAu-DfDRZGQ!eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJz
> ptHPrFpLyJhG_F55JEk5-yDI$
>
>
> -----Original Message-----
> From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of Stuart A.
> Yeates
> Sent: Monday, May 4, 2026 3:07 PM
> To: CODE4LIB_at_LISTS.CLIR.ORG
> Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised 
> personal names
>
> CAUTION:This email originated from outside of The University of Akron. 
> Use caution when opening attachments, clicking links or responding to 
> requests for information.
>
>
>
> As it happens, I have already downloaded the records in bulk. What I 
> need is a regexp to parse the "quoted text"
>
> cheers
> stuart
>
> --
> ...let us be heard from red core to black sky
>
>
> On Tue, 5 May 2026 at 06:33, Trail, Nate <ntra_at_loc.gov> wrote:
>
> > Stuart,
> >
> > You could download the entire Names file in "nt" serialization, then 
> > there's one line for each name you can filter on:
> >
> >
> > <https://urldefense.us/v3/__http://id.l/__;!!MrYkk0_46kUzGAu-DfDRZGQ

> > !eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F5
> > 5JFyT7rah$ 
> > oc.gov%2Fauthorities%2Fnames%2Fnr2001046558&data=05%7C02%7Cmmonaco%
> 40UAKRON.EDU%7C65c1a7fc4f6d48f5610608deaa106e9e%7Ce8575dedd7f94ecea4aa
> 0b32991aeedd%7C0%7C0%7C639135184716106736%7CUnknown%7CTWFpbGZsb3d8eyJF
> bXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbC
> IsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XITloQ5ZybEL5qrdAojXpx%2FZ21wedG
> 6%2BA%2BO%2B1ix4cok%3D&reserved=0>
> < http://www.loc.gov/mads/rdf/v1#authoritativeLabel > "Smith, Jim, 
> 1940 October 17-" .
> >
> > Then you can do what you want with the quoted text.
> >
> > Saves bandwidth for you and us.
> >
> > https://urldefense.us/v3/__https://id.l/__;!!MrYkk0_46kUzGAu-DfDRZGQ

> > !eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F5
> > 5JKbGlPyQ$
> > oc.gov%2Fdownload%2F&data=05%7C02%7Cmmonaco%40UAKRON.EDU%7C65c1a7fc4
> > f6
> > d48f5610608deaa106e9e%7Ce8575dedd7f94ecea4aa0b32991aeedd%7C0%7C0%7C6
> > 39 
> > 135184716159980%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> > Oi 
> > IwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7
> > C%
> > 7C%7C&sdata=T7OhOWgr1s4TxHLYmtL5hgQR7rNT3rcLIT5LfjFSvoA%3D&reserved=
> > 0
> >
> > Good luck,
> >
> > Nate
> >
> >
> > -----------------------------------------
> > Nate Trail
> > Network Development & MARC Standards Office LCSG/DPS/ABA/NDMSO 
> > Library of Congress Washington DC 20540
> >
> >
> > -----Original Message-----
> > From: Code for Libraries <CODE4LIB_at_LISTS.CLIR.ORG> On Behalf Of 
> > Kevin Hawkins
> > Sent: Monday, May 04, 2026 2:08 PM
> > To: CODE4LIB_at_LISTS.CLIR.ORG
> > Subject: Re: [CODE4LIB] Regexp for rewriting LoC LCCN authorised 
> > personal names
> >
> > CAUTION: This email message has been received from an external source.
> > Please use caution when opening attachments, or clicking on links.
> >
> > Hello Stuart,
> >
> > Do you mean that you want to convert LCNAF personal names from this 
> > sort of order:
> >
> > Mudge, Lewis Seymour, 1868-1945
> >
> > to something like this:
> >
> > Lewis Seymour Mudge, 1868-1945
> >
> > ?  But then also deal with authorized forms containing no commas, 
> > forms with more than two commas, and occasional use of parentheses.
> > So, as you know, it gets complicated.
> >
> > I wonder if a different approach might make more sense here:
> >
> > 1. Query the inverted LCNAF form at
> > https://urldefense.us/v3/__https://id.l/__;!!MrYkk0_46kUzGAu-DfDRZGQ

> > !eCVHA4UrUnLZ4pxsftyKHSpGCX-NTX6bW29M5KEBEtBBodS7cFJzptHPrFpLyJhG_F5
> > 5JKbGlPyQ$
> > oc.gov%2F&data=05%7C02%7Cmmonaco%40UAKRON.EDU%7C65c1a7fc4f6d48f56106
> > 08
> > deaa106e9e%7Ce8575dedd7f94ecea4aa0b32991aeedd%7C0%7C0%7C639135184716
> > 17 
> > 8598%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDA
> > wM 
> > CIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sda
> > ta
> > =FkP48ZXE11h7Qq1kXsl9JK%2FBhQvnswsYpC8rPoPGgYg%3D&reserved=0
> >
> > 2. Retrieve the URI, extracting the identifier (beginning with "n")
> >
> > 3. Query Wikidata using this identifier.
> >
> > 4. Retrieve Wikidata's form of the name, which is not inverted.
> >
> > --Kevin
> >
> > On 5/3/26 1:25 PM, Stuart A. Yeates wrote:
> > > Does anyone know of somewhere that describes LCCN authorised 
> > > personal names as regexps? I want to be able to rewrite them at 
> > > scale
> to 'normal'
> > order.
> > >
> > > AI appears to be actively undermining the functionality of search
> > engines.
> > >
> > > cheers
> > > stuart
> > > --
> > > ...let us be heard from red core to black sky
> >
>
Received on Mon May 04 2026 - 17:07:07 EDT