[Dataset] TIB-SID – bilingual library subject indexing dataset (GND, 136k records)

From: Jennifer D'Souza <jenlindadsouza_at_nyob> Date: Sat, 14 Mar 2026 13:01:34 +0100 To: CODE4LIB_at_LISTS.CLIR.ORG

Hi all,

I wanted to share a new resource that may be useful for people
experimenting with *AI-assisted cataloging, subject indexing, or metadata
enrichment*.

We recently released *TIB-SID*, a dataset of *136,569 real library catalog
records (English/German)* linked to the *GND authority file*, together with
a machine-actionable version of the subject taxonomy. The dataset frames
subject indexing as a realistic *extreme multi-label classification*
problem over controlled vocabulary terms.

The resource was originally introduced through the *LLMs4Subjects shared
tasks (SemEval 2025 and GermEval 2025)*, where more than a dozen teams
developed and evaluated automated subject tagging systems using the
dataset. The tasks explored approaches ranging from embedding-based
retrieval pipelines to LLM prompting and hybrid XMTC systems.

Resources:

Dataset
https://github.com/sciknoworg/tib-sid

Preprint
https://arxiv.org/abs/2603.10876

Shared task pages
https://sites.google.com/view/llms4subjects
https://sites.google.com/view/llms4subjects-germeval

If anyone is experimenting with *automated subject indexing, authority
control, or multilingual metadata*, we would be very interested to hear how
the dataset works in other settings.

We would also be happy to hear from others working on similar problems or
interested in collaborating on future evaluations or extensions of the
dataset.

Best,
Jennifer D’Souza
TIB – Leibniz Information Centre for Science and Technology