Linked data and multivalue fields

This is a guide that makes linked data easier to work with. This page includes a script used to take a “flattened” csv template and reformat the values so that they are compatible with the Archipelago Multi-Importer (AMI).

Why is this needed?

The spreadsheet template for Archipelago AMI batch import includes fields that require json strings/objects to be included in a single cell.

Here is an example of a subjects_local field:

["American Civil War (United States : 1861-1865)", "Migration, Internal--Southern States", "Cherokee Indians--Tribal citizenship", "Oklahoma--Fort Gibson (Indian Territory)", "Oklahoma--Sequoyah District (Indian Territory)"].

Each heading is surrounded by quotation marks and separated from the next heading with a comma and a space. The entire string is in brackets.

This bracketing becomes even more complicated when using linked data. For example, all of this should go under the creator_lod field when doing a batch import in a single cell:

[{"name_uri":"http://id.loc.gov/authorities/names/n2001078880","role_uri":"http://id.loc.gov/vocabulary/relators/cre","agent_type":"personal","name_label":"Hogg, James Stephen, 1851-1906.","role_label":"Creator"},{"name_uri":"https://id.loc.gov/authorities/names/n82158463","role_uri":"http://id.loc.gov/vocabulary/relators/rcp","agent_type":"personal","name_label":"Ross, Lawrence Sullivan, 1838-1898","role_label":"Addressee"}]

This is two headings, each with 5 properties, and should all go in the same cell. Typing this out may be time-consuming and lead to mistakes, so here is a template flat csv and a script that to convert it into an AMI-compatible spreadsheet.

What does this csv format do?

Instead of typing long strings into a single small cell, this “flat” template allows you to spread values across multiple columns for better readability and usability.

This is similar to an Avalon spreadsheet upload, where you can add as many columns as needed, except a few things differ.

  • Creators and subjects will be separated between local and linked data. In Avalon, all creators/subjects are treated the same.

  • Creators and contributors are both classified as “creators”. You can differentiate between them using a role property. In Avalon, creators and contributors are differentiated by being placed in columns labeled “Creator” or “Contributor”.

  • Every linked data value requires at least two columns, not one. To add an extra subject on Avalon, a single column needs to be created. To add an extra subject here, two columns must be created, one for the label and the other for the uri.

  • Most linked data fields will require two columns per entry except creator/contributor fields, which require five: personal/corporate, label, uri, role, and role uri.

  • Local subjects, local subjects - personal names, languages, publishers, descriptions, notes, physical_description_extents can be added with a single column that contains the label. This is similar to Avalon.

  • Avalon column names are repeatable. Here, the columns require a number so the script knows which label to match with which uri.

    • For example, LCSH subjects follow this format: subject_loc_{i}_label and subject_loc_{i}_uri where i represents a number. i must match for a label and uri pair.

  • An Avalon spreadsheet is the final product that is uploaded. This template creates an intermediate spreadsheet that must be converted into an AMI-compatible spreadsheet.

Not all Archipelago values require multiple “parts” to be strung together. This spreadsheet template is only used for adding the following repeatable fields:

  • LCNAF creators

  • LCSH subject headings

  • LCNAF subjects (personal)

  • LCNAF subjects (corporate)

  • LCNAF geographic subjects

  • LCGFT terms

  • Local creators

  • Local subjects

  • Local subjects - personal names

  • Wikidata

  • EDTF

  • Language

  • Publisher

  • Description

  • Note

  • Physical_description_extent

Other fields, like label or date_created should be added directly to the final AMI-compatible spreadsheet.

Note

AMI spreadsheets are the ONLY way to add repeatable descriptions, notes, publishers, and physical_description_extents. If you try to add this data through the webform it will not work.

Input csv

Enter your metadata into this csv.

Be sure to look at the first sheet (Flat).

Add more columns if there need to be more metadata values per field. Remember that:

  • LCNAF creator fields are added in sets of 5.

  • Local creator, LCSH subject headings, LCNAF subjects (personal), LCNAF subjects (corporate), LCNAF geographic subjects, LCGFT terms, and Wikidata fields are added in sets of 2.

  • EDTF fields are added as a group of four. This group can only be added once.

  • All other fields must be added individually.

What do these column headers mean?

Note

date_type will be date_range if you are using date_to and date_from. date_type will be date_edtf if you are using the date_free field. Although it is labeled date_free, it must be written in Extended Date Time Format.

Conversion

Once you have the input csv, you can run this script to convert it to a csv compatible with Archipelago AMI batch import.

import csv
import json
import re

INPUT_CSV = "test-flat.csv"
OUTPUT_CSV = "test-compressed.csv"

# --------------------
# Helpers
# --------------------

def non_empty(val):
    return val is not None and str(val).strip() != ""

def find_indexes(fieldnames, prefix, suffix):
    """
    Finds all numeric indexes for columns like:
    prefix{i}suffix
    """
    pattern = re.compile(rf"^{re.escape(prefix)}(\d+){re.escape(suffix)}$")
    indexes = set()

    for name in fieldnames:
        match = pattern.match(name)
        if match:
            indexes.add(int(match.group(1)))

    return sorted(indexes)

def build_lcnaf_array(row, fieldnames, base):
    indexes = find_indexes(fieldnames, f"{base}_", "_label")
    results = []

    for i in indexes:
        label = row.get(f"{base}_{i}_label")
        uri   = row.get(f"{base}_{i}_uri")

        if non_empty(label):
            results.append({
                "label": label,
                "uri": uri
            })

    return results

# --------------------
# Main
# --------------------

with open(INPUT_CSV, newline="", encoding="utf-8") as infile, \
    open(OUTPUT_CSV, "w", newline="", encoding="utf-8") as outfile:

    reader = csv.DictReader(infile)

    # Normalize headers
    reader.fieldnames = [h.strip() for h in reader.fieldnames]

    fieldnames = [

        "creator",
        "creator_lod",

        "subject_loc",
        "subject_lcnaf_personal_names",
        "subject_lcnaf_corporate_names",
        "subject_lcnaf_geographic_names",
        "subject_lcgft_terms",
        "subject_wikidata",

        "subjects_local",
        "subjects_local_personal_names",

        "date_created_edtf",
        "language",
        "publisher",
        "description",
        "note",
        "physical_description_extent"

    ]

    writer = csv.DictWriter(outfile, fieldnames=fieldnames)
    writer.writeheader()

    for row in reader:

        # ---- Creators ----
        creator_indexes = find_indexes(
            reader.fieldnames,
            "creator_",
            "_name_label"
            )

        creators = []

        for i in creator_indexes:
            name_label = row.get(f"creator_{i}_name_label")
            role_label = row.get(f"creator_{i}_role_label")

            if non_empty(name_label):
                creator_obj = {
                    "name_label": name_label
                }

                if non_empty(role_label):
                    creator_obj["role_label"] = role_label

                creators.append(creator_obj)

        # ---- Creator LOD ----
        lod_indexes = find_indexes(reader.fieldnames, "creator_lod_", "_name_label")
        creator_lod = []

        for i in lod_indexes:
            prefix = f"creator_lod_{i}_"
            fields = {
                "agent_type": row.get(prefix + "agent_type"),
                "name_label": row.get(prefix + "name_label"),
                "name_uri": row.get(prefix + "name_uri"),
                "role_label": row.get(prefix + "role_label"),
                "role_uri": row.get(prefix + "role_uri")
            }

            if any(non_empty(v) for v in fields.values()):
                creator_lod.append(fields)

        # ---- Subject LOC ----
        loc_indexes = find_indexes(reader.fieldnames, "subject_loc_", "_label")
        subject_loc = []

        for i in loc_indexes:
            label = row.get(f"subject_loc_{i}_label")
            uri   = row.get(f"subject_loc_{i}_uri")

            if non_empty(label):
                subject_loc.append({
                    "label": label,
                    "uri": uri
                })

        # ---- LCNAF ----
        lcnaf_personal   = build_lcnaf_array(row, reader.fieldnames, "subject_lcnaf_personal")
        lcnaf_corporate  = build_lcnaf_array(row, reader.fieldnames, "subject_lcnaf_corporate")
        lcnaf_geographic = build_lcnaf_array(row, reader.fieldnames, "subject_lcnaf_geographic")

        # ---- LCGFT ----
        lcgft_indexes = find_indexes(reader.fieldnames, "subject_lcgft_terms_", "_label")
        subject_lcgft = []

        for i in lcgft_indexes:
            label = row.get(f"subject_lcgft_terms_{i}_label")
            uri   = row.get(f"subject_lcgft_terms_{i}_uri")

            if non_empty(label):
                subject_lcgft.append({
                    "label": label,
                    "uri": uri
                })
        # ---- Wikidata ----
        wikidata = build_lcnaf_array(row, reader.fieldnames, "subject_wikidata")

        # ---- Subjects local ----
        local_indexes = find_indexes(reader.fieldnames, "subjects_local_", "")
        subjects_local = [
            row.get(f"subjects_local_{i}")
            for i in local_indexes
            if non_empty(row.get(f"subjects_local_{i}"))
        ]

        # ---- Subjects local personal names ----
        local_personal_indexes = find_indexes(
            reader.fieldnames,
            "subjects_local_personal_names_",
            ""
        )

        subjects_local_personal = [
            row.get(f"subjects_local_personal_names_{i}")
            for i in local_personal_indexes
            if non_empty(row.get(f"subjects_local_personal_names_{i}"))
        ]

        # ---- Date created (EDTF) ----
        raw_date_created = {
            "date_to": row.get("date_to"),
            "date_free": row.get("date_free"),
            "date_from": row.get("date_from"),
            "date_type": row.get("date_type")
        }

        # Remove empty values entirely
        date_created = {
            k: v for k, v in raw_date_created.items()
            if non_empty(v)
        }

        if not date_created:
            date_created = None

        # Only keep if at least one value is present
        if not any(non_empty(v) for v in date_created.values()):
            date_created = None

        # ---- Languages ----
        language_indexes = find_indexes(reader.fieldnames, "language_", "")
        languages = [
            row.get(f"language_{i}")
            for i in language_indexes
            if non_empty(row.get(f"language_{i}"))
        ]

        # ---- Publishers ----
        publisher_indexes = find_indexes(reader.fieldnames, "publisher_", "")
        publishers = [
            row.get(f"publisher_{i}")
            for i in publisher_indexes
            if non_empty(row.get(f"publisher_{i}"))
        ]

        # ---- Descriptions ----
        description_indexes = find_indexes(reader.fieldnames, "description_", "")
        descriptions = [
            row.get(f"description_{i}")
            for i in description_indexes
            if non_empty(row.get(f"description_{i}"))
        ]

        # ---- Notes ----
        note_indexes = find_indexes(reader.fieldnames, "note_", "")
        notes = [
            row.get(f"note_{i}")
            for i in note_indexes
            if non_empty(row.get(f"note_{i}"))
        ]

        # ---- physical_description_extent ----
        physical_description_extent_indexes = find_indexes(reader.fieldnames, "physical_description_extent_", "")
        physical_description_extents = [
            row.get(f"physical_description_extent_{i}")
            for i in physical_description_extent_indexes
            if non_empty(row.get(f"physical_description_extent_{i}"))
        ]

        # ---- Write row ----
        writer.writerow({

            "creator": json.dumps(creators, ensure_ascii=False),
            "creator_lod": json.dumps(creator_lod, ensure_ascii=False),

            "subject_loc": json.dumps(subject_loc, ensure_ascii=False),
            "subject_lcnaf_personal_names": json.dumps(lcnaf_personal, ensure_ascii=False),
            "subject_lcnaf_corporate_names": json.dumps(lcnaf_corporate, ensure_ascii=False),
            "subject_lcnaf_geographic_names": json.dumps(lcnaf_geographic, ensure_ascii=False),
            "subject_lcgft_terms": json.dumps(subject_lcgft, ensure_ascii=False),
            "subject_wikidata": json.dumps(wikidata, ensure_ascii=False),

            "subjects_local": json.dumps(subjects_local, ensure_ascii=False),
            "subjects_local_personal_names": json.dumps(subjects_local_personal, ensure_ascii=False),

            "date_created_edtf": (
                json.dumps(date_created, ensure_ascii=False)
                if date_created else ""
            ),

            "language": json.dumps(languages, ensure_ascii=False),
            "publisher": json.dumps(publishers, ensure_ascii=False),
            "description": json.dumps(descriptions, ensure_ascii=False),
            "note": json.dumps(notes, ensure_ascii=False),
            "physical_description_extent": json.dumps(physical_description_extents, ensure_ascii=False)
        })

Putting your csvs together

Now that you have your output.csv, you can copy and paste these columns into your AMI-compatible spreadsheet.