migrants/README.md
Daniel Hernandez bb78b8a758 Add step-08: merge migration reasons and refactor temporal properties
Replace tm:reason and tm:secondaryReason with a single tm:hasReason
property (937 triples). Refactor 6 flat date properties into structured
tm:uncertainBeginning/tm:uncertainEnd intervals using W3C OWL-Time,
introducing tm:FuzzyInterval as a superclass of tm:Migration,
org:Membership, tm:Relationship, tm:PersonName, and
tm:ReligionAffiliation. Output: data/graph-08.ttl (218,251 triples).
2026-03-01 17:41:50 +01:00

15 KiB
Raw Permalink Blame History

Theatre Migrants

To generate a knowledge graph about migrants in the theatre in Europe.

Running the scripts

The mapping scripts have been reimplemented in Rust for faster execution. Both scripts must be run from this directory (mapping/).

Prerequisites: Start the MariaDB container before running step 1:

docker compose up -d

Step 1 — Direct Mapping from MariaDB to RDF (data/graph-01.ttl):

cargo run --release --bin step-01

Step 2 — Apply SPARQL UPDATE queries (data/graph-02.ttl):

cargo run --release --bin step-02

Alternatively, after installing with cargo install --path .:

step-01
step-02

Generating the ontology

Next there are set of steps describing how to generate the migrants RDF graph.

Step 1 - Loading the input data into a relational database

Task

The file teatre-migrants.sql contains the dump of a MariaDB database. The tables involved in this schema are described in the file db_schema.md. We will load this data in MariaDB to access the data with SQL. To this end:

  1. Create a Dockerfile to create a docker container for MariaDB.

  2. Upload the dump into a database in the container.

  3. Create a Rust program src/map/step_01.rs that connects to the database. This program should return a file called graph-01.ttl containing all the data from the tables loaded in the database using the direct mapping from relational databases to RDF.

Summary

The Dockerfile creates a MariaDB 10.11 container that automatically loads teatre-migrants.sql on first start. The docker-compose.yml exposes the database on port 3306 with a healthcheck.

The program src/map/step_01.rs connects to the database and implements the W3C Direct Mapping for all 9 tables (location, migration_table, organisation, person, person_profession, personnames, relationship, religions, work). Each table row becomes an RDF resource identified by its primary key, each column becomes a datatype property, and each foreign key becomes an object property linking to the referenced row. The output file graph-01.ttl contains 162,029 triples.

To run:

docker compose up -d
cargo run --release --bin step-01

Step 2 - Generate Objects

Continents and countries should be objects instead of literals. To this end, we can transform the following data:

base:location\/ARG-BahBlanca-00 a base:location;
  base:location\#City "Bahia Blanca";
  base:location\#Continent "South America";
  base:location\#Country "Argentina";
  base:location\#GeoNamesID "3865086";
  base:location\#IDLocation "ARG-BahBlanca-00";
  base:location\#latitude -3.87253e1;
  base:location\#longitude -6.22742e1;
  base:location\#wikidata "Q54108";
  base:location\#wikipedia "https://en.wikipedia.org/wiki/Bah%C3%ADa_Blanca" .

Into the following data:

base:location\/ARG-BahBlanca-00 a base:location;
  base:location\#City base:City-BahiaBlanca;
  base:location\#Continent base:Continent-SouthAmerica;
  base:location\#Country base:Country-Argentina;
  base:location\#GeoNamesID "3865086";
  base:location\#IDLocation "ARG-BahBlanca-00";
  base:location\#latitude -3.87253e1;
  base:location\#longitude -6.22742e1;
  base:location\#wikidata "Q54108";
  base:location\#wikipedia "https://en.wikipedia.org/wiki/Bah%C3%ADa_Blanca" .

base:City-BahiaBlanca a base:City;
  rdfs:label "Bahia Blanca"@en .

base:Continent-SouthAmerica a base:Continent;
  rdfs:label "South America"@en .

base:Country-Argentina a base:Country;
  rdfs:label "Argentina"@en .

Notice that all ranges of property rdfs:label are stated to be in English.

Generate an SPARQL UPDATE query that do this tranformation for all elements of the table and save it a new folder called updates. Do the same with the other tables, proposing which columns should be defined as objects. For every table define a different SPARQL UPDATE query and to be saved in the updates folder. Enumerate these generated queries adding a prefix number like 001, 002, 003, and so on.

After generating the update queries, generate a Rust program that executes the updates on the RDF graph generated in the previous step and generates a new RDF graph to be saved: data/graph-02.ttl.

Summary

19 SPARQL UPDATE queries in updates/ transform literal values into typed objects across all tables:

Query Table Column Object type
001 location Continent Continent
002 location Country Country
003 location State State
004 location City City
005 migration_table reason MigrationReason
006 migration_table reason2 MigrationReason
007 organisation InstType InstitutionType
008 person gender Gender
009 person Nametype Nametype
010 person Importsource ImportSource
011 person_profession Eprofession Profession
012 personnames Nametype Nametype
013 relationship Relationshiptype RelationshipType
014 relationship relationshiptype_precise RelationshipTypePrecise
015 religions religion Religion
016 work Profession Profession
017 work Profession2 Profession
018 work Profession3 Profession
019 work EmploymentType EmploymentType

Each query replaces a literal value with an object reference and creates the object with rdf:type and rdfs:label (in English). The program src/map/step_02.rs loads data/graph-01.ttl, applies all queries in order, and writes data/graph-02.ttl (164,632 triples).

To run:

cargo run --release --bin step-02

Step 3 - Annotate dataypes

In the previous example we have dates like "1894-12-31", which is represented as an xsd:string datatype. Please infer the datatypes of these literals and create a new SPARQL query to generate a new RDF graph where literals use these dataypes.

Step 4 - Replace empty string with unbound values

Intuitively, the triple

work:4 workp:EmploymentType workp:comment "" .

does not intended to mean a comment "", but the lack of a comment. So, write a query that exclude these comments from the next generated graph.

Step 5 - Use well-known vocabularies

For some classes, properties, and individuals we can be represented with Schema.org. For example, the class migrants:person can be represented with the class schema:Person. Please propose what of these elements could use the Schema.org vocabulary and generate an SPARQL to generate the next graph. Consider using other vocabularies beyond Schema.org, if you consider them appropiate to represent the information on this dataset.

Summary

7 SPARQL UPDATE queries in updates_step05/ add well-known vocabulary properties alongside the existing migrants: predicates:

Query Mapping
001 Person properties → schema:givenName, schema:familyName, schema:birthDate, schema:deathDate, schema:gender, schema:birthPlace, schema:deathPlace, schema:image, schema:hasOccupation, schema:citation, rdfs:comment
002 Person authority identifiers (Wikidata, GND, VIAF, CERL, LCCN, ISNI, SNAC) → owl:sameAs and wdtn: normalized properties
003 Location properties → wgs84:lat, wgs84:long; Wikipedia/Wikidata links → owl:sameAs
004 Organisation properties → schema:name, schema:location, rdfs:comment
005 Person labels → rdfs:label (generated from first_name + family_name)
006 Enumeration instances → skos:Concept + skos:prefLabel
007 Class types → schema:Person, schema:Place, schema:Organization

The program src/map/step_05.rs loads data/graph-04.ttl, applies all queries, and writes data/graph-05.ttl (168,129 triples).

To run:

cargo run --release --bin step-05

Step 6 - Map to the Theatre Migrants ontology

Task

Define a custom OWL ontology (teatre-migrants.ttl) for domain-specific terms not covered by well-known vocabularies, published at https://daniel.degu.cl/ontologies/theatre-migrants/ with prefix tm:. Reuse existing vocabularies where possible:

  • Schema.org for persons, places, organizations, and occupations.
  • W3C Organization Ontology (org:) for work engagements, modeled as org:Membership (replacing the original migrants:work class). Properties org:member and org:organization link the membership to the person and organization.
  • SKOS for enumeration types as subclasses of skos:Concept.

Write SPARQL CONSTRUCT queries that produce a new graph using only the tm:, schema:, org:, skos:, owl:, wgs84:, and wdtn: vocabularies. The original http://example.org/migrants/ predicates and class types are replaced; only entity IRIs retain the migrants: namespace.

Summary

The ontology teatre-migrants.ttl defines:

  • 5 domain-specific classes: tm:Migration, tm:Relationship, tm:PersonName, tm:ReligionAffiliation, tm:ImportSource (tm:PersonProfession was removed in Step 7).
  • 11 enumeration classes (all rdfs:subClassOf skos:Concept): tm:Continent, tm:Country, tm:State, tm:City, tm:MigrationReason, tm:InstitutionType, tm:NameType, tm:RelationshipType, tm:RelationshipTypePrecise, tm:Religion, tm:EmploymentType.
  • Object and datatype properties with domains, ranges, and temporal uncertainty modeling (tm:dateStartMin, tm:dateStartMax, tm:dateEndMin, tm:dateEndMax, tm:dateStartFuzzy, tm:dateEndFuzzy).

12 SPARQL CONSTRUCT queries in constructs_step06/ transform the graph:

Query Description
001-persons Persons with schema:Person properties and tm: extensions
002-places Places with wgs84: coordinates and tm: geographic hierarchy
003-organisations Organizations with schema:name and tm:institutionType
004-migrations Migration events with tm:migrant, tm:startPlace, tm:destinationPlace
005-memberships Work engagements as org:Membership with org:member, org:organization
006-relationships Interpersonal relationships with tm:activePerson, tm:passivePerson
007-person-professions Personprofession associations
008-person-names Historical/alternative person names
009-religion-affiliations Religion affiliations with temporal bounds
010a-occupations-passthrough Pass through existing schema:Occupation instances
010b-occupations-from-profession Retype migrants:Profession as schema:Occupation
011-enumerations Map enumeration instances to skos:Concept with tm: subtypes

The program src/map/step_06.rs loads data/graph-05.ttl, runs all CONSTRUCT queries, collects the resulting triples into a new graph, and writes data/graph-06.ttl (148,985 triples).

To run:

cargo run --release --bin step-06

Step 7 - Clean up secondary organisations and simplify personprofession

Task

Two clean-up tasks are performed on the graph produced by Step 6:

Secondary organisations. org:Membership instances may carry a tm:secondaryOrganisation property in addition to org:organization. An analysis of the 1,222 memberships with a secondary organisation reveals:

Category Count
Secondary differs from primary 736
Secondary equals primary (redundant) 230
Secondary exists but no primary 256
Total with secondary organisation 1,222

Two SPARQL UPDATE queries clean up these cases:

  1. Remove redundant secondary — when tm:secondaryOrganisation equals org:organization, delete the secondary (230 triples removed).
  2. Promote secondary to primary — when a membership has tm:secondaryOrganisation but no org:organization, move the secondary to primary (256 triples replaced).

After these updates, 736 memberships retain a tm:secondaryOrganisation that genuinely differs from the primary organisation.

Personprofession simplification. The tm:PersonProfession class modeled an intermediate node linking persons to professions (from the person_profession database table). Since both the profession and Eprofession columns represent occupation names (schema:name), the intermediate class is replaced by direct schema:hasOccupation links from persons to schema:Occupation instances. The tm:PersonProfession class and its properties (tm:personProfessionPerson, tm:enumeratedProfession, tm:professionLabel) are removed from the ontology.

Summary

5 SPARQL UPDATE queries in updates_step07/:

Query Description Affected
001 Remove tm:secondaryOrganisation when it equals org:organization 230
002 Promote tm:secondaryOrganisation to org:organization when no primary exists 256
003 Add schema:hasOccupation from person to enumerated profession 3
004 Create schema:Occupation from profession label and add schema:hasOccupation 730
005 Remove all tm:PersonProfession instances 742

The program src/map/step_07.rs loads data/graph-06.ttl, applies all queries, and writes data/graph-07.ttl (147,431 triples).

To run:

cargo run --release --bin step-07

Step 8 - Merge migration reasons and refactor temporal properties

Task

Two structural changes are applied to the graph produced by Step 7:

Merge migration reasons. The functional properties tm:reason (774 uses) and tm:secondaryReason (163 uses) are replaced by a single non-functional property tm:hasReason, resulting in 937 reason triples.

Refactor temporal properties into tm:FuzzyInterval. Six flat date properties (tm:dateStartMin, tm:dateStartMax, tm:dateEndMin, tm:dateEndMax, tm:dateStartFuzzy, tm:dateEndFuzzy) are replaced by a structured model based on W3C OWL-Time. A new class tm:FuzzyInterval (subclass of time:TemporalEntity) is introduced, with two object properties tm:uncertainBeginning and tm:uncertainEnd pointing to time:DateTimeInterval resources. Each interval has time:hasBeginning and time:hasEnd linking to time:Instant nodes with time:inXSDDate values, plus an optional rdfs:label for fuzzy date strings. Five classes are declared as subclasses of tm:FuzzyInterval: tm:Migration, org:Membership, tm:Relationship, tm:PersonName, tm:ReligionAffiliation.

Summary

8 SPARQL UPDATE queries in updates_step08/:

Query Description
001 Merge tm:reason and tm:secondaryReason into tm:hasReason
002a Create tm:uncertainBeginning interval from tm:dateStartMin
002b Add upper bound to tm:uncertainBeginning from tm:dateStartMax
003a Create tm:uncertainEnd interval from tm:dateEndMin
003b Add upper bound to tm:uncertainEnd from tm:dateEndMax
004a Add rdfs:label on uncertainBeginning from tm:dateStartFuzzy
004b Add rdfs:label on uncertainEnd from tm:dateEndFuzzy
005 Remove all 6 old date properties

The program src/map/step_08.rs loads data/graph-07.ttl, applies all queries, and writes data/graph-08.ttl.

To run:

cargo run --release --bin step-08