# Theatre Migrants

To generate a knowledge graph about migrants in the theatre in Europe.

## Running the scripts

The mapping scripts have been reimplemented in Rust for faster execution. Both
scripts must be run from this directory (`mapping/`).

**Prerequisites:** Start the MariaDB container before running step 1:

```sh
docker compose up -d
```

**Step 1** — Direct Mapping from MariaDB to RDF (`data/graph-01.ttl`):

```sh
cargo run --release --bin step-01
```

**Step 2** — Apply SPARQL UPDATE queries (`data/graph-02.ttl`):

```sh
cargo run --release --bin step-02
```

Alternatively, after installing with `cargo install --path .`:

```sh
step-01
step-02
```

## Generating the ontology

Next there are set of steps describing how to generate the migrants RDF graph.

### Step 1 - Loading the input data into a relational database

#### Task

The file `teatre-migrants.sql` contains the dump of a MariaDB database. The tables involved in this schema are described in the file `db_schema.md`. We will load this data in MariaDB to access the data with SQL. To this end:

1. Create a Dockerfile to create a docker container for MariaDB.

2. Upload the dump into a database in the container.

3. Create a Rust program `src/map/step_01.rs` that connects to the database. This program should return a file called `graph-01.ttl` containing all the data from the tables loaded in the database using the direct mapping from relational databases to RDF.

#### Summary

The `Dockerfile` creates a MariaDB 10.11 container that automatically loads `teatre-migrants.sql` on first start. The `docker-compose.yml` exposes the database on port 3306 with a healthcheck.

The program `src/map/step_01.rs` connects to the database and implements the [W3C Direct Mapping](https://www.w3.org/TR/rdb-direct-mapping/) for all 9 tables (`location`, `migration_table`, `organisation`, `person`, `person_profession`, `personnames`, `relationship`, `religions`, `work`). Each table row becomes an RDF resource identified by its primary key, each column becomes a datatype property, and each foreign key becomes an object property linking to the referenced row. The output file `graph-01.ttl` contains 162,029 triples.

To run:

```sh
docker compose up -d
cargo run --release --bin step-01
```

### Step 2 - Generate Objects

Continents and countries should be objects instead of literals. To this end, we can transform the following data:

```
base:location\/ARG-BahBlanca-00 a base:location;
  base:location\#City "Bahia Blanca";
  base:location\#Continent "South America";
  base:location\#Country "Argentina";
  base:location\#GeoNamesID "3865086";
  base:location\#IDLocation "ARG-BahBlanca-00";
  base:location\#latitude -3.87253e1;
  base:location\#longitude -6.22742e1;
  base:location\#wikidata "Q54108";
  base:location\#wikipedia "https://en.wikipedia.org/wiki/Bah%C3%ADa_Blanca" .
```

Into the following data:

```
base:location\/ARG-BahBlanca-00 a base:location;
  base:location\#City base:City-BahiaBlanca;
  base:location\#Continent base:Continent-SouthAmerica;
  base:location\#Country base:Country-Argentina;
  base:location\#GeoNamesID "3865086";
  base:location\#IDLocation "ARG-BahBlanca-00";
  base:location\#latitude -3.87253e1;
  base:location\#longitude -6.22742e1;
  base:location\#wikidata "Q54108";
  base:location\#wikipedia "https://en.wikipedia.org/wiki/Bah%C3%ADa_Blanca" .

base:City-BahiaBlanca a base:City;
  rdfs:label "Bahia Blanca"@en .

base:Continent-SouthAmerica a base:Continent;
  rdfs:label "South America"@en .

base:Country-Argentina a base:Country;
  rdfs:label "Argentina"@en .
```

Notice that all ranges of property `rdfs:label` are stated to be in English.

Generate an SPARQL UPDATE query that do this tranformation for all elements of the table and save it a new folder called `updates`. Do the same with the other tables, proposing which columns should be defined as objects. For every table define a different SPARQL UPDATE query and to be saved in the `updates` folder. Enumerate these generated queries adding a prefix number like 001, 002, 003, and so on.

After generating the update queries, generate a Rust program that executes the updates on the RDF graph generated in the previous step and generates a new RDF graph to be saved: `data/graph-02.ttl`.

#### Summary

19 SPARQL UPDATE queries in `updates/` transform literal values into typed objects across all tables:

| Query | Table | Column | Object type |
|-------|-------|--------|-------------|
| 001 | location | Continent | Continent |
| 002 | location | Country | Country |
| 003 | location | State | State |
| 004 | location | City | City |
| 005 | migration_table | reason | MigrationReason |
| 006 | migration_table | reason2 | MigrationReason |
| 007 | organisation | InstType | InstitutionType |
| 008 | person | gender | Gender |
| 009 | person | Nametype | Nametype |
| 010 | person | Importsource | ImportSource |
| 011 | person_profession | Eprofession | Profession |
| 012 | personnames | Nametype | Nametype |
| 013 | relationship | Relationshiptype | RelationshipType |
| 014 | relationship | relationshiptype_precise | RelationshipTypePrecise |
| 015 | religions | religion | Religion |
| 016 | work | Profession | Profession |
| 017 | work | Profession2 | Profession |
| 018 | work | Profession3 | Profession |
| 019 | work | EmploymentType | EmploymentType |

Each query replaces a literal value with an object reference and creates the object with `rdf:type` and `rdfs:label` (in English). The program `src/map/step_02.rs` loads `data/graph-01.ttl`, applies all queries in order, and writes `data/graph-02.ttl` (164,632 triples).

To run:

```sh
cargo run --release --bin step-02
```

### Step 3 - Annotate dataypes

In the previous example we have dates like "1894-12-31", which is represented as an `xsd:string` datatype. Please infer the datatypes of these literals and create a new SPARQL query to generate a new RDF graph where literals use these dataypes.

### Step 4 - Replace empty string with unbound values

Intuitively, the triple 

```
work:4 workp:EmploymentType workp:comment "" .
```

does not intended to mean a comment "", but the lack of a comment. So, write a query that exclude these comments from the next generated graph.

### Step 5 - Use well-known vocabularies

For some classes, properties, and individuals we can be represented with Schema.org. For example, the class `migrants:person` can be represented with the class `schema:Person`. Please propose what of these elements could use the Schema.org vocabulary and generate an SPARQL to generate the next graph. Consider using other vocabularies beyond Schema.org, if you consider them appropiate to represent the information on this dataset.

#### Summary

7 SPARQL UPDATE queries in `updates_step05/` add well-known vocabulary properties alongside the existing `migrants:` predicates:

| Query | Mapping |
|-------|---------|
| 001 | Person properties → `schema:givenName`, `schema:familyName`, `schema:birthDate`, `schema:deathDate`, `schema:gender`, `schema:birthPlace`, `schema:deathPlace`, `schema:image`, `schema:hasOccupation`, `schema:citation`, `rdfs:comment` |
| 002 | Person authority identifiers (Wikidata, GND, VIAF, CERL, LCCN, ISNI, SNAC) → `owl:sameAs` and `wdtn:` normalized properties |
| 003 | Location properties → `wgs84:lat`, `wgs84:long`; Wikipedia/Wikidata links → `owl:sameAs` |
| 004 | Organisation properties → `schema:name`, `schema:location`, `rdfs:comment` |
| 005 | Person labels → `rdfs:label` (generated from first\_name + family\_name) |
| 006 | Enumeration instances → `skos:Concept` + `skos:prefLabel` |
| 007 | Class types → `schema:Person`, `schema:Place`, `schema:Organization` |

The program `src/map/step_05.rs` loads `data/graph-04.ttl`, applies all queries, and writes `data/graph-05.ttl` (168,129 triples).

To run:

```sh
cargo run --release --bin step-05
```

### Step 6 - Map to the Theatre Migrants ontology

#### Task

Define a custom OWL ontology (`teatre-migrants.ttl`) for domain-specific terms not covered by well-known vocabularies, published at `https://daniel.degu.cl/ontologies/theatre-migrants/` with prefix `tm:`. Reuse existing vocabularies where possible:

- **Schema.org** for persons, places, organizations, and occupations.
- **W3C Organization Ontology** (`org:`) for work engagements, modeled as `org:Membership` (replacing the original `migrants:work` class). Properties `org:member` and `org:organization` link the membership to the person and organization.
- **SKOS** for enumeration types as subclasses of `skos:Concept`.

Write SPARQL CONSTRUCT queries that produce a new graph using only the `tm:`, `schema:`, `org:`, `skos:`, `owl:`, `wgs84:`, and `wdtn:` vocabularies. The original `http://example.org/migrants/` predicates and class types are replaced; only entity IRIs retain the `migrants:` namespace.

#### Summary

The ontology `teatre-migrants.ttl` defines:

- **5 domain-specific classes:** `tm:Migration`, `tm:Relationship`, `tm:PersonName`, `tm:ReligionAffiliation`, `tm:ImportSource` (`tm:PersonProfession` was removed in Step 7).
- **11 enumeration classes** (all `rdfs:subClassOf skos:Concept`): `tm:Continent`, `tm:Country`, `tm:State`, `tm:City`, `tm:MigrationReason`, `tm:InstitutionType`, `tm:NameType`, `tm:RelationshipType`, `tm:RelationshipTypePrecise`, `tm:Religion`, `tm:EmploymentType`.
- Object and datatype properties with domains, ranges, and temporal uncertainty modeling (`tm:dateStartMin`, `tm:dateStartMax`, `tm:dateEndMin`, `tm:dateEndMax`, `tm:dateStartFuzzy`, `tm:dateEndFuzzy`).

12 SPARQL CONSTRUCT queries in `constructs_step06/` transform the graph:

| Query | Description |
|-------|-------------|
| 001-persons | Persons with `schema:Person` properties and `tm:` extensions |
| 002-places | Places with `wgs84:` coordinates and `tm:` geographic hierarchy |
| 003-organisations | Organizations with `schema:name` and `tm:institutionType` |
| 004-migrations | Migration events with `tm:migrant`, `tm:startPlace`, `tm:destinationPlace` |
| 005-memberships | Work engagements as `org:Membership` with `org:member`, `org:organization` |
| 006-relationships | Interpersonal relationships with `tm:activePerson`, `tm:passivePerson` |
| 007-person-professions | Person–profession associations |
| 008-person-names | Historical/alternative person names |
| 009-religion-affiliations | Religion affiliations with temporal bounds |
| 010a-occupations-passthrough | Pass through existing `schema:Occupation` instances |
| 010b-occupations-from-profession | Retype `migrants:Profession` as `schema:Occupation` |
| 011-enumerations | Map enumeration instances to `skos:Concept` with `tm:` subtypes |

The program `src/map/step_06.rs` loads `data/graph-05.ttl`, runs all CONSTRUCT queries, collects the resulting triples into a new graph, and writes `data/graph-06.ttl` (148,985 triples).

To run:

```sh
cargo run --release --bin step-06
```

### Step 7 - Clean up secondary organisations and simplify person–profession

#### Task

Two clean-up tasks are performed on the graph produced by Step 6:

**Secondary organisations.** `org:Membership` instances may carry a `tm:secondaryOrganisation` property in addition to `org:organization`. An analysis of the 1,222 memberships with a secondary organisation reveals:

| Category | Count |
|----------|------:|
| Secondary differs from primary | 736 |
| Secondary equals primary (redundant) | 230 |
| Secondary exists but no primary | 256 |
| **Total with secondary organisation** | **1,222** |

Two SPARQL UPDATE queries clean up these cases:

1. **Remove redundant secondary** — when `tm:secondaryOrganisation` equals `org:organization`, delete the secondary (230 triples removed).
2. **Promote secondary to primary** — when a membership has `tm:secondaryOrganisation` but no `org:organization`, move the secondary to primary (256 triples replaced).

After these updates, 736 memberships retain a `tm:secondaryOrganisation` that genuinely differs from the primary organisation.

**Person–profession simplification.** The `tm:PersonProfession` class modeled an intermediate node linking persons to professions (from the `person_profession` database table). Since both the `profession` and `Eprofession` columns represent occupation names (`schema:name`), the intermediate class is replaced by direct `schema:hasOccupation` links from persons to `schema:Occupation` instances. The `tm:PersonProfession` class and its properties (`tm:personProfessionPerson`, `tm:enumeratedProfession`, `tm:professionLabel`) are removed from the ontology.

#### Summary

5 SPARQL UPDATE queries in `updates_step07/`:

| Query | Description | Affected |
|-------|-------------|----------|
| 001 | Remove `tm:secondaryOrganisation` when it equals `org:organization` | 230 |
| 002 | Promote `tm:secondaryOrganisation` to `org:organization` when no primary exists | 256 |
| 003 | Add `schema:hasOccupation` from person to enumerated profession | 3 |
| 004 | Create `schema:Occupation` from profession label and add `schema:hasOccupation` | 730 |
| 005 | Remove all `tm:PersonProfession` instances | 742 |

The program `src/map/step_07.rs` loads `data/graph-06.ttl`, applies all queries, and writes `data/graph-07.ttl` (147,431 triples).

To run:

```sh
cargo run --release --bin step-07
```

### Step 8 - Merge migration reasons and refactor temporal properties

#### Task

Two structural changes are applied to the graph produced by Step 7:

**Merge migration reasons.** The functional properties `tm:reason` (774 uses) and `tm:secondaryReason` (163 uses) are replaced by a single non-functional property `tm:hasReason`, resulting in 937 reason triples.

**Refactor temporal properties into tm:FuzzyInterval.** Six flat date properties (`tm:dateStartMin`, `tm:dateStartMax`, `tm:dateEndMin`, `tm:dateEndMax`, `tm:dateStartFuzzy`, `tm:dateEndFuzzy`) are replaced by a structured model based on W3C OWL-Time. A new class `tm:FuzzyInterval` (subclass of `time:TemporalEntity`) is introduced, with two object properties `tm:uncertainBeginning` and `tm:uncertainEnd` pointing to `time:DateTimeInterval` resources. Each interval has `time:hasBeginning` and `time:hasEnd` linking to `time:Instant` nodes with `time:inXSDDate` values, plus an optional `rdfs:label` for fuzzy date strings. Five classes are declared as subclasses of `tm:FuzzyInterval`: `tm:Migration`, `org:Membership`, `tm:Relationship`, `tm:PersonName`, `tm:ReligionAffiliation`.

#### Summary

8 SPARQL UPDATE queries in `updates_step08/`:

| Query | Description |
|-------|-------------|
| 001 | Merge `tm:reason` and `tm:secondaryReason` into `tm:hasReason` |
| 002a | Create `tm:uncertainBeginning` interval from `tm:dateStartMin` |
| 002b | Add upper bound to `tm:uncertainBeginning` from `tm:dateStartMax` |
| 003a | Create `tm:uncertainEnd` interval from `tm:dateEndMin` |
| 003b | Add upper bound to `tm:uncertainEnd` from `tm:dateEndMax` |
| 004a | Add `rdfs:label` on `uncertainBeginning` from `tm:dateStartFuzzy` |
| 004b | Add `rdfs:label` on `uncertainEnd` from `tm:dateEndFuzzy` |
| 005 | Remove all 6 old date properties |

The program `src/map/step_08.rs` loads `data/graph-07.ttl`, applies all queries, and writes `data/graph-08.ttl`.

To run:

```sh
cargo run --release --bin step-08
```