161 lines
6.4 KiB
Markdown
161 lines
6.4 KiB
Markdown
# Theatre Migrants
|
|
|
|
To generate a knowledge graph about migrants in the theatre in Europe.
|
|
|
|
## Running the scripts
|
|
|
|
The mapping scripts have been reimplemented in Rust for faster execution. Both
|
|
scripts must be run from this directory (`mapping/`).
|
|
|
|
**Prerequisites:** Start the MariaDB container before running step 1:
|
|
|
|
```sh
|
|
docker compose up -d
|
|
```
|
|
|
|
**Step 1** — Direct Mapping from MariaDB to RDF (`data/graph-01.ttl`):
|
|
|
|
```sh
|
|
cargo run --release --bin step-01
|
|
```
|
|
|
|
**Step 2** — Apply SPARQL UPDATE queries (`data/graph-02.ttl`):
|
|
|
|
```sh
|
|
cargo run --release --bin step-02
|
|
```
|
|
|
|
Alternatively, after installing with `cargo install --path .`:
|
|
|
|
```sh
|
|
step-01
|
|
step-02
|
|
```
|
|
|
|
## Generating the ontology
|
|
|
|
Next there are set of steps describing how to generate the migrants RDF graph.
|
|
|
|
### Step 1 - Loading the input data into a relational database
|
|
|
|
#### Task
|
|
|
|
The file `teatre-migrants.sql` contains the dump of a MariaDB database. The tables involved in this schema are described in the file `db_schema.md`. We will load this data in MariaDB to access the data with SQL. To this end:
|
|
|
|
1. Create a Dockerfile to create a docker container for MariaDB.
|
|
|
|
2. Upload the dump into a database in the container.
|
|
|
|
3. Create a Rust program `src/map/step_01.rs` that connects to the database. This program should return a file called `graph-01.ttl` containing all the data from the tables loaded in the database using the direct mapping from relational databases to RDF.
|
|
|
|
#### Summary
|
|
|
|
The `Dockerfile` creates a MariaDB 10.11 container that automatically loads `teatre-migrants.sql` on first start. The `docker-compose.yml` exposes the database on port 3306 with a healthcheck.
|
|
|
|
The program `src/map/step_01.rs` connects to the database and implements the [W3C Direct Mapping](https://www.w3.org/TR/rdb-direct-mapping/) for all 9 tables (`location`, `migration_table`, `organisation`, `person`, `person_profession`, `personnames`, `relationship`, `religions`, `work`). Each table row becomes an RDF resource identified by its primary key, each column becomes a datatype property, and each foreign key becomes an object property linking to the referenced row. The output file `graph-01.ttl` contains 162,029 triples.
|
|
|
|
To run:
|
|
|
|
```sh
|
|
docker compose up -d
|
|
cargo run --release --bin step-01
|
|
```
|
|
|
|
### Step 2 - Generate Objects
|
|
|
|
Continents and countries should be objects instead of literals. To this end, we can transform the following data:
|
|
|
|
```
|
|
base:location\/ARG-BahBlanca-00 a base:location;
|
|
base:location\#City "Bahia Blanca";
|
|
base:location\#Continent "South America";
|
|
base:location\#Country "Argentina";
|
|
base:location\#GeoNamesID "3865086";
|
|
base:location\#IDLocation "ARG-BahBlanca-00";
|
|
base:location\#latitude -3.87253e1;
|
|
base:location\#longitude -6.22742e1;
|
|
base:location\#wikidata "Q54108";
|
|
base:location\#wikipedia "https://en.wikipedia.org/wiki/Bah%C3%ADa_Blanca" .
|
|
```
|
|
|
|
Into the following data:
|
|
|
|
```
|
|
base:location\/ARG-BahBlanca-00 a base:location;
|
|
base:location\#City base:City-BahiaBlanca;
|
|
base:location\#Continent base:Continent-SouthAmerica;
|
|
base:location\#Country base:Country-Argentina;
|
|
base:location\#GeoNamesID "3865086";
|
|
base:location\#IDLocation "ARG-BahBlanca-00";
|
|
base:location\#latitude -3.87253e1;
|
|
base:location\#longitude -6.22742e1;
|
|
base:location\#wikidata "Q54108";
|
|
base:location\#wikipedia "https://en.wikipedia.org/wiki/Bah%C3%ADa_Blanca" .
|
|
|
|
base:City-BahiaBlanca a base:City;
|
|
rdfs:label "Bahia Blanca"@en .
|
|
|
|
base:Continent-SouthAmerica a base:Continent;
|
|
rdfs:label "South America"@en .
|
|
|
|
base:Country-Argentina a base:Country;
|
|
rdfs:label "Argentina"@en .
|
|
```
|
|
|
|
Notice that all ranges of property `rdfs:label` are stated to be in English.
|
|
|
|
Generate an SPARQL UPDATE query that do this tranformation for all elements of the table and save it a new folder called `updates`. Do the same with the other tables, proposing which columns should be defined as objects. For every table define a different SPARQL UPDATE query and to be saved in the `updates` folder. Enumerate these generated queries adding a prefix number like 001, 002, 003, and so on.
|
|
|
|
After generating the update queries, generate a Rust program that executes the updates on the RDF graph generated in the previous step and generates a new RDF graph to be saved: `data/graph-02.ttl`.
|
|
|
|
#### Summary
|
|
|
|
19 SPARQL UPDATE queries in `updates/` transform literal values into typed objects across all tables:
|
|
|
|
| Query | Table | Column | Object type |
|
|
|-------|-------|--------|-------------|
|
|
| 001 | location | Continent | Continent |
|
|
| 002 | location | Country | Country |
|
|
| 003 | location | State | State |
|
|
| 004 | location | City | City |
|
|
| 005 | migration_table | reason | MigrationReason |
|
|
| 006 | migration_table | reason2 | MigrationReason |
|
|
| 007 | organisation | InstType | InstitutionType |
|
|
| 008 | person | gender | Gender |
|
|
| 009 | person | Nametype | Nametype |
|
|
| 010 | person | Importsource | ImportSource |
|
|
| 011 | person_profession | Eprofession | Profession |
|
|
| 012 | personnames | Nametype | Nametype |
|
|
| 013 | relationship | Relationshiptype | RelationshipType |
|
|
| 014 | relationship | relationshiptype_precise | RelationshipTypePrecise |
|
|
| 015 | religions | religion | Religion |
|
|
| 016 | work | Profession | Profession |
|
|
| 017 | work | Profession2 | Profession |
|
|
| 018 | work | Profession3 | Profession |
|
|
| 019 | work | EmploymentType | EmploymentType |
|
|
|
|
Each query replaces a literal value with an object reference and creates the object with `rdf:type` and `rdfs:label` (in English). The program `src/map/step_02.rs` loads `data/graph-01.ttl`, applies all queries in order, and writes `data/graph-02.ttl` (164,632 triples).
|
|
|
|
To run:
|
|
|
|
```sh
|
|
cargo run --release --bin step-02
|
|
```
|
|
|
|
### Step 3 - Annotate dataypes
|
|
|
|
In the previous example we have dates like "1894-12-31", which is represented as an `xsd:string` datatype. Please infer the datatypes of these literals and create a new SPARQL query to generate a new RDF graph where literals use these dataypes.
|
|
|
|
### Step 4 - Replace empty string with unbound values
|
|
|
|
Intuitively, the triple
|
|
|
|
```
|
|
work:4 workp:EmploymentType workp:comment "" .
|
|
```
|
|
|
|
does not intended to mean a comment "", but the lack of a comment. So, write a query that exclude these comments from the next generated graph.
|
|
|
|
### Step 5 - Use Schema.org
|
|
|
|
For some classes, properties and individuals we can be represented with Schema.org. For example, the class `migrants:person` can be represented with the class `schema:Person`. Please propose what of these elements could use the Schema.org vocabulary and generate an SPARQL to generate the next graph.
|