|
TermFactory Manual |
© Lauri Carlson 2007-2012
TermFactory is an architecture and a workflow for Semantic Web based, multilingual, collaborative terminology work.
TF provides
for people to work on terms jointly or separately, building on the results of the work of others, while maintaining quality and consistency between the different contributions.
Termitehdas käsittää Semanttisen Webin tekniikoihin perustuvan ohjelmiston ja toimintaohjeet.
Termitehtaan tarkoitus on koneavusteinen, yhteisöllinen, ammattimainen (tarkka) monikielinen sanastotyö verkossa.
The TF pyramid is not necessarily cuneiform in reality, it can be more of an amoeba. It is drawn with a triangle here just because there are many more special concepts than general ones, and many more people working at the lower rungs than at the top. The divisions of the cone depict both the composite structure of the content and the division of work between experts of different things.
TF -arkkitehtuuri
A TF term, an instance of TF class term:Term , is an association (ordered pair) of language independent concept with a natural language designation. (This accords with terminology standard DIN 2342-1 and de Saussure's definition of a linguistic sign.).
Consider the following example of the two strings home and hallitus . English home has a general language meaning an abode, and a special domain meaning "institution" (as in "we must put granny into a home"). The same string means mildew in Finnish. The Estonian word for mildew is hallitus . The same string in Finnish means government. Each such pairing of a meaning (blue) with a form (yellow) is a term/sign (green) in TF.
| Subset | gloss |
|---|---|
| TFS.owl | contains just enough to support day-to-day business with validated TF terms. |
| TFTop.owl | extends TFS.owl for general language vocabulary, in particular, |
| TFwn.owl | bridges Princeton Wordnet to TFTop.owl |
| TFSem.owl | extends TFS.owl for NL semantics, in particular, semantic role frames. |
| TFProp.owl | contains a taxonomy of TF properties for querying and conversion. |
| TFL10n.owl | and its language specific subsets TFL10n-??.owl localize TF vocabulary. |
| TFLang.owl | contain ISO language codes in TF format. |
| TFCtry.owl | contains ISO country codes in TF format. |
| TFDom.owl | contains the TF subject field classification for querying and browsing. |
| TFStrict.owl | contains integrity constraints for validating a TF term ontology. |
| DictionaryStrict.owl | separates terms from expressions. |
| LegacyStrict.owl | separates terms from concepts. |
WordNet is a large open source lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets. WordNet has been converted by others to RDF/OWL more than once. We bridge WordNet to TF rather than do yet another version. See section on WordNet conversion .
Wordnet 3.0 OWL has almost half a million URIs. FinWordnet adds another couple of hundred thousand.
WordNet consists of about 100K synsets. For editing the English-Finnish WordNet, TF splits the synsets into entry size files for English terms, Finnish terms, and (English) relationships. An entry is collected together from these pieces on demand with a php entry generator script, shown below. A url of form wn30entry?e=synset-entity-noun-1 generates the wordnet entry for synset-entity-noun-1 .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix meta: <http://tfs.cc/meta/> . @prefix Wn30: <http://tfs.cc/wn30/> . @prefix Wn30en: <http://tfs.cc/wn30/en/> . @prefix Wn30enh: <http://tfs.cc/wn30/en/h/> . @prefix Wn30fi: <http://tfs.cc/wn30/fi/> . [] rdf:type meta:Entry ; meta:active Wn30fi:fi-<?=$_GET['e']?>.ttl ; meta:schema <http://tfs.cc/owl/wn/TFwn.owl> ; meta:template <file:etc/templates/sem.ttl> . Wn30:<?=$_GET['e']?>.ttl rdf:type owl:Ontology ; owl:imports Wn30en:en-<?=$_GET['e']?>.ttl ; owl:imports Wn30enh:enh-<?=$_GET['e']?>.ttl ; owl:imports Wn30fi:fi-<?=$_GET['e']?>.ttl ; owl:versionInfo "TF WordNet version 0.2 21.01.2012" .
The current TF WordNet conversion uses the w3c namespace URIs.
The OWL version of the Suggested Upper Model Ontology SUMO (SUMO.owl) contains around 10K classes at the top. The full ontology with instances contains about 250K URIs, including names of airports, languages, cities, etc.
SUMO has been tentatively bridged with TF in the bridge ontology SUMO2TFS.owl .
The Finnish national FinnOnto project has built a sizable thesaurus ontology collection available at ONKI . Parts of it have been translated into English and Swedish. TF bridges to ONKI in the TF domain classification ontology TFDom.owl .
TF classes and properties like concept, term, and expression are also described in TF as (instances of) concepts, terms, and expressions. This means that TF is capable of reflection : it can document and localise itself. TF schema localisation ontology TFL10n.owl provides definitions and translations of TFS descriptive classes and properties. This information is used to change language in the TF front end tools. Note how the property names in the example below have been partially localized into Finnish (the coverage depends on the localization ontology). This is not interface localization, but content localization (if the dichotomy makes sense at all here).
Show/hide localized WordNet entry
Working experience with ontologies shows that large ontologies take a lot of space and time to process. The approach to ontology work taken in TF is "small models": instead of working with a complete ontology, make it easy to extract and merge working subsets from larger ontologies.
Ontologioista tulee helposti isoja ja vaikeasti hallittavia. Termitehtaan pyrkimyksenä on hajoittaa ontologiat hallittaviin osiin, joita voi vapaasti yhdistellä.
Termitehtaan ratkaisussa ontologioita haetaan ja palastellaan ontologiakyselyillä. Sanastosta voidaan poimia kyselyillä haluttuja otoksia. Kyselyjä voidaan nimetä ja niiden tuloksia tuoda toisiin ontologioihin.
Esimerkki:
A TF query import URI points to a TF service which returns the result of a query as a model (possibly one already cached in the repository database or filesystem).
Here is an example of a query import:
Toisin kuin perinteinen terminologia, Termitehdas ei perustu termitietueisiin (term entry). Ontologiassa tieto koostuu itsenäisistä käsitteitä, ilmauksia ja niiden suhteita kuvailevista väittämistä, joista voidaan koostaa monenlaisia kuvauksia. Perinteinen käsitelähtöinen termitietue on vain yksi vaihtoehtoinen kooste. Termitehdas koostaa erilaisia näkymiä termeistä termikyselyillä (DESCRIBE query). Esimerkki:
use a DESCRIBE query pattern to select the entities to describe (minimally, DESCRIBE r1 r2 ... ). The TF DESCRIBE query
Show/hide XHTML concept oriented example
Termitehdas tukee kaikkia em. työskentelytapoja.
This is a XHTML format for viewing and editing of TF entries on (X/HT)ML platforms. The format is not limited to TermFactory, but it can be used to view and edit any RDF. The layout of the XHTML document can be customised with a template written also in RDF.
The TF3 (TF triple) normal form of a TF ontology aims to minimise free variation between file versions.
JSON (an acronym for JavaScript Object Notation pronounced /ˈdʒeɪsən/) is a lightweight text-based open standard designed for human-readable data interchange. It is derived from the JavaScript scripting language for representing simple data structures and associative arrays, called objects. Despite its relationship to JavaScript, it is language-independent, with parsers available for most scripting languages. The JSON format was originally specified by Douglas Crockford, and is described in RFC 4627. The official Internet media type for JSON is application/json. The JSON filename extension is .json. The JSON format is often used for serializing and transmitting structured data over a network connection. It is primarily used to transmit data between a server and web application, serving as an alternative to XML.
There are several proposals for representing RDF in JSON . TermFactory currently supports the Talis specification . There is no general JSON to RDF reader in TF as yet. Script io/script/json2tf wrapping Java class com.grapson.tf.rev.jena/Json2TF converts the TF json localization file format back to TF.
SPARQL SELECT query results are produced in the Jena result set JSON serialization. .
Another defacto standard JSON format for RDF is the Exhibit JSON format from MIT Simile project. There is an online converter babel here or here .
TBX (Termbase Exchange format) is a localisation industry ( LISA ) standard for the interchange of terminology data including detailed lexical information.
WordNet 2.0 was converted for the w3c consortium by Mark van Assem. More recently, the conversion has been updated to WordNet 3.0 .
Talis.com provides a sparql endpoint for WordNet 3.0 .
Variants of the following query can be used to inspect the contents of the Talis.com store:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> DESCRIBE ?inst WHERE { ?inst rdfs:label ?label . FILTER(REGEX(STR(?label),"^entity")) } LIMIT 3
Another conversion of WordNet 3.0 to OWL is available from SUMO site ontologyportal.org . This version has been linked to SUMO.
The data models of both WordNet to RDF/OWL conversions match the TF sign structure. Word, WordSense, and Synset are distinguished as subclasses of TF form, sign, and meaning. Both conversions have been bridged to TF.
All of WordNet 3.0 has been translated into Finnish in the FIN-CLARIN project. For TF, the Finnish WordNet translations have been converted from xml to owl for both sumo and vu.nl versions . The current TF WordNet conversion uses the w3c namespace.
This slide shows how the different tools play together. Retry locates entries and documents in the TF cloud. Rewrite converts between formats and does refactoring of names. Pellet4TF is the query engine and reasoner. Edit4TF supports ontology editing. Index4TF handles uploading of entries and documents to TermFactory repositories. A clockwise roundtrip on the perimeter of the figure traces the revisioning cycle of an ontology/entry through TF.
Big web service queries can be made with the QueryService Axis2 operations setQuery/getQuery .
The TF QueryForm servlet provides a RESTful endpoint for the TF back-end query service.
With parameter form=true , it serves a standalone query form.
The TF EditForm servlet provides a RESTful endpoint for the functions of the TF back-end query service.
With parameter form=true , it serves a standalone editor.
An early mockup of TermFactory specific web front end interfaces is shown below.
The current notion is that TF provides plugins and services for third party platforms.
MediaWiki and Drupal are being used as testbeds.
Tämmöistä omaa käyttöliittymää oli alun perin suunniteltu.
Nykyinen TF tarjoaa välineitä tällaisten käyttöliittymien toteutukseen eri alustoilla.
With parameter form=true , The TF QueryForm servlet provides a standalone query form. Option lang=fi produces a Finnish version. The TF query form exemplifies the use of the TF QueryService in a web application. Kaavake QueryForm havainnollistaa TF kyselypalvelun toiminnallisuuksia.
With parameter form=true , The TF EditForm serves a standalone editor. Option lang=fi produces a Finnish version.
There are a variety of ways of editing TF content, depending on which facilities and options are used. The following table singles out some of them.
| action | editor | query | edit | cache | active | schema | template |
|---|---|---|---|---|---|---|---|
| Discuss and comment | |||||||
| Edit entries locally | yes | ||||||
| Edit web source | yes | yes | |||||
| Validate edits | yes | yes | yes | ||||
| Save and share source in TF database | yes | yes | yes | ||||
| Save and share edits in TF database | yes | yes | yes | ||||
| Edit an ontology entry | yes | yes | yes | yes | |||
| Edit third party format as TF | yes | yes | |||||
| Edit 3rd party ontology entry | yes | yes | yes | yes | yes | yes | yes |
| Edit any html content | yes | yes | yes | yes |
This figure shows the component types of a TermFactory system graph. Each type of component is exemplified with a concrete instance (one actually used in the reference implementation).
In a professional terminology use scenario, participants in the terminology workflow can assume different roles.
Termitehtaan toimijoilla on eri rooleja.
Matching disease names in PULS and Biocaster epidemic ontologies can be listed with one TF query PULS- ja Biocaster-hankkeiden tautiontologioiden vertailu onnistuu yhdellä TF-kyselyllä.
The combined PULS/BioCaster epidemic ontology contains about 3000 classes and 20K instances, altogether about 300K triples. Their materialization (inference closure) contains about 1M triples.
The disease match query from the merged PULS-BioCaster ontology takes a few minutes real time on usual hardware.
For web use, offline processing of ontology queries and caching of results is the way to go.
Repositories can get big, but queries and entries need to remain manageable size ("small models" approach).
Yhdistetty PULS/BioCaster epidemiaontologia sisältää noin 3000 luokkaa ja 20K esiintymää sekä noin 300K väittämää. Päättelijä lisää 600K väittämää, joten ontologiasta seuraa noin 1M väittämää.
Ontologioiden vertailukyselyt kestävät 5-10 minuuttia reaaliajassa.
Termitehtaan siilot voivat olla suuria, mutta termikyselyjen on oltava nopeita ja kohtuukokoisia.
Ratkaisu: Tausta-ajo ja tulosten välitalletus
Exploit CF synergy of web harvesting and crowdsourcing:
EU MOLTO 2010 - 2013
The overall design of the MOLTO WP3 Translation Tools prototype is shown here. The core is a client side MOLTO translation editor embeddable to a third party translation management platform like GlobalSight, supported by web service APIs that take care of document, grammar and term management, GF translation, and ontology services.