Thematic Pilot Interview: Language Resources

Read the Interview with the CLARIN Thematic Pilot to discover the latest updates on OSTrails pilot studies. Explore their progress in integrating open science principles and advancing research assessment. This month we had the pleasure of speaking with Daan Broeder and Menzo Windhouwer, of CLARIN ERIC.
![]() |
- Menzo Windhouwer & Daan Broeder |
"Each science cluster can benefit from seamless information exchange through the Scientific Knowledge Graph Interoperability Framework, which cross-connects repositories, databases, catalogues, knowledge graphs and Linked Open Data collections."
-Can you briefly introduce your organisation? How do they contribute to EOSC?
CLARIN is the European infrastructure whose core business is to provide access to language resources and tools for processing them. CLARIN was one of the first ERICs to be established and nowadays its network spans in 26 countries (25 in Europe, plus South Africa). All these distributed resources and expertise will be made available to the wider EOSC ecosystem under the emerging model of service federation via one or more CLARIN nodes. CLARIN is the European infrastructure whose core business is to provide access to language resources and tools for processing them. CLARIN was one of the first ERICs to be established and nowadays its network spans 26 countries (25 in Europe, plus South Africa). All these distributed resources and expertise will be made available to the wider EOSC ecosystem under the emerging model of service federation via one or more CLARIN nodes. CLARIN is one of the research infrastructures in the Social Science and Humanities Open Cluster (SSHOC).
Over the years, Daan Broeder and Menzo Windhouwer have been working for various institutions, all of which have been deeply involved in the development of the CLARIN infrastructure and its embedding in the European context.
Menzo is currently based at the Humanities Cluster of the Royal Netherlands Academy of Arts and Sciences (KNAW-HuC). Several institutes in the domain of Social Sciences and Humanities (SSH) are part of KNAW-HuC, including the Meertens Institute and the Huygens Institute. Both are also CLARIN centres.
Daan is currently based at the at the CLARIN ERIC central office, which coordinates the CLARIN research infrastructure. In project OSTrails, CLARIN ERIC represents the SSHOC cluster and participates in the project board meetings as representative for the five science clusters.
-What are you most excited about in OSTrails? What are you looking forward to?
Within CLARIN we are looking forward to making the FAIR principles more tangible to our communities. What does it mean for a dataset to be FAIR? What kind of positive impact will that have for a researcher? And will it spark the willingness to spend the required effort to make resources FAIR? And if so, will FAIR become the norm because funders enforce it, or because the researchers see why it matters.
CLARIN has its own flexible metadata standard: CMDI. It can handle the many types of datasets and modalities in the language domain, e.g. raw text collections, annotated corpora, lexicons, speech recordings, field work on endangered languages, all in a multitude of languages. However, due to the flexibility of CMDI, metadata schema proliferation is an inherent challenge. CLARIN addresses this via a shared semantic overlay: the CLARIN Concept Registry. The proliferation issues could have been overcome if an extendable common core set of metadata profiles had been available for the community from the start. The Scientific Knowledge Graph Interoperability Framework (SKG-IF), initiated by a working group of the Research Data Alliance (RDA) and now further developed within OSTrails, gives us a fresh start for implementing such a strategy. CLARIN is looking forward to seeing the SKG approach panning out. We have planned to develop and test it working together with the SSHOC partners in OSTrails. This joint work is building on the existing collaboration on the EOSC entry point for the SSH cluster: the SSH Open Marketplace.
-How is planning, tracking and assessing research being realised in your scientific domain?
In the CLARIN infrastructure, FAIR was part of the design avant la lettre and what is now called FAIR assessment has been implemented through technical certification of the CLARIN centres. Enabling proper citation has also been on the agenda for over a decade. CLARIN was one of the first RIs to require proper PIDs for resources and it offers the Virtual Collection Registry (VCR), a tool that enables to build virtual collections distributed across repositories and domains.
At country level things are partly dependent on national circumstances. In the Netherlands DMPs are still paper trails. Tracking happens in a disconnected way. Assessment of datasets is slowly taking off by communities like CLARIAH (a collaboration of the national humanities infrastructures CLARIN and DARIAH), ODISSEI (the Dutch social sciences infrastructure) and NDE (network of Dutch cultural heritage institutions). These initiatives are creating FAIR Implementation Profiles and are eager to make tools available for assessing if the datasets produced by the communities are actually matching these profiles.
-Can you provide some details on your pilot's main actors, services and priorities? How will your pilot adopt the results of OSTrails?
CLARIN’s thematic pilot centres around the central catalogue and discovery platform that is used within CLARIN: the Virtual Language Observatory (VLO). This catalogue is based on a weekly harvest of the OAI-PMH providers of the CLARIN centres and other relevant providers. The centres provide their CMDI metadata, from which a dozen common facets are extracted using the shared semantic overlay.
Already in 2017, a pilot was implemented for making this joint metadata space available as RDF: CMD2RDF. In CLARIN’s thematic OSTrails pilot, CMD2RDF will be refreshed by making it deliver RDF that is compliant with the SKG-IF data model via the API developed by OSTrails and RDA. Some entity types will have to be added explicitly to the semantic overlay, e.g. persons, services and projects, including entities from SKGs to be developed by other OSTrails partners, such as CESSDA. For the linking of entities, we will use Lenticular Lens: an alignment tool developed at KNAW-HuC.
This alignment should enable researchers to use the SKG federation to connect a CLARIN dataset to related entities in the CESSDA SKG. This could for example help discover datasets from different domains, which can be useful for interdisciplinary research, e.g. investigations into the influence of socio-economic status on language use. The alignment will also enhance the findability of resources available in the SKG that will become available for the SSH Open Marketplace.
In addition, we will provide FAIR assessment information for the CLARIN datasets and guidance to both researchers (How can the FAIRness of this dataset facilitate the research process?) and providers (How can you improve the FAIRness of this dataset?).
-Ongoing activities and Next Steps?
Currently we are actively involved with the design of the SKG-IF data model and the API development. We are also processing the outcomes of the face-to-face hackathon in Athens (March 2025), and as the actual pilot implementation is gradually coming closer some of the tooling is now being prepared:
- CMD2RDF is adjusted to deal with the latest developments in the CMDI ecosystem and to take advantage of state-of-the-art RDF facilities, such as RDF*;
- Lenticular Lens is generalized to take input data from any SPARQL-based triple store and will be tested with the SKG-IF data model;
- The FAIR assessment tool pyFAT is extended with guidance fitting OSTrails developments.
These action lines should enable us to get a first version of the pilot going, as soon as the OSTrails Interoperability Frameworks are available.