Skip to main content

Thematic Pilot Interview: Language Resources

Read the Interview with the CLARIN Thematic Pilot to discover the latest updates on OSTrails pilot studies. Explore their progress in integrating open science principles and advancing research assessment. This month we had the pleasure of speaking with Daan Broeder and Menzo Windhouwer,of CLARIN ERIC. 

Pilot Interviews th 8 CLARIN Broeder Windhouwer 
  - Menzo Windhouwer & Daan Broeder

"Each science cluster can benefit from seamless information exchange through the Scientific Knowledge Graph Interoperability Framework, which cross-connects repositories, databases, catalogues, knowledge graphs and Linked Open Data collections."

 

-Can you briefly introduce your organisation? How do they contribute to EOSC?  

CLARIN is the European infrastructure whose core business is to provide access to language resources and tools for processing them. CLARIN was one of the first ERICs to be established and nowadays its network spans in 26 countries (25 in Europe, plus South Africa). All these distributed resources and expertise will be made available to the wider EOSC ecosystem under the emerging model of service federation via one or more CLARIN nodes. CLARINis the European infrastructure whose core business is to provide access to language resources and tools for processing them. CLARIN was one of the first ERICs to be established and nowadays its network spans 26 countries (25 in Europe, plus South Africa). All these distributed resources and expertise will be made available to the wider EOSC ecosystem under the emerging model of service federation via one or more CLARIN nodes.  CLARIN is one of the research infrastructures in the Social Science and Humanities Open Cluster (SSHOC).

Over the years, Daan Broeder and Menzo Windhouwer have been working for various institutions, all of which have been deeply involved in the development of the CLARIN infrastructure and its embedding in the European context.

Menzo is currently based at the Humanities Cluster of the Royal Netherlands Academy of Arts and Sciences (KNAW-HuC). Several institutes in the domain of Social Sciences and Humanities (SSH) are part of KNAW-HuC, including theMeertens Institute and theHuygensInstitute. Both are also CLARIN centres.

Daan is currently based at the at the CLARIN ERIC central office, which coordinates the CLARIN research infrastructure. In projectOSTrails, CLARIN ERIC represents the SSHOC cluster and participates in the project board meetings as representative for thefivescience clusters.

-What are you most excited about in OSTrails? What are you looking forward to?   

Within CLARIN we are looking forward to making the FAIR principles more tangible to our communities. What does it mean for a dataset to be FAIR? What kind of positive impact will that have for a researcher? And will it spark the willingness to spend the required effort to make resources FAIR?  And if so, will FAIR become the norm because funders enforce it, or because the researchers see why it matters.

CLARIN has its own flexible metadata standard:CMDI. It can handle the many types of datasets and modalities in the language domain, e.g. raw text collections, annotated corpora, lexicons, speech recordings, field work on endangered languages, all in a multitude of languages. However, due to the flexibility of CMDI, metadata schema proliferation is an inherent challenge. CLARIN addresses this via a shared semantic overlay: theCLARINConcept Registry. The proliferation issues could have been overcome if an extendable common core set of metadata profiles had been available for the community from the start. TheScientific Knowledge Graph Interoperability Framework (SKG-IF),initiated by a working group of the Research Data Alliance(RDA) and now further developed within OSTrails, gives us a fresh start for implementing such a strategy. CLARIN is looking forward to seeing the SKG approach panning out. We have planned to develop and test it working together with the SSHOC partners in OSTrails. This joint work is building on the existing collaboration on the EOSC entry point for the SSH cluster: theSSH OpenMarketplace.

-How is planning, tracking and assessing research being realised in your scientific domain?

In the CLARIN infrastructure, FAIR was part of the designavant la lettre and what is now called FAIR assessment has been implemented through technical certification of the CLARIN centres.  Enabling proper citation has also been on the agenda for over a decade. CLARIN was one of the first RIs to require proper PIDs for resources and it offers the Virtual Collection Registry (VCR), a tool that enables to build virtual collections distributed across repositories and domains.

At country level things are partly dependent on national circumstances. In the Netherlands DMPs are still paper trails. Tracking happens in a disconnected way. Assessment of datasets is slowly taking off by communities likeCLARIAH (a collaboration of the national humanities infrastructures CLARIN and DARIAH),ODISSEI (the Dutch social sciences infrastructure) andNDE (network of Dutch cultural heritage institutions).  These initiatives are creating FAIR Implementation Profiles and are eager to make tools available for assessing if the datasets produced by the communities are actually matching these profiles.

-Can you provide some details on your pilot's main actors, services and priorities? How will your pilot adopt the results of OSTrails?

CLARIN’s thematic pilot centres around the central catalogue and discovery platform that is used within CLARIN: theVirtual Language Observatory (VLO). This catalogue is based on a weekly harvest of the OAI-PMH providers of the CLARIN centres and other relevant providers. The centres provide theirCMDI metadata, from which a dozen common facets are extracted using the shared semantic overlay.

Already in 2017, a pilot was implemented for making this joint metadata space available as RDF:CMD2RDF.  In CLARIN’s thematic OSTrails pilot, CMD2RDF will be refreshed by making it deliver RDF that is compliant with theSKG-IF data model via the API developed by OSTrails and RDA.  Some entity types will have to be added explicitly to the semantic overlay, e.g. persons, services and projects, including entities from SKGs to be developed by other OSTrails partners, such as CESSDA. For the linking of entities, we will useLenticular Lens: an alignment tool developed at KNAW-HuC.

This alignment should enable researchers to use the SKG federation to connect a CLARIN dataset to related entities in the CESSDA SKG. This could for example help discover datasets from different domains, which can be useful for interdisciplinary research, e.g. investigations into the influence of socio-economic status on language use. The alignment will also enhance the findability of resources available in the SKG that will become available for the SSH Open Marketplace.

In addition, we will provide FAIR assessment information for the CLARIN datasets and guidance to both researchers (How can the FAIRness of this dataset facilitate the research process?) and providers (How can you improve the FAIRness of this dataset?).

-Ongoing activities and Next Steps? 

Currently we are actively involved with the design of the SKG-IF data model and the API development. We are also processing the outcomes of the face-to-face hackathon in Athens (March 2025), and as the actual pilot implementation is gradually coming closer some of the tooling is now being prepared:

  • CMD2RDF is adjusted to deal with the latest developments in theCMDI ecosystem and to take advantage of state-of-the-art RDF facilities, such as RDF*;
  • Lenticular Lens is generalized to take input data from any SPARQL-based triple store and will be tested with the SKG-IF data model;
  • The FAIR assessment toolpyFAT is extended with guidance fitting OSTrails developments.

These action lines should enable us to get a first version of the pilot going, as soon as the OSTrails Interoperability Frameworks are available.

 

Thematic Pilot Interview: Social Sciences & Humanities

Read the Inteview with the Social Sciences & Humanities Thamatic Pilot to discover the latest updates on OSTrails pilot studies. Explore their progress in integrating open science principles and advancing research assessment. This month we had the pleasure of speaking with Michael Kurzmeier from Digital Research Infrastructure for the Arts and Humanities (DARIAH).

 Thematic Pilot Interview Social Sciences Humanities Michael Kurzmaier
  - Michael Kurzmeier

"Curation is an ongoing effort in the SSH Open Marketplace. Being able to exchange data with other SKGs through a common interoperability framework, will help keeprecords up to date and make it easier for users to bring their research outputs into the Marketplace."

-Can you briefly introduce your organisation? How do they contribute to EOSC?

The creation of theSSH Open Marketplace was funded by theSocial Sciences and Humanities Open Cloud (SSHOC) project, which aimed to integrate and consolidate thematic e-infrastructure platforms in preparation for connecting them to theEuropean Open Science Cloud (EOSC). The overall objective of the SSHOC project was to establish the Social Sciences and Humanities segment of EOSC.

As a domain-specific discovery portal and the aggregator of the SSHOC project, the SSH Open Marketplace directly contributes to EOSC by supplementing existing services like theEOSCResource Hub. It facilitates the seamless exchange of tools, services, data, and knowledge within the research community.

-What are you most excited about in OSTrails? What are you looking forward to?

One of our ongoing tasks in the SSH Open Marketplace is integrating new data sources. Given the differences in how data is handled by each source, we often need to map the source data model to our own. Additionally, when different vocabularies are used, we must address compatibility issues.

With OSTrails introducing a common Scientific Knowledge Graph (SKG) interoperability format, we anticipate a significant reduction in the need for custom ingest pipelines. This commonality will streamline the integration process and allow us to enrich existing entries by querying other SKGs for missing information.

-How is planning, tracking and assessing research being realised in your scientific domain?

The SSH Open Marketplace is a community-driven discovery portal where researchers can create metadata entries for tools, services, training materials, data sources, publications, and workflows. The workflows enable the detailed description of research processes, linking tools, data sources, training materials, and publications to create a comprehensive view of the research journey.

Items can have relationships with one another, enabling users to discover unexpected connections between resources and fostering opportunities for serendipitous insights. These relationships contribute to a richer and more interconnected research environment, ultimately enhancing the visibility and accessibility of various tools, services, and data.

Curation is a collaborative effort managed by the editorial board and the curation task force. The editorial board oversees the overall strategy and ensures content quality, while the curation task force actively curates new entries, maintains metadata accuracy, and ensures that the resources in the marketplace are up-to-date and relevant to the community.

Because the SSH Open Marketplace brings together diverse resources from the SSH domain and offers a public API endpoint, it is used to track research outputs and act as a single source of truth for various projects such asText+ who use the Marketplace AP for provision and maintenance of services.

-Can you provide some details on your pilot's main actors, services and priorities? How will your pilot adopt the results of OSTrails?

TheAustrian Centre for Digital Humanities and Cultural Heritage (ACDH-CH) of the Austrian Academy of Sciences(OeAW) hosts and maintains the service. Matej Ďurčo coordinates efforts from the ACDH-CH team, with Stefan Probst acting as the front-end developer, and Dalibor Pančić handling Marketplace deployment. ThePoznan Supercomputing and Networking Center (PSNC), affiliated with theInstitute of Bioorganic Chemistry of the Polish Academy of Sciences, provides the data ingestion pipeline and maintains the service. Tomasz Parkoła is responsible for maintaining the back-end code, while Aleksandra Nowak serves as the lead developer of theDACE ingestion pipeline. The curation team includes KlausIllmayer (OeAW), Alex König (Clarin), Cesare Concordia (ISTI-CNR), Laure Barbot (DARIAH)and MichaelKurzmeier. Regarding the adoption of OSTrails' results, we will integrate the interoperability framework into our processes. Ideally, this will allow us to easily connect to new sources and help users autofill most of the fields when creating a new entry.

-Ongoing activities and Next Steps? 

Currently, we are focused on expanding our data sources and refining our integration processes. Our next steps include aligning our infrastructure with the SKG interoperability standards introduced byOSTrails, which will help ustofacilitate smoother data ingestion and improved metadata enrichment for our users. We look forward to learning aboutOSTrails interoperability framework to be ready to start our integrations with other tools as soon as it becomes available.

Thematic Pilots

    • Odile Hologne , This email address is being protected from spambots. You need JavaScript enabled to view it.
  • CS Organisations:
    • INRAE , CS Organisation Logo:

ScienceClusters and Thematic Pilots

Image by Freepik

Image by Freepik

Overview

There are 8 thematic pilots in OSTrails highlighting their importance in achieving project objectives across different communities. They play a crucial role in simplifying researchers’ efforts to handle data and follow best practices. Firstly, they'll design DMP templates tailored to community specifications embedding FAIR metrics. Secondly, they will improve knowledge graphs by adding more details about experiments, datasets, instruments, and other important results in research. Lastly, they will work together to create rules for FAIRness in their specific areas. In essence, these thematic pilots contribute to making data management more effective and tailored to the unique needs of different research fields, promoting FAIRness and interoperability within those domains.

In brief

  • Cross-domain and science cluster

    Working with and for all the five ESFRI Science Clusters, this pilot focuses on the collaborative enrichment of FAIRsharing, a resource of interlinked community standards (to identify and report data and metadata), databases (repositories and knowledge bases), and policies (from all organisations).FAIRsharing content will power DMP tools and FAIR assistance services to find and extract relevant standards and databases for inclusion in DMPs, or forFAIRnessvalidation processes.Records in FAIRsharingare also interlinked, creating rich knowledge graphs, showing which standards are implemented by a database, also how databases are related (e.g. exchanging data), and how standards are connected to each other (e.g. which terminologies are used by a given model).FAIRsharing graph will be connected to the OpenAIRE’s Graph, and other Cluster-specific complementary graphs.To enable this collaborative work, theFAIRsharing Community Champions Programme will ensure formal connections to all Clusters and their research infrastructures, crediting their individual and organizational contributions via the use of ORCID and ROR.

  • Physics

    Photon and neutron open science cloud (PaNOSC) pilot on Physics aims at developing PaN RI community DMP ma-templates, embedding community FAIR metrics within the templates to assess DMPs against FAIRness of their described outputs, enhancing Pan RI with entities, PIDs and relationships for experiments, datasets, instruments, facilities, software, publications, vocabulary and grants (PUMA SKG), and enhancing FAIR assessment of ICAT - PaN dataset repository, SOLEIL and ILL RIs digital objects. For this purpose, DSW local instance, PUblication and experiment Metadata Analyser (PUMA), ICAT - PaN dataset repository will be employed.

  • Marine/ Coastal

    JERICO pilot on Marine and Coastal sciences aims at developing JERICO community ma-templates for DMPs (e.g. SOCIB), embedding community FAIR metrics within the templates to assess DMPs against FAIRness of their described outputs, enhancing SOCIB and JERICO with PIDs, while interoperating with SKGs, aligning community FAIR metrics for digital objects assessments with JERICO Label and KPIs, and assessing FAIRness of JERICO digital objects. For this purpose, JERICO-CORE, SOCIB Knowledge Catalogue, JERICO KG, Coastal Ocean Resource Infrastructure System (CORIS) will be employed.

  • Social Sciences

    Consortium of European Social Science Data Archives (CESSDA) pilot on Social Sciences aims at enhancing CESSDA Data Catalogue by using standard vocabularies, PIDs and relationships for data, projects and maDMPs, while interoperating with SKGs, as well as at aligning community FAIR metrics for digital objects assessments with the CESSDA Data Catalogue Assessment, the resulting improvements of which will be inserted in CDC and source repositories within the context of the ongoing collaboration with F-UJI in FAIR-IMPACT project. For this purpose,CESSDA Data Catalogue, CESSDA Metadata Validator, CESSDA Vocabulary Service, DDI Alliance Vocabulary service, CESSDA ELSST, Kuha2 (OAI-PMH endpoint), CESSDA Data Catalogue assessment, F-UJI tool will be employed.

  • Social Sciences & Humanities

    Social Sciences and Humanities Open Cloud (SSHOC) pilot onSocial Sciences and Humanities aims at fostering collaboration developing SSH community ma-templates for DMPs, embedding community FAIR metrics within the templates to assess DMPs against FAIRness of their described outputs, and enhancing the SSH OMP-SKG data model and content with DMPs, FAIR assessment results, standard vocabularies linking them with tools and services, datasets, training materials, workflows and publications, while interoperating with the OpenAIRE Graph and other SKGs. For this purpose, SSH Open Marketplace - SSHOMP, SSH Vocabulary Commons, SSH Data Citation Service will be employed

  • Biodiversity

    LifeWatch ERIC pilot aims at developing community ma-templates for DMPs, embedding community FAIR metrics within the templates to assess described research outputs, configuring and link templates with the relevant institutional and thematic data services, as well as other research products (e.g. preprints, publications, open peer-reviews, data, software, workflows, storage, organisations, projects, funders, services, researchers, facilities,etc.), enhancing the LifeBlock SKG with maDMPs, PIDs and links for the above research products and digital objects, and codefining community FAIR assessment metrics. For this purpose, LifeBlock SKG, Workflow orchestrators and VRE forming software, Metadata catalogue and Semantics repository will be employed. 

  • Language Resources

    CLARIN pilot on Language Resources aims at fostering collaboration developing community ma-template for DMPs, embedding community FAIR metrics within the templates to assess described research outputs, and enhancing VLO SKG with DMPs, FAIR assessment results, standard vocabularies, while interoperating with SSH-OMP-SKG, the OpenAIRE Graph and other SKGs. For this purpose, Virtual Language Observatory-VLO and CLARIN curation dashboard, Digital Object Gateway, SSH Vocabulary Commons, Lenticular Lens and CMDI2RDF service will be employed.

  • Astronomy and particle physics

    ESCAPE pilot on Astronomy and particle physics aims at developing ESCAPE community ma-templates for DMPs, based on ObsParis DMP templates and the use case of the CTA Observatory, embedding metadata required to conduct FAIR metrics assessment with the astronomy community ecosystem, enhancing the community registries’ interfaces with PIDs and interoperate with SKGs, aligning community FAIR metrics for digital objects assessments with astronomy metadata and KPIs, and assessing FAIRness of ESCAPE  digital objects. For this purpose, Metadata Catalogue, IVOA standard validators, IVOA data servers (with OAI-PMH endpoints), PADC data and metadata services, CTA Virtual Observatory repository will be employed.

Partners


  • FAIRsharing

  • University of Oxford

  • Photon and Neutron Open Science

  • European Synchrotron Radiation Facility

  • SOLEIL

  • Institut Laue-Langevin- ILL

  • Environmental Research Infrastructures

  • JERICO

  • CLARIN

  • CESSDA

  • Social Sciences and Humanties Open Scloud - SSHOC

  • Austrian Academy of Sciences

  • LIFEWATCH

  • ESCAPE

  • Observatoire de Paris