INDE lab | publications

Bibtex file with the publications listed below.

2025

The effects of mismatched train and test data cleaning pipelines on regression models: lessons for practice Nevin, James, Lees, Michael, and Groth, Paul PeerJ Computer Science 2025 [Abs] [Link] [DOI:10.7717/peerj-cs.2793]
Data quality problems are present in all real-world, large-scale datasets. Each of these potential problems can be addressed in multiple ways through data cleaning. However, there is no single best data cleaning approach that always produces a perfect result, meaning that a choice needs to be made about which approach to use. At the same time, machine learning (ML) models are being trained and tested on these cleaned datasets, usually with one single data cleaning pipeline applied. In practice, however, data cleaning pipelines are updated regularly, often without retraining of production models. It is therefore common to apply different test (or production) data than the data on which the models were originally trained. The changes in these new test data and the data cleaning process applied can have potential ramifications for model performance. In this article, we show the impact that altering a data cleaning pipeline between the training and testing steps of an ML workflow can have. Through the fitting and evaluation of over 6,000 models, we find that mismatches between cleaning pipelines on training and test data can have a meaningful impact on regression model performance. Counter-intuitively, such mismatches can improve test set performance and potentially alter model selection choices.
3K: Knowledge-Enriched Digital Twin Framework Karabulut, Erkan, Groth, Paul, and Degeler, Victoria In Proceedings of the 14th International Conference on the Internet of Things 2025 [Abs] [Link] [DOI:10.1145/3703790.3703834]
Digital Twins (DTs) are the digital equivalent of physical entities that facilitate, among others, monitoring and decision-making, thus helping extend the longevity of the twinned entity. DTs with automated decision-making capabilities require explainable inference mechanisms, especially for critical infrastructures such as water networks. Here we introduce 3K, a DT framework that aims for knowledge-enriched inference that is explainable and fast, by synthesizing knowledge representation (semantics) and knowledge discovery methods. 3K constructs a knowledge graph, which is becoming a mainstream way of metadata storage in DTs, and proposes a new method that can run on both sensor data and knowledge graphs to learn semantic association rules. The rules represent the expected working conditions of the DT and we argue that when combined with domain knowledge in the form of ontological axioms, semantic association rules can help perform downstream tasks in DTs, including extending the longevity of the twinned entities such as an Internet of Things (IoT) system. Furthermore, we demonstrate the 3K framework in a water distribution network use case and show how it can be used for downstream tasks.
A Deep Dive Into Cross-Dataset Entity Matching with Large and Small Language Models Zhang, Zeyu, Groth, Paul, Calixto, Iacer, and Schelter, Sebastian In Proceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025 2025 [Link] [DOI:10.48786/EDBT.2025.75] [Code]
The effects of data quality on machine learning performance on tabular data Mohammed, Sedir, Budach, Lukas, Feuerpfeil, Moritz, Ihde, Nina, Nathansen, Andrea, Noack, Nele, Patzlaff, Hendrik, Naumann, Felix, and Harmouch, Hazar Information Systems 2025 [Abs] [Link] [DOI:https://doi.org/10.1016/j.is.2025.102549]
Modern artificial intelligence (AI) applications require large quantities of training and test data. This need creates critical challenges not only concerning the availability of such data, but also regarding its quality. For example, incomplete, erroneous, or inappropriate training data can lead to unreliable models that produce ultimately poor decisions. Trustworthy AI applications require high-quality training and test data along many quality dimensions, such as accuracy, completeness, and consistency. We explore empirically the relationship between six data quality dimensions and the performance of 19 popular machine learning algorithms covering the tasks of classification, regression, and clustering, with the goal of explaining their performance in terms of data quality. Our experiments distinguish three scenarios based on the AI pipeline steps that were fed with polluted data: polluted training data, test data, or both. We conclude the paper with an extensive discussion of our observations.
Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy Mohammed, Sedir, Naumann, Felix, and Harmouch, Hazar In Proceedings 28th International Conference on Extending Database Technology, EDBT 2025, Barcelona, Spain, March 25-28, 2025 2025 [Link] [DOI:10.48786/EDBT.2025.43]
Data Systems Education: Curriculum Recommendations, Course Syllabi, and Industry Needs Miedema, Daphne, Taipalus, Toni, Ajanovski, Vangel V., Alawini, Abdussalam, Goodfellow, Martin, Liut, Michael, Peltsverger, Svetlana, and Young, Tiffany In 2024 Working Group Reports on Innovation and Technology in Computer Science Education 2025 [Abs] [Link] [DOI:10.1145/3689187.3709609]
Data systems have been an important part of computing curricula for decades, and an integral part of data-focused industry roles such as software developers, data engineers, and data scientists. However, the field of data systems encompasses a large number of topics ranging from data manipulation and database distribution to creating data pipelines and data analytics solutions. Due to the slow nature of curriculum development, it remains unclear (i) which data systems topics are recommended across diverse higher education curriculum guidelines, (ii) which topics are taught in higher education data systems courses, and (iii) which data systems topics are actually valued in data-focused industry roles. In this study, we analyzed computing curriculum guidelines, course contents, and industry needs regarding data systems to uncover discrepancies between them. Our results show, for example, that topics such as data visualization, data warehousing, and semi-structured data models are valued in industry, yet seldom taught in courses. This work allows professionals to further align curriculum guidelines, higher education, and data systems industry to better prepare students for their working life by focusing on relevant skills in data systems education.
ANYMATCH – Efficient Zero-Shot Entity Matching with a Small Language Model Zhang, Zeyu, Groth, Paul, Calixto, Iacer, and Schelter, Sebastian In Workshop on Preparing Good Data for Generative AI: Challenges and Approaches at AAAI 2025 [Link] [Code]
FAIR Research Objects and computational workflows Soiland-Reyes, Stian 2025 [Abs] [Link]
This PhD thesis explores the topics of RO-Crate, FAIR Digital Objects (FDOs), and computational workflows, in order to examine research questions on how these can be implemented and integrated using Linked Data approaches – forming “FAIR Research Objects”. The background covers the evolution of the Semantic Web, Linked Data, and FAIR Digital Objects, which are evaluated against the FAIR principles and several frameworks, to consider these technologies as potential middleware for a global distributed object system that enable machine-actionable research outputs. This work introduces the community-developed method RO-Crate for packaging research artefacts with their contextual information, relationships and metadata – utilising Linked Data standards that are simplified and documented for pragmatic use by software developers. The tension between flexibility for implementations and rigidity of semantic constraints is explored, and demonstrated by profiles of RO-Crate across research domains such as bioinformatics, regulatory sciences, biodiversity and digital humanities. Computational workflows, for reproducible data analysis across execution platforms, are examined as potential FAIR Digital Objects, considering them both as shareable research outputs as well as a part of provenance of computational results, captured in a Workflow Run Crate. This thesis explores the emerging ecosystem of FAIR Digital Objects and how it can learn from the community development of RO-Crate to carefully adapt ’just enough’ of Linked Data technologies, balancing flexibility and predictability. The main findings of this thesis emphasise community-driven pragmatic solutions over strict semantic correctness, supporting advancement of the FAIR principles through practical and interoperable implementations of Web standards.

2024

The Ramifications of Data Handling for Computational Models Nevin, James Graham 2024 [Link]
Exploiting Subgraphs and Attributes for Representation Learning on Knowledge Graphs Daza Cruz, Daniel Fernando 2024 [Abs] [DOI:10.5463/thesis.823]
Knowledge graphs (KGs) are data structures that explicitly represent entities and the relations between them over some domain. They can be used to store information about people and their relatives or birth locations, about organizations and the countries where they are located, or about chemical compounds and their interactions with proteins in the human body. In this thesis, we investigate the problem of representation learning on knowledge graphs, which consists of learning vector representations of entities and relations that capture the information contained in the graph. These learned representations are useful for capturing patterns that occur in the graph but are not explicitly stated in it. Several methods for representation learning on KGs are based on predicting a link between a pair of entities in the graph. While efficient, this approach forgoes useful learning signals from other sources that exhibit patterns, such as subgraphs involving multiple entities, and attributes of entities. Examples of attributes are textual descriptions of people, or molecular structures of chemical compounds. The goal of this thesis is to explore such learning signals beyond pairwise interactions of entities. Our findings provide evidence that subgraphs and attributes are powerful signals from which we can learn representations in KGs. Not only do they yield improved representations, but they also broaden the range of tasks in which they can be applied. We hope that this serves as a motivation for learning from further sources of information that are already available in KGs, but whose potential is yet to be discovered.
Understanding the Impact of Entity Linking on the Topology of Entity Co-occurrence Networks for Social Media Analysis Nevin, James, Zhang, Pengyu, Dimitrov, Dimitar, Lees, Michael, Groth, Paul, and Dietze, Stefan In Knowledge Engineering and Knowledge Management (EKAW) 2024 [Link] [DOI:10.1007/978-3-031-77792-9_5] [Code]
Influence Beyond Similarity: A Contrastive Learning Approach to Object Influence Retrieval Liberatore, Teresa, Groth, Paul, Kackovic, Monika, and Wijnberg, Nachoem In Knowledge Engineering and Knowledge Management (EKAW) 2024 [Link] [DOI:10.1007/978-3-031-77792-9_3] [Code]
TIGER: Temporally Improved Graph Entity Linker Zhang, Pengyu, Cao, Congfeng, and Groth, Paul In 27th European Conference on Artificial Intelligence (ECAI 24) 2024 [Link] [DOI:10.3233/faia240933] [Code]
DiTEC: Digital Twin for Evolutionary Changes in Water Distribution Networks Degeler, Victoria, Hadadian, Mostafa, Karabulut, Erkan, Lazovik, Alexander, Loo, Hester, Tello, Andrés, and Truong, Huy In Leveraging Applications of Formal Methods, Verification and Validation. Application Areas 2024 [Abs] [Link] [Code]
Conventional digital twins (DT) for critical infrastructures are widely used to model and simulate the system’s state. But fundamental environment changes bring challenges for DT adaptation to new conditions, leading to a progressively decreasing correspondence of the DT to its physical counterpart. This paper introduces the DiTEC system, a Digital Twin for Evolutionary Changes in Water Distribution Networks (WDN). This framework combines novel techniques, including semantic rule learning, graph neural network-based state estimation, and adaptive model selection, to ensure that changes are adequately detected, processed and the DT is updated to the new state. The DiTEC system is tested on the Dutch Oosterbeek region WDN, with results showing the superiority of the approach compared to traditional methods.
CYCLE: Cross-Year Contrastive Learning in Entity-Linking Zhang, Pengyu, Cao, Congfeng, Zaporojets, Klim, and Groth, Paul In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM 24) 2024 [Abs] [Link] [DOI:10.1145/3627673.3679702] [Code]
Knowledge graphs constantly evolve with new entities emerging, existing definitions being revised, and entity relationships changing. These changes lead to temporal degradation in entity linking models, characterized as a decline in model performance over time. To address this issue, we propose leveraging graph relationships to aggregate information from neighboring entities across different time periods. This approach enhances the ability to distinguish similar entities over time, thereby minimizing the impact of temporal degradation. We introduce CYCLE: Cross-Year Contrastive Learning for Entity-Linking. This model employs a novel graph contrastive learning method to tackle temporal performance degradation in entity linking tasks. Our contrastive learning method treats newly added graph relationships as positive samples and newly removed ones as negative samples. This approach helps our model effectively prevent temporal degradation, achieving a 13.90% performance improvement over the state-of-the-art from 2023 when the time gap is one year, and a 17.79% improvement as the gap expands to three years. Further analysis shows that CYCLE is particularly robust for low-degree entities, which are less resistant to temporal degradation due to their sparse connectivity, making them particularly suitable for our method. The code and data are made available at https://github.com/pengyu-zhang/CYCLE-Cross-Year-Contrastive-Learning-in-Entity-Linking
Testing prompt engineering methods for knowledge extraction from text Polat, Fina, Tiddi, Ilaria, and Groth, Paul Semantic Web 2024 [Link] [DOI:10.3233/sw-243719]
How different is different? Systematically identifying distribution shifts and their impacts in NER datasets Li, Xue, and Groth, Paul Language Resources and Evaluation 2024 [Link] [DOI:10.1007/s10579-024-09754-8]
A Sparsity Principle for Partially Observable Causal Representation Learning Xu, Danru, Yao, Dingling, Lachapelle, Sébastien, Taslakian, Perouz, Kügelgen, Julius, Locatello, Francesco, and Magliacane, Sara International Conference on Machine Learning (ICML) 2024 [Link]
SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection Allen, Bradley P., Polat, Fina, and Groth, Paul In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024) 2024 [Link] [Code]
Towards Efficient Data Wrangling with LLMs using Code Generation Li, Xue, and Döhmen, Till In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning 2024 [Abs] [Link] [DOI:10.1145/3650203.3663334]
While LLM-based data wrangling approaches that process each row of data have shown promising benchmark results, computational costs still limit their suitability for real-world use cases on large datasets. We revisit code generation using LLMs for various data wrangling tasks, which show promising results particularly for data transformation tasks (up to 37.2 points improvement on F1 score) at much lower computational costs. We furthermore identify shortcomings of code generation methods especially for semantically challenging tasks, and consequently propose an approach that combines program generation with a routing mechanism using LLMs.
Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines" Grafberger, Stefan, Groth, Paul, and Schelter, Sebastian In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning 2024 [Abs] [Link] [DOI:10.1145/3650203.3663327]
Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.
Towards Federated LLM-Powered CEP Rule Generation and Refinement Lotfian Delouee, Majid, Pernes, Daria G., Degeler, Victoria, and Koldehofe, Boris In The 18th ACM International Conference on Distributed and Event-Based Systems (DEBS’24) 2024 [Abs] [Link]
In traditional event processing systems, patterns representing situations of interest are typically defined by domain experts or learned from historical data. These approaches often make rule generation reactive, time-consuming, and susceptible to human error. In this paper, we propose and investigate the integration of large language models (LLMs) to automate and accelerate query translation and rule generation in event processing systems. Furthermore, we introduce a federated learning schema to refine the initially generated rules by examining them over distributed event streams, ensuring greater accuracy and adaptability.Preliminary results demonstrate the potential of LLMs as a key component in proactively expediting the autonomous rule-generation process. Moreover, our findings suggest that employing customized prompt engineering techniques can further enhance the quality of the generated rules.
Multi-View Causal Representation Learning with Partial Observability Yao, Dingling, Xu, Danru, Lachapelle, Sébastien, Magliacane, Sara, Taslakian, Perouz, Martius, Georg, Kügelgen, Julius, and Locatello, Francesco In The Twelfth International Conference on Learning Representations 2024 [Link] [ Spotlight Presentation ]
Large-Scale Multipurpose Benchmark Datasets For Assessing Data-Driven Deep Learning Approaches For Water Distribution Networks Tello, Andrés, Truong, Huy, Lazovik, Alexander, and Degeler, Victoria In Engineering Proceedings 2024 [Abs] [Link] [Data]
Currently, the number of common benchmark datasets that researchers can use straight away for assessing data-driven deep learning approaches is very limited. Most studies provide data as configuration files. It is still up to each practitioner to follow a particular data generation method and run computationally intensive simulations to obtain usable data for model training and evaluation. In this work, we provide a collection of datasets that includes several small and medium size publicly available Water Distribution Networks (WDNs), including Anytown, Modena, Balerma, C-Town, D-Town, L-Town, Ky1, Ky6, Ky8, Ky10, and Ky13. In total 1,394,400 hours of WDNs data operating under normal conditions is made available to the community.
Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models Allen, Bradley P., and Groth, Paul T. In Proceedings of European Semantic Web Conference Special Track on Large Language Models for Knowledge Engineering 2024 [Link]
Retrieval-based Question Answering with Passage Expansion Using a Knowledge Graph Kruit, Benno, Xu, Yiming, and Kalo, Jan-Christoph In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) 2024 [Abs] [Link]
Recent advancements in dense neural retrievers and language models have led to large improvements in state-of-the-art approaches to open-domain Question Answering (QA) based on retriever-reader architectures. However, issues stemming from data quality and imbalances in the use of dense embeddings have hindered performance, particularly for less common entities and facts. To tackle these problems, this study explores a multi-modal passage retrieval model’s potential to bolster QA system performance. This study poses three key questions: (1) Can a distantly supervised question-relation extraction model enhance retrieval using a knowledge graph (KG), compensating for dense neural retrievers’ shortcomings with rare entities? (2) How does this multi-modal approach compare to existing QA systems based on textual features? (3) Can this QA system alleviate poor performance on less common entities on common benchmarks? We devise a multi-modal retriever combining entity features and textual data, leading to improved retrieval precision in some situations, particularly for less common entities. Experiments across different datasets confirm enhanced performance for entity-centric questions, but challenges remain in handling complex generalized questions.
Directions Towards Efficient and Automated Data Wrangling with Large Language Models Zhang, Zeyu, Groth, Paul, Calixto, Iacer, and Schelter, Sebastian In 2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW) 2024 [Link] [DOI:10.1109/ICDEW61823.2024.00044]
SchemaPile: A Large Collection of Relational Database Schemas Döhmen, Till, Geacu, Radu, Hulsebos, Madelon, and Schelter, Sebastian Proc. ACM Manag. Data 2024 [Abs] [Link] [DOI:10.1145/3654975]
Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of corpus to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.
Data Debugging with Shapley Importance over Machine Learning Pipelines Karlaš, Bojan, Dao, David, Interlandi, Matteo, Schelter, Sebastian, Wu, Wentao, and Zhang, Ce In The Twelfth International Conference on Learning Representations 2024 [Link]
Driving Towards Efficiency: Adaptive Resource-aware Clustered Federated Learning in Vehicular Networks Khalil, Ahmad, Lotfian Delouee, Majid, Degeler, Victoria, Meuser, Tobias, Fernandez Anta, Antonio, and Koldehofe, Boris In The 22nd Mediterranean Communication and Computer Networking Conference (MedComNet’24) 2024 [Abs] [Link]
Guaranteeing precise perception for fully autonomous driving in diverse driving conditions requires continuous improvement and training. In vehicular networks, federated learning (FL) facilitates this by enabling model training without sharing raw sensory data. As an extension, clustered FL reduces communication overhead and aligns well with the dynamic nature of these networks. However, current literature on this topic does not consider critical dimensions of FL, including (1) the correlation between perception performance and the networking overhead, (2) the limited vehicle storage, (3) the need for training with freshly captured data, and (4) the impact of non-IID data and varying traffic densities. To fill these research gaps, we introduce AR-CFL, an Adaptive Resource-aware Clustered Federated Learning framework. AR-CFL utilizes clustered FL to collectively model the environment of connected vehicles, integrating models from all vehicles and ensuring universal accessibility to the refined model. AR-CFL dynamically enhances system efficiency by adaptively adjusting the number of clusters and specific in-cluster participant selection strategies. Using AR-CFL, we systematically study the scenario of online car detection model training on non-IID data across varied conditions. The evaluation results highlight the robust detection performance exhibited by the trained model employing the clustered FL approach, despite the constraints posed by limited vehicle storage capacity. Furthermore, our investigation unveils superior training performance with clustered FL in comparison to specific classical FL scenarios, increasing the training efficiency in terms of participating nodes by up to 25% and reducing cellular communication by 33%.
Evaluating FAIR Digital Object and Linked Data as distributed object systems Soiland-Reyes, Stian, Goble, Carole, and Groth, Paul PeerJ Computer Science 2024 [Abs] [Link] [DOI:10.7717/peerj-cs.1781] [Data]
FAIR Digital Object (FDO) is an emerging concept that is highlighted by European Open Science Cloud (EOSC) as a potential candidate for building an ecosystem of machine-actionable research outputs. In this work we systematically evaluate FDO and its implementations as a global distributed object system, by using five different conceptual frameworks that cover interoperability, middleware, FAIR principles, EOSC requirements and FDO guidelines themself. We compare the FDO approach with established Linked Data practices and the existing Web architecture, and provide a brief history of the Semantic Web while discussing why these technologies may have been difficult to adopt for FDO purposes. We conclude with recommendations for both Linked Data and FDO communities to further their adaptation and alignment.
Ontologies in digital twins: A systematic literature review Karabulut, Erkan, Pileggi, Salvatore F., Groth, Paul, and Degeler, Victoria Future Generation Computer Systems 2024 [Link] [DOI:10.1016/j.future.2023.12.013] [Data]
Empirical ontology design patterns and shapes from Wikidata Carriero, Valentina Anita, Groth, Paul, and Presutti, Valentina Semantic Web 2024 [Link] [DOI:10.3233/sw-243613]
Table Representation Learning Hulsebos, Madelon 2024 [Link]
Assisted design of data science pipelines Redyuk, Sergey, Kaoudi, Zoi, Schelter, Sebastian, and Markl, Volker The VLDB Journal 2024 [Link] [DOI:10.1007/s00778-024-00835-2]
Domain Generalization in Time Series Forecasting Deng, Songgaojun, Sprangers, Olivier, Li, Ming, Schelter, Sebastian, and Rijke, Maarten ACM Trans. Knowl. Discov. Data 2024 [Abs] [Link] [DOI:10.1145/3643035]
Domain generalization aims to design models that can effectively generalize to unseen target domains by learning from observed source domains. Domain generalization poses a significant challenge for time series data, due to varying data distributions and temporal dependencies. Existing approaches to domain generalization are not designed for time series data, which often results in suboptimal or unstable performance when confronted with diverse temporal patterns and complex data characteristics. We propose a novel approach to tackle the problem of domain generalization in time series forecasting. We focus on a scenario where time series domains share certain common attributes and exhibit no abrupt distribution shifts. Our method revolves around the incorporation of a key regularization term into an existing time series forecasting model: domain discrepancy regularization. In this way, we aim to enforce consistent performance across different domains that exhibit distinct patterns. We calibrate the regularization term by investigating the performance within individual domains and propose the domain discrepancy regularization with domain difficulty awareness. We demonstrate the effectiveness of our method on multiple datasets, including synthetic and real-world time series datasets from diverse domains such as retail, transportation, and finance. Our method is compared against traditional methods, deep learning models, and domain generalization approaches to provide comprehensive insights into its performance. In these experiments, our method showcases superior performance, surpassing both the base model and competing domain generalization models across all datasets. Furthermore, our method is highly general and can be applied to various time series models.
Red Onions, Soft Cheese and Data: From Food Safety to Data Traceability for Responsible AI Grafberger, Stefan, Zhang, Zeyu, Schelter, Sebastian, and Zhang, Ce IEEE Data Engineering Bulletin 2024 [Link]
Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment Martorana, Margherita, Kuhn, Tobias, Stork, Lise, and Ossenbruggen, Jacco 2024 [Link] [DOI:10.3233/SSW240006]
Editorial for the Special Issue on Knowledge Engineering Groth, Paul, Blomqvist, Eva, and Sequeda, Juan F. Journal of Web Semantics 2024 [Link] [DOI:10.1016/j.websem.2024.100840]
Graph Neural Networks for Pressure Estimation in Water Distribution Systems Truong, Huy, Tello, Andrés, Lazovik, Alexander, and Degeler, Victoria Water Resources Research 2024 [Link]
Large-Scale Forecasting of Electric Vehicle Charging Demand Using Global Time Series Modeling Etten, Tijmen, Degeler, Victoria, and Luo, Ding In Proceedings of the 10th International Conference on Vehicle Technology and Intelligent Transport Systems 2024 [Link] [DOI:10.5220/0012555400003702]
Standardizing Knowledge Engineering Practices with a Reference Architecture Allen, Bradley P., and Ilievski, Filip Transactions on Graph Data and Knowledge 2024 [Link] [DOI:10.4230/TGDK.2.1.5]
Too Good To Be True: accuracy overestimation in (re)current practices for Human Activity Recognition Tello, Andrés, Degeler, Victoria, and Lazovik, Alexander In 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops) 2024 [DOI:10.1109/PerComWorkshops59983.2024.10503465]
Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making Guha, Shubha, Khan, Falaah Arif, Stoyanovich, Julia, and Schelter, Sebastian IEEE Transactions on Knowledge and Data Engineering 2024 [Link] [DOI:10.1109/TKDE.2024.3365524]

2023

BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs Daza, Daniel, Alivanistos, Dimitrios, Mitra, Payal, Pijnenburg, Thom, Cochez, Michael, and Groth, Paul Journal of Biomedical Semantics 2023 [Link] [DOI:10.1186/s13326-023-00301-y] [Code] [Data]
APP-CEP: Adaptive Pattern-level Privacy Protection in Complex Event Processing Systems Lotfian Delouee, Majid, Degeler, Victoria, Amthor, Peter, and Koldehofe, Boris In 10th International Conference on Information Systems Security and Privacy (ICISSP’24) 2023 [Abs] [Link]
Although privacy-preserving mechanisms endeavor to safeguard sensitive information at the attribute level, detected event patterns can still disclose privacy-sensitive knowledge in distributed complex event processing systems (DCEP). Events might not be inherently sensitive, but their aggregation into a pattern could still breach privacy. In this paper, we study in the context of APP-CEP the problem of integrating pattern-level privacy in event-based systems by selective assignment of obfuscation techniques to conceal private information. Compared to state-of-the-art techniques, we seek to enforce privacy independent of the actual events in streams. To support this, we acquire queries and privacy requirements using CEP-like patterns. The protection of privacy is accomplished through generating pattern dependency graphs, leading to dynamically appointing those techniques that have no consequences on detecting other sensitive patterns, as well as non-sensitive patterns required to provide acceptable Quality of Service. Besides, we model the knowledge that might be possessed by potential adversaries to violate privacy and its impacts on the obfuscation procedure. We assessed the performance of APP-CEP in a real-world scenario involving an online retailer’s transactions. Our evaluationresults demonstrate that APP-CEP successfully provides a privacy-utility trade-off. Modeling the background knowledge also effectively prevents adversaries from realizing the modifications in the input streams.
Large Language Models and Knowledge Graphs: Opportunities and Challenges Pan, Jeff Z., Razniewski, Simon, Kalo, Jan-Christoph, Singhania, Sneha, Chen, Jiaoyan, Dietze, Stefan, Jabeen, Hajira, Omeliyanenko, Janna, Zhang, Wen, Lissandrini, Matteo, Biswas, Russa, Melo, Gerard, Bonifati, Angela, Vakaj, Edlira, Dragoni, Mauro, and Graux, Damien Transactions on Graph Data and Knowledge 2023 [Link] [DOI:10.4230/TGDK.1.1.2]
Evaluating the Knowledge Base Completion Potential of GPT Veseli, Blerta, Razniewski, Simon, Kalo, Jan-Christoph, and Weikum, Gerhard In Findings of the Association for Computational Linguistics: EMNLP 2023 2023 [Abs] [Link] [DOI:10.18653/v1/2023.findings-emnlp.426]
Structured knowledge bases (KBs) are an asset for search engines and other applications but are inevitably incomplete. Language models (LMs) have been proposed for unsupervised knowledge base completion (KBC), yet, their ability to do this at scale and with high accuracy remains an open question. Prior experimental studies mostly fall short because they only evaluate on popular subjects, or sample already existing facts from KBs. In this work, we perform a careful evaluation of GPT’s potential to complete the largest public KB: Wikidata. We find that, despite their size and capabilities, models like GPT-3, ChatGPT and GPT-4 do not achieve fully convincing results on this task. Nonetheless, it provides solid improvements over earlier approaches with smaller LMs. In particular, we show that it is feasible to extend Wikidata by 27M facts at 90% precision.
A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference Krieken, Emile, Thanapalasingam, Thiviyan, Tomczak, Jakub M., Harmelen, Frank Van, and Teije, Annette Ten In Thirty-seventh Conference on Neural Information Processing Systems 2023 [arXiv] [Link]
Adapting Neural Link Predictors for Data-Efficient Complex Query Answering Arakelyan, Erik, Minervini, Pasquale, Daza, Daniel, Cochez, Michael, and Augenstein, Isabelle In Thirty-seventh Conference on Neural Information Processing Systems 2023 [arXiv] [Link]
Observatory: Characterizing Embeddings of Relational Tables Cong, Tianji, Hulsebos, Madelon, Sun, Zhenjie, Groth, Paul, and Jagadish, H. V. Proceedings of the VLDB Endowment 2023 [Link] [DOI:10.14778/3636218.3636237]
Knowledge Engineering Using Large Language Models Allen, Bradley P., Stork, Lise, and Groth, Paul Transactions on Graph Data and Knowledge 2023 [Link] [DOI:10.4230/TGDK.1.1.3]
Preface: LM-KBC Challenge 2023 Sneha Singhania, Jan-Christoph Kalo, Simon Razniewski, Jeff Z. Pan In Joint proceedings of the 1st workshop on Knowledge Base Construction from Pre-Trained Language Models (KBC-LM) and the 2nd challenge on Language Models for Knowledge Base Construction (LM-KBC) 2023 [Link]
Do Instruction-tuned Large Language Models Help with Relation Extraction? Li, Xue, Polat, Fina, and Groth, Paul In KBC-LM’23: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2023 2023 [Link] [Code]
Knowledge-centric Prompt Composition for Knowledge Base Construction from Pre-trained Language Models Li, Xue, Hughes, Anthony, Llugiqi, Majlinda, Polat, Fina, Groth, Paul, and Ekaputra, Fajar J. In KBC-LM’23: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2023 2023 [Link] [Code]
Semantic Association Rule Learning from Time Series Data and Knowledge Graphs Karabulut, Erkan, Degeler, Victoria, and Groth1, Paul In SemIIM’23: 2nd International Workshop on Semantic Industrial Information Modelling co-located with 22nd International Semantic Web Conference (ISWC 2023) 2023 [arXiv] [Link]
Mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses over and Over? Grafberger, Stefan, Guha, Shubha, Groth, Paul, and Schelter, Sebastian Proc. VLDB Endow. 2023 [Abs] [Link] [DOI:10.14778/3611540.3611606] [Code]
Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this variant to see how the change impacts the pipeline’s output score.We recently proposed mlwhatif, a library that enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. We demonstrate how data scientists can leverage mlwhatif for a variety of pipelines and three different what-if analyses focusing on the robustness of a pipeline against data errors, the impact of data cleaning operations, and the impact of data preprocessing operations on fairness. In particular, we demonstrate step-by-step how mlwhatif generates and optimizes the required execution plans for the pipeline analyses. Our library is publicly available at https://github.com/stefan-grafberger/mlwhatif.
Improving Graph-to-Text Generation Using Cycle Training Polat, Fina, Tiddi, Ilaria, Groth, Paul, and Vossen, Piek In Proceedings of the 4th Conference on Language, Data and Knowledge 2023 [Link]
Harnessing the Web and Knowledge Graphs for Automated Impact Investing Scoring Hu, Qingzhi, Daza, Daniel, Swinkels, Laurens, Usaite, Kristina, Hoen, Robbert-Jan, and Groth, Paul In KDD Fragile Earth Workshop 2023 [Link] [DOI:10.48550/arXiv.2308.02622]
An approach for analysing the impact of data integration on complex network diffusion models Nevin, James, Groth, Paul, and Lees, Michael Journal of Complex Networks 2023 [Link] [DOI:10.1093/comnet/cnad025] [Code]
Self-Contained Entity Discovery from Captioned Videos Ayoughi, Melika, Mettes, Pascal, and Groth, Paul ACM Trans. Multimedia Comput. Commun. Appl. 2023 [Abs] [Link] [DOI:10.1145/3583138]
This article introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g., faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating entities with labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame–caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks, SC-Friends and SC-BBT, based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub.1
Data journeys: Explaining AI workflows through abstraction Daga, Enrico, and Groth, Paul Semantic Web 2023 [Abs] [Link] [DOI:10.3233/sw-233407]
"Artificial intelligence systems are not simply built on a single dataset or trained model. Instead, they are made by complex data science workflows involving multiple datasets, models, preparation scripts, and algorithms. Given this complexity, in order to understand these AI systems, we need to provide explanations of their functioning at higher levels of abstraction. To tackle this problem, we focus on the extraction and representation of data journeys from these workflows. A data journey is a multi-layered semantic representation of data processing activity linked to data science code and assets. We propose an ontology to capture the essential elements of a data journey and an approach to extract such data journeys. Using a corpus of Python notebooks from Kaggle, we show that we are able to capture high-level semantic data flow that is more compact than using the code structure itself. Furthermore, we show that introducing an intermediate knowledge graph representation outperforms models that rely only on the code itself. Finally, we report on a user survey to reflect on the challenges and opportunities presented by computational data journeys for explainable AI."
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines Grafberger, Stefan, Groth, Paul, and Schelter, Sebastian Proc. ACM Manag. of Data 2023 [Abs] [Link] [DOI:10.1145/3589273]
Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this pipeline variant to see how the change impacts the pipeline’s output score. The application of existing analysis techniques to ML pipelines is technically challenging as they are hard to integrate into existing pipeline code and their execution introduces large overheads due to repeated work.We propose mlwhatif to address these integration and efficiency challenges for data-centric what-if analyses on ML pipelines. mlwhatif enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. Our approach employs pipeline patches to specify changes to the data, operators and models of a pipeline. Based on these patches, we define a multi-query optimizer for efficiently executing the resulting pipeline variants jointly, with four subsumption-based optimization rules. Subsequently, we detail how to implement the pipeline variant generation and optimizer of mlwhatif. For that, we instrument native ML pipelines written in Python to extract dataflow plans with re-executable operators.We experimentally evaluate mlwhatif, and find that its speedup scales linearly with the number of pipeline variants in applicable cases, and is invariant to the input data size. In end-to-end experiments with four analyses on more than 60 pipelines, we show speedups of up to 13x compared to sequential execution, and find that the speedup is invariant to the model and featurization in the pipeline. Furthermore, we confirm the low instrumentation overhead of mlwhatif.
GitTables: A Large-Scale Corpus of Relational Tables Hulsebos, Madelon, Demiralp, Çagatay, and Groth, Paul Proc. ACM Manag. Data 2023 [Abs] [Link] [DOI:10.1145/3588710]
The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io.
AQuA-CEP: Adaptive Quality-Aware Complex Event Processing in the Internet of Things Lotfian Delouee, Majid, Koldehofe, Boris, and Degeler, Viktoriya In The 17th ACM International Conference on Distributed Event-Based Systems (DEBS 2023) 2023 [Abs] [Link]
Sensory data profoundly influences the quality of detected events in a distributed complex event processing system (DCEP). Since each sensor’s status is unstable at runtime, a single sensing assignment is often insufficient to fulfill the consumer’s quality requirements. In this paper, we study in the context of AQuA-CEP the problem of dynamic quality monitoring and adaptation of complex event processing by active integration of suitable data sources. To support this, in AQuA-CEP, queries to detect complex events are supplemented with consumer-definable quality policies that are evaluated and used to autonomously select (or even configure) suitable data sources of the sensing infrastructure. In addition, we studied different forms of expressing quality policies and analyzed how it affects the quality monitoring process. Various modes of evaluating and applying quality-related adaptations and their impacts on correlation efficiency are addressed, too. We assessed the performance of AQuA-CEP in IoT scenarios by utilizing the notion of the quality policy alongside the query processing adaptation using knowledge derived from quality monitoring. The results show that AQuA-CEP can improve the performance of DCEP systems in terms of the quality of results while fulfilling the consumer’s quality requirements. Quality-based adaptation can also increase the network’s lifetime by optimizing the sensor’s energy consumption due to efficient data source selection.
An Analysis of Machine Learning-Based Semantic Matchmaking Karabulut, Erkan, and Sofia, Rute C. IEEE Access 2023 [DOI:10.1109/ACCESS.2023.3259360] [Code]
How to Make an Outlier? Studying the Effect of Presentational Features on the Outlierness of Items in Product Search Results Sarvi, Fatemeh, Aliannejadi, Mohammad, Schelter, Sebastian, and Rijke, Maarten In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval 2023 [Abs] [Link] [DOI:10.1145/3576840.3578278]
In two-sided marketplaces, items compete for attention from users since attention translates to revenue for suppliers. Item exposure is an indication of the amount of attention that items receive from users in a ranking. It can be influenced by factors like position bias. Recent work suggests that another phenomenon related to inter-item dependencies may also affect item exposure, viz. outlier items in the ranking. Hence, a deeper understanding of outlier items is crucial to determining an item’s exposure distribution. In this work, we study the impact of different presentational e-commerce features on users’ perception of outlierness of an item in a search result page. Informed by visual search literature, we design a set of crowdsourcing tasks where we compare the observability of three main features, viz. price, star rating, and discount tag. We find that various factors affect item outlierness, namely, visual complexity (e.g., shape, color), discriminative item features, and value range. In particular, we observe that a distinctive visual feature such as a colored discount tag can attract users’ attention much easier than a high price difference, simply because of visual characteristics that are easier to spot. Moreover, we see that the magnitude of deviations in all features affects the task complexity, such that when the similarity between outlier and non-outlier items increases, the task becomes more difficult.
The Mysterious User of Research Data: Knitting Together Science and Technology Studies with Information and Computer Science Gregory, Kathleen, Groth, Paul, Scharnhorst, Andrea, and Wyatt, Sally 2023 [Abs] [Link] [DOI:10.1007/978-3-031-11108-2_11]
Open, accessible, and standardized research data are seen as essential scaffolding for open science. To support this vision, data repositories and scientific publishers have developed new tools to facilitate data discovery while funders and policy makers have implemented open science and data management policies. Users are often invoked as central to these efforts. Despite this stated focus, the concept of ‘user’ often remains an abstraction, visible only via anonymous ensembles of click behavior or data management plans. This chapter reports and reflects on a project which draws on science and technology studies (STS) to open up the black box of research data use, bridging the gap between designers of data search systems and researchers who (re-)use both data and these systems in their actual practices. Quantitative and qualitative studies conducted in the course of this project will be drawn upon to demonstrate the insights gained from an interdisciplinary approach.
SemTab 2022: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, co-located with the 21st International Semantic Web Conference, ISWC 2022, Virtual conference, October 23-27, 2022 2023 [Link]
E2EG: End-to-End Node Classification Using Graph Topology and Text-based Node Attributes Dinh, Tu Anh, Boef, Jeroen, Cornelisse, Joran, and Groth, Paul In 2023 IEEE International Conference on Data Mining Workshops (ICDMW) 2023 [DOI:10.1109/ICDMW60847.2023.00142]
Towards Declarative Systems for Data-Centric Machine Learning Grafberger, Stefan, Karlaš, Bojan, Groth, Paul, and Schelter, Sebastian In Proceedings of the Data-Centric Machine Learning Research work- shop (DMLR) at ICML, 2023 2023 [Link]
Approximate Answering of Graph Queries Cochez, Michael, Alivanistos, Dimitrios, Arakelyan, Erik, Berrendorf, Max, Daza, Daniel, Galkin, Mikhail, Minervini, Pasquale, Niepert, Mathias, and Ren, Hongyu 2023 [Link] [DOI:10.3233/FAIA230149]
Reasoning beyond Triples: Recent Advances in Knowledge Graph Embeddings Xiong, Bo, Nayyeri, Mojtaba, Daza, Daniel, and Cochez, Michael In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, United Kingdom, October 21-25, 2023 2023 [Link] [DOI:10.1145/3583780.3615294]
Reconstructing and Querying ML Pipeline Intermediates Schelter, Sebastian In 13th Conference on Innovative Data Systems Research, CIDR 2023, Amsterdam, The Netherlands, January 8-11, 2023 2023 [Link]
Figure of Speech Detection and Generation as a Service in IDN Authoring Support Akkerman, Simon, and Nack, Frank 2023 [Link] [DOI:10.1007/978-3-031-47658-7_8]
Empowering Machine Learning Development with Service-Oriented Computing Principles Yousefi, Mostafa Hadadian Nejad, Degeler, Viktoriya, and Lazovik, Alexander In Service-Oriented Computing 2023 [Abs]
Despite software industries’ successful utilization of Service-Oriented Computing (SOC) to streamline software development, machine learning (ML) development has yet to fully integrate these practices. This disparity can be attributed to multiple factors, such as the unique challenges inherent to ML development and the absence of a unified framework for incorporating services into this process. In this paper, we shed light on the disparities between services-oriented computing and machine learning development. We propose “Everything as a Module” (XaaM), a framework designed to encapsulate every ML artifacts including models, code, data, and configurations as individual modules, to bridge this gap. We propose a set of additional steps that need to be taken to empower machine learning development using services-oriented computing via an architecture that facilitates efficient management and orchestration of complex ML systems. By leveraging the best practices of services-oriented computing, we believe that machine learning development can achieve a higher level of maturity, improve the efficiency of the development process, and ultimately, facilitate the more effective creation of machine learning applications.
Results of SemTab 2023 Hassanzadeh, Oktie, Abdelmageed, Nora, Efthymiou, Vasilis, Chen, Jiaoyan, Cutrona, Vincenzo, Hulsebos, Madelon, Jiménez-Ruiz, Ernesto, Khatiwada, Aamod, Korini, Keti, Kruit, Benno, Sequeda, Juan, and Srinivas, Kavitha In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching, SemTab 2023, co-located with the 22nd International Semantic Web Conference, ISWC 2023, Athens, Greece, November 6-10, 2023 2023 [Link]
Introducing the Observatory Library for End-to-End Table Embedding Inference Cong, Tianji, Sun, Zhenjie, Groth, Paul, Jagadish, H., and Hulsebos, Madelon In NeurIPS 2023 Second Table Representation Learning Workshop 2023 [Link]
Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making Guha, Shubha, Khan, Falaah Arif, Stoyanovich, Julia, and Schelter, Sebastian In 2023 IEEE 39th International Conference on Data Engineering (ICDE) 2023 [DOI:10.1109/ICDE55515.2023.00303] [Code]
Forget Me Now: Fast and Exact Unlearning in Neighborhood-Based Recommendation Schelter, Sebastian, Ariannezhad, Mozhdeh, and Rijke, Maarten In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval 2023 [Abs] [Link] [DOI:10.1145/3539618.3591989]
Modern search and recommendation systems are optimized using logged interaction data. There is increasing societal pressure to enable users of such systems to have some of their data deleted from those systems. This paper focuses on "unlearning" such user data from neighborhood-based recommendation models on sparse, high-dimensional datasets. We present caboose, a custom top-k index for such models, which enables fast and exact deletion of user interactions. We experimentally find that caboose provides competitive index building times, makes sub-second unlearning possible (even for a large index built from one million users and 256 million interactions), and, when integrated into three state-of-the-art next-basket recommendation models, allows users to effectively adjust their predictions to remove sensitive items.
Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion Models Nevin, James, Groth, Paul, and Lees, Michael In Computational Science – ICCS 2023 2023 [Abs] [DOI:10.1007/978-3-031-35995-8_35]
The successful application of computational models presupposes access to accurate, relevant, and representative datasets. The growth of public data, and the increasing practice of data sharing and reuse, emphasises the importance of data provenance and increases the need for modellers to understand how data processing decisions might impact model output. One key step in the data processing pipeline is that of data integration and entity resolution, where entities are matched across disparate datasets. In this paper, we present a new formulation of data integration in complex networks that incorporates integration uncertainty. We define an approach for understanding how different data integration setups can impact the results of network diffusion models under this uncertainty, allowing one to systematically characterise potential model outputs in order to create an output distribution that provides a more comprehensive picture.
Proactively Screening Machine Learning Pipelines with ARGUSEYES Schelter, Sebastian, Grafberger, Stefan, Guha, Shubha, Karlas, Bojan, and Zhang, Ce In Companion of the 2023 International Conference on Management of Data 2023 [Abs] [Link] [DOI:10.1145/3555041.3589682] [ 2nd Place Demo ]
Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline.
Seventh Workshop on Data Management for End-to-End Machine Learning (DEEM) Boehm, Matthias, Hulsebos, Madelon, Shankar, Shreya, and Varma, Paroma In Companion of the 2023 International Conference on Management of Data 2023 [Abs] [Link] [DOI:10.1145/3555041.3590819]
The DEEM’23 workshop (Data Management for End-to-End Machine Learning) is held on Sunday June 18th, in conjunction with SIGMOD/PODS 2023. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers (10 pages) describing preliminary and ongoing research results, including industrial experience reports of end-to-end ML deployments, related to DEEM topics. In addition, DEEM 2023 has a category for short papers (4 pages) as a forum for sharing interesting use cases, problems, datasets, benchmarks, visionary ideas, system designs, preliminary results, and descriptions of system components and tools related to end-to-end ML pipelines. The workshop received 13 high-quality submissions on diverse topics relevant to DEEM, of which 6 regular papers and 7 short papers.
Models and Practice of Neural Table Representations Hulsebos, Madelon, Deng, Xiang, Sun, Huan, and Papotti, Paolo In Companion of the 2023 International Conference on Management of Data 2023 [Abs] [Link] [DOI:10.1145/3555041.3589411]
In the last few years, the natural language processing community witnessed advances in neural representations of free-form text with transformer-based language models (LMs). Given the importance of knowledge available in relational tables, recent research efforts extend LMs by developing neural representations for tabular data. In this tutorial, we present these proposals with three main goals. First, we aim at introducing the potentials and limitations of current models to a database audience. Second, we want the attendees to see the benefit of such line of work in a large variety of data applications. Third, we would like to empower the audience with a new set of tools and to inspire them to tackle some of the important directions for neural table representations, including model and system design, evaluation, application and deployment. To achieve these goals, the tutorial is organized in two parts. The first part covers the background for neural table representations, including a survey of the most important systems. The second part is designed as a hands-on session, where attendees will use their laptop to explore this new framework and test neural models involving text and tabular data.
Provenance Tracking for End-to-End Machine Learning Pipelines Grafberger, Stefan, Groth, Paul, and Schelter, Sebastian In Companion Proceedings of the ACM Web Conference 2023 2023 [Link] [DOI:10.1145/3543873.3587557]
A Simulation Environment and Reinforcement Learning Method for Waste Reduction Jullien, Sami, Ariannezhad, Mozhdeh, Groth, Paul, and Rijke, Maarten Transactions on Machine Learning Research 2023 [Link]
Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century (Dagstuhl Seminar 22372) Groth, Paul, Simperl, Elena, Erp, Marieke, and Vrandečić, Denny Dagstuhl Reports 2023 [Link] [DOI:10.4230/DagRep.12.9.60]
Poster: Towards Pattern-Level Privacy Protection in Distributed Complex Event Processing Lotfian Delouee, Majid, Koldehofe, Boris, and Degeler, Viktoriya In Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems 2023 [Abs] [Link] [DOI:10.1145/3583678.3603278]
In event processing systems, detected event patterns can reveal privacy-sensitive information. In this paper, we propose and discuss how to integrate pattern-level privacy protection in event-based systems. Compared to state-of-the-art approaches, we aim to enforce privacy independent of the particularities of specific operators. We accomplish this by supporting the flexible integration of multiple obfuscation techniques and studying deployment strategies for privacy-enforcing mechanisms. In addition, we share ideas on how to model the adversary’s knowledge to select appropriate obfuscation techniques for the discussed deployment strategies. Initial results indicate that flexibly choosing obfuscation techniques and deployment strategies is essential to conceal privacy-sensitive event patterns accurately.
Parameter Efficient Node Classification on Homophilic Graphs Prieto, Lucas, Boef, Jeroen Den, Groth, Paul, and Cornelisse, Joran Transactions on Machine Learning Research 2023 [Link] [Code]

2022

Relational graph convolutional networks: a closer look Thanapalasingam, Thiviyan, Berkel, Lucas, Bloem, Peter, and Groth, Paul PeerJ Computer Science 2022 [Link] [DOI:10.7717/peerj-cs.1073]
Question Answering with Additive Restrictive Training (QuAART): Question Answering for the Rapid Development of New Knowledge Extraction Pipelines Harper, Corey A., Daniel, Ron, and Groth, Paul In Knowledge Engineering and Knowledge Management (EKAW) 2022 [Abs] [Link] [DOI:10.1007/978-3-031-17105-5_4]
Abstract Numerous studies have explored the use of language models and question answering techniques for knowledge extraction. In most cases, these models are trained on data specific to the new task at hand. We hypothesize that using models trained only on generic question answering data (e.g. SQuAD) is a good starting point for domain specific entity extraction. We test this hypothesis, and explore whether the addition of small amounts of training data can help lift model performance. We pay special attention to the use of null answers and unanswerable questions to optimize performance. To our knowledge, no studies have been done to evaluate the effectiveness of this technique. We do so for an end-to-end entity mention detection and entity typing task on HAnDS and FIGER, two common evaluation datasets for fine grained entity recognition. We focus on fine-grained entity recognition because it is challenging scenario, and because the long tail of types in this task highlights the need for entity extraction systems that can deal with new domains and types. To our knowledge, we are the first system beyond those presented in the original FIGER and HAnDS papers to tackle the task in an end-to-end fashion. Using an extremely small sample from the distantly-supervised HAnDS training data – 0.0015%, or less than 500 passages randomly chosen out of 31 million – we produce a CoNNL F1 score of 73.72 for entity detection on FIGER. Our end-to-end detection and typing evaluation produces macro and micro F1s of 45.11 and 54.75, based on the FIGER evaluation metrics. This work provides a foundation for the rapid development of new knowledge extraction pipelines.
Serenade - Low-Latency Session-Based Recommendation in e-Commerce at Scale Kersbergen, Barrie, Sprangers, Olivier, and Schelter, Sebastian In Proceedings of the 2022 International Conference on Management of Data 2022 [Abs] [Link] [DOI:10.1145/3514221.3517901]
Session-based recommendation predicts the next item with which a user will interact, given a sequence of her past interactions with other items. This machine learning problem targets a core scenario in e-commerce platforms, which aim to recommend interesting items to buy to users browsing the site. Session-based recommenders are difficult to scale due to their exponentially large input space of potential sessions. This impedes offline precomputation of the recommendations, and implies the necessity to maintain state during the online computation of next-item recommendations.We propose VMIS-kNN, an adaptation of a state-of-the-art nearest neighbor approach to session-based recommendation, which leverages a prebuilt index to compute next-item recommendations with low latency in scenarios with hundreds of millions of clicks to search through. Based on this approach, we design and implement the scalable session-based recommender system Serenade, which is in production usage at bol.com, a large European e-commerce platform.We evaluate the predictive performance of VMIS-kNN, and show that Serenade can answer a thousand recommendation requests per second with a 90th percentile latency of less than seven milliseconds in scenarios with millions of items to recommend. Furthermore, we present results from a three week long online A/B test with up to 600 requests per second for 6.5 million distinct items on more than 45 million user sessions from our e-commerce platform. To the best of our knowledge, we provide the first empirical evidence that the superior predictive performance of nearest neighbor approaches to session-based recommendation in offline evaluations translates to superior performance in a real world e-commerce setting.
Towards Data-Centric What-If Analysis for Native Machine Learning Pipelines Grafberger, Stefan, Groth, Paul, and Schelter, Sebastian In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning 2022 [Abs] [Link] [DOI:10.1145/3533028.3533303]
An important task of data scientists is to understand the sensitivity of their models to changes in the data that the models are trained and tested upon. Currently, conducting such data-centric what-if analyses requires significant and costly manual development and testing with the corresponding chance for the introduction of bugs. We discuss the problem of data-centric what-if analysis over whole ML pipelines (including data preparation and feature encoding), propose optimisations that reuse trained models and intermediate data to reduce the runtime of such analysis, and finally conduct preliminary experiments on three complex example pipelines, where our approach reduces the runtime by a factor of up to six.
Responsible Data Management Stoyanovich, Julia, Abiteboul, Serge, Howe, Bill, Jagadish, H. V., and Schelter, Sebastian Communications of the ACM 2022 [Abs] [Link] [DOI:10.1145/3488717]
Perspectives on the role and responsibility of the data-management research community in designing, developing, using, and overseeing automated decision systems.
SlotGAN: Detecting Mentions in Text via Adversarial Distant Learning Daza, Daniel, Cochez, Michael, and Groth, Paul In Proceedings of the Sixth Workshop on Structured Prediction for NLP 2022 [Abs] [Link] [DOI:10.18653/v1/2022.spnlp-1.4]
We present SlotGAN, a framework for training a mention detection model that only requires unlabeled text and a gazetteer. It consists of a generator trained to extract spans from an input sentence, and a discriminator trained to determine whether a span comes from the generator, or from the gazetteer.We evaluate the method on English newswire data and compare it against supervised, weakly-supervised, and unsupervised methods. We find that the performance of the method is lower than these baselines, because it tends to generate more and longer spans, and in some cases it relies only on capitalization. In other cases, it generates spans that are valid but differ from the benchmark. When evaluated with metrics based on overlap, we find that SlotGAN performs within 95% of the precision of a supervised method, and 84% of its recall. Our results suggest that the model can generate spans that overlap well, but an additional filtering mechanism is required.
CITRIS: Causal Identifiability from Temporal Intervened Sequences Lippe, Phillip, Magliacane, Sara, Löwe, Sindy, Asano, Yuki M., Cohen, Taco, and Gavves, Efstratios In Proceedings of the 39th International Conference on Machine Learning, ICML 2022 [arXiv]
Methods Included Crusoe, Michael R., Abeln, Sanne, Iosup, Alexandru, Amstutz, Peter, Chilton, John, Tijanić, Nebojša, Ménager, Hervé, Soiland-Reyes, Stian, Gavrilović, Bogdan, Goble, Carole, and The CWL Community, Communications of the ACM 2022 [Abs] [Link] [DOI:10.1145/3486897]
Standardizing computational reuse and portability with the Common Workflow Language.
Making Canonical Workflow Building Blocks Interoperable across Workflow Languages Soiland-Reyes, Stian, Bayarri, Genís, Andrio, Pau, Long, Robin, Lowe, Douglas, Niewielska, Ania, Hospital, Adam, and Groth, Paul Data Intelligence 2022 [Abs] [Link] [DOI:10.1162/dint_a_00135]
We introduce the concept of Canonical Workflow Building Blocks (CWBB), a methodology of describing and wrapping computational tools, in order for them to be utilised in a reproducible manner from multiple workflow languages and execution platforms. The concept is implemented and demonstrated with the BioExcel Building Blocks library (BioBB), a collection of tool wrappers in the field of computational biomolecular simulation. Interoperability across different workflow languages is showcased through a protein Molecular Dynamics setup transversal workflow, built using this library and run with 5 different Workflow Manager Systems (WfMS). We argue such practice is a necessary requirement for FAIR Computational Workflows and an element of Canonical Workflow Frameworks for Research (CWFR) in order to improve widespread adoption and reuse of computational methods across workflow language barriers.
Letter from the Special Issue Editor Schelter, Sebastian IEEE Data Engineering Bulletin (Special issue on Directions Towards GDPR-Compliant Data Systems and Applications) 2022 [Link]
Defining a Knowledge Graph Development Process Through a Systematic Review Tamašauskaitundefined, Gytundefined, and Groth, Paul ACM Transactios on Software Engineering and Methodology 2022 [Abs] [Link] [DOI:10.1145/3522586]
Knowledge graphs are widely used in industry and studied within the academic community. However, the models applied in the development of knowledge graphs vary. Analysing and providing a synthesis of the commonly used approaches to knowledge graph development would provide researchers and practitioners a better understanding of the overall process and methods involved. Hence, this paper aims to define the overall process of knowledge graph development and its key constituent steps. For this purpose, a systematic review and a conceptual analysis of the literature was conducted. The resulting process was compared to case studies to evaluate its applicability. The proposed process suggests a unified approach and provides guidance for both researchers and practitioners when constructing and managing knowledge graphs.
Packaging research artefacts with RO-Crate Soiland-Reyes, Stian, Sefton, Peter, Crosas, Mercè, Castro, Leyla Jael, Coppens, Frederik, Fernández, José M., Garijo, Daniel, Grüning, Björn, La Rosa, Marco, Leo, Simone, and al., Data Science 2022 [Link] [DOI:10.3233/DS-210053]
Data distribution debugging in machine learning pipelines Grafberger, Stefan, Groth, Paul, Stoyanovich, Julia, and Schelter, Sebastian The VLDB Journal 2022 [Link] [DOI:10.1007/s00778-021-00726-w]
Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation Schröder, Max, Staehlke, Susanne, Groth, Paul, Nebe, J. Barbara, Spors, Sascha, and Krüger, Frank Journal of Biomedical Semantics 2022 [Link] [DOI:10.1186/s13326-021-00257-x]
Screening Native Machine Learning Pipelines with ArgusEyes Schelter, Sebastian, Grafberger, Stefan, Guha, Shubha, Sprangers, Olivier, Karlas, Bojan, and Zhang, Ce In 12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, CA, USA, January 9-12, 2022 2022 [Link]
The Semantic Web - 19th International Conference, ESWC 2022, Hersonissos, Crete, Greece, May 29 - June 2, 2022, Proceedings 2022 [Link] [DOI:10.1007/978-3-031-06981-9]
Towards improving Wikidata reuse with emerging patterns Carriero, Valentina Anita, Groth, Paul, and Presutti, Valentina In Proceedings of the 3rd Wikidata Workshop 2022 co-located with the 21st International Semantic Web Conference (ISWC2022), Virtual Event, Hanghzou, China, October 2022 2022 [Link]
Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge Graph Matching co-located with the 20th International Semantic Web Conference (ISWC 2021), Virtual conference, October 27, 2021 Jiménez-Ruiz, Ernesto, Efthymiou, Vasilis, Chen, Jiaoyan, Cutrona, Vincenzo, Hassanzadeh, Oktie, Sequeda, Juan, Srinivas, Kavitha, Abdelmageed, Nora, Hulsebos, Madelon, Oliveira, Daniela, and Pesquita, Catia 2022 [Link]
AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning Huang, Biwei, Feng, Fan, Lu, Chaochao, Magliacane, Sara, and Zhang, Kun In International Conference on Learning Representations 2022 [Link] [ Spotlight Presentation ]
Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning Vos, David, Döhmen, Till, and Schelter, Sebastian In NeurIPS 2022 First Table Representation Workshop 2022 [Link]
GitSchemas: A Dataset for Automating Relational Data Preparation Tasks Döhmen, Till, Hulsebos, Madelon, Beecks, Christian, and Schelter, Sebastian In 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW) 2022 [Link] [DOI:10.1109/ICDEW55742.2022.00016]
Making Table Understanding Work in Practice Hulsebos, Madelon, Gathani, Sneha, Gale, James, Dillig, Isil, Groth, Paul, and Demiralp, In 12th Conference on Innovative Data Systems Research, CIDR 2022, Chaminade, CA, USA, January 9-12, 2022 2022 [Link]

2021

The non-linear impact of data handling on network diffusion models Nevin, James, Lees, Michael, and Groth, Paul Patterns 2021 [Link] [DOI:10.1016/j.patter.2021.100397]
GraphPOPE: Retaining Structural Graph Information Using Position-aware Node Embeddings Boef, Jeroen Den, Cornelisse, Joran, and Groth, Paul In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG 2021) 2021 [Link]
Quality Assessment of Knowledge Graph Hierarchies using KG-BERT Szarkowska, Kinga, Moore, Veronique, Vandenbussche, Pierre-Yves, and Groth, Paul In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG 2021) 2021 [Link]
Perspectives on automated composition of workflows in the life sciences Lamprecht, Anna-Lena, Palmblad, Magnus, Ison, Jon, Schwämmle, Veit, Manir, Mohammad Sadnan Al, Altintas, Ilkay, Baker, Christopher J. O., Amor, Ammar Ben Hadj, Capella-Gutierrez, Salvador, Charonyktakis, Paulos, Crusoe, Michael R., Gil, Yolanda, Goble, Carole, Griffin, Timothy J., Groth, Paul, Ienasescu, Hans, Jagtap, Pratik, Kalaš, Matúš, Kasalica, Vedran, Khanteymoori, Alireza, Kuhn, Tobias, Mei, Hailiang, Ménager, Hervé, Möller, Steffen, Richardson, Robin A., Robert, Vincent, Soiland-Reyes, Stian, Stevens, Robert, Szaniszlo, Szoke, Verberne, Suzan, Verhoeven, Aswin, and Wolstencroft, Katherine F1000Research 2021 [Link] [DOI:10.12688/f1000research.54159.1]
SemEval-2021 Task 8: MeasEval – Extracting Counts and Measurements and their Related Contexts Harper, Corey, Cox, Jessica, Kohler, Curt, Scerri, Antony, Daniel Jr., Ron, and Groth, Paul In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021) 2021 [Abs] [Link] [DOI:10.18653/v1/2021.semeval-1.38] [ SemEval 2021 Best Task Paper ]
We describe MeasEval, a SemEval task of extracting counts, measurements, and related context from scientific documents, which is of significant importance to the creation of Knowledge Graphs that distill information from the scientific literature. This is a new task in 2021, for which over 75 submissions from 25 participants were received. We expect the data developed for this task and the findings reported to be valuable to the scientific knowledge extraction, metrology, and automated knowledge base construction communities.
Further with Knowledge Graphs: Proceedings of the 17th International Conference on Semantic Systems, 6–9 September 2021, Amsterdam, The Netherlands 2021 [Link] [DOI:10.3233/SSW53]
Reinforcement Learning–Based Collective Entity Alignment with Adaptive Features Zeng, Weixin, Zhao, Xiang, Tang, Jiuyang, Lin, Xuemin, and Groth, Paul ACM Trans. Inf. Syst. 2021 [Abs] [Link] [DOI:10.1145/3446428]
Entity alignment (EA) is the task of identifying the entities that refer to the same real-world object but are located in different knowledge graphs (KGs). For entities to be aligned, existing EA solutions treat them separately and generate alignment results as ranked lists of entities on the other side. Nevertheless, this decision-making paradigm fails to take into account the interdependence among entities. Although some recent efforts mitigate this issue by imposing the 1-to-1 constraint on the alignment process, they still cannot adequately model the underlying interdependence and the results tend to be sub-optimal.To fill in this gap, in this work, we delve into the dynamics of the decision-making process, and offer a reinforcement learning (RL)–based model to align entities collectively. Under the RL framework, we devise the coherence and exclusiveness constraints to characterize the interdependence and restrict collective alignment. Additionally, to generate more precise inputs to the RL framework, we employ representative features to capture different aspects of the similarity between entities in heterogeneous KGs, which are integrated by an adaptive feature fusion strategy. Our proposal is evaluated on both cross-lingual and mono-lingual EA benchmarks and compared against state-of-the-art solutions. The empirical results verify its effectiveness and superiority.
Learnings from a Retail Recommendation System on Billions of Interactions at bol.com Kersbergen, B., and Schelter, S. In 2021 IEEE 37th International Conference on Data Engineering (ICDE) 2021 [Link] [DOI:10.1109/ICDE51399.2021.00277]
Inductive Entity Representations from Text via Link Prediction Daza, Daniel, Cochez, Michael, and Groth, Paul In Proceedings of The Web Conference 2021 [arXiv] [DOI:10.1145/3442381.3450141] [Code]
Letter from the Special Issue Editor Schelter, Sebastian IEEE Data Engineering Bulletin (Special issue on Data validation for machine learning models and applications) 2021 [Link]
Complex Query Answering with Neural Link Predictors Arakelyan, Erik, Daza, Daniel, Minervini, Pasquale, and Cochez, Michael In International Conference on Learning Representations (ICLR) 2021 [arXiv] [Link] [ Outstanding Paper Award ICLR 2021 ]
Taming Technical Bias in Machine Learning Pipelines Schelter, Sebastian, and Stoyanovich, Julia IEEE Data Engineering Bulletin (Special Issue on Interdisciplinary Perspectives on Fairness and Artificial Intelligence Systems) 2021 [Link]
Talking datasets – Understanding data sensemaking behaviours Koesten, Laura, Gregory, Kathleen, Groth, Paul, and Simperl, Elena International Journal of Human-Computer Studies 2021 [Abs] [arXiv] [Link] [DOI:10.1016/j.ijhcs.2020.102562]
The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little attention has been paid to a key step in data reuse: the behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 researchers from different disciplines as they summarised and interacted with both familiar and unfamiliar data. We use our findings to identify and detail common patterns of data-centric sensemaking across three clusters of activities that we present as a framework: inspecting data, engaging with content, and placing data within broader contexts. Additionally, we propose design recommendations for tools and documentation practices, which can be used to facilitate sensemaking and subsequent data reuse.
The Challenges of Cross-Document Coreference Resolution for Email Li, Xue, Magliacane, Sara, and Groth, Paul In Proceedings of the 11th on Knowledge Capture Conference 2021 [Abs] [Link] [DOI:10.1145/3460210.3493573]
Long-form conversations such as email are an important source of information for knowledge capture. For tasks such as knowledge graph construction, conversational search, and entity linking, being able to resolve entities from across documents is important. Building on recent work on within document coreference resolution for email, we study for the first time a cross-document formulation of the problem. Our results show that the current state-of-the-art deep learning models for general cross-document coreference resolution are insufficient for email conversations. Our experiments show that the general task is challenging and, importantly for knowledge intensive tasks, coreference resolution models that only treat entity mentions perform worse. Based on these results, we outline the work needed to address this challenging task.
Supporting Ontology Maintenance with Contextual Word Embeddings and Maximum Mean Discrepancy Shroff, Natasha, Vandenbussche, Pierre-Yves, Moore, Véronique, and Groth, Paul In Joint Proceedings of the 2nd International Workshop on Deep Learning meets Ontologies and Natural Language Processing (DeepOntoNLP 2021) & 6th International Workshop on Explainable Sentiment Mining and Emotion Detection (X-SENTIMENT 2021) co-located with co-located with 18th Extended Semantic Web Conference 2021, Hersonissos, Greece, June 6th - 7th, 2021 (moved online) 2021 [Link]
Proceedings of Machine Learning with Symbolic Methods and Knowledge Graphs co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021), Virtual, September 17, 2021 Alam, Mehwish, Ali, Mehdi, Groth, Paul, Hitzler, Pascal, Lehmann, Jens, Paulheim, Heiko, Rettinger, Achim, Sack, Harald, Sadeghi, Afshin, and Tresp, Volker 2021 [Link]
Verifiably Safe Exploration for End-to-End Reinforcement Learning Hunt, Nathan, Fulton, Nathan, Magliacane, Sara, Hoang, Trong Nghia, Das, Subhro, and Solar-Lezama, Armando In Proceedings of the 24th International Conference on Hybrid Systems: Computation and Control 2021 [Abs] [Link] [DOI:10.1145/3447928.3456653] [ Best Paper Award ACM HSCC 2021 ]
Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey hard constraints during exploration. This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs. Our approach draws on recent advances in object detection and automated reasoning for hybrid dynamical systems. The approach is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of hard constraints. Our benchmark draws from several proposed problem sets for safe learning and includes problems that emphasize challenges such as reward signals that are not aligned with safety constraints. On each of these benchmark problems, our algorithm completely avoids unsafe behavior while remaining competitive at optimizing for as much reward as is safe. We characterize safety constraints in terms of a refinement relation on Markov decision processes - rather than directly constraining the reinforcement learning algorithm so that it only takes safe actions, we instead refine the environment so that only safe actions are defined in the environment’s transition structure. This has pragmatic system design benefits and, more importantly, provides a clean conceptual setting in which we are able to prove important safety and efficiency properties. These allow us to transform the constrained optimization problem of acting safely in the original environment into an unconstrained optimization in a refined environment.
Summary of Tutorials at The Web Conference 2021 West, Robert, Bhagat, Smriti, Groth, Paul, Zitnik, Marinka, Couto, Francisco M., Lisena, Pasquale, Meroño-Peñuela, Albert, Zhao, Xiangyu, Fan, Wenqi, Yin, Dawei, Tang, Jiliang, Shou, Linjun, Gong, Ming, Pei, Jian, Geng, Xiubo, Zhou, Xingjie, Jiang, Daxin, Ricaud, Benjamin, Aspert, Nicolas, Miz, Volodymyr, Dy, Jennifer, Ioannidis, Stratis, Yıldız, undefinedlkay, Rezapour, Rezvaneh, Aref, Samin, Dinh, Ly, Diesner, Jana, Drutsa, Alexey, Ustalov, Dmitry, Popov, Nikita, Baidakova, Daria, Mishra, Shubhanshu, Gopalan, Arjun, Juan, Da-Cheng, Ilharco Magalhaes, Cesar, Ferng, Chun-Sung, Heydon, Allan, Lu, Chun-Ta, Pham, Philip, Yu, George, Fan, Yicheng, Wang, Yueqi, Laurent, Florian, Schraner, Yanick, Scheller, Christian, Mohanty, Sharada, Chen, Jiawei, Wang, Xiang, Feng, Fuli, He, Xiangnan, Teinemaa, Irene, Albert, Javier, Goldenberg, Dmitri, Vasile, Flavian, Rohde, David, Jeunen, Olivier, Benhalloum, Amine, Sakhi, Otmane, Rong, Yu, Huang, Wenbing, Xu, Tingyang, Bian, Yatao, Cheng, Hong, Sun, Fuchun, Huang, Junzhou, Fakhraei, Shobeir, Faloutsos, Christos, Çelebi, Onur, Müller, Martin, Schneider, Manuel, Altunina, Olesia, Wingerath, Wolfram, Wollmer, Benjamin, Gessert, Felix, Succo, Stephan, Ritter, Norbert, Courdier, Evann, Avram, Tudor Mihai, Cvetinovic, Dragan, Tsinadze, Levan, Jose, Johny, Howell, Rose, Koenig, Mario, Defferrard, Michaël, Kenthapadi, Krishnaram, Packer, Ben, Sameki, Mehrnoosh, and Sephus, Nashlie In Companion Proceedings of the Web Conference 2021 2021 [Abs] [Link] [DOI:10.1145/3442442.3453701]
This report summarizes the 23 tutorials hosted at The Web Conference 2021: nine lecture-style tutorials and 14 hands-on tutorials.
HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning Schelter, Sebastian, Grafberger, Stefan, and Dunning, Ted In Proceedings of the 2021 International Conference on Management of Data 2021 [Abs] [Link] [DOI:10.1145/3448016.3457239]
Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.
MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines Grafberger, Stefan, Guha, Shubha, Stoyanovich, Julia, and Schelter, Sebastian In Proceedings of the 2021 International Conference on Management of Data 2021 [Abs] [Link] [DOI:10.1145/3448016.3452759]
Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policymakers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. While bias detection cannot be fully automated, computational tools can help pinpoint particular types of data issues.We recently proposed mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines. In this demonstration, we show how mlinspect can be used to detect data distribution bugs in a representative pipeline. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines, can handle both relational and matrix data, and does not require manual code instrumentation. The library is publicly available at https://github.com/stefan-grafberger/mlinspect.
Automating Data Quality Validation for Dynamic Data Ingestion Redyuk, Sergey, Kaoudi, Zoi, Markl, Volker, and Schelter, Sebastian In Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021 2021 [Link] [DOI:10.5441/002/edbt.2021.07]
JENGA - A Framework to Study the Impact of Data Errors on the Predictions of Machine Learning Models Schelter, Sebastian, Rukat, Tammo, and Biessmann, Felix In Proceedings of the 24th International Conference on Extending Database Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021 2021 [Link] [DOI:10.5441/002/edbt.2021.63]
Lightweight Inspection of Data Preprocessing in Native Machine Learning Pipelines Grafberger, Stefan, Stoyanovich, Julia, and Schelter, Sebastian In 11th Conference on Innovative Data Systems Research, CIDR 2021, Virtual Event, January 11-15, 2021, Online Proceedings 2021 [Link]

2020

Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels Brate, Ryan, Groth, Paul, and Erp, Marieke In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2020 [Abs] [Link]
Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them ‘smell experiences’, offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper, we present two variations on a semi-supervised approach to identify smell experiences in English literature. The combined set of patterns from both implementations offer significantly better performance than a keyword-based baseline.
Dataset Reuse: Toward Translating Principles to Practice Koesten, Laura, Vougiouklis, Pavlos, Simperl, Elena, and Groth, Paul Patterns 2020 [Link] [DOI:10.1016/j.patter.2020.100136]
Effective distributed representations for academic expert search Berger, Mark, Zavrel, Jakub, and Groth, Paul In Proceedings of the First Workshop on Scholarly Document Processing at EMNLP 2020 [Abs] [Link]
Expert search aims to find and rank experts based on a user’s query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a document-centric voting model for retrieval. In particular, we explore the impact of the use of contextualized embeddings on search performance. We also present results for paper embeddings that incorporate citation information through retrofitting. Additionally, experiments are conducted using different techniques for assigning author weights based on author order. We observe that using contextual embeddings produced by a transformer model trained for sentence similarity tasks produces the most effective paper representations for document-centric expert retrieval. However, retrofitting the paper embeddings and using elaborate author contribution weighting strategies did not improve retrieval performance.
Dataset search: a survey Chapman, Adriane, Simperl, Elena, Koesten, Laura, Konstantinidis, George, Ibáñez, Luis-Daniel, Kacprzak, Emilia, and Groth, Paul The VLDB Journal 2020 [arXiv] [Link] [DOI:10.1007/s00778-019-00564-x]
Introduction – FAIR data, systems and analysis Groth, Paul, and Dumontier, Michel Data Science 2020 [Link] [DOI:10.3233/DS-200029]
Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine Learning Yang, Ke, Huang, Biao, Stoyanovich, Julia, and Schelter, Sebastian In Workshop on Human-In-the-Loop Data Analytics (HILDA’20) 2020 [Link] [DOI:10.1145/3398730.3399194]
Towards Entity Spaces Erp, Marieke, and Groth, Paul In Proceedings of The 12th Language Resources and Evaluation Conference 2020 [Abs] [Link]
Entities are a central element of knowledge bases and are important input to many knowledge-centric tasks including text analysis. For example, they allow us to find documents relevant to a specific entity irrespective of the underlying syntactic expression within a document. However, the entities that are commonly represented in knowledge bases are often a simplification of what is truly being referred to in text. For example, in a knowledge base, we may have an entity for Germany as a country but not for the more fuzzy concept of Germany that covers notions of German Population, German Drivers, and the German Government. Inspired by recent advances in contextual word embeddings, we introduce the concept of entity spaces - specific representations of a set of associated entities with near-identity. Thus, these entity spaces provide a handle to an amorphous grouping of entities. We developed a proof-of-concept for English showing how, through the introduction of entity spaces in the form of disambiguation pages, the recall of entity linking can be improved.
Lost or Found? Discovering Data Needed for Research Gregory, Kathleen, Groth, Paul, Scharnhorst, Andrea, and Wyatt, Sally Harvard Data Science Review 2020 [Link] [DOI:10.1162/99608f92.e38165eb]
PANDAcap: A Framework for Streamlining Collection of Full-System Traces Stamatogiannakis, Manolis, Bos, Herbert, and Groth, Paul In EuroSec 2020 [Link] [DOI:10.1145/3380786.3391396] [Code]
Estimating the imageability of words by mining visual characteristics from crawled image data Kastner, Marc A., Ide, Ichiro, Nack, Frank, Kawanishi, Yasutomo, Hirayama, Takatsugu, Deguchi, Daisuke, and Murase, Hiroshi Multimedia Tools and Applications 2020 [Link] [DOI:10.1007/s11042-019-08571-4]
FAIR Data Reuse – the Path through Data Citation Groth, Paul, Cousijn, Helena, Clark, Tim, and Goble, Carole Data Intelligence 2020 [Link] [DOI:10.1162/dint_a_00030]
Message Passing Query Embedding Daza, Daniel, and Cochez, Michael In ICML Workshop - Graph Representation Learning and Beyond 2020 [arXiv] [Link]
The state of altmetrics: a tenth anniversary celebration Altmetric Engineering, , Konkiel, Stacy, Priem, Jason, Adie, Euan, Derrick, Gemma, Didegah, Fereshteh, Groth, Paul, Neylon, Cameron, Shenmeng Xu, , Zahedi, Zohreh, Bowman, Timothy, Vanash M Patel, , Haunschild, Robin, Bornmann, Lutz, Taylor, Mike, Ross, Liesa, Theng, Yin-Leng, Hassan, Saeed-Ul, and Aljohani, Naif R. 2020 [Link] [DOI:10.6084/M9.FIGSHARE.13010000.V2]
CSSA’20: Workshop on Combining Symbolic and Sub-Symbolic Methods and Their Applications Alam, Mehwish, Groth, Paul, Hitzler, Pascal, Paulheim, Heiko, Sack, Harald, and Tresp, Volker In Proceedings of the 29th ACM International Conference on Information & Knowledge Management 2020 [Abs] [Link] [DOI:10.1145/3340531.3414072]
There has been a rapid growth in the use of symbolic representations along with their applications in many important tasks. Symbolic representations, in the form of Knowledge Graphs (KGs), constitute large networks of real-world entities and their relationships. On the other hand, sub-symbolic artificial intelligence has also become a mainstream area of research. This workshop brought together researchers to discuss and foster collaborations on the intersection of these two areas.
ICIDS2020 Panel: Building the Discipline of Interactive Digital Narratives Bernstein, Mark, Palosaari Eladhari, Mirjam, Koenitz, Hartmut, Louchart, Sandy, Nack, Frank, Martens, Chris, Rossi, Giulia Carla, Bosser, Anne-Gwenn, and Millard, David E. In Interactive Storytelling 2020 [Abs] [DOI:10.1007/978-3-030-62516-0_1]
Building our discipline has been an ongoing discussion since the early days of ICIDS. From earlier international joint efforts to integrate research from multiple fields of study to today’s endeavours by researchers to provide scholarly works of reference, the discussion on how to continue building Interactive Digital Narratives as a discipline with its own vocabulary, scope, evaluation and methods is far from over. This year, we have chosen to continue this discussion through a panel in order to explore what are the epistemological implications of the multiple disciplinary roots of our field, and what are the next steps we should take as a community.
Technical Perspective: Query Optimization for Faster Deep CNN Explanations Schelter, Sebastian ACM SIGMOD Record 2020 [Link]
Apache Mahout: Machine Learning on Distributed Dataflow Systems Anil, Robin, Capan, Gokhan, Drost-Fromm, Isabel, Dunning, Ted, Friedman, Ellen, Grant, Trevor, Quinn, Shannon, Ranjan, Paritosh, Schelter, Sebastian, and YÄ±lmazel, Ã–zgÃ¼r Journal of Machine Learning Research 2020 [Link]
Semantic Systems. In the Era of Knowledge Graphs - 16th International Conference on Semantic Systems, SEMANTiCS 2020, Amsterdam, The Netherlands, September 7-10, 2020, Proceedings Blomqvist, Eva, Groth, Paul, Boer, Victor, Pellegrini, Tassilo, Alam, Mehwish, Käfer, Tobias, Kieseberg, Peter, Kirrane, Sabrina, Meroño-Peñuela, Albert, and Pandit, Harshvardhan J. 2020 [Link] [DOI:10.1007/978-3-030-59833-4]
A longitudinal analysis of university rankings Selten, Friso, Neylon, Cameron, Huang, Chun-Kai, and Groth, Paul Quantitative Science Studies 2020 [Link] [DOI:10.1162/qss_a_00052]

2019

How Relevant Is Your Choice? Kolhoff, Lobke, and Nack, Frank In ICIDS 2019. Lecture Notes in Computer Science, vol 11869 2019 [Abs]
With the release of the film Black Mirror: Bandersnatch Netflix entered the area of interactive streamed narratives. We performed a qualitative analysis with 169 Netflix subscribers that had watched the episode. The key findings show (1) participants are initially engaged because of curiosity and the novelty value, and desire to explore the narrative regardless of satisfaction, (2) perceived agency is limited due to arbitrary choices and the lack of meaningful consequences, (3) the overall experience is satisfactory but adaptions are desirable in future design to make full use of the potential of the format.
Transfer Learning for Biomedical Named Entity Recognition with BioBERT Symeonidou, Anthi, Sazonau, Viachaslau, and Groth, Paul In Proceedings of the Posters and Demo Track of the 15th International Conference on Semantic Systems co-located with 15th International Conference on Semantic Systems (SEMANTiCS 2019), Karlsruhe, Germany, September 9th - to - 12th, 2019. 2019 [Link]
Understanding data search as a socio-technical practice Gregory, Kathleen M, Cousijn, Helena, Groth, Paul, Scharnhorst, Andrea, and Wyatt, Sally Journal of Information Science 2019 [Abs] [Link] [DOI:10.1177/0165551519837182]
Open research data are heralded as having the potential to increase effectiveness, productivity and reproducibility in science, but little is known about the actual practices involved in data search. The socio-technical problem of locating data for reuse is often reduced to the technological dimension of designing data search systems. We combine a bibliometric study of the current academic discourse around data search with interviews with data seekers. In this article, we explore how adopting a contextual, socio-technical perspective can help to understand user practices and behaviour and ultimately help to improve the design of data discovery systems.
Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines Gregory, Kathleen, Groth, Paul, Cousijn, Helena, Scharnhorst, Andrea, and Wyatt, Sally Journal of the Association for Information Science and Technology 2019 [Abs] [Link] [DOI:10.1002/asi.24165]
A cross-disciplinary examination of the user behaviors involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data in selected disciplines. Two analytical frameworks, rooted in information retrieval and science and technology studies, are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.
End-to-End Learning for Answering Structured Queries Directly over Text Groth, Paul T., Scerri, Antony, Daniel, Ron, and Allen, Bradley P. In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG2019) Co-located with the 16th Extended Semantic Web Conference 2019 (ESWC 2019), Portoroz, Slovenia, June 2, 2019. 2019 [arXiv] [Link]

2018

Open Information Extraction on Scientific Text: An Evaluation Groth, Paul T., Lauruhn, Michael, Scerri, Antony, and Daniel, Ron In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018 2018 [Link]
Elsevier’s Healthcare Knowledge Graph and the Case for Enterprise Level Linked Data Standards DeJong, Alex, Bord, Radmila, Dowling, Will, Hoekstra, Rinke, Moquin, Ryan, O, Charlie, Samarasinghe, Mevan, Snyder, Paul, Stanley, Craig, Tordai, Anna, Trefry, Michael, and Groth, Paul T. In Proceedings of the ISWC 2018 Posters & Demonstrations, Industry and Blue Sky Ideas Tracks co-located with 17th International Semantic Web Conference (ISWC 2018), Monterey, USA, October 8th - to - 12th, 2018. 2018 [Link]
Use of Internal Testing Data to Help Determine Compensation for Crowdsourcing Tasks Lauruhn, Michael, Groth, Paul T., Harper, Corey A., and Deus, Helena F. In Proceedings of the 2nd International Workshop on Augmenting Intelligence with Humans\--in-\-the-\-Loop co-located with 17th International Semantic Web Conference (ISWC 2018), Monterey, California, October 9th, 2018. 2018 [Link]