Bibtex file with the publications listed below.
2025
-
FAIR Research Objects and computational workflows
Soiland-Reyes, Stian
2025
[Abs]
[Link]
This PhD thesis explores the topics of RO-Crate, FAIR Digital Objects (FDOs), and computational workflows, in order to examine research questions on how these can be implemented and integrated using Linked Data approaches – forming “FAIR Research Objects”. The background covers the evolution of the Semantic Web, Linked Data, and FAIR Digital Objects, which are evaluated against the FAIR principles and several frameworks, to consider these technologies as potential middleware for a global distributed object system that enable machine-actionable research outputs. This work introduces the community-developed method RO-Crate for packaging research artefacts with their contextual information, relationships and metadata – utilising Linked Data standards that are simplified and documented for pragmatic use by software developers. The tension between flexibility for implementations and rigidity of semantic constraints is explored, and demonstrated by profiles of RO-Crate across research domains such as bioinformatics, regulatory sciences, biodiversity and digital humanities. Computational workflows, for reproducible data analysis across execution platforms, are examined as potential FAIR Digital Objects, considering them both as shareable research outputs as well as a part of provenance of computational results, captured in a Workflow Run Crate. This thesis explores the emerging ecosystem of FAIR Digital Objects and how it can learn from the community development of RO-Crate to carefully adapt ’just enough’ of Linked Data technologies, balancing flexibility and predictability. The main findings of this thesis emphasise community-driven pragmatic solutions over strict semantic correctness, supporting advancement of the FAIR principles through practical and interoperable implementations of Web standards.
2024
-
Exploiting Subgraphs and Attributes for Representation Learning on Knowledge Graphs
Daza Cruz, Daniel Fernando
2024
[Abs]
[DOI:10.5463/thesis.823]
Knowledge graphs (KGs) are data structures that explicitly represent entities and the relations between them over some domain. They can be used to store information about people and their relatives or birth locations, about organizations and the countries where they are located, or about chemical compounds and their interactions with proteins in the human body. In this thesis, we investigate the problem of representation learning on knowledge graphs, which consists of learning vector representations of entities and relations that capture the information contained in the graph. These learned representations are useful for capturing patterns that occur in the graph but are not explicitly stated in it. Several methods for representation learning on KGs are based on predicting a link between a pair of entities in the graph. While efficient, this approach forgoes useful learning signals from other sources that exhibit patterns, such as subgraphs involving multiple entities, and attributes of entities. Examples of attributes are textual descriptions of people, or molecular structures of chemical compounds. The goal of this thesis is to explore such learning signals beyond pairwise interactions of entities. Our findings provide evidence that subgraphs and attributes are powerful signals from which we can learn representations in KGs. Not only do they yield improved representations, but they also broaden the range of tasks in which they can be applied. We hope that this serves as a motivation for learning from further sources of information that are already available in KGs, but whose potential is yet to be discovered.
-
The Ramifications of Data Handling for Computational Models
Nevin, James Graham
2024
[Link]
-
Understanding the Impact of Entity Linking on the Topology of Entity Co-occurrence Networks for Social Media Analysis
Nevin, James,
Zhang, Pengyu,
Dimitrov, Dimitar,
Lees, Michael,
Groth, Paul,
and Dietze, Stefan
In Knowledge Engineering and Knowledge Management (EKAW)
2024
[Link]
[DOI:10.1007/978-3-031-77792-9_5]
[Code]
-
Influence Beyond Similarity: A Contrastive Learning Approach to Object Influence Retrieval
Liberatore, Teresa,
Groth, Paul,
Kackovic, Monika,
and Wijnberg, Nachoem
In Knowledge Engineering and Knowledge Management (EKAW)
2024
[Link]
[DOI:10.1007/978-3-031-77792-9_3]
[Code]
-
DiTEC: Digital Twin for Evolutionary Changes in Water Distribution Networks
Degeler, Victoria,
Hadadian, Mostafa,
Karabulut, Erkan,
Lazovik, Alexander,
Loo, Hester,
Tello, Andrés,
and Truong, Huy
In Leveraging Applications of Formal Methods, Verification and Validation. Application Areas
2024
[Abs]
[Link]
[Code]
Conventional digital twins (DT) for critical infrastructures are widely used to model and simulate the system’s state. But fundamental environment changes bring challenges for DT adaptation to new conditions, leading to a progressively decreasing correspondence of the DT to its physical counterpart. This paper introduces the DiTEC system, a Digital Twin for Evolutionary Changes in Water Distribution Networks (WDN). This framework combines novel techniques, including semantic rule learning, graph neural network-based state estimation, and adaptive model selection, to ensure that changes are adequately detected, processed and the DT is updated to the new state. The DiTEC system is tested on the Dutch Oosterbeek region WDN, with results showing the superiority of the approach compared to traditional methods.
-
TIGER: Temporally Improved Graph Entity Linker
Zhang, Pengyu,
Cao, Congfeng,
and Groth, Paul
In 27th European Conference on Artificial Intelligence (ECAI 24)
2024
[Link]
[DOI:10.3233/faia240933]
[Code]
-
CYCLE: Cross-Year Contrastive Learning in Entity-Linking
Zhang, Pengyu,
Cao, Congfeng,
Zaporojets, Klim,
and Groth, Paul
In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM 24)
2024
[Abs]
[Link]
[DOI:10.1145/3627673.3679702]
[Code]
Knowledge graphs constantly evolve with new entities emerging, existing definitions being revised, and entity relationships changing. These changes lead to temporal degradation in entity linking models, characterized as a decline in model performance over time. To address this issue, we propose leveraging graph relationships to aggregate information from neighboring entities across different time periods. This approach enhances the ability to distinguish similar entities over time, thereby minimizing the impact of temporal degradation. We introduce CYCLE: Cross-Year Contrastive Learning for Entity-Linking. This model employs a novel graph contrastive learning method to tackle temporal performance degradation in entity linking tasks. Our contrastive learning method treats newly added graph relationships as positive samples and newly removed ones as negative samples. This approach helps our model effectively prevent temporal degradation, achieving a 13.90% performance improvement over the state-of-the-art from 2023 when the time gap is one year, and a 17.79% improvement as the gap expands to three years. Further analysis shows that CYCLE is particularly robust for low-degree entities, which are less resistant to temporal degradation due to their sparse connectivity, making them particularly suitable for our method. The code and data are made available at https://github.com/pengyu-zhang/CYCLE-Cross-Year-Contrastive-Learning-in-Entity-Linking
-
Testing prompt engineering methods for knowledge extraction from text
Polat, Fina,
Tiddi, Ilaria,
and Groth, Paul
Semantic Web
2024
[Link]
[DOI:10.3233/sw-243719]
-
How different is different? Systematically identifying distribution shifts and their impacts in NER datasets
Li, Xue,
and Groth, Paul
Language Resources and Evaluation
2024
[Link]
[DOI:10.1007/s10579-024-09754-8]
-
A Sparsity Principle for Partially Observable Causal Representation Learning
Xu, Danru,
Yao, Dingling,
Lachapelle, Sébastien,
Taslakian, Perouz,
Kügelgen, Julius,
Locatello, Francesco,
and Magliacane, Sara
International Conference on Machine Learning (ICML)
2024
[Link]
-
SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection
Allen, Bradley P.,
Polat, Fina,
and Groth, Paul
In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
2024
[Link]
[Code]
-
Towards Federated LLM-Powered CEP Rule Generation and Refinement
Lotfian Delouee, Majid,
Pernes, Daria G.,
Degeler, Victoria,
and Koldehofe, Boris
In The 18th ACM International Conference on Distributed and Event-Based Systems (DEBS’24)
2024
[Abs]
[Link]
In traditional event processing systems, patterns representing situations of interest are typically defined by domain experts or learned from historical data. These approaches often make rule generation reactive, time-consuming, and susceptible to human error. In this paper, we propose and investigate the integration of large language models (LLMs) to automate and accelerate query translation and rule generation in event processing systems. Furthermore, we introduce a federated learning schema to refine the initially generated rules by examining them over distributed event streams, ensuring greater accuracy and adaptability.Preliminary results demonstrate the potential of LLMs as a key component in proactively expediting the autonomous rule-generation process. Moreover, our findings suggest that employing customized prompt engineering techniques can further enhance the quality of the generated rules.
-
Towards Efficient Data Wrangling with LLMs using Code Generation
Li, Xue,
and Döhmen, Till
In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning
2024
[Abs]
[Link]
[DOI:10.1145/3650203.3663334]
While LLM-based data wrangling approaches that process each row of data have shown promising benchmark results, computational costs still limit their suitability for real-world use cases on large datasets. We revisit code generation using LLMs for various data wrangling tasks, which show promising results particularly for data transformation tasks (up to 37.2 points improvement on F1 score) at much lower computational costs. We furthermore identify shortcomings of code generation methods especially for semantically challenging tasks, and consequently propose an approach that combines program generation with a routing mechanism using LLMs.
-
Towards Interactively Improving ML Data Preparation Code via "Shadow Pipelines"
Grafberger, Stefan,
Groth, Paul,
and Schelter, Sebastian
In Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning
2024
[Abs]
[Link]
[DOI:10.1145/3650203.3663327]
Data scientists develop ML pipelines in an iterative manner: they repeatedly screen a pipeline for potential issues, debug it, and then revise and improve its code according to their findings. However, this manual process is tedious and error-prone. Therefore, we propose to support data scientists during this development cycle with automatically derived interactive suggestions for pipeline improvements. We discuss our vision to generate these suggestions with so-called shadow pipelines, hidden variants of the original pipeline that modify it to auto-detect potential issues, try out modifications for improvements, and suggest and explain these modifications to the user. We envision to apply incremental view maintenance-based optimisations to ensure low-latency computation and maintenance of the shadow pipelines. We conduct preliminary experiments to showcase the feasibility of our envisioned approach and the potential benefits of our proposed optimisations.
-
Large-Scale Multipurpose Benchmark Datasets For Assessing Data-Driven Deep Learning Approaches For Water Distribution Networks
Tello, Andrés,
Truong, Huy,
Lazovik, Alexander,
and Degeler, Victoria
In Engineering Proceedings
2024
[Abs]
[Link]
[Data]
Currently, the number of common benchmark datasets that researchers can use straight away for assessing data-driven deep learning approaches is very limited. Most studies provide data as configuration files. It is still up to each practitioner to follow a particular data generation method and run computationally intensive simulations to obtain usable data for model training and evaluation. In this work, we provide a collection of datasets that includes several small and medium size publicly available Water Distribution Networks (WDNs), including Anytown, Modena, Balerma, C-Town, D-Town, L-Town, Ky1, Ky6, Ky8, Ky10, and Ky13. In total 1,394,400 hours of WDNs data operating under normal conditions is made available to the community.
-
Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models
Allen, Bradley P.,
and Groth, Paul T.
In Proceedings of European Semantic Web Conference Special Track on Large Language Models for Knowledge Engineering
2024
[Link]
-
Retrieval-based Question Answering with Passage Expansion Using a Knowledge Graph
Kruit, Benno,
Xu, Yiming,
and Kalo, Jan-Christoph
In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
2024
[Abs]
[Link]
Recent advancements in dense neural retrievers and language models have led to large improvements in state-of-the-art approaches to open-domain Question Answering (QA) based on retriever-reader architectures. However, issues stemming from data quality and imbalances in the use of dense embeddings have hindered performance, particularly for less common entities and facts. To tackle these problems, this study explores a multi-modal passage retrieval model’s potential to bolster QA system performance. This study poses three key questions: (1) Can a distantly supervised question-relation extraction model enhance retrieval using a knowledge graph (KG), compensating for dense neural retrievers’ shortcomings with rare entities? (2) How does this multi-modal approach compare to existing QA systems based on textual features? (3) Can this QA system alleviate poor performance on less common entities on common benchmarks? We devise a multi-modal retriever combining entity features and textual data, leading to improved retrieval precision in some situations, particularly for less common entities. Experiments across different datasets confirm enhanced performance for entity-centric questions, but challenges remain in handling complex generalized questions.
-
Directions Towards Efficient and Automated Data Wrangling with Large Language Models
Zhang, Zeyu,
Groth, Paul,
Calixto, Iacer,
and Schelter, Sebastian
In 2024 IEEE 40th International Conference on Data Engineering Workshops (ICDEW)
2024
[Link]
[DOI:10.1109/ICDEW61823.2024.00044]
-
SchemaPile: A Large Collection of Relational Database Schemas
Döhmen, Till,
Geacu, Radu,
Hulsebos, Madelon,
and Schelter, Sebastian
Proc. ACM Manag. Data
2024
[Abs]
[Link]
[DOI:10.1145/3654975]
Access to fine-grained schema information is crucial for understanding how relational databases are designed and used in practice, and for building systems that help users interact with them. Furthermore, such information is required as training data to leverage the potential of large language models (LLMs) for improving data preparation, data integration and natural language querying. Existing single-table corpora such as GitTables provide insights into how tables are structured in-the-wild, but lack detailed schema information about how tables relate to each other, as well as metadata like data types or integrity constraints. On the other hand, existing multi-table (or database schema) datasets are rather small and attribute-poor, leaving it unclear to what extent they actually represent typical real-world database schemas.In order to address these challenges, we present SchemaPile, a corpus of 221,171 database schemas, extracted from SQL files on GitHub. It contains 1.7 million tables with 10 million column definitions, 700 thousand foreign key relationships, seven million integrity constraints, and data content for more than 340 thousand tables. We conduct an in-depth analysis on the millions of schema metadata properties in our corpus, as well as its highly diverse language and topic distribution. In addition, we showcase the potential of corpus to improve a variety of data management applications, e.g., fine-tuning LLMs for schema-only foreign key detection, improving CSV header detection and evaluating multi-dialect SQL parsers. We publish the code and data for recreating SchemaPile and a permissively licensed subset SchemaPile-Perm.
-
Data Debugging with Shapley Importance over Machine Learning Pipelines
Karlaš, Bojan,
Dao, David,
Interlandi, Matteo,
Schelter, Sebastian,
Wu, Wentao,
and Zhang, Ce
In The Twelfth International Conference on Learning Representations
2024
[Link]
-
Multi-View Causal Representation Learning with Partial Observability
Yao, Dingling,
Xu, Danru,
Lachapelle, Sébastien,
Magliacane, Sara,
Taslakian, Perouz,
Martius, Georg,
Kügelgen, Julius,
and Locatello, Francesco
In The Twelfth International Conference on Learning Representations
2024
[Link]
[ Spotlight Presentation ]
-
Driving Towards Efficiency: Adaptive Resource-aware Clustered Federated Learning in Vehicular Networks
Khalil, Ahmad,
Lotfian Delouee, Majid,
Degeler, Victoria,
Meuser, Tobias,
Fernandez Anta, Antonio,
and Koldehofe, Boris
In The 22nd Mediterranean Communication and Computer Networking Conference (MedComNet’24)
2024
[Abs]
[Link]
Guaranteeing precise perception for fully autonomous driving in diverse driving conditions requires continuous improvement and training. In vehicular networks, federated learning (FL) facilitates this by enabling model training without sharing raw sensory data. As an extension, clustered FL reduces communication overhead and aligns well with the dynamic nature of these networks. However, current literature on this topic does not consider critical dimensions of FL, including (1) the correlation between perception performance and the networking overhead, (2) the limited vehicle storage, (3) the need for training with freshly captured data, and (4) the impact of non-IID data and varying traffic densities. To fill these research gaps, we introduce AR-CFL, an Adaptive Resource-aware Clustered Federated Learning framework. AR-CFL utilizes clustered FL to collectively model the environment of connected vehicles, integrating models from all vehicles and ensuring universal accessibility to the refined model. AR-CFL dynamically enhances system efficiency by adaptively adjusting the number of clusters and specific in-cluster participant selection strategies. Using AR-CFL, we systematically study the scenario of online car detection model training on non-IID data across varied conditions. The evaluation results highlight the robust detection performance exhibited by the trained model employing the clustered FL approach, despite the constraints posed by limited vehicle storage capacity. Furthermore, our investigation unveils superior training performance with clustered FL in comparison to specific classical FL scenarios, increasing the training efficiency in terms of participating nodes by up to 25% and reducing cellular communication by 33%.
-
Evaluating FAIR Digital Object and Linked Data as distributed object systems
Soiland-Reyes, Stian,
Goble, Carole,
and Groth, Paul
PeerJ Computer Science
2024
[Abs]
[Link]
[DOI:10.7717/peerj-cs.1781]
[Data]
FAIR Digital Object (FDO) is an emerging concept that is highlighted by European Open Science Cloud (EOSC) as a potential candidate for building an ecosystem of machine-actionable research outputs. In this work we systematically evaluate FDO and its implementations as a global distributed object system, by using five different conceptual frameworks that cover interoperability, middleware, FAIR principles, EOSC requirements and FDO guidelines themself. We compare the FDO approach with established Linked Data practices and the existing Web architecture, and provide a brief history of the Semantic Web while discussing why these technologies may have been difficult to adopt for FDO purposes. We conclude with recommendations for both Linked Data and FDO communities to further their adaptation and alignment.
-
Ontologies in digital twins: A systematic literature review
Karabulut, Erkan,
Pileggi, Salvatore F.,
Groth, Paul,
and Degeler, Victoria
Future Generation Computer Systems
2024
[Link]
[DOI:10.1016/j.future.2023.12.013]
[Data]
-
Empirical ontology design patterns and shapes from Wikidata
Carriero, Valentina Anita,
Groth, Paul,
and Presutti, Valentina
Semantic Web
2024
[Link]
[DOI:10.3233/sw-243613]
-
Assisted design of data science pipelines
Redyuk, Sergey,
Kaoudi, Zoi,
Schelter, Sebastian,
and Markl, Volker
The VLDB Journal
2024
[Link]
[DOI:10.1007/s00778-024-00835-2]
-
Table Representation Learning
Hulsebos, Madelon
2024
[Link]
-
Domain Generalization in Time Series Forecasting
Deng, Songgaojun,
Sprangers, Olivier,
Li, Ming,
Schelter, Sebastian,
and Rijke, Maarten
ACM Trans. Knowl. Discov. Data
2024
[Abs]
[Link]
[DOI:10.1145/3643035]
Domain generalization aims to design models that can effectively generalize to unseen target domains by learning from observed source domains. Domain generalization poses a significant challenge for time series data, due to varying data distributions and temporal dependencies. Existing approaches to domain generalization are not designed for time series data, which often results in suboptimal or unstable performance when confronted with diverse temporal patterns and complex data characteristics. We propose a novel approach to tackle the problem of domain generalization in time series forecasting. We focus on a scenario where time series domains share certain common attributes and exhibit no abrupt distribution shifts. Our method revolves around the incorporation of a key regularization term into an existing time series forecasting model: domain discrepancy regularization. In this way, we aim to enforce consistent performance across different domains that exhibit distinct patterns. We calibrate the regularization term by investigating the performance within individual domains and propose the domain discrepancy regularization with domain difficulty awareness. We demonstrate the effectiveness of our method on multiple datasets, including synthetic and real-world time series datasets from diverse domains such as retail, transportation, and finance. Our method is compared against traditional methods, deep learning models, and domain generalization approaches to provide comprehensive insights into its performance. In these experiments, our method showcases superior performance, surpassing both the base model and competing domain generalization models across all datasets. Furthermore, our method is highly general and can be applied to various time series models.
-
Zero-Shot Topic Classification of Column Headers: Leveraging LLMs for Metadata Enrichment
Martorana, Margherita,
Kuhn, Tobias,
Stork, Lise,
and Ossenbruggen, Jacco
2024
[Link]
[DOI:10.3233/SSW240006]
-
-
Graph Neural Networks for Pressure Estimation in Water Distribution Systems
Truong, Huy,
Tello, Andrés,
Lazovik, Alexander,
and Degeler, Victoria
Water Resources Research
2024
[Link]
-
Large-Scale Forecasting of Electric Vehicle Charging Demand Using Global Time Series Modeling
Etten, Tijmen,
Degeler, Victoria,
and Luo, Ding
In Proceedings of the 10th International Conference on Vehicle Technology and Intelligent Transport Systems
2024
[Link]
[DOI:10.5220/0012555400003702]
-
Standardizing Knowledge Engineering Practices with a Reference Architecture
Allen, Bradley P.,
and Ilievski, Filip
Transactions on Graph Data and Knowledge
2024
[Link]
[DOI:10.4230/TGDK.2.1.5]
-
Too Good To Be True: accuracy overestimation in (re)current practices for Human Activity Recognition
Tello, Andrés,
Degeler, Victoria,
and Lazovik, Alexander
In 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops)
2024
[DOI:10.1109/PerComWorkshops59983.2024.10503465]
-
Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making
Guha, Shubha,
Khan, Falaah Arif,
Stoyanovich, Julia,
and Schelter, Sebastian
IEEE Transactions on Knowledge and Data Engineering
2024
[Link]
[DOI:10.1109/TKDE.2024.3365524]
-
Red Onions, Soft Cheese and Data: From Food Safety to Data Traceability for Responsible AI
Grafberger, Stefan,
Zhang, Zeyu,
Schelter, Sebastian,
and Zhang, Ce
IEEE Data Engineering Bulletin
2024
[Link]
2023
-
APP-CEP: Adaptive Pattern-level Privacy Protection in Complex Event Processing Systems
Lotfian Delouee, Majid,
Degeler, Victoria,
Amthor, Peter,
and Koldehofe, Boris
In 10th International Conference on Information Systems Security and Privacy (ICISSP’24)
2023
[Abs]
[Link]
Although privacy-preserving mechanisms endeavor to safeguard sensitive information at the attribute level, detected event patterns can still disclose privacy-sensitive knowledge in distributed complex event processing systems (DCEP). Events might not be inherently sensitive, but their aggregation into a pattern could still breach privacy. In this paper, we study in the context of APP-CEP the problem of integrating pattern-level privacy in event-based systems by selective assignment of obfuscation techniques to conceal private information. Compared to state-of-the-art techniques, we seek to enforce privacy independent of the actual events in streams. To support this, we acquire queries and privacy requirements using CEP-like patterns. The protection of privacy is accomplished through generating pattern dependency graphs, leading to dynamically appointing those techniques that have no consequences on detecting other sensitive patterns, as well as non-sensitive patterns required to provide acceptable Quality of Service. Besides, we model the knowledge that might be possessed by potential adversaries to violate privacy and its impacts on the obfuscation procedure. We assessed the performance of APP-CEP in a real-world scenario involving an online retailer’s transactions. Our evaluationresults demonstrate that APP-CEP successfully provides a privacy-utility trade-off. Modeling the background knowledge also effectively prevents adversaries from realizing the modifications in the input streams.
-
Observatory: Characterizing Embeddings of Relational Tables
Cong, Tianji,
Hulsebos, Madelon,
Sun, Zhenjie,
Groth, Paul,
and Jagadish, H. V.
Proceedings of the VLDB Endowment
2023
[Link]
[DOI:10.14778/3636218.3636237]
-
Large Language Models and Knowledge Graphs: Opportunities and Challenges
Pan, Jeff Z.,
Razniewski, Simon,
Kalo, Jan-Christoph,
Singhania, Sneha,
Chen, Jiaoyan,
Dietze, Stefan,
Jabeen, Hajira,
Omeliyanenko, Janna,
Zhang, Wen,
Lissandrini, Matteo,
Biswas, Russa,
Melo, Gerard,
Bonifati, Angela,
Vakaj, Edlira,
Dragoni, Mauro,
and Graux, Damien
Transactions on Graph Data and Knowledge
2023
[Link]
[DOI:10.4230/TGDK.1.1.2]
-
Evaluating the Knowledge Base Completion Potential of GPT
Veseli, Blerta,
Razniewski, Simon,
Kalo, Jan-Christoph,
and Weikum, Gerhard
In Findings of the Association for Computational Linguistics: EMNLP 2023
2023
[Abs]
[Link]
[DOI:10.18653/v1/2023.findings-emnlp.426]
Structured knowledge bases (KBs) are an asset for search engines and other applications but are inevitably incomplete. Language models (LMs) have been proposed for unsupervised knowledge base completion (KBC), yet, their ability to do this at scale and with high accuracy remains an open question. Prior experimental studies mostly fall short because they only evaluate on popular subjects, or sample already existing facts from KBs. In this work, we perform a careful evaluation of GPT’s potential to complete the largest public KB: Wikidata. We find that, despite their size and capabilities, models like GPT-3, ChatGPT and GPT-4 do not achieve fully convincing results on this task. Nonetheless, it provides solid improvements over earlier approaches with smaller LMs. In particular, we show that it is feasible to extend Wikidata by 27M facts at 90% precision.
-
Knowledge Engineering Using Large Language Models
Allen, Bradley P.,
Stork, Lise,
and Groth, Paul
Transactions on Graph Data and Knowledge
2023
[Link]
[DOI:10.4230/TGDK.1.1.3]
-
BioBLP: a modular framework for learning on multimodal biomedical knowledge graphs
Daza, Daniel,
Alivanistos, Dimitrios,
Mitra, Payal,
Pijnenburg, Thom,
Cochez, Michael,
and Groth, Paul
Journal of Biomedical Semantics
2023
[Link]
[DOI:10.1186/s13326-023-00301-y]
[Code]
[Data]
-
Adapting Neural Link Predictors for Data-Efficient Complex Query Answering
Arakelyan, Erik,
Minervini, Pasquale,
Daza, Daniel,
Cochez, Michael,
and Augenstein, Isabelle
In Thirty-seventh Conference on Neural Information Processing Systems
2023
[arXiv]
[Link]
-
A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference
Krieken, Emile,
Thanapalasingam, Thiviyan,
Tomczak, Jakub M.,
Harmelen, Frank Van,
and Teije, Annette Ten
In Thirty-seventh Conference on Neural Information Processing Systems
2023
[arXiv]
[Link]
-
Preface: LM-KBC Challenge 2023
Sneha Singhania, Jan-Christoph Kalo, Simon Razniewski, Jeff Z. Pan
In Joint proceedings of the 1st workshop on Knowledge Base Construction from Pre-Trained Language Models (KBC-LM) and the 2nd challenge on Language Models for Knowledge Base Construction (LM-KBC)
2023
[Link]
-
Do Instruction-tuned Large Language Models Help with Relation Extraction?
Li, Xue,
Polat, Fina,
and Groth, Paul
In KBC-LM’23: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2023
2023
[Link]
[Code]
-
Knowledge-centric Prompt Composition for Knowledge Base Construction from Pre-trained Language Models
Li, Xue,
Hughes, Anthony,
Llugiqi, Majlinda,
Polat, Fina,
Groth, Paul,
and Ekaputra, Fajar J.
In KBC-LM’23: Knowledge Base Construction from Pre-trained Language Models workshop at ISWC 2023
2023
[Link]
[Code]
-
Semantic Association Rule Learning from Time Series Data and Knowledge Graphs
Karabulut, Erkan,
Degeler, Victoria,
and Groth1, Paul
In SemIIM’23: 2nd International Workshop on Semantic Industrial Information Modelling co-located with 22nd International Semantic Web Conference (ISWC 2023)
2023
[arXiv]
[Link]
-
Improving Graph-to-Text Generation Using Cycle Training
Polat, Fina,
Tiddi, Ilaria,
Groth, Paul,
and Vossen, Piek
In Proceedings of the 4th Conference on Language, Data and Knowledge
2023
[Link]
-
Mlwhatif: What If You Could Stop Re-Implementing Your Machine Learning Pipeline Analyses over and Over?
Grafberger, Stefan,
Guha, Shubha,
Groth, Paul,
and Schelter, Sebastian
Proc. VLDB Endow.
2023
[Abs]
[Link]
[DOI:10.14778/3611540.3611606]
[Code]
Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this variant to see how the change impacts the pipeline’s output score.We recently proposed mlwhatif, a library that enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. We demonstrate how data scientists can leverage mlwhatif for a variety of pipelines and three different what-if analyses focusing on the robustness of a pipeline against data errors, the impact of data cleaning operations, and the impact of data preprocessing operations on fairness. In particular, we demonstrate step-by-step how mlwhatif generates and optimizes the required execution plans for the pipeline analyses. Our library is publicly available at https://github.com/stefan-grafberger/mlwhatif.
-
Harnessing the Web and Knowledge Graphs for Automated Impact Investing
Scoring
Hu, Qingzhi,
Daza, Daniel,
Swinkels, Laurens,
Usaite, Kristina,
Hoen, Robbert-Jan,
and Groth, Paul
In KDD Fragile Earth Workshop
2023
[Link]
[DOI:10.48550/arXiv.2308.02622]
-
An approach for analysing the impact of data integration on complex network diffusion models
Nevin, James,
Groth, Paul,
and Lees, Michael
Journal of Complex Networks
2023
[Link]
[DOI:10.1093/comnet/cnad025]
[Code]
-
Data journeys: Explaining AI workflows through abstraction
Daga, Enrico,
and Groth, Paul
Semantic Web
2023
[Abs]
[Link]
[DOI:10.3233/sw-233407]
"Artificial intelligence systems are not simply built on a single dataset or trained model. Instead, they are made by complex data science workflows involving multiple datasets, models, preparation scripts, and algorithms. Given this complexity, in order to understand these AI systems, we need to provide explanations of their functioning at higher levels of abstraction. To tackle this problem, we focus on the extraction and representation of data journeys from these workflows. A data journey is a multi-layered semantic representation of data processing activity linked to data science code and assets. We propose an ontology to capture the essential elements of a data journey and an approach to extract such data journeys. Using a corpus of Python notebooks from Kaggle, we show that we are able to capture high-level semantic data flow that is more compact than using the code structure itself. Furthermore, we show that introducing an intermediate knowledge graph representation outperforms models that rely only on the code itself. Finally, we report on a user survey to reflect on the challenges and opportunities presented by computational data journeys for explainable AI."
-
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines
Grafberger, Stefan,
Groth, Paul,
and Schelter, Sebastian
Proc. ACM Manag. of Data
2023
[Abs]
[Link]
[DOI:10.1145/3589273]
Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this pipeline variant to see how the change impacts the pipeline’s output score. The application of existing analysis techniques to ML pipelines is technically challenging as they are hard to integrate into existing pipeline code and their execution introduces large overheads due to repeated work.We propose mlwhatif to address these integration and efficiency challenges for data-centric what-if analyses on ML pipelines. mlwhatif enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. Our approach employs pipeline patches to specify changes to the data, operators and models of a pipeline. Based on these patches, we define a multi-query optimizer for efficiently executing the resulting pipeline variants jointly, with four subsumption-based optimization rules. Subsequently, we detail how to implement the pipeline variant generation and optimizer of mlwhatif. For that, we instrument native ML pipelines written in Python to extract dataflow plans with re-executable operators.We experimentally evaluate mlwhatif, and find that its speedup scales linearly with the number of pipeline variants in applicable cases, and is invariant to the input data size. In end-to-end experiments with four analyses on more than 60 pipelines, we show speedups of up to 13x compared to sequential execution, and find that the speedup is invariant to the model and featurization in the pipeline. Furthermore, we confirm the low instrumentation overhead of mlwhatif.
-
Self-Contained Entity Discovery from Captioned Videos
Ayoughi, Melika,
Mettes, Pascal,
and Groth, Paul
ACM Trans. Multimedia Comput. Commun. Appl.
2023
[Abs]
[Link]
[DOI:10.1145/3583138]
This article introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g., faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating entities with labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame–caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks, SC-Friends and SC-BBT, based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub.1
-
GitTables: A Large-Scale Corpus of Relational Tables
Hulsebos, Madelon,
Demiralp, Çagatay,
and Groth, Paul
Proc. ACM Manag. Data
2023
[Abs]
[Link]
[DOI:10.1145/3588710]
The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io.
-
AQuA-CEP: Adaptive Quality-Aware Complex Event Processing in the Internet of Things
Lotfian Delouee, Majid,
Koldehofe, Boris,
and Degeler, Viktoriya
In The 17th ACM International Conference on Distributed Event-Based Systems (DEBS 2023)
2023
[Abs]
[Link]
Sensory data profoundly influences the quality of detected events in a distributed complex event processing system (DCEP). Since each sensor’s status is unstable at runtime, a single sensing assignment is often insufficient to fulfill the consumer’s quality requirements. In this paper, we study in the context of AQuA-CEP the problem of dynamic quality monitoring and adaptation of complex event processing by active integration of suitable data sources. To support this, in AQuA-CEP, queries to detect complex events are supplemented with consumer-definable quality policies that are evaluated and used to autonomously select (or even configure) suitable data sources of the sensing infrastructure. In addition, we studied different forms of expressing quality policies and analyzed how it affects the quality monitoring process. Various modes of evaluating and applying quality-related adaptations and their impacts on correlation efficiency are addressed, too. We assessed the performance of AQuA-CEP in IoT scenarios by utilizing the notion of the quality policy alongside the query processing adaptation using knowledge derived from quality monitoring. The results show that AQuA-CEP can improve the performance of DCEP systems in terms of the quality of results while fulfilling the consumer’s quality requirements. Quality-based adaptation can also increase the network’s lifetime by optimizing the sensor’s energy consumption due to efficient data source selection.
-
E2EG: End-to-End Node Classification Using Graph Topology and Text-based Node Attributes
Dinh, Tu Anh,
Boef, Jeroen,
Cornelisse, Joran,
and Groth, Paul
In 2023 IEEE International Conference on Data Mining Workshops (ICDMW)
2023
[DOI:10.1109/ICDMW60847.2023.00142]
-
Towards Declarative Systems for Data-Centric Machine Learning
Grafberger, Stefan,
Karlaš, Bojan,
Groth, Paul,
and Schelter, Sebastian
In Proceedings of the Data-Centric Machine Learning Research work- shop (DMLR) at ICML, 2023
2023
[Link]
-
Approximate Answering of Graph Queries
Cochez, Michael,
Alivanistos, Dimitrios,
Arakelyan, Erik,
Berrendorf, Max,
Daza, Daniel,
Galkin, Mikhail,
Minervini, Pasquale,
Niepert, Mathias,
and Ren, Hongyu
2023
[Link]
[DOI:10.3233/FAIA230149]
-
Reasoning beyond Triples: Recent Advances in Knowledge Graph Embeddings
Xiong, Bo,
Nayyeri, Mojtaba,
Daza, Daniel,
and Cochez, Michael
In Proceedings of the 32nd ACM International Conference on Information
and Knowledge Management, CIKM 2023, Birmingham, United Kingdom,
October 21-25, 2023
2023
[Link]
[DOI:10.1145/3583780.3615294]
-
Reconstructing and Querying ML Pipeline Intermediates
Schelter, Sebastian
In 13th Conference on Innovative Data Systems Research, CIDR 2023,
Amsterdam, The Netherlands, January 8-11, 2023
2023
[Link]
-
-
Empowering Machine Learning Development with Service-Oriented Computing Principles
Yousefi, Mostafa Hadadian Nejad,
Degeler, Viktoriya,
and Lazovik, Alexander
In Service-Oriented Computing
2023
[Abs]
Despite software industries’ successful utilization of Service-Oriented Computing (SOC) to streamline software development, machine learning (ML) development has yet to fully integrate these practices. This disparity can be attributed to multiple factors, such as the unique challenges inherent to ML development and the absence of a unified framework for incorporating services into this process. In this paper, we shed light on the disparities between services-oriented computing and machine learning development. We propose “Everything as a Module” (XaaM), a framework designed to encapsulate every ML artifacts including models, code, data, and configurations as individual modules, to bridge this gap. We propose a set of additional steps that need to be taken to empower machine learning development using services-oriented computing via an architecture that facilitates efficient management and orchestration of complex ML systems. By leveraging the best practices of services-oriented computing, we believe that machine learning development can achieve a higher level of maturity, improve the efficiency of the development process, and ultimately, facilitate the more effective creation of machine learning applications.
-
Results of SemTab 2023
Hassanzadeh, Oktie,
Abdelmageed, Nora,
Efthymiou, Vasilis,
Chen, Jiaoyan,
Cutrona, Vincenzo,
Hulsebos, Madelon,
Jiménez-Ruiz, Ernesto,
Khatiwada, Aamod,
Korini, Keti,
Kruit, Benno,
Sequeda, Juan,
and Srinivas, Kavitha
In Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching, SemTab 2023, co-located with the 22nd International
Semantic Web Conference, ISWC 2023, Athens, Greece, November 6-10,
2023
2023
[Link]
-
Introducing the Observatory Library for End-to-End Table Embedding Inference
Cong, Tianji,
Sun, Zhenjie,
Groth, Paul,
Jagadish, H.,
and Hulsebos, Madelon
In NeurIPS 2023 Second Table Representation Learning Workshop
2023
[Link]
-
Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making
Guha, Shubha,
Khan, Falaah Arif,
Stoyanovich, Julia,
and Schelter, Sebastian
In 2023 IEEE 39th International Conference on Data Engineering (ICDE)
2023
[DOI:10.1109/ICDE55515.2023.00303]
[Code]
-
Forget Me Now: Fast and Exact Unlearning in Neighborhood-Based Recommendation
Schelter, Sebastian,
Ariannezhad, Mozhdeh,
and Rijke, Maarten
In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
2023
[Abs]
[Link]
[DOI:10.1145/3539618.3591989]
Modern search and recommendation systems are optimized using logged interaction data. There is increasing societal pressure to enable users of such systems to have some of their data deleted from those systems. This paper focuses on "unlearning" such user data from neighborhood-based recommendation models on sparse, high-dimensional datasets. We present caboose, a custom top-k index for such models, which enables fast and exact deletion of user interactions. We experimentally find that caboose provides competitive index building times, makes sub-second unlearning possible (even for a large index built from one million users and 256 million interactions), and, when integrated into three state-of-the-art next-basket recommendation models, allows users to effectively adjust their predictions to remove sensitive items.
-
Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion Models
Nevin, James,
Groth, Paul,
and Lees, Michael
In Computational Science – ICCS 2023
2023
[Abs]
[DOI:10.1007/978-3-031-35995-8_35]
The successful application of computational models presupposes access to accurate, relevant, and representative datasets. The growth of public data, and the increasing practice of data sharing and reuse, emphasises the importance of data provenance and increases the need for modellers to understand how data processing decisions might impact model output. One key step in the data processing pipeline is that of data integration and entity resolution, where entities are matched across disparate datasets. In this paper, we present a new formulation of data integration in complex networks that incorporates integration uncertainty. We define an approach for understanding how different data integration setups can impact the results of network diffusion models under this uncertainty, allowing one to systematically characterise potential model outputs in order to create an output distribution that provides a more comprehensive picture.
-
Proactively Screening Machine Learning Pipelines with ARGUSEYES
Schelter, Sebastian,
Grafberger, Stefan,
Guha, Shubha,
Karlas, Bojan,
and Zhang, Ce
In Companion of the 2023 International Conference on Management of Data
2023
[Abs]
[Link]
[DOI:10.1145/3555041.3589682]
[ 2nd Place Demo ]
Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline.
-
Seventh Workshop on Data Management for End-to-End Machine Learning (DEEM)
Boehm, Matthias,
Hulsebos, Madelon,
Shankar, Shreya,
and Varma, Paroma
In Companion of the 2023 International Conference on Management of Data
2023
[Abs]
[Link]
[DOI:10.1145/3555041.3590819]
The DEEM’23 workshop (Data Management for End-to-End Machine Learning) is held on Sunday June 18th, in conjunction with SIGMOD/PODS 2023. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers (10 pages) describing preliminary and ongoing research results, including industrial experience reports of end-to-end ML deployments, related to DEEM topics. In addition, DEEM 2023 has a category for short papers (4 pages) as a forum for sharing interesting use cases, problems, datasets, benchmarks, visionary ideas, system designs, preliminary results, and descriptions of system components and tools related to end-to-end ML pipelines. The workshop received 13 high-quality submissions on diverse topics relevant to DEEM, of which 6 regular papers and 7 short papers.
-
Models and Practice of Neural Table Representations
Hulsebos, Madelon,
Deng, Xiang,
Sun, Huan,
and Papotti, Paolo
In Companion of the 2023 International Conference on Management of Data
2023
[Abs]
[Link]
[DOI:10.1145/3555041.3589411]
In the last few years, the natural language processing community witnessed advances in neural representations of free-form text with transformer-based language models (LMs). Given the importance of knowledge available in relational tables, recent research efforts extend LMs by developing neural representations for tabular data. In this tutorial, we present these proposals with three main goals. First, we aim at introducing the potentials and limitations of current models to a database audience. Second, we want the attendees to see the benefit of such line of work in a large variety of data applications. Third, we would like to empower the audience with a new set of tools and to inspire them to tackle some of the important directions for neural table representations, including model and system design, evaluation, application and deployment. To achieve these goals, the tutorial is organized in two parts. The first part covers the background for neural table representations, including a survey of the most important systems. The second part is designed as a hands-on session, where attendees will use their laptop to explore this new framework and test neural models involving text and tabular data.
-
Provenance Tracking for End-to-End Machine Learning Pipelines
Grafberger, Stefan,
Groth, Paul,
and Schelter, Sebastian
In Companion Proceedings of the ACM Web Conference 2023
2023
[Link]
[DOI:10.1145/3543873.3587557]
-
A Simulation Environment and Reinforcement Learning Method for Waste Reduction
Jullien, Sami,
Ariannezhad, Mozhdeh,
Groth, Paul,
and Rijke, Maarten
Transactions on Machine Learning Research
2023
[Link]
-
Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century (Dagstuhl Seminar 22372)
Groth, Paul,
Simperl, Elena,
Erp, Marieke,
and Vrandečić, Denny
Dagstuhl Reports
2023
[Link]
[DOI:10.4230/DagRep.12.9.60]
-
Poster: Towards Pattern-Level Privacy Protection in Distributed Complex Event Processing
Lotfian Delouee, Majid,
Koldehofe, Boris,
and Degeler, Viktoriya
In Proceedings of the 17th ACM International Conference on Distributed and Event-Based Systems
2023
[Abs]
[Link]
[DOI:10.1145/3583678.3603278]
In event processing systems, detected event patterns can reveal privacy-sensitive information. In this paper, we propose and discuss how to integrate pattern-level privacy protection in event-based systems. Compared to state-of-the-art approaches, we aim to enforce privacy independent of the particularities of specific operators. We accomplish this by supporting the flexible integration of multiple obfuscation techniques and studying deployment strategies for privacy-enforcing mechanisms. In addition, we share ideas on how to model the adversary’s knowledge to select appropriate obfuscation techniques for the discussed deployment strategies. Initial results indicate that flexibly choosing obfuscation techniques and deployment strategies is essential to conceal privacy-sensitive event patterns accurately.
-
Parameter Efficient Node Classification on Homophilic Graphs
Prieto, Lucas,
Boef, Jeroen Den,
Groth, Paul,
and Cornelisse, Joran
Transactions on Machine Learning Research
2023
[Link]
[Code]
-
-
How to Make an Outlier? Studying the Effect of Presentational Features on the Outlierness of Items in Product Search Results
Sarvi, Fatemeh,
Aliannejadi, Mohammad,
Schelter, Sebastian,
and Rijke, Maarten
In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval
2023
[Abs]
[Link]
[DOI:10.1145/3576840.3578278]
In two-sided marketplaces, items compete for attention from users since attention translates to revenue for suppliers. Item exposure is an indication of the amount of attention that items receive from users in a ranking. It can be influenced by factors like position bias. Recent work suggests that another phenomenon related to inter-item dependencies may also affect item exposure, viz. outlier items in the ranking. Hence, a deeper understanding of outlier items is crucial to determining an item’s exposure distribution. In this work, we study the impact of different presentational e-commerce features on users’ perception of outlierness of an item in a search result page. Informed by visual search literature, we design a set of crowdsourcing tasks where we compare the observability of three main features, viz. price, star rating, and discount tag. We find that various factors affect item outlierness, namely, visual complexity (e.g., shape, color), discriminative item features, and value range. In particular, we observe that a distinctive visual feature such as a colored discount tag can attract users’ attention much easier than a high price difference, simply because of visual characteristics that are easier to spot. Moreover, we see that the magnitude of deviations in all features affects the task complexity, such that when the similarity between outlier and non-outlier items increases, the task becomes more difficult.
-
The Mysterious User of Research Data: Knitting Together Science and Technology Studies with Information and Computer Science
Gregory, Kathleen,
Groth, Paul,
Scharnhorst, Andrea,
and Wyatt, Sally
2023
[Abs]
[Link]
[DOI:10.1007/978-3-031-11108-2_11]
Open, accessible, and standardized research data are seen as essential scaffolding for open science. To support this vision, data repositories and scientific publishers have developed new tools to facilitate data discovery while funders and policy makers have implemented open science and data management policies. Users are often invoked as central to these efforts. Despite this stated focus, the concept of ‘user’ often remains an abstraction, visible only via anonymous ensembles of click behavior or data management plans. This chapter reports and reflects on a project which draws on science and technology studies (STS) to open up the black box of research data use, bridging the gap between designers of data search systems and researchers who (re-)use both data and these systems in their actual practices. Quantitative and qualitative studies conducted in the course of this project will be drawn upon to demonstrate the insights gained from an interdisciplinary approach.
-
SemTab 2022: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching, co-located with the 21st International
Semantic Web Conference, ISWC 2022, Virtual conference, October
23-27, 2022
2023
[Link]
2022
-
Relational graph convolutional networks: a closer look
Thanapalasingam, Thiviyan,
Berkel, Lucas,
Bloem, Peter,
and Groth, Paul
PeerJ Computer Science
2022
[Link]
[DOI:10.7717/peerj-cs.1073]
-
Question Answering with Additive Restrictive Training (QuAART): Question Answering for the Rapid Development of New Knowledge Extraction Pipelines
Harper, Corey A.,
Daniel, Ron,
and Groth, Paul
In Knowledge Engineering and Knowledge Management (EKAW)
2022
[Abs]
[Link]
[DOI:10.1007/978-3-031-17105-5_4]
Abstract
Numerous studies have explored the use of language models and question answering techniques for knowledge extraction. In most cases, these models are trained on data specific to the new task at hand. We hypothesize that using models trained only on generic question answering data (e.g. SQuAD) is a good starting point for domain specific entity extraction. We test this hypothesis, and explore whether the addition of small amounts of training data can help lift model performance. We pay special attention to the use of null answers and unanswerable questions to optimize performance. To our knowledge, no studies have been done to evaluate the effectiveness of this technique. We do so for an end-to-end entity mention detection and entity typing task on HAnDS and FIGER, two common evaluation datasets for fine grained entity recognition. We focus on fine-grained entity recognition because it is challenging scenario, and because the long tail of types in this task highlights the need for entity extraction systems that can deal with new domains and types. To our knowledge, we are the first system beyond those presented in the original FIGER and HAnDS papers to tackle the task in an end-to-end fashion. Using an extremely small sample from the distantly-supervised HAnDS training data – 0.0015%, or less than 500 passages randomly chosen out of 31 million – we produce a CoNNL F1 score of 73.72 for entity detection on FIGER. Our end-to-end detection and typing evaluation produces macro and micro F1s of 45.11 and 54.75, based on the FIGER evaluation metrics. This work provides a foundation for the rapid development of new knowledge extraction pipelines.
-
Serenade - Low-Latency Session-Based Recommendation in e-Commerce at Scale
Kersbergen, Barrie,
Sprangers, Olivier,
and Schelter, Sebastian
In Proceedings of the 2022 International Conference on Management of Data
2022
[Abs]
[Link]
[DOI:10.1145/3514221.3517901]
Session-based recommendation predicts the next item with which a user will interact, given a sequence of her past interactions with other items. This machine learning problem targets a core scenario in e-commerce platforms, which aim to recommend interesting items to buy to users browsing the site. Session-based recommenders are difficult to scale due to their exponentially large input space of potential sessions. This impedes offline precomputation of the recommendations, and implies the necessity to maintain state during the online computation of next-item recommendations.We propose VMIS-kNN, an adaptation of a state-of-the-art nearest neighbor approach to session-based recommendation, which leverages a prebuilt index to compute next-item recommendations with low latency in scenarios with hundreds of millions of clicks to search through. Based on this approach, we design and implement the scalable session-based recommender system Serenade, which is in production usage at bol.com, a large European e-commerce platform.We evaluate the predictive performance of VMIS-kNN, and show that Serenade can answer a thousand recommendation requests per second with a 90th percentile latency of less than seven milliseconds in scenarios with millions of items to recommend. Furthermore, we present results from a three week long online A/B test with up to 600 requests per second for 6.5 million distinct items on more than 45 million user sessions from our e-commerce platform. To the best of our knowledge, we provide the first empirical evidence that the superior predictive performance of nearest neighbor approaches to session-based recommendation in offline evaluations translates to superior performance in a real world e-commerce setting.
-
Towards Data-Centric What-If Analysis for Native Machine Learning Pipelines
Grafberger, Stefan,
Groth, Paul,
and Schelter, Sebastian
In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
2022
[Abs]
[Link]
[DOI:10.1145/3533028.3533303]
An important task of data scientists is to understand the sensitivity of their models to changes in the data that the models are trained and tested upon. Currently, conducting such data-centric what-if analyses requires significant and costly manual development and testing with the corresponding chance for the introduction of bugs. We discuss the problem of data-centric what-if analysis over whole ML pipelines (including data preparation and feature encoding), propose optimisations that reuse trained models and intermediate data to reduce the runtime of such analysis, and finally conduct preliminary experiments on three complex example pipelines, where our approach reduces the runtime by a factor of up to six.
-
Responsible Data Management
Stoyanovich, Julia,
Abiteboul, Serge,
Howe, Bill,
Jagadish, H. V.,
and Schelter, Sebastian
Communications of the ACM
2022
[Abs]
[Link]
[DOI:10.1145/3488717]
Perspectives on the role and responsibility of the data-management research community in designing, developing, using, and overseeing automated decision systems.
-
Methods Included
Crusoe, Michael R.,
Abeln, Sanne,
Iosup, Alexandru,
Amstutz, Peter,
Chilton, John,
Tijanić, Nebojša,
Ménager, Hervé,
Soiland-Reyes, Stian,
Gavrilović, Bogdan,
Goble, Carole,
and The CWL Community,
Communications of the ACM
2022
[Abs]
[Link]
[DOI:10.1145/3486897]
Standardizing computational reuse and portability with the Common Workflow Language.
-
CITRIS: Causal Identifiability from Temporal Intervened Sequences
Lippe, Phillip,
Magliacane, Sara,
Löwe, Sindy,
Asano, Yuki M.,
Cohen, Taco,
and Gavves, Efstratios
In Proceedings of the 39th International Conference on Machine Learning, ICML
2022
[arXiv]
-
SlotGAN: Detecting Mentions in Text via Adversarial Distant Learning
Daza, Daniel,
Cochez, Michael,
and Groth, Paul
In Proceedings of the Sixth Workshop on Structured Prediction for NLP
2022
[Abs]
[Link]
[DOI:10.18653/v1/2022.spnlp-1.4]
We present SlotGAN, a framework for training a mention detection model that only requires unlabeled text and a gazetteer. It consists of a generator trained to extract spans from an input sentence, and a discriminator trained to determine whether a span comes from the generator, or from the gazetteer.We evaluate the method on English newswire data and compare it against supervised, weakly-supervised, and unsupervised methods. We find that the performance of the method is lower than these baselines, because it tends to generate more and longer spans, and in some cases it relies only on capitalization. In other cases, it generates spans that are valid but differ from the benchmark. When evaluated with metrics based on overlap, we find that SlotGAN performs within 95% of the precision of a supervised method, and 84% of its recall. Our results suggest that the model can generate spans that overlap well, but an additional filtering mechanism is required.
-
Letter from the Special Issue Editor
Schelter, Sebastian
IEEE Data Engineering Bulletin (Special issue on Directions Towards GDPR-Compliant Data Systems and Applications)
2022
[Link]
-
Making Canonical Workflow Building Blocks Interoperable across Workflow Languages
Soiland-Reyes, Stian,
Bayarri, Genís,
Andrio, Pau,
Long, Robin,
Lowe, Douglas,
Niewielska, Ania,
Hospital, Adam,
and Groth, Paul
Data Intelligence
2022
[Abs]
[Link]
[DOI:10.1162/dint_a_00135]
We introduce the concept of Canonical Workflow Building Blocks (CWBB), a methodology of describing and wrapping computational tools, in order for them to be utilised in a reproducible manner from multiple workflow languages and execution platforms. The concept is implemented and demonstrated with the BioExcel Building Blocks library (BioBB), a collection of tool wrappers in the field of computational biomolecular simulation. Interoperability across different workflow languages is showcased through a protein Molecular Dynamics setup transversal workflow, built using this library and run with 5 different Workflow Manager Systems (WfMS). We argue such practice is a necessary requirement for FAIR Computational Workflows and an element of Canonical Workflow Frameworks for Research (CWFR) in order to improve widespread adoption and reuse of computational methods across workflow language barriers.
-
Defining a Knowledge Graph Development Process Through a Systematic Review
Tamašauskaitundefined, Gytundefined,
and Groth, Paul
ACM Transactios on Software Engineering and Methodology
2022
[Abs]
[Link]
[DOI:10.1145/3522586]
Knowledge graphs are widely used in industry and studied within the academic community. However, the models applied in the development of knowledge graphs vary. Analysing and providing a synthesis of the commonly used approaches to knowledge graph development would provide researchers and practitioners a better understanding of the overall process and methods involved. Hence, this paper aims to define the overall process of knowledge graph development and its key constituent steps. For this purpose, a systematic review and a conceptual analysis of the literature was conducted. The resulting process was compared to case studies to evaluate its applicability. The proposed process suggests a unified approach and provides guidance for both researchers and practitioners when constructing and managing knowledge graphs.
-
Data distribution debugging in machine learning pipelines
Grafberger, Stefan,
Groth, Paul,
Stoyanovich, Julia,
and Schelter, Sebastian
The VLDB Journal
2022
[Link]
[DOI:10.1007/s00778-021-00726-w]
-
Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation
Schröder, Max,
Staehlke, Susanne,
Groth, Paul,
Nebe, J. Barbara,
Spors, Sascha,
and Krüger, Frank
Journal of Biomedical Semantics
2022
[Link]
[DOI:10.1186/s13326-021-00257-x]
-
Packaging research artefacts with RO-Crate
Soiland-Reyes, Stian,
Sefton, Peter,
Crosas, Mercè,
Castro, Leyla Jael,
Coppens, Frederik,
Fernández, José M.,
Garijo, Daniel,
Grüning, Björn,
La Rosa, Marco,
Leo, Simone,
and al.,
Data Science
2022
[Link]
[DOI:10.3233/DS-210053]
-
Towards Parameter-Efficient Automation of Data Wrangling Tasks with Prefix-Tuning
Vos, David,
Döhmen, Till,
and Schelter, Sebastian
In NeurIPS 2022 First Table Representation Workshop
2022
[Link]
-
Towards improving Wikidata reuse with emerging patterns
Carriero, Valentina Anita,
Groth, Paul,
and Presutti, Valentina
In Proceedings of the 3rd Wikidata Workshop 2022 co-located with the
21st International Semantic Web Conference (ISWC2022), Virtual Event,
Hanghzou, China, October 2022
2022
[Link]
-
The Semantic Web - 19th International Conference, ESWC 2022, Hersonissos,
Crete, Greece, May 29 - June 2, 2022, Proceedings
2022
[Link]
[DOI:10.1007/978-3-031-06981-9]
-
Making Table Understanding Work in Practice
Hulsebos, Madelon,
Gathani, Sneha,
Gale, James,
Dillig, Isil,
Groth, Paul,
and Demiralp,
In 12th Conference on Innovative Data Systems Research, CIDR 2022,
Chaminade, CA, USA, January 9-12, 2022
2022
[Link]
-
Screening Native Machine Learning Pipelines with ArgusEyes
Schelter, Sebastian,
Grafberger, Stefan,
Guha, Shubha,
Sprangers, Olivier,
Karlas, Bojan,
and Zhang, Ce
In 12th Conference on Innovative Data Systems Research, CIDR 2022,
Chaminade, CA, USA, January 9-12, 2022
2022
[Link]
-
GitSchemas: A Dataset for Automating Relational Data Preparation Tasks
Döhmen, Till,
Hulsebos, Madelon,
Beecks, Christian,
and Schelter, Sebastian
In 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)
2022
[Link]
[DOI:10.1109/ICDEW55742.2022.00016]
-
Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching co-located with the 20th International Semantic Web
Conference (ISWC 2021), Virtual conference, October 27, 2021
Jiménez-Ruiz, Ernesto,
Efthymiou, Vasilis,
Chen, Jiaoyan,
Cutrona, Vincenzo,
Hassanzadeh, Oktie,
Sequeda, Juan,
Srinivas, Kavitha,
Abdelmageed, Nora,
Hulsebos, Madelon,
Oliveira, Daniela,
and Pesquita, Catia
2022
[Link]
-
AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning
Huang, Biwei,
Feng, Fan,
Lu, Chaochao,
Magliacane, Sara,
and Zhang, Kun
In International Conference on Learning Representations
2022
[Link]
[ Spotlight Presentation ]
2021
-
-
Quality Assessment of Knowledge Graph Hierarchies using KG-BERT
Szarkowska, Kinga,
Moore, Veronique,
Vandenbussche, Pierre-Yves,
and Groth, Paul
In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG 2021)
2021
[Link]
-
GraphPOPE: Retaining Structural Graph Information Using Position-aware Node
Embeddings
Boef, Jeroen Den,
Cornelisse, Joran,
and Groth, Paul
In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG 2021)
2021
[Link]
-
Perspectives on automated composition of workflows in the life sciences
Lamprecht, Anna-Lena,
Palmblad, Magnus,
Ison, Jon,
Schwämmle, Veit,
Manir, Mohammad Sadnan Al,
Altintas, Ilkay,
Baker, Christopher J. O.,
Amor, Ammar Ben Hadj,
Capella-Gutierrez, Salvador,
Charonyktakis, Paulos,
Crusoe, Michael R.,
Gil, Yolanda,
Goble, Carole,
Griffin, Timothy J.,
Groth, Paul,
Ienasescu, Hans,
Jagtap, Pratik,
Kalaš, Matúš,
Kasalica, Vedran,
Khanteymoori, Alireza,
Kuhn, Tobias,
Mei, Hailiang,
Ménager, Hervé,
Möller, Steffen,
Richardson, Robin A.,
Robert, Vincent,
Soiland-Reyes, Stian,
Stevens, Robert,
Szaniszlo, Szoke,
Verberne, Suzan,
Verhoeven, Aswin,
and Wolstencroft, Katherine
F1000Research
2021
[Link]
[DOI:10.12688/f1000research.54159.1]
-
Further with Knowledge Graphs: Proceedings of the 17th International Conference on Semantic Systems, 6–9 September 2021, Amsterdam, The Netherlands
2021
[Link]
[DOI:10.3233/SSW53]
-
SemEval-2021 Task 8: MeasEval – Extracting Counts and Measurements and their Related Contexts
Harper, Corey,
Cox, Jessica,
Kohler, Curt,
Scerri, Antony,
Daniel Jr., Ron,
and Groth, Paul
In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
2021
[Abs]
[Link]
[DOI:10.18653/v1/2021.semeval-1.38]
[ SemEval 2021 Best Task Paper ]
We describe MeasEval, a SemEval task of extracting counts, measurements, and related context from scientific documents, which is of significant importance to the creation of Knowledge Graphs that distill information from the scientific literature. This is a new task in 2021, for which over 75 submissions from 25 participants were received. We expect the data developed for this task and the findings reported to be valuable to the scientific knowledge extraction, metrology, and automated knowledge base construction communities.
-
Reinforcement Learning–Based Collective Entity Alignment with Adaptive Features
Zeng, Weixin,
Zhao, Xiang,
Tang, Jiuyang,
Lin, Xuemin,
and Groth, Paul
ACM Trans. Inf. Syst.
2021
[Abs]
[Link]
[DOI:10.1145/3446428]
Entity alignment (EA) is the task of identifying the entities that refer to the same real-world object but are located in different knowledge graphs (KGs). For entities to be aligned, existing EA solutions treat them separately and generate alignment results as ranked lists of entities on the other side. Nevertheless, this decision-making paradigm fails to take into account the interdependence among entities. Although some recent efforts mitigate this issue by imposing the 1-to-1 constraint on the alignment process, they still cannot adequately model the underlying interdependence and the results tend to be sub-optimal.To fill in this gap, in this work, we delve into the dynamics of the decision-making process, and offer a reinforcement learning (RL)–based model to align entities collectively. Under the RL framework, we devise the coherence and exclusiveness constraints to characterize the interdependence and restrict collective alignment. Additionally, to generate more precise inputs to the RL framework, we employ representative features to capture different aspects of the similarity between entities in heterogeneous KGs, which are integrated by an adaptive feature fusion strategy. Our proposal is evaluated on both cross-lingual and mono-lingual EA benchmarks and compared against state-of-the-art solutions. The empirical results verify its effectiveness and superiority.
-
Inductive Entity Representations from Text via Link Prediction
Daza, Daniel,
Cochez, Michael,
and Groth, Paul
In Proceedings of The Web Conference
2021
[arXiv]
[DOI:10.1145/3442381.3450141]
[Code]
-
Learnings from a Retail Recommendation System on Billions of Interactions at bol.com
Kersbergen, B.,
and Schelter, S.
In 2021 IEEE 37th International Conference on Data Engineering (ICDE)
2021
[Link]
[DOI:10.1109/ICDE51399.2021.00277]
-
Letter from the Special Issue Editor
Schelter, Sebastian
IEEE Data Engineering Bulletin (Special issue on Data validation for machine learning models and applications)
2021
[Link]
-
Complex Query Answering with Neural Link Predictors
Arakelyan, Erik,
Daza, Daniel,
Minervini, Pasquale,
and Cochez, Michael
In International Conference on Learning Representations (ICLR)
2021
[arXiv]
[Link]
[ Outstanding Paper Award ICLR 2021 ]
-
The Challenges of Cross-Document Coreference Resolution for Email
Li, Xue,
Magliacane, Sara,
and Groth, Paul
In Proceedings of the 11th on Knowledge Capture Conference
2021
[Abs]
[Link]
[DOI:10.1145/3460210.3493573]
Long-form conversations such as email are an important source of information for knowledge capture. For tasks such as knowledge graph construction, conversational search, and entity linking, being able to resolve entities from across documents is important. Building on recent work on within document coreference resolution for email, we study for the first time a cross-document formulation of the problem. Our results show that the current state-of-the-art deep learning models for general cross-document coreference resolution are insufficient for email conversations. Our experiments show that the general task is challenging and, importantly for knowledge intensive tasks, coreference resolution models that only treat entity mentions perform worse. Based on these results, we outline the work needed to address this challenging task.
-
Supporting Ontology Maintenance with Contextual Word Embeddings and
Maximum Mean Discrepancy
Shroff, Natasha,
Vandenbussche, Pierre-Yves,
Moore, Véronique,
and Groth, Paul
In Joint Proceedings of the 2nd International Workshop on Deep Learning
meets Ontologies and Natural Language Processing (DeepOntoNLP 2021)
& 6th International Workshop on Explainable Sentiment Mining
and Emotion Detection (X-SENTIMENT 2021) co-located with co-located
with 18th Extended Semantic Web Conference 2021, Hersonissos, Greece,
June 6th - 7th, 2021 (moved online)
2021
[Link]
-
Proceedings of Machine Learning with Symbolic Methods and Knowledge Graphs co-located
with European Conference on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases (ECML PKDD 2021), Virtual,
September 17, 2021
Alam, Mehwish,
Ali, Mehdi,
Groth, Paul,
Hitzler, Pascal,
Lehmann, Jens,
Paulheim, Heiko,
Rettinger, Achim,
Sack, Harald,
Sadeghi, Afshin,
and Tresp, Volker
2021
[Link]
-
Verifiably Safe Exploration for End-to-End Reinforcement Learning
Hunt, Nathan,
Fulton, Nathan,
Magliacane, Sara,
Hoang, Trong Nghia,
Das, Subhro,
and Solar-Lezama, Armando
In Proceedings of the 24th International Conference on Hybrid Systems: Computation and Control
2021
[Abs]
[Link]
[DOI:10.1145/3447928.3456653]
[ Best Paper Award ACM HSCC 2021 ]
Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey hard constraints during exploration. This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs. Our approach draws on recent advances in object detection and automated reasoning for hybrid dynamical systems. The approach is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of hard constraints. Our benchmark draws from several proposed problem sets for safe learning and includes problems that emphasize challenges such as reward signals that are not aligned with safety constraints. On each of these benchmark problems, our algorithm completely avoids unsafe behavior while remaining competitive at optimizing for as much reward as is safe. We characterize safety constraints in terms of a refinement relation on Markov decision processes - rather than directly constraining the reinforcement learning algorithm so that it only takes safe actions, we instead refine the environment so that only safe actions are defined in the environment’s transition structure. This has pragmatic system design benefits and, more importantly, provides a clean conceptual setting in which we are able to prove important safety and efficiency properties. These allow us to transform the constrained optimization problem of acting safely in the original environment into an unconstrained optimization in a refined environment.
-
Summary of Tutorials at The Web Conference 2021
West, Robert,
Bhagat, Smriti,
Groth, Paul,
Zitnik, Marinka,
Couto, Francisco M.,
Lisena, Pasquale,
Meroño-Peñuela, Albert,
Zhao, Xiangyu,
Fan, Wenqi,
Yin, Dawei,
Tang, Jiliang,
Shou, Linjun,
Gong, Ming,
Pei, Jian,
Geng, Xiubo,
Zhou, Xingjie,
Jiang, Daxin,
Ricaud, Benjamin,
Aspert, Nicolas,
Miz, Volodymyr,
Dy, Jennifer,
Ioannidis, Stratis,
Yıldız, undefinedlkay,
Rezapour, Rezvaneh,
Aref, Samin,
Dinh, Ly,
Diesner, Jana,
Drutsa, Alexey,
Ustalov, Dmitry,
Popov, Nikita,
Baidakova, Daria,
Mishra, Shubhanshu,
Gopalan, Arjun,
Juan, Da-Cheng,
Ilharco Magalhaes, Cesar,
Ferng, Chun-Sung,
Heydon, Allan,
Lu, Chun-Ta,
Pham, Philip,
Yu, George,
Fan, Yicheng,
Wang, Yueqi,
Laurent, Florian,
Schraner, Yanick,
Scheller, Christian,
Mohanty, Sharada,
Chen, Jiawei,
Wang, Xiang,
Feng, Fuli,
He, Xiangnan,
Teinemaa, Irene,
Albert, Javier,
Goldenberg, Dmitri,
Vasile, Flavian,
Rohde, David,
Jeunen, Olivier,
Benhalloum, Amine,
Sakhi, Otmane,
Rong, Yu,
Huang, Wenbing,
Xu, Tingyang,
Bian, Yatao,
Cheng, Hong,
Sun, Fuchun,
Huang, Junzhou,
Fakhraei, Shobeir,
Faloutsos, Christos,
Çelebi, Onur,
Müller, Martin,
Schneider, Manuel,
Altunina, Olesia,
Wingerath, Wolfram,
Wollmer, Benjamin,
Gessert, Felix,
Succo, Stephan,
Ritter, Norbert,
Courdier, Evann,
Avram, Tudor Mihai,
Cvetinovic, Dragan,
Tsinadze, Levan,
Jose, Johny,
Howell, Rose,
Koenig, Mario,
Defferrard, Michaël,
Kenthapadi, Krishnaram,
Packer, Ben,
Sameki, Mehrnoosh,
and Sephus, Nashlie
In Companion Proceedings of the Web Conference 2021
2021
[Abs]
[Link]
[DOI:10.1145/3442442.3453701]
This report summarizes the 23 tutorials hosted at The Web Conference 2021: nine lecture-style tutorials and 14 hands-on tutorials.
-
HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning
Schelter, Sebastian,
Grafberger, Stefan,
and Dunning, Ted
In Proceedings of the 2021 International Conference on Management of Data
2021
[Abs]
[Link]
[DOI:10.1145/3448016.3457239]
Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.
-
MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines
Grafberger, Stefan,
Guha, Shubha,
Stoyanovich, Julia,
and Schelter, Sebastian
In Proceedings of the 2021 International Conference on Management of Data
2021
[Abs]
[Link]
[DOI:10.1145/3448016.3452759]
Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policymakers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. While bias detection cannot be fully automated, computational tools can help pinpoint particular types of data issues.We recently proposed mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines. In this demonstration, we show how mlinspect can be used to detect data distribution bugs in a representative pipeline. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines, can handle both relational and matrix data, and does not require manual code instrumentation. The library is publicly available at https://github.com/stefan-grafberger/mlinspect.
-
Automating Data Quality Validation for Dynamic Data Ingestion
Redyuk, Sergey,
Kaoudi, Zoi,
Markl, Volker,
and Schelter, Sebastian
In Proceedings of the 24th International Conference on Extending Database
Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021
2021
[Link]
[DOI:10.5441/002/edbt.2021.07]
-
JENGA - A Framework to Study the Impact of Data Errors on the
Predictions of Machine Learning Models
Schelter, Sebastian,
Rukat, Tammo,
and Biessmann, Felix
In Proceedings of the 24th International Conference on Extending Database
Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021
2021
[Link]
[DOI:10.5441/002/edbt.2021.63]
-
-
Taming Technical Bias in Machine Learning Pipelines
Schelter, Sebastian,
and Stoyanovich, Julia
IEEE Data Engineering Bulletin (Special Issue on Interdisciplinary Perspectives on Fairness and Artificial Intelligence Systems)
2021
[Link]
-
Talking datasets – Understanding data sensemaking behaviours
Koesten, Laura,
Gregory, Kathleen,
Groth, Paul,
and Simperl, Elena
International Journal of Human-Computer Studies
2021
[Abs]
[arXiv]
[Link]
[DOI:10.1016/j.ijhcs.2020.102562]
The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little attention has been paid to a key step in data reuse: the behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 researchers from different disciplines as they summarised and interacted with both familiar and unfamiliar data. We use our findings to identify and detail common patterns of data-centric sensemaking across three clusters of activities that we present as a framework: inspecting data, engaging with content, and placing data within broader contexts. Additionally, we propose design recommendations for tools and documentation practices, which can be used to facilitate sensemaking and subsequent data reuse.
2020
-
Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels
Brate, Ryan,
Groth, Paul,
and Erp, Marieke
In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
2020
[Abs]
[Link]
Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them ‘smell experiences’, offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper, we present two variations on a semi-supervised approach to identify smell experiences in English literature. The combined set of patterns from both implementations offer significantly better performance than a keyword-based baseline.
-
Dataset Reuse: Toward Translating Principles to Practice
Koesten, Laura,
Vougiouklis, Pavlos,
Simperl, Elena,
and Groth, Paul
Patterns
2020
[Link]
[DOI:10.1016/j.patter.2020.100136]
-
Effective distributed representations for academic expert search
Berger, Mark,
Zavrel, Jakub,
and Groth, Paul
In Proceedings of the First Workshop on Scholarly Document Processing at EMNLP
2020
[Abs]
[Link]
Expert search aims to find and rank experts based on a user’s query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a document-centric voting model for retrieval. In particular, we explore the impact of the use of contextualized embeddings on search performance. We also present results for paper embeddings that incorporate citation information through retrofitting. Additionally, experiments are conducted using different techniques for assigning author weights based on author order. We observe that using contextual embeddings produced by a transformer model trained for sentence similarity tasks produces the most effective paper representations for document-centric expert retrieval. However, retrofitting the paper embeddings and using elaborate author contribution weighting strategies did not improve retrieval performance.
-
Dataset search: a survey
Chapman, Adriane,
Simperl, Elena,
Koesten, Laura,
Konstantinidis, George,
Ibáñez, Luis-Daniel,
Kacprzak, Emilia,
and Groth, Paul
The VLDB Journal
2020
[arXiv]
[Link]
[DOI:10.1007/s00778-019-00564-x]
-
Introduction – FAIR data, systems and analysis
Groth, Paul,
and Dumontier, Michel
Data Science
2020
[Link]
[DOI:10.3233/DS-200029]
-
Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine
Learning
Yang, Ke,
Huang, Biao,
Stoyanovich, Julia,
and Schelter, Sebastian
In Workshop on Human-In-the-Loop Data Analytics (HILDA’20)
2020
[Link]
[DOI:10.1145/3398730.3399194]
-
Towards Entity Spaces
Erp, Marieke,
and Groth, Paul
In Proceedings of The 12th Language Resources and Evaluation Conference
2020
[Abs]
[Link]
Entities are a central element of knowledge bases and are important input to many knowledge-centric tasks including text analysis. For example, they allow us to find documents relevant to a specific entity irrespective of the underlying syntactic expression within a document. However, the entities that are commonly represented in knowledge bases are often a simplification of what is truly being referred to in text. For example, in a knowledge base, we may have an entity for Germany as a country but not for the more fuzzy concept of Germany that covers notions of German Population, German Drivers, and the German Government. Inspired by recent advances in contextual word embeddings, we introduce the concept of entity spaces - specific representations of a set of associated entities with near-identity. Thus, these entity spaces provide a handle to an amorphous grouping of entities. We developed a proof-of-concept for English showing how, through the introduction of entity spaces in the form of disambiguation pages, the recall of entity linking can be improved.
-
PANDAcap: A Framework for Streamlining Collection of Full-System Traces
Stamatogiannakis, Manolis,
Bos, Herbert,
and Groth, Paul
In EuroSec
2020
[Link]
[DOI:10.1145/3380786.3391396]
[Code]
-
Lost or Found? Discovering Data Needed for Research
Gregory, Kathleen,
Groth, Paul,
Scharnhorst, Andrea,
and Wyatt, Sally
Harvard Data Science Review
2020
[Link]
[DOI:10.1162/99608f92.e38165eb]
-
Estimating the imageability of words by mining visual characteristics from crawled image data
Kastner, Marc A.,
Ide, Ichiro,
Nack, Frank,
Kawanishi, Yasutomo,
Hirayama, Takatsugu,
Deguchi, Daisuke,
and Murase, Hiroshi
Multimedia Tools and Applications
2020
[Link]
[DOI:10.1007/s11042-019-08571-4]
-
FAIR Data Reuse – the Path through Data Citation
Groth, Paul,
Cousijn, Helena,
Clark, Tim,
and Goble, Carole
Data Intelligence
2020
[Link]
[DOI:10.1162/dint_a_00030]
-
Semantic Systems. In the Era of Knowledge Graphs - 16th International
Conference on Semantic Systems, SEMANTiCS 2020, Amsterdam, The Netherlands,
September 7-10, 2020, Proceedings
Blomqvist, Eva,
Groth, Paul,
Boer, Victor,
Pellegrini, Tassilo,
Alam, Mehwish,
Käfer, Tobias,
Kieseberg, Peter,
Kirrane, Sabrina,
Meroño-Peñuela, Albert,
and Pandit, Harshvardhan J.
2020
[Link]
[DOI:10.1007/978-3-030-59833-4]
-
The state of altmetrics: a tenth anniversary celebration
Altmetric Engineering, ,
Konkiel, Stacy,
Priem, Jason,
Adie, Euan,
Derrick, Gemma,
Didegah, Fereshteh,
Groth, Paul,
Neylon, Cameron,
Shenmeng Xu, ,
Zahedi, Zohreh,
Bowman, Timothy,
Vanash M Patel, ,
Haunschild, Robin,
Bornmann, Lutz,
Taylor, Mike,
Ross, Liesa,
Theng, Yin-Leng,
Hassan, Saeed-Ul,
and Aljohani, Naif R.
2020
[Link]
[DOI:10.6084/M9.FIGSHARE.13010000.V2]
-
CSSA’20: Workshop on Combining Symbolic and Sub-Symbolic Methods and Their Applications
Alam, Mehwish,
Groth, Paul,
Hitzler, Pascal,
Paulheim, Heiko,
Sack, Harald,
and Tresp, Volker
In Proceedings of the 29th ACM International Conference on Information & Knowledge Management
2020
[Abs]
[Link]
[DOI:10.1145/3340531.3414072]
There has been a rapid growth in the use of symbolic representations along with their applications in many important tasks. Symbolic representations, in the form of Knowledge Graphs (KGs), constitute large networks of real-world entities and their relationships. On the other hand, sub-symbolic artificial intelligence has also become a mainstream area of research. This workshop brought together researchers to discuss and foster collaborations on the intersection of these two areas.
-
ICIDS2020 Panel: Building the Discipline of Interactive Digital Narratives
Bernstein, Mark,
Palosaari Eladhari, Mirjam,
Koenitz, Hartmut,
Louchart, Sandy,
Nack, Frank,
Martens, Chris,
Rossi, Giulia Carla,
Bosser, Anne-Gwenn,
and Millard, David E.
In Interactive Storytelling
2020
[Abs]
[DOI:10.1007/978-3-030-62516-0_1]
Building our discipline has been an ongoing discussion since the early days of ICIDS. From earlier international joint efforts to integrate research from multiple fields of study to today’s endeavours by researchers to provide scholarly works of reference, the discussion on how to continue building Interactive Digital Narratives as a discipline with its own vocabulary, scope, evaluation and methods is far from over. This year, we have chosen to continue this discussion through a panel in order to explore what are the epistemological implications of the multiple disciplinary roots of our field, and what are the next steps we should take as a community.
-
Technical Perspective: Query Optimization for Faster Deep CNN Explanations
Schelter, Sebastian
ACM SIGMOD Record
2020
[Link]
-
Apache Mahout: Machine Learning on Distributed Dataflow Systems
Anil, Robin,
Capan, Gokhan,
Drost-Fromm, Isabel,
Dunning, Ted,
Friedman, Ellen,
Grant, Trevor,
Quinn, Shannon,
Ranjan, Paritosh,
Schelter, Sebastian,
and Yılmazel, Özgür
Journal of Machine Learning Research
2020
[Link]
-
Message Passing Query Embedding
Daza, Daniel,
and Cochez, Michael
In ICML Workshop - Graph Representation Learning and Beyond
2020
[arXiv]
[Link]
-
A longitudinal analysis of university rankings
Selten, Friso,
Neylon, Cameron,
Huang, Chun-Kai,
and Groth, Paul
Quantitative Science Studies
2020
[Link]
[DOI:10.1162/qss_a_00052]
2019
-
How Relevant Is Your Choice?
Kolhoff, Lobke,
and Nack, Frank
In ICIDS 2019. Lecture Notes in Computer Science, vol 11869
2019
[Abs]
With the release of the film Black Mirror: Bandersnatch Netflix entered the area of interactive streamed narratives. We performed a qualitative analysis with 169 Netflix subscribers that had watched the episode. The key findings show (1) participants are initially engaged because of curiosity and the novelty value, and desire to explore the narrative regardless of satisfaction, (2) perceived agency is limited due to arbitrary choices and the lack of meaningful consequences, (3) the overall experience is satisfactory but adaptions are desirable in future design to make full use of the potential of the format.
-
Transfer Learning for Biomedical Named Entity Recognition with BioBERT
Symeonidou, Anthi,
Sazonau, Viachaslau,
and Groth, Paul
In Proceedings of the Posters and Demo Track of the 15th International
Conference on Semantic Systems co-located with 15th International
Conference on Semantic Systems (SEMANTiCS 2019), Karlsruhe, Germany,
September 9th - to - 12th, 2019.
2019
[Link]
-
Understanding data search as a socio-technical practice
Gregory, Kathleen M,
Cousijn, Helena,
Groth, Paul,
Scharnhorst, Andrea,
and Wyatt, Sally
Journal of Information Science
2019
[Abs]
[Link]
[DOI:10.1177/0165551519837182]
Open research data are heralded as having the potential to increase effectiveness, productivity and reproducibility in science, but little is known about the actual practices involved in data search. The socio-technical problem of locating data for reuse is often reduced to the technological dimension of designing data search systems. We combine a bibliometric study of the current academic discourse around data search with interviews with data seekers. In this article, we explore how adopting a contextual, socio-technical perspective can help to understand user practices and behaviour and ultimately help to improve the design of data discovery systems.
-
Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines
Gregory, Kathleen,
Groth, Paul,
Cousijn, Helena,
Scharnhorst, Andrea,
and Wyatt, Sally
Journal of the Association for Information Science and Technology
2019
[Abs]
[Link]
[DOI:10.1002/asi.24165]
A cross-disciplinary examination of the user behaviors involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data in selected disciplines. Two analytical frameworks, rooted in information retrieval and science and technology studies, are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.
-
End-to-End Learning for Answering Structured Queries Directly over
Text
Groth, Paul T.,
Scerri, Antony,
Daniel, Ron,
and Allen, Bradley P.
In Proceedings of the Workshop on Deep Learning for Knowledge Graphs
(DL4KG2019) Co-located with the 16th Extended Semantic Web Conference
2019 (ESWC 2019), Portoroz, Slovenia, June 2, 2019.
2019
[arXiv]
[Link]
2018
-
Open Information Extraction on Scientific Text: An Evaluation
Groth, Paul T.,
Lauruhn, Michael,
Scerri, Antony,
and Daniel, Ron
In Proceedings of the 27th International Conference on Computational
Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26,
2018
2018
[Link]
-
Elsevier’s Healthcare Knowledge Graph and the Case for Enterprise
Level Linked Data Standards
DeJong, Alex,
Bord, Radmila,
Dowling, Will,
Hoekstra, Rinke,
Moquin, Ryan,
O, Charlie,
Samarasinghe, Mevan,
Snyder, Paul,
Stanley, Craig,
Tordai, Anna,
Trefry, Michael,
and Groth, Paul T.
In Proceedings of the ISWC 2018 Posters & Demonstrations, Industry
and Blue Sky Ideas Tracks co-located with 17th International Semantic
Web Conference (ISWC 2018), Monterey, USA, October 8th - to - 12th,
2018.
2018
[Link]
-
Use of Internal Testing Data to Help Determine Compensation for Crowdsourcing
Tasks
Lauruhn, Michael,
Groth, Paul T.,
Harper, Corey A.,
and Deus, Helena F.
In Proceedings of the 2nd International Workshop on Augmenting Intelligence
with Humans\--in-\-the-\-Loop co-located with 17th International
Semantic Web Conference (ISWC 2018), Monterey, California, October
9th, 2018.
2018
[Link]