Bibtex file with the publications listed below.
2023
-
Harnessing the Web and Knowledge Graphs for Automated Impact Investing
Scoring
Hu, Qingzhi,
Daza, Daniel,
Swinkels, Laurens,
Usaite, Kristina,
Hoen, Robbert-Jan,
and Groth, Paul
In KDD Fragile Earth Workshop
2023
[Link]
[DOI:10.48550/arXiv.2308.02622]
-
An approach for analysing the impact of data integration on complex network diffusion models
Nevin, James,
Groth, Paul,
and Lees, Michael
Journal of Complex Networks
2023
[Link]
[DOI:10.1093/comnet/cnad025]
-
Automating and Optimizing Data-Centric What-If Analyses on Native Machine Learning Pipelines
Grafberger, Stefan,
Groth, Paul,
and Schelter, Sebastian
Proc. ACM Manag. of Data
2023
[Abs]
[Link]
[DOI:10.1145/3589273]
Software systems that learn from data with machine learning (ML) are used in critical decision-making processes. Unfortunately, real-world experience shows that the pipelines for data preparation, feature encoding and model training in ML systems are often brittle with respect to their input data. As a consequence, data scientists have to run different kinds of data centric what-if analyses to evaluate the robustness and reliability of such pipelines, e.g., with respect to data errors or preprocessing techniques. These what-if analyses follow a common pattern: they take an existing ML pipeline, create a pipeline variant by introducing a small change, and execute this pipeline variant to see how the change impacts the pipeline’s output score. The application of existing analysis techniques to ML pipelines is technically challenging as they are hard to integrate into existing pipeline code and their execution introduces large overheads due to repeated work.We propose mlwhatif to address these integration and efficiency challenges for data-centric what-if analyses on ML pipelines. mlwhatif enables data scientists to declaratively specify what-if analyses for an ML pipeline, and to automatically generate, optimize and execute the required pipeline variants. Our approach employs pipeline patches to specify changes to the data, operators and models of a pipeline. Based on these patches, we define a multi-query optimizer for efficiently executing the resulting pipeline variants jointly, with four subsumption-based optimization rules. Subsequently, we detail how to implement the pipeline variant generation and optimizer of mlwhatif. For that, we instrument native ML pipelines written in Python to extract dataflow plans with re-executable operators.We experimentally evaluate mlwhatif, and find that its speedup scales linearly with the number of pipeline variants in applicable cases, and is invariant to the input data size. In end-to-end experiments with four analyses on more than 60 pipelines, we show speedups of up to 13x compared to sequential execution, and find that the speedup is invariant to the model and featurization in the pipeline. Furthermore, we confirm the low instrumentation overhead of mlwhatif.
-
Data journeys: Explaining AI workflows through abstraction
Daga, Enrico,
and Groth, Paul
Semantic Web
2023
[Abs]
[Link]
[DOI:10.3233/sw-233407]
"Artificial intelligence systems are not simply built on a single dataset or trained model. Instead, they are made by complex data science workflows involving multiple datasets, models, preparation scripts, and algorithms. Given this complexity, in order to understand these AI systems, we need to provide explanations of their functioning at higher levels of abstraction. To tackle this problem, we focus on the extraction and representation of data journeys from these workflows. A data journey is a multi-layered semantic representation of data processing activity linked to data science code and assets. We propose an ontology to capture the essential elements of a data journey and an approach to extract such data journeys. Using a corpus of Python notebooks from Kaggle, we show that we are able to capture high-level semantic data flow that is more compact than using the code structure itself. Furthermore, we show that introducing an intermediate knowledge graph representation outperforms models that rely only on the code itself. Finally, we report on a user survey to reflect on the challenges and opportunities presented by computational data journeys for explainable AI."
-
Self-Contained Entity Discovery from Captioned Videos
Ayoughi, Melika,
Mettes, Pascal,
and Groth, Paul
ACM Trans. Multimedia Comput. Commun. Appl.
2023
[Abs]
[Link]
[DOI:10.1145/3583138]
This article introduces the task of visual named entity discovery in videos without the need for task-specific supervision or task-specific external knowledge sources. Assigning specific names to entities (e.g., faces, scenes, or objects) in video frames is a long-standing challenge. Commonly, this problem is addressed as a supervised learning objective by manually annotating entities with labels. To bypass the annotation burden of this setup, several works have investigated the problem by utilizing external knowledge sources such as movie databases. While effective, such approaches do not work when task-specific knowledge sources are not provided and can only be applied to movies and TV series. In this work, we take the problem a step further and propose to discover entities in videos from videos and corresponding captions or subtitles. We introduce a three-stage method where we (i) create bipartite entity-name graphs from frame–caption pairs, (ii) find visual entity agreements, and (iii) refine the entity assignment through entity-level prototype construction. To tackle this new problem, we outline two new benchmarks, SC-Friends and SC-BBT, based on the Friends and Big Bang Theory TV series. Experiments on the benchmarks demonstrate the ability of our approach to discover which named entity belongs to which face or scene, with an accuracy close to a supervised oracle, just from the multimodal information present in videos. Additionally, our qualitative examples show the potential challenges of self-contained discovery of any visual entity for future work. The code and the data are available on GitHub.1
-
GitTables: A Large-Scale Corpus of Relational Tables
Hulsebos, Madelon,
Demiralp, Çagatay,
and Groth, Paul
Proc. ACM Manag. Data
2023
[Abs]
[Link]
[DOI:10.1145/3588710]
The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. Here we introduce GitTables, a corpus of 1M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 10M tables. Analyses of GitTables show that its structure, content, and topical coverage differ significantly from existing table corpora. We annotate table columns in GitTables with semantic types, hierarchical relations and descriptions from Schema.org and DBpedia. The evaluation of our annotation pipeline on the T2Dv2 benchmark illustrates that our approach provides results on par with human annotations. We present three applications of GitTables, demonstrating its value for learned semantic type detection models, schema completion methods, and benchmarks for table-to-KG matching, data search, and preparation. We make the corpus and code available at https://gittables.github.io.
-
AQuA-CEP: Adaptive Quality-Aware Complex Event Processing in the Internet of Things
Lotfian Delouee, Majid,
Koldehofe, Boris,
and Degeler, Viktoriya
In The 17th ACM International Conference on Distributed Event-Based Systems (DEBS 2023)
2023
[Abs]
[Link]
Sensory data profoundly influences the quality of detected events in a distributed complex event processing system (DCEP). Since each sensor’s status is unstable at runtime, a single sensing assignment is often insufficient to fulfill the consumer’s quality requirements. In this paper, we study in the context of AQuA-CEP the problem of dynamic quality monitoring and adaptation of complex event processing by active integration of suitable data sources. To support this, in AQuA-CEP, queries to detect complex events are supplemented with consumer-definable quality policies that are evaluated and used to autonomously select (or even configure) suitable data sources of the sensing infrastructure. In addition, we studied different forms of expressing quality policies and analyzed how it affects the quality monitoring process. Various modes of evaluating and applying quality-related adaptations and their impacts on correlation efficiency are addressed, too. We assessed the performance of AQuA-CEP in IoT scenarios by utilizing the notion of the quality policy alongside the query processing adaptation using knowledge derived from quality monitoring. The results show that AQuA-CEP can improve the performance of DCEP systems in terms of the quality of results while fulfilling the consumer’s quality requirements. Quality-based adaptation can also increase the network’s lifetime by optimizing the sensor’s energy consumption due to efficient data source selection.
-
A Simulation Environment and Reinforcement Learning Method for Waste Reduction
Jullien, Sami,
Ariannezhad, Mozhdeh,
Groth, Paul,
and Rijke, Maarten
Transactions on Machine Learning Research
2023
[Link]
-
Knowledge Graphs and their Role in the Knowledge Engineering of the 21st Century (Dagstuhl Seminar 22372)
Groth, Paul,
Simperl, Elena,
Erp, Marieke,
and Vrandečić, Denny
Dagstuhl Reports
2023
[Link]
[DOI:10.4230/DagRep.12.9.60]
-
Parameter Efficient Node Classification on Homophilic Graphs
Prieto, Lucas,
Boef, Jeroen Den,
Groth, Paul,
and Cornelisse, Joran
Transactions on Machine Learning Research
2023
[Link]
[Code]
-
-
How to Make an Outlier? Studying the Effect of Presentational Features on the Outlierness of Items in Product Search Results
Sarvi, Fatemeh,
Aliannejadi, Mohammad,
Schelter, Sebastian,
and Rijke, Maarten
In Proceedings of the 2023 Conference on Human Information Interaction and Retrieval
2023
[Abs]
[Link]
[DOI:10.1145/3576840.3578278]
In two-sided marketplaces, items compete for attention from users since attention translates to revenue for suppliers. Item exposure is an indication of the amount of attention that items receive from users in a ranking. It can be influenced by factors like position bias. Recent work suggests that another phenomenon related to inter-item dependencies may also affect item exposure, viz. outlier items in the ranking. Hence, a deeper understanding of outlier items is crucial to determining an item’s exposure distribution. In this work, we study the impact of different presentational e-commerce features on users’ perception of outlierness of an item in a search result page. Informed by visual search literature, we design a set of crowdsourcing tasks where we compare the observability of three main features, viz. price, star rating, and discount tag. We find that various factors affect item outlierness, namely, visual complexity (e.g., shape, color), discriminative item features, and value range. In particular, we observe that a distinctive visual feature such as a colored discount tag can attract users’ attention much easier than a high price difference, simply because of visual characteristics that are easier to spot. Moreover, we see that the magnitude of deviations in all features affects the task complexity, such that when the similarity between outlier and non-outlier items increases, the task becomes more difficult.
-
The Mysterious User of Research Data: Knitting Together Science and Technology Studies with Information and Computer Science
Gregory, Kathleen,
Groth, Paul,
Scharnhorst, Andrea,
and Wyatt, Sally
2023
[Abs]
[Link]
[DOI:10.1007/978-3-031-11108-2_11]
Open, accessible, and standardized research data are seen as essential scaffolding for open science. To support this vision, data repositories and scientific publishers have developed new tools to facilitate data discovery while funders and policy makers have implemented open science and data management policies. Users are often invoked as central to these efforts. Despite this stated focus, the concept of ‘user’ often remains an abstraction, visible only via anonymous ensembles of click behavior or data management plans. This chapter reports and reflects on a project which draws on science and technology studies (STS) to open up the black box of research data use, bridging the gap between designers of data search systems and researchers who (re-)use both data and these systems in their actual practices. Quantitative and qualitative studies conducted in the course of this project will be drawn upon to demonstrate the insights gained from an interdisciplinary approach.
-
SemTab 2022: Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching, co-located with the 21st International
Semantic Web Conference, ISWC 2022, Virtual conference, October
23-27, 2022
2023
[Link]
-
Forget Me Now: Fast and Exact Unlearning in Neighborhood-Based Recommendation
Schelter, Sebastian,
Ariannezhad, Mozhdeh,
and Rijke, Maarten
In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
2023
[Abs]
[Link]
[DOI:10.1145/3539618.3591989]
Modern search and recommendation systems are optimized using logged interaction data. There is increasing societal pressure to enable users of such systems to have some of their data deleted from those systems. This paper focuses on "unlearning" such user data from neighborhood-based recommendation models on sparse, high-dimensional datasets. We present caboose, a custom top-k index for such models, which enables fast and exact deletion of user interactions. We experimentally find that caboose provides competitive index building times, makes sub-second unlearning possible (even for a large index built from one million users and 256 million interactions), and, when integrated into three state-of-the-art next-basket recommendation models, allows users to effectively adjust their predictions to remove sensitive items.
-
Data Integration Landscapes: The Case for Non-optimal Solutions in Network Diffusion Models
Nevin, James,
Groth, Paul,
and Lees, Michael
In Computational Science – ICCS 2023
2023
[Abs]
[DOI:10.1007/978-3-031-35995-8_35]
The successful application of computational models presupposes access to accurate, relevant, and representative datasets. The growth of public data, and the increasing practice of data sharing and reuse, emphasises the importance of data provenance and increases the need for modellers to understand how data processing decisions might impact model output. One key step in the data processing pipeline is that of data integration and entity resolution, where entities are matched across disparate datasets. In this paper, we present a new formulation of data integration in complex networks that incorporates integration uncertainty. We define an approach for understanding how different data integration setups can impact the results of network diffusion models under this uncertainty, allowing one to systematically characterise potential model outputs in order to create an output distribution that provides a more comprehensive picture.
-
Proactively Screening Machine Learning Pipelines with ARGUSEYES
Schelter, Sebastian,
Grafberger, Stefan,
Guha, Shubha,
Karlas, Bojan,
and Zhang, Ce
In Companion of the 2023 International Conference on Management of Data
2023
[Abs]
[Link]
[DOI:10.1145/3555041.3589682]
[
2nd Place Demo ]
Software systems that learn from data with machine learning (ML) are ubiquitous. ML pipelines in these applications often suffer from a variety of data-related issues, such as data leakage, label errors or fairness violations, which require reasoning about complex dependencies between their inputs and outputs. These issues are usually only detected in hindsight after deployment, after they caused harm in production. We demonstrate ArgusEyes, a system which enables data scientists to proactively screen their ML pipelines for data-related issues as part of continuous integration. ArgusEyes instruments, executes and screens ML pipelines for declaratively specified pipeline issues, and analyzes data artifacts and their provenance to catch potential problems early before deployment to production. We demonstrate our system for three scenarios: detecting mislabeled images in a computer vision pipeline, spotting data leakage in a price prediction pipeline, and addressing fairness violations in a credit scoring pipeline.
-
Seventh Workshop on Data Management for End-to-End Machine Learning (DEEM)
Boehm, Matthias,
Hulsebos, Madelon,
Shankar, Shreya,
and Varma, Paroma
In Companion of the 2023 International Conference on Management of Data
2023
[Abs]
[Link]
[DOI:10.1145/3555041.3590819]
The DEEM’23 workshop (Data Management for End-to-End Machine Learning) is held on Sunday June 18th, in conjunction with SIGMOD/PODS 2023. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers (10 pages) describing preliminary and ongoing research results, including industrial experience reports of end-to-end ML deployments, related to DEEM topics. In addition, DEEM 2023 has a category for short papers (4 pages) as a forum for sharing interesting use cases, problems, datasets, benchmarks, visionary ideas, system designs, preliminary results, and descriptions of system components and tools related to end-to-end ML pipelines. The workshop received 13 high-quality submissions on diverse topics relevant to DEEM, of which 6 regular papers and 7 short papers.
-
Models and Practice of Neural Table Representations
Hulsebos, Madelon,
Deng, Xiang,
Sun, Huan,
and Papotti, Paolo
In Companion of the 2023 International Conference on Management of Data
2023
[Abs]
[Link]
[DOI:10.1145/3555041.3589411]
In the last few years, the natural language processing community witnessed advances in neural representations of free-form text with transformer-based language models (LMs). Given the importance of knowledge available in relational tables, recent research efforts extend LMs by developing neural representations for tabular data. In this tutorial, we present these proposals with three main goals. First, we aim at introducing the potentials and limitations of current models to a database audience. Second, we want the attendees to see the benefit of such line of work in a large variety of data applications. Third, we would like to empower the audience with a new set of tools and to inspire them to tackle some of the important directions for neural table representations, including model and system design, evaluation, application and deployment. To achieve these goals, the tutorial is organized in two parts. The first part covers the background for neural table representations, including a survey of the most important systems. The second part is designed as a hands-on session, where attendees will use their laptop to explore this new framework and test neural models involving text and tabular data.
-
Provenance Tracking for End-to-End Machine Learning Pipelines
Grafberger, Stefan,
Groth, Paul,
and Schelter, Sebastian
In Companion Proceedings of the ACM Web Conference 2023
2023
[Link]
[DOI:10.1145/3543873.3587557]
2022
-
Relational graph convolutional networks: a closer look
Thanapalasingam, Thiviyan,
Berkel, Lucas,
Bloem, Peter,
and Groth, Paul
PeerJ Computer Science
2022
[Link]
[DOI:10.7717/peerj-cs.1073]
-
Question Answering with Additive Restrictive Training (QuAART): Question Answering for the Rapid Development of New Knowledge Extraction Pipelines
Harper, Corey A.,
Daniel, Ron,
and Groth, Paul
In Knowledge Engineering and Knowledge Management (EKAW)
2022
[Abs]
[Link]
[DOI:10.1007/978-3-031-17105-5_4]
Abstract
Numerous studies have explored the use of language models and question answering techniques for knowledge extraction. In most cases, these models are trained on data specific to the new task at hand. We hypothesize that using models trained only on generic question answering data (e.g. SQuAD) is a good starting point for domain specific entity extraction. We test this hypothesis, and explore whether the addition of small amounts of training data can help lift model performance. We pay special attention to the use of null answers and unanswerable questions to optimize performance. To our knowledge, no studies have been done to evaluate the effectiveness of this technique. We do so for an end-to-end entity mention detection and entity typing task on HAnDS and FIGER, two common evaluation datasets for fine grained entity recognition. We focus on fine-grained entity recognition because it is challenging scenario, and because the long tail of types in this task highlights the need for entity extraction systems that can deal with new domains and types. To our knowledge, we are the first system beyond those presented in the original FIGER and HAnDS papers to tackle the task in an end-to-end fashion. Using an extremely small sample from the distantly-supervised HAnDS training data – 0.0015%, or less than 500 passages randomly chosen out of 31 million – we produce a CoNNL F1 score of 73.72 for entity detection on FIGER. Our end-to-end detection and typing evaluation produces macro and micro F1s of 45.11 and 54.75, based on the FIGER evaluation metrics. This work provides a foundation for the rapid development of new knowledge extraction pipelines.
-
Serenade - Low-Latency Session-Based Recommendation in e-Commerce at Scale
Kersbergen, Barrie,
Sprangers, Olivier,
and Schelter, Sebastian
In Proceedings of the 2022 International Conference on Management of Data
2022
[Abs]
[Link]
[DOI:10.1145/3514221.3517901]
Session-based recommendation predicts the next item with which a user will interact, given a sequence of her past interactions with other items. This machine learning problem targets a core scenario in e-commerce platforms, which aim to recommend interesting items to buy to users browsing the site. Session-based recommenders are difficult to scale due to their exponentially large input space of potential sessions. This impedes offline precomputation of the recommendations, and implies the necessity to maintain state during the online computation of next-item recommendations.We propose VMIS-kNN, an adaptation of a state-of-the-art nearest neighbor approach to session-based recommendation, which leverages a prebuilt index to compute next-item recommendations with low latency in scenarios with hundreds of millions of clicks to search through. Based on this approach, we design and implement the scalable session-based recommender system Serenade, which is in production usage at bol.com, a large European e-commerce platform.We evaluate the predictive performance of VMIS-kNN, and show that Serenade can answer a thousand recommendation requests per second with a 90th percentile latency of less than seven milliseconds in scenarios with millions of items to recommend. Furthermore, we present results from a three week long online A/B test with up to 600 requests per second for 6.5 million distinct items on more than 45 million user sessions from our e-commerce platform. To the best of our knowledge, we provide the first empirical evidence that the superior predictive performance of nearest neighbor approaches to session-based recommendation in offline evaluations translates to superior performance in a real world e-commerce setting.
-
Towards Data-Centric What-If Analysis for Native Machine Learning Pipelines
Grafberger, Stefan,
Groth, Paul,
and Schelter, Sebastian
In Proceedings of the Sixth Workshop on Data Management for End-To-End Machine Learning
2022
[Abs]
[Link]
[DOI:10.1145/3533028.3533303]
An important task of data scientists is to understand the sensitivity of their models to changes in the data that the models are trained and tested upon. Currently, conducting such data-centric what-if analyses requires significant and costly manual development and testing with the corresponding chance for the introduction of bugs. We discuss the problem of data-centric what-if analysis over whole ML pipelines (including data preparation and feature encoding), propose optimisations that reuse trained models and intermediate data to reduce the runtime of such analysis, and finally conduct preliminary experiments on three complex example pipelines, where our approach reduces the runtime by a factor of up to six.
-
Responsible Data Management
Stoyanovich, Julia,
Abiteboul, Serge,
Howe, Bill,
Jagadish, H. V.,
and Schelter, Sebastian
Communications of the ACM
2022
[Abs]
[Link]
[DOI:10.1145/3488717]
Perspectives on the role and responsibility of the data-management research community in designing, developing, using, and overseeing automated decision systems.
-
Methods Included
Crusoe, Michael R.,
Abeln, Sanne,
Iosup, Alexandru,
Amstutz, Peter,
Chilton, John,
Tijanić, Nebojša,
Ménager, Hervé,
Soiland-Reyes, Stian,
Gavrilović, Bogdan,
Goble, Carole,
and The CWL Community,
Communications of the ACM
2022
[Abs]
[Link]
[DOI:10.1145/3486897]
Standardizing computational reuse and portability with the Common Workflow Language.
-
CITRIS: Causal Identifiability from Temporal Intervened Sequences
Lippe, Phillip,
Magliacane, Sara,
Löwe, Sindy,
Asano, Yuki M.,
Cohen, Taco,
and Gavves, Efstratios
In Proceedings of the 39th International Conference on Machine Learning, ICML
2022
[arXiv]
-
SlotGAN: Detecting Mentions in Text via Adversarial Distant Learning
Daza, Daniel,
Cochez, Michael,
and Groth, Paul
In Proceedings of the Sixth Workshop on Structured Prediction for NLP
2022
[Abs]
[Link]
[DOI:10.18653/v1/2022.spnlp-1.4]
We present SlotGAN, a framework for training a mention detection model that only requires unlabeled text and a gazetteer. It consists of a generator trained to extract spans from an input sentence, and a discriminator trained to determine whether a span comes from the generator, or from the gazetteer.We evaluate the method on English newswire data and compare it against supervised, weakly-supervised, and unsupervised methods. We find that the performance of the method is lower than these baselines, because it tends to generate more and longer spans, and in some cases it relies only on capitalization. In other cases, it generates spans that are valid but differ from the benchmark. When evaluated with metrics based on overlap, we find that SlotGAN performs within 95% of the precision of a supervised method, and 84% of its recall. Our results suggest that the model can generate spans that overlap well, but an additional filtering mechanism is required.
-
Making Canonical Workflow Building Blocks Interoperable across Workflow Languages
Soiland-Reyes, Stian,
Bayarri, Genís,
Andrio, Pau,
Long, Robin,
Lowe, Douglas,
Niewielska, Ania,
Hospital, Adam,
and Groth, Paul
Data Intelligence
2022
[Abs]
[Link]
[DOI:10.1162/dint_a_00135]
We introduce the concept of Canonical Workflow Building Blocks (CWBB), a methodology of describing and wrapping computational tools, in order for them to be utilised in a reproducible manner from multiple workflow languages and execution platforms. The concept is implemented and demonstrated with the BioExcel Building Blocks library (BioBB), a collection of tool wrappers in the field of computational biomolecular simulation. Interoperability across different workflow languages is showcased through a protein Molecular Dynamics setup transversal workflow, built using this library and run with 5 different Workflow Manager Systems (WfMS). We argue such practice is a necessary requirement for FAIR Computational Workflows and an element of Canonical Workflow Frameworks for Research (CWFR) in order to improve widespread adoption and reuse of computational methods across workflow language barriers.
-
Letter from the Special Issue Editor
Schelter, Sebastian
IEEE Data Engineering Bulletin (Special issue on Directions Towards GDPR-Compliant Data Systems and Applications)
2022
[Link]
-
Defining a Knowledge Graph Development Process Through a Systematic Review
Tamašauskaitundefined, Gytundefined,
and Groth, Paul
ACM Transactios on Software Engineering and Methodology
2022
[Abs]
[Link]
[DOI:10.1145/3522586]
Knowledge graphs are widely used in industry and studied within the academic community. However, the models applied in the development of knowledge graphs vary. Analysing and providing a synthesis of the commonly used approaches to knowledge graph development would provide researchers and practitioners a better understanding of the overall process and methods involved. Hence, this paper aims to define the overall process of knowledge graph development and its key constituent steps. For this purpose, a systematic review and a conceptual analysis of the literature was conducted. The resulting process was compared to case studies to evaluate its applicability. The proposed process suggests a unified approach and provides guidance for both researchers and practitioners when constructing and managing knowledge graphs.
-
Data distribution debugging in machine learning pipelines
Grafberger, Stefan,
Groth, Paul,
Stoyanovich, Julia,
and Schelter, Sebastian
The VLDB Journal
2022
[Link]
[DOI:10.1007/s00778-021-00726-w]
-
Structure-based knowledge acquisition from electronic lab notebooks for research data provenance documentation
Schröder, Max,
Staehlke, Susanne,
Groth, Paul,
Nebe, J. Barbara,
Spors, Sascha,
and Krüger, Frank
Journal of Biomedical Semantics
2022
[Link]
[DOI:10.1186/s13326-021-00257-x]
-
Packaging research artefacts with RO-Crate
Soiland-Reyes, Stian,
Sefton, Peter,
Crosas, Mercè,
Castro, Leyla Jael,
Coppens, Frederik,
Fernández, José M.,
Garijo, Daniel,
Grüning, Björn,
La Rosa, Marco,
Leo, Simone,
and al.,
Data Science
2022
[Link]
[DOI:10.3233/DS-210053]
-
The Semantic Web - 19th International Conference, ESWC 2022, Hersonissos,
Crete, Greece, May 29 - June 2, 2022, Proceedings
2022
[Link]
[DOI:10.1007/978-3-031-06981-9]
-
GitSchemas: A Dataset for Automating Relational Data Preparation Tasks
Döhmen, Till,
Hulsebos, Madelon,
Beecks, Christian,
and Schelter, Sebastian
In 2022 IEEE 38th International Conference on Data Engineering Workshops (ICDEW)
2022
[Link]
[DOI:10.1109/ICDEW55742.2022.00016]
-
Towards improving Wikidata reuse with emerging patterns
Carriero, Valentina Anita,
Groth, Paul,
and Presutti, Valentina
In Proceedings of the 3rd Wikidata Workshop 2022 co-located with the
21st International Semantic Web Conference (ISWC2022), Virtual Event,
Hanghzou, China, October 2022
2022
[Link]
-
Proceedings of the Semantic Web Challenge on Tabular Data to Knowledge
Graph Matching co-located with the 20th International Semantic Web
Conference (ISWC 2021), Virtual conference, October 27, 2021
Jiménez-Ruiz, Ernesto,
Efthymiou, Vasilis,
Chen, Jiaoyan,
Cutrona, Vincenzo,
Hassanzadeh, Oktie,
Sequeda, Juan,
Srinivas, Kavitha,
Abdelmageed, Nora,
Hulsebos, Madelon,
Oliveira, Daniela,
and Pesquita, Catia
2022
[Link]
-
AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning
Huang, Biwei,
Feng, Fan,
Lu, Chaochao,
Magliacane, Sara,
and Zhang, Kun
In International Conference on Learning Representations
2022
[Link]
[
Spotlight Presentation ]
-
Making Table Understanding Work in Practice
Hulsebos, Madelon,
Gathani, Sneha,
Gale, James,
Dillig, Isil,
Groth, Paul,
and Demiralp,
In 12th Conference on Innovative Data Systems Research, CIDR 2022,
Chaminade, CA, USA, January 9-12, 2022
2022
[Link]
-
Screening Native Machine Learning Pipelines with ArgusEyes
Schelter, Sebastian,
Grafberger, Stefan,
Guha, Shubha,
Sprangers, Olivier,
Karlas, Bojan,
and Zhang, Ce
In 12th Conference on Innovative Data Systems Research, CIDR 2022,
Chaminade, CA, USA, January 9-12, 2022
2022
[Link]
2021
-
-
GraphPOPE: Retaining Structural Graph Information Using Position-aware Node
Embeddings
Boef, Jeroen Den,
Cornelisse, Joran,
and Groth, Paul
In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG 2021)
2021
[Link]
-
Quality Assessment of Knowledge Graph Hierarchies using KG-BERT
Szarkowska, Kinga,
Moore, Veronique,
Vandenbussche, Pierre-Yves,
and Groth, Paul
In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG 2021)
2021
[Link]
-
Perspectives on automated composition of workflows in the life sciences
Lamprecht, Anna-Lena,
Palmblad, Magnus,
Ison, Jon,
Schwämmle, Veit,
Manir, Mohammad Sadnan Al,
Altintas, Ilkay,
Baker, Christopher J. O.,
Amor, Ammar Ben Hadj,
Capella-Gutierrez, Salvador,
Charonyktakis, Paulos,
Crusoe, Michael R.,
Gil, Yolanda,
Goble, Carole,
Griffin, Timothy J.,
Groth, Paul,
Ienasescu, Hans,
Jagtap, Pratik,
Kalaš, Matúš,
Kasalica, Vedran,
Khanteymoori, Alireza,
Kuhn, Tobias,
Mei, Hailiang,
Ménager, Hervé,
Möller, Steffen,
Richardson, Robin A.,
Robert, Vincent,
Soiland-Reyes, Stian,
Stevens, Robert,
Szaniszlo, Szoke,
Verberne, Suzan,
Verhoeven, Aswin,
and Wolstencroft, Katherine
F1000Research
2021
[Link]
[DOI:10.12688/f1000research.54159.1]
-
SemEval-2021 Task 8: MeasEval – Extracting Counts and Measurements and their Related Contexts
Harper, Corey,
Cox, Jessica,
Kohler, Curt,
Scerri, Antony,
Daniel Jr., Ron,
and Groth, Paul
In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
2021
[Abs]
[Link]
[DOI:10.18653/v1/2021.semeval-1.38]
[
SemEval 2021 Best Task Paper ]
We describe MeasEval, a SemEval task of extracting counts, measurements, and related context from scientific documents, which is of significant importance to the creation of Knowledge Graphs that distill information from the scientific literature. This is a new task in 2021, for which over 75 submissions from 25 participants were received. We expect the data developed for this task and the findings reported to be valuable to the scientific knowledge extraction, metrology, and automated knowledge base construction communities.
-
Further with Knowledge Graphs: Proceedings of the 17th International Conference on Semantic Systems, 6–9 September 2021, Amsterdam, The Netherlands
2021
[Link]
[DOI:10.3233/SSW53]
-
Reinforcement Learning–Based Collective Entity Alignment with Adaptive Features
Zeng, Weixin,
Zhao, Xiang,
Tang, Jiuyang,
Lin, Xuemin,
and Groth, Paul
ACM Trans. Inf. Syst.
2021
[Abs]
[Link]
[DOI:10.1145/3446428]
Entity alignment (EA) is the task of identifying the entities that refer to the same real-world object but are located in different knowledge graphs (KGs). For entities to be aligned, existing EA solutions treat them separately and generate alignment results as ranked lists of entities on the other side. Nevertheless, this decision-making paradigm fails to take into account the interdependence among entities. Although some recent efforts mitigate this issue by imposing the 1-to-1 constraint on the alignment process, they still cannot adequately model the underlying interdependence and the results tend to be sub-optimal.To fill in this gap, in this work, we delve into the dynamics of the decision-making process, and offer a reinforcement learning (RL)–based model to align entities collectively. Under the RL framework, we devise the coherence and exclusiveness constraints to characterize the interdependence and restrict collective alignment. Additionally, to generate more precise inputs to the RL framework, we employ representative features to capture different aspects of the similarity between entities in heterogeneous KGs, which are integrated by an adaptive feature fusion strategy. Our proposal is evaluated on both cross-lingual and mono-lingual EA benchmarks and compared against state-of-the-art solutions. The empirical results verify its effectiveness and superiority.
-
Learnings from a Retail Recommendation System on Billions of Interactions at bol.com
Kersbergen, B.,
and Schelter, S.
In 2021 IEEE 37th International Conference on Data Engineering (ICDE)
2021
[Link]
[DOI:10.1109/ICDE51399.2021.00277]
-
Inductive Entity Representations from Text via Link Prediction
Daza, Daniel,
Cochez, Michael,
and Groth, Paul
In Proceedings of The Web Conference
2021
[arXiv]
[DOI:10.1145/3442381.3450141]
[Code]
-
Letter from the Special Issue Editor
Schelter, Sebastian
IEEE Data Engineering Bulletin (Special issue on Data validation for machine learning models and applications)
2021
[Link]
-
Complex Query Answering with Neural Link Predictors
Arakelyan, Erik,
Daza, Daniel,
Minervini, Pasquale,
and Cochez, Michael
In International Conference on Learning Representations (ICLR)
2021
[arXiv]
[Link]
[
Outstanding Paper Award ICLR 2021 ]
-
Taming Technical Bias in Machine Learning Pipelines
Schelter, Sebastian,
and Stoyanovich, Julia
IEEE Data Engineering Bulletin (Special Issue on Interdisciplinary Perspectives on Fairness and Artificial Intelligence Systems)
2021
[Link]
-
Talking datasets – Understanding data sensemaking behaviours
Koesten, Laura,
Gregory, Kathleen,
Groth, Paul,
and Simperl, Elena
International Journal of Human-Computer Studies
2021
[Abs]
[arXiv]
[Link]
[DOI:10.1016/j.ijhcs.2020.102562]
The sharing and reuse of data are seen as critical to solving the most complex problems of today. Despite this potential, relatively little attention has been paid to a key step in data reuse: the behaviours involved in data-centric sensemaking. We aim to address this gap by presenting a mixed-methods study combining in-depth interviews, a think-aloud task and a screen recording analysis with 31 researchers from different disciplines as they summarised and interacted with both familiar and unfamiliar data. We use our findings to identify and detail common patterns of data-centric sensemaking across three clusters of activities that we present as a framework: inspecting data, engaging with content, and placing data within broader contexts. Additionally, we propose design recommendations for tools and documentation practices, which can be used to facilitate sensemaking and subsequent data reuse.
-
The Challenges of Cross-Document Coreference Resolution for Email
Li, Xue,
Magliacane, Sara,
and Groth, Paul
In Proceedings of the 11th on Knowledge Capture Conference
2021
[Abs]
[Link]
[DOI:10.1145/3460210.3493573]
Long-form conversations such as email are an important source of information for knowledge capture. For tasks such as knowledge graph construction, conversational search, and entity linking, being able to resolve entities from across documents is important. Building on recent work on within document coreference resolution for email, we study for the first time a cross-document formulation of the problem. Our results show that the current state-of-the-art deep learning models for general cross-document coreference resolution are insufficient for email conversations. Our experiments show that the general task is challenging and, importantly for knowledge intensive tasks, coreference resolution models that only treat entity mentions perform worse. Based on these results, we outline the work needed to address this challenging task.
-
Supporting Ontology Maintenance with Contextual Word Embeddings and
Maximum Mean Discrepancy
Shroff, Natasha,
Vandenbussche, Pierre-Yves,
Moore, Véronique,
and Groth, Paul
In Joint Proceedings of the 2nd International Workshop on Deep Learning
meets Ontologies and Natural Language Processing (DeepOntoNLP 2021)
& 6th International Workshop on Explainable Sentiment Mining
and Emotion Detection (X-SENTIMENT 2021) co-located with co-located
with 18th Extended Semantic Web Conference 2021, Hersonissos, Greece,
June 6th - 7th, 2021 (moved online)
2021
[Link]
-
Proceedings of Machine Learning with Symbolic Methods and Knowledge Graphs co-located
with European Conference on Machine Learning and Principles and Practice
of Knowledge Discovery in Databases (ECML PKDD 2021), Virtual,
September 17, 2021
Alam, Mehwish,
Ali, Mehdi,
Groth, Paul,
Hitzler, Pascal,
Lehmann, Jens,
Paulheim, Heiko,
Rettinger, Achim,
Sack, Harald,
Sadeghi, Afshin,
and Tresp, Volker
2021
[Link]
-
Verifiably Safe Exploration for End-to-End Reinforcement Learning
Hunt, Nathan,
Fulton, Nathan,
Magliacane, Sara,
Hoang, Trong Nghia,
Das, Subhro,
and Solar-Lezama, Armando
In Proceedings of the 24th International Conference on Hybrid Systems: Computation and Control
2021
[Abs]
[Link]
[DOI:10.1145/3447928.3456653]
[
Best Paper Award ACM HSCC 2021 ]
Deploying deep reinforcement learning in safety-critical settings requires developing algorithms that obey hard constraints during exploration. This paper contributes a first approach toward enforcing formal safety constraints on end-to-end policies with visual inputs. Our approach draws on recent advances in object detection and automated reasoning for hybrid dynamical systems. The approach is evaluated on a novel benchmark that emphasizes the challenge of safely exploring in the presence of hard constraints. Our benchmark draws from several proposed problem sets for safe learning and includes problems that emphasize challenges such as reward signals that are not aligned with safety constraints. On each of these benchmark problems, our algorithm completely avoids unsafe behavior while remaining competitive at optimizing for as much reward as is safe. We characterize safety constraints in terms of a refinement relation on Markov decision processes - rather than directly constraining the reinforcement learning algorithm so that it only takes safe actions, we instead refine the environment so that only safe actions are defined in the environment’s transition structure. This has pragmatic system design benefits and, more importantly, provides a clean conceptual setting in which we are able to prove important safety and efficiency properties. These allow us to transform the constrained optimization problem of acting safely in the original environment into an unconstrained optimization in a refined environment.
-
Summary of Tutorials at The Web Conference 2021
West, Robert,
Bhagat, Smriti,
Groth, Paul,
Zitnik, Marinka,
Couto, Francisco M.,
Lisena, Pasquale,
Meroño-Peñuela, Albert,
Zhao, Xiangyu,
Fan, Wenqi,
Yin, Dawei,
Tang, Jiliang,
Shou, Linjun,
Gong, Ming,
Pei, Jian,
Geng, Xiubo,
Zhou, Xingjie,
Jiang, Daxin,
Ricaud, Benjamin,
Aspert, Nicolas,
Miz, Volodymyr,
Dy, Jennifer,
Ioannidis, Stratis,
Yıldız, undefinedlkay,
Rezapour, Rezvaneh,
Aref, Samin,
Dinh, Ly,
Diesner, Jana,
Drutsa, Alexey,
Ustalov, Dmitry,
Popov, Nikita,
Baidakova, Daria,
Mishra, Shubhanshu,
Gopalan, Arjun,
Juan, Da-Cheng,
Ilharco Magalhaes, Cesar,
Ferng, Chun-Sung,
Heydon, Allan,
Lu, Chun-Ta,
Pham, Philip,
Yu, George,
Fan, Yicheng,
Wang, Yueqi,
Laurent, Florian,
Schraner, Yanick,
Scheller, Christian,
Mohanty, Sharada,
Chen, Jiawei,
Wang, Xiang,
Feng, Fuli,
He, Xiangnan,
Teinemaa, Irene,
Albert, Javier,
Goldenberg, Dmitri,
Vasile, Flavian,
Rohde, David,
Jeunen, Olivier,
Benhalloum, Amine,
Sakhi, Otmane,
Rong, Yu,
Huang, Wenbing,
Xu, Tingyang,
Bian, Yatao,
Cheng, Hong,
Sun, Fuchun,
Huang, Junzhou,
Fakhraei, Shobeir,
Faloutsos, Christos,
Çelebi, Onur,
Müller, Martin,
Schneider, Manuel,
Altunina, Olesia,
Wingerath, Wolfram,
Wollmer, Benjamin,
Gessert, Felix,
Succo, Stephan,
Ritter, Norbert,
Courdier, Evann,
Avram, Tudor Mihai,
Cvetinovic, Dragan,
Tsinadze, Levan,
Jose, Johny,
Howell, Rose,
Koenig, Mario,
Defferrard, Michaël,
Kenthapadi, Krishnaram,
Packer, Ben,
Sameki, Mehrnoosh,
and Sephus, Nashlie
In Companion Proceedings of the Web Conference 2021
2021
[Abs]
[Link]
[DOI:10.1145/3442442.3453701]
This report summarizes the 23 tutorials hosted at The Web Conference 2021: nine lecture-style tutorials and 14 hands-on tutorials.
-
HedgeCut: Maintaining Randomised Trees for Low-Latency Machine Unlearning
Schelter, Sebastian,
Grafberger, Stefan,
and Dunning, Ted
In Proceedings of the 2021 International Conference on Management of Data
2021
[Abs]
[Link]
[DOI:10.1145/3448016.3457239]
Software systems that learn from user data with machine learning (ML) have become ubiquitous over the last years. Recent law such as the "General Data Protection Regulation" (GDPR) requires organisations that process personal data to delete user data upon request (enacting the "right to be forgotten"). However, this regulation does not only require the deletion of user data from databases, but also applies to ML models that have been learned from the stored data. We therefore argue that ML applications should offer users to unlearn their data from trained models in a timely manner. We explore how fast this unlearning can be done under the constraints imposed by real world deployments, and introduce the problem of low-latency machine unlearning: maintaining a deployed ML model in-place under the removal of a small fraction of training samples without retraining.We propose HedgeCut, a classification model based on an ensemble of randomised decision trees, which is designed to answer unlearning requests with low latency. We detail how to efficiently implement HedgeCut with vectorised operators for decision tree learning. We conduct an experimental evaluation on five privacy-sensitive datasets, where we find that HedgeCut can unlearn training samples with a latency of around 100 microseconds and answers up to 36,000 prediction requests per second, while providing a training time and predictive accuracy similar to widely used implementations of tree-based ML models such as Random Forests.
-
MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines
Grafberger, Stefan,
Guha, Shubha,
Stoyanovich, Julia,
and Schelter, Sebastian
In Proceedings of the 2021 International Conference on Management of Data
2021
[Abs]
[Link]
[DOI:10.1145/3448016.3452759]
Machine Learning (ML) is increasingly used to automate impactful decisions, and the risks arising from this wide-spread use are garnering attention from policymakers, scientists, and the media. ML applications are often very brittle with respect to their input data, which leads to concerns about their reliability, accountability, and fairness. While bias detection cannot be fully automated, computational tools can help pinpoint particular types of data issues.We recently proposed mlinspect, a library that enables lightweight lineage-based inspection of ML preprocessing pipelines. In this demonstration, we show how mlinspect can be used to detect data distribution bugs in a representative pipeline. In contrast to existing work, mlinspect operates on declarative abstractions of popular data science libraries like estimator/transformer pipelines, can handle both relational and matrix data, and does not require manual code instrumentation. The library is publicly available at https://github.com/stefan-grafberger/mlinspect.
-
Automating Data Quality Validation for Dynamic Data Ingestion
Redyuk, Sergey,
Kaoudi, Zoi,
Markl, Volker,
and Schelter, Sebastian
In Proceedings of the 24th International Conference on Extending Database
Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021
2021
[Link]
[DOI:10.5441/002/edbt.2021.07]
-
JENGA - A Framework to Study the Impact of Data Errors on the
Predictions of Machine Learning Models
Schelter, Sebastian,
Rukat, Tammo,
and Biessmann, Felix
In Proceedings of the 24th International Conference on Extending Database
Technology, EDBT 2021, Nicosia, Cyprus, March 23 - 26, 2021
2021
[Link]
[DOI:10.5441/002/edbt.2021.63]
-
2020
-
Towards Olfactory Information Extraction from Text: A Case Study on Detecting Smell Experiences in Novels
Brate, Ryan,
Groth, Paul,
and Erp, Marieke
In Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
2020
[Abs]
[Link]
Environmental factors determine the smells we perceive, but societal factors factors shape the importance, sentiment and biases we give to them. Descriptions of smells in text, or as we call them ‘smell experiences’, offer a window into these factors, but they must first be identified. To the best of our knowledge, no tool exists to extract references to smell experiences from text. In this paper, we present two variations on a semi-supervised approach to identify smell experiences in English literature. The combined set of patterns from both implementations offer significantly better performance than a keyword-based baseline.
-
Dataset Reuse: Toward Translating Principles to Practice
Koesten, Laura,
Vougiouklis, Pavlos,
Simperl, Elena,
and Groth, Paul
Patterns
2020
[Link]
[DOI:10.1016/j.patter.2020.100136]
-
Effective distributed representations for academic expert search
Berger, Mark,
Zavrel, Jakub,
and Groth, Paul
In Proceedings of the First Workshop on Scholarly Document Processing at EMNLP
2020
[Abs]
[Link]
Expert search aims to find and rank experts based on a user’s query. In academia, retrieving experts is an efficient way to navigate through a large amount of academic knowledge. Here, we study how different distributed representations of academic papers (i.e. embeddings) impact academic expert retrieval. We use the Microsoft Academic Graph dataset and experiment with different configurations of a document-centric voting model for retrieval. In particular, we explore the impact of the use of contextualized embeddings on search performance. We also present results for paper embeddings that incorporate citation information through retrofitting. Additionally, experiments are conducted using different techniques for assigning author weights based on author order. We observe that using contextual embeddings produced by a transformer model trained for sentence similarity tasks produces the most effective paper representations for document-centric expert retrieval. However, retrofitting the paper embeddings and using elaborate author contribution weighting strategies did not improve retrieval performance.
-
Dataset search: a survey
Chapman, Adriane,
Simperl, Elena,
Koesten, Laura,
Konstantinidis, George,
Ibáñez, Luis-Daniel,
Kacprzak, Emilia,
and Groth, Paul
The VLDB Journal
2020
[arXiv]
[Link]
[DOI:10.1007/s00778-019-00564-x]
-
Introduction – FAIR data, systems and analysis
Groth, Paul,
and Dumontier, Michel
Data Science
2020
[Link]
[DOI:10.3233/DS-200029]
-
Fairness-Aware Instrumentation of Preprocessing Pipelines for Machine
Learning
Yang, Ke,
Huang, Biao,
Stoyanovich, Julia,
and Schelter, Sebastian
In Workshop on Human-In-the-Loop Data Analytics (HILDA’20)
2020
[Link]
[DOI:10.1145/3398730.3399194]
-
Towards Entity Spaces
Erp, Marieke,
and Groth, Paul
In Proceedings of The 12th Language Resources and Evaluation Conference
2020
[Abs]
[Link]
Entities are a central element of knowledge bases and are important input to many knowledge-centric tasks including text analysis. For example, they allow us to find documents relevant to a specific entity irrespective of the underlying syntactic expression within a document. However, the entities that are commonly represented in knowledge bases are often a simplification of what is truly being referred to in text. For example, in a knowledge base, we may have an entity for Germany as a country but not for the more fuzzy concept of Germany that covers notions of German Population, German Drivers, and the German Government. Inspired by recent advances in contextual word embeddings, we introduce the concept of entity spaces - specific representations of a set of associated entities with near-identity. Thus, these entity spaces provide a handle to an amorphous grouping of entities. We developed a proof-of-concept for English showing how, through the introduction of entity spaces in the form of disambiguation pages, the recall of entity linking can be improved.
-
Lost or Found? Discovering Data Needed for Research
Gregory, Kathleen,
Groth, Paul,
Scharnhorst, Andrea,
and Wyatt, Sally
Harvard Data Science Review
2020
[Link]
[DOI:10.1162/99608f92.e38165eb]
-
PANDAcap: A Framework for Streamlining Collection of Full-System Traces
Stamatogiannakis, Manolis,
Bos, Herbert,
and Groth, Paul
In EuroSec
2020
[Link]
[DOI:10.1145/3380786.3391396]
[Code]
-
Estimating the imageability of words by mining visual characteristics from crawled image data
Kastner, Marc A.,
Ide, Ichiro,
Nack, Frank,
Kawanishi, Yasutomo,
Hirayama, Takatsugu,
Deguchi, Daisuke,
and Murase, Hiroshi
Multimedia Tools and Applications
2020
[Link]
[DOI:10.1007/s11042-019-08571-4]
-
FAIR Data Reuse – the Path through Data Citation
Groth, Paul,
Cousijn, Helena,
Clark, Tim,
and Goble, Carole
Data Intelligence
2020
[Link]
[DOI:10.1162/dint_a_00030]
-
Message Passing Query Embedding
Daza, Daniel,
and Cochez, Michael
In ICML Workshop - Graph Representation Learning and Beyond
2020
[arXiv]
[Link]
-
The state of altmetrics: a tenth anniversary celebration
Altmetric Engineering, ,
Konkiel, Stacy,
Priem, Jason,
Adie, Euan,
Derrick, Gemma,
Didegah, Fereshteh,
Groth, Paul,
Neylon, Cameron,
Shenmeng Xu, ,
Zahedi, Zohreh,
Bowman, Timothy,
Vanash M Patel, ,
Haunschild, Robin,
Bornmann, Lutz,
Taylor, Mike,
Ross, Liesa,
Theng, Yin-Leng,
Hassan, Saeed-Ul,
and Aljohani, Naif R.
2020
[Link]
[DOI:10.6084/M9.FIGSHARE.13010000.V2]
-
CSSA’20: Workshop on Combining Symbolic and Sub-Symbolic Methods and Their Applications
Alam, Mehwish,
Groth, Paul,
Hitzler, Pascal,
Paulheim, Heiko,
Sack, Harald,
and Tresp, Volker
In Proceedings of the 29th ACM International Conference on Information & Knowledge Management
2020
[Abs]
[Link]
[DOI:10.1145/3340531.3414072]
There has been a rapid growth in the use of symbolic representations along with their applications in many important tasks. Symbolic representations, in the form of Knowledge Graphs (KGs), constitute large networks of real-world entities and their relationships. On the other hand, sub-symbolic artificial intelligence has also become a mainstream area of research. This workshop brought together researchers to discuss and foster collaborations on the intersection of these two areas.
-
ICIDS2020 Panel: Building the Discipline of Interactive Digital Narratives
Bernstein, Mark,
Palosaari Eladhari, Mirjam,
Koenitz, Hartmut,
Louchart, Sandy,
Nack, Frank,
Martens, Chris,
Rossi, Giulia Carla,
Bosser, Anne-Gwenn,
and Millard, David E.
In Interactive Storytelling
2020
[Abs]
[DOI:10.1007/978-3-030-62516-0_1]
Building our discipline has been an ongoing discussion since the early days of ICIDS. From earlier international joint efforts to integrate research from multiple fields of study to today’s endeavours by researchers to provide scholarly works of reference, the discussion on how to continue building Interactive Digital Narratives as a discipline with its own vocabulary, scope, evaluation and methods is far from over. This year, we have chosen to continue this discussion through a panel in order to explore what are the epistemological implications of the multiple disciplinary roots of our field, and what are the next steps we should take as a community.
-
Technical Perspective: Query Optimization for Faster Deep CNN Explanations
Schelter, Sebastian
ACM SIGMOD Record
2020
[Link]
-
Apache Mahout: Machine Learning on Distributed Dataflow Systems
Anil, Robin,
Capan, Gokhan,
Drost-Fromm, Isabel,
Dunning, Ted,
Friedman, Ellen,
Grant, Trevor,
Quinn, Shannon,
Ranjan, Paritosh,
Schelter, Sebastian,
and Yılmazel, Özgür
Journal of Machine Learning Research
2020
[Link]
-
Semantic Systems. In the Era of Knowledge Graphs - 16th International
Conference on Semantic Systems, SEMANTiCS 2020, Amsterdam, The Netherlands,
September 7-10, 2020, Proceedings
Blomqvist, Eva,
Groth, Paul,
Boer, Victor,
Pellegrini, Tassilo,
Alam, Mehwish,
Käfer, Tobias,
Kieseberg, Peter,
Kirrane, Sabrina,
Meroño-Peñuela, Albert,
and Pandit, Harshvardhan J.
2020
[Link]
[DOI:10.1007/978-3-030-59833-4]
-
A longitudinal analysis of university rankings
Selten, Friso,
Neylon, Cameron,
Huang, Chun-Kai,
and Groth, Paul
Quantitative Science Studies
2020
[Link]
[DOI:10.1162/qss_a_00052]
2019
-
How Relevant Is Your Choice?
Kolhoff, Lobke,
and Nack, Frank
In ICIDS 2019. Lecture Notes in Computer Science, vol 11869
2019
[Abs]
With the release of the film Black Mirror: Bandersnatch Netflix entered the area of interactive streamed narratives. We performed a qualitative analysis with 169 Netflix subscribers that had watched the episode. The key findings show (1) participants are initially engaged because of curiosity and the novelty value, and desire to explore the narrative regardless of satisfaction, (2) perceived agency is limited due to arbitrary choices and the lack of meaningful consequences, (3) the overall experience is satisfactory but adaptions are desirable in future design to make full use of the potential of the format.
-
Transfer Learning for Biomedical Named Entity Recognition with BioBERT
Symeonidou, Anthi,
Sazonau, Viachaslau,
and Groth, Paul
In Proceedings of the Posters and Demo Track of the 15th International
Conference on Semantic Systems co-located with 15th International
Conference on Semantic Systems (SEMANTiCS 2019), Karlsruhe, Germany,
September 9th - to - 12th, 2019.
2019
[Link]
-
Understanding data search as a socio-technical practice
Gregory, Kathleen M,
Cousijn, Helena,
Groth, Paul,
Scharnhorst, Andrea,
and Wyatt, Sally
Journal of Information Science
2019
[Abs]
[Link]
[DOI:10.1177/0165551519837182]
Open research data are heralded as having the potential to increase effectiveness, productivity and reproducibility in science, but little is known about the actual practices involved in data search. The socio-technical problem of locating data for reuse is often reduced to the technological dimension of designing data search systems. We combine a bibliometric study of the current academic discourse around data search with interviews with data seekers. In this article, we explore how adopting a contextual, socio-technical perspective can help to understand user practices and behaviour and ultimately help to improve the design of data discovery systems.
-
Searching Data: A Review of Observational Data Retrieval Practices in Selected Disciplines
Gregory, Kathleen,
Groth, Paul,
Cousijn, Helena,
Scharnhorst, Andrea,
and Wyatt, Sally
Journal of the Association for Information Science and Technology
2019
[Abs]
[Link]
[DOI:10.1002/asi.24165]
A cross-disciplinary examination of the user behaviors involved in seeking and evaluating data is surprisingly absent from the research data discussion. This review explores the data retrieval literature to identify commonalities in how users search for and evaluate observational research data in selected disciplines. Two analytical frameworks, rooted in information retrieval and science and technology studies, are used to identify key similarities in practices as a first step toward developing a model describing data retrieval.
-
End-to-End Learning for Answering Structured Queries Directly over
Text
Groth, Paul T.,
Scerri, Antony,
Daniel, Ron,
and Allen, Bradley P.
In Proceedings of the Workshop on Deep Learning for Knowledge Graphs
(DL4KG2019) Co-located with the 16th Extended Semantic Web Conference
2019 (ESWC 2019), Portoroz, Slovenia, June 2, 2019.
2019
[arXiv]
[Link]
2018
-
Open Information Extraction on Scientific Text: An Evaluation
Groth, Paul T.,
Lauruhn, Michael,
Scerri, Antony,
and Daniel, Ron
In Proceedings of the 27th International Conference on Computational
Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26,
2018
2018
[Link]
-
Elsevier’s Healthcare Knowledge Graph and the Case for Enterprise
Level Linked Data Standards
DeJong, Alex,
Bord, Radmila,
Dowling, Will,
Hoekstra, Rinke,
Moquin, Ryan,
O, Charlie,
Samarasinghe, Mevan,
Snyder, Paul,
Stanley, Craig,
Tordai, Anna,
Trefry, Michael,
and Groth, Paul T.
In Proceedings of the ISWC 2018 Posters & Demonstrations, Industry
and Blue Sky Ideas Tracks co-located with 17th International Semantic
Web Conference (ISWC 2018), Monterey, USA, October 8th - to - 12th,
2018.
2018
[Link]
-
Use of Internal Testing Data to Help Determine Compensation for Crowdsourcing
Tasks
Lauruhn, Michael,
Groth, Paul T.,
Harper, Corey A.,
and Deus, Helena F.
In Proceedings of the 2nd International Workshop on Augmenting Intelligence
with Humans\--in-\-the-\-Loop co-located with 17th International
Semantic Web Conference (ISWC 2018), Monterey, California, October
9th, 2018.
2018
[Link]
2017
-
Indicators for the use of robotic labs in basic biomedical research: a literature analysis
Groth, Paul,
and Cox, Jessica
PeerJ
2017
[Abs]
[Link]
[DOI:10.7717/peerj.3997]
Robotic labs, in which experiments are carried out entirely by robots, have the potential to provide a reproducible and transparent foundation for performing basic biomedical laboratory experiments. In this article, we investigate whether these labs could be applicable in current experimental practice. We do this by text mining 1,628 papers for occurrences of methods that are supported by commercial robotic labs. Using two different concept recognition tools, we find that 86%–89% of the papers have at least one of these methods. This and our other results provide indications that robotic labs can serve as the foundation for performing many lab-based experiments.
-
Storing, Tracking, and Querying Provenance in Linked Data
Wylot, Marcin,
Cudré-Mauroux, Philippe,
Hauswirth, Manfred,
and Groth, Paul T.
IEEE Trans. Knowl. Data Eng.
2017
[Link]
[DOI:10.1109/TKDE.2017.2690299]
-
PROV2R: Practical Provenance Analysis of Unstructured
Processes
Stamatogiannakis, Manolis,
Athanasopoulos, Elias,
Bos, Herbert,
and Groth, Paul T.
ACM Trans. Internet Techn.
2017
[Link]
[DOI:10.1145/3062176]
-
2016
-
Sources of Change for Modern Knowledge Organization Systems
Lauruhn, Michael,
and Groth, Paul
KNOWLEDGE ORGANIZATION
2016
[arXiv]
[DOI:10.5771/0943-7444-2016-8-622]
-
Applying Universal Schemas for Domain Specific Ontology Expansion
Groth, Paul T.,
Pal, Sujit,
McBeath, Darin,
Allen, Brad,
and Daniel, Ron
In Proceedings of the 5th Workshop on Automated Knowledge Base Construction,
AKBC@NAACL-HLT 2016, San Diego, CA, USA, June 17, 2016
2016
[Link]
-
The FAIR Guiding Principles for scientific data management and stewardship
Wilkinson, Mark D.,
Dumontier, Michel,
Aalbersberg, IJsbrand Jan,
Appleton, Gabrielle,
Axton, Myles,
Baak, Arie,
Blomberg, Niklas,
Boiten, Jan Willem,
da Silva Santos, Luiz Bonino,
Bourne, Philip E.,
Bouwman, Jildau,
Brookes, Anthony J.,
Clark, Tim,
Crosas, Mercè,
Dillo, Ingrid,
Dumon, Olivier,
Edmunds, Scott,
Evelo, Chris T.,
Finkers, Richard,
Gonzalez-Beltran, Alejandra,
Gray, Alasdair J.G.,
Groth, Paul,
Goble, Carole,
Grethe, Jeffrey S.,
Heringa, Jaap,
’t Hoen, Peter A C,
Hooft, Rob,
Kuhn, Tobias,
Kok, Ruben,
Kok, Joost,
Lusher, Scott J.,
Martone, Maryann E,
Mons, Albert,
Packer, Abel L.,
Persson, Bengt,
Rocca-Serra, Philippe,
Roos, Marco,
van Schaik, Rene,
Sansone, Susanna Assunta,
Schultes, Erik,
Sengstag, Thierry,
Slater, Ted,
Strawn, George,
Swertz, Morris A.,
Thompson, Mark,
Van Der Lei, Johan,
Van Mulligen, Erik,
Velterop, Jan,
Waagmeester, Andra,
Wittenburg, Peter,
Wolstencroft, Katherine,
Zhao, Jun,
and Mons, Barend
Scientific Data
2016
[Abs]
[DOI:10.1038/sdata.2016.18]
There is an urgent need to improve the infrastructure supporting the reuse of scholarly data. A diverse set of stakeholders-representing academia, industry, funding agencies, and scholarly publishers-have come together to design and jointly endorse a concise and measureable set of principles that we refer to as the FAIR Data Principles. The intent is that these may act as a guideline for those wishing to enhance the reusability of their data holdings. Distinct from peer initiatives that focus on the human scholar, the FAIR Principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals. This Comment is the first formal publication of the FAIR Principles, and includes the rationale behind them, and some exemplar implementations in the community.