First Workshop on
Data-AI Systems (DAIS)
Co-located with ICDE 2025, Hong Kong, China
May 19, 2025 (Monday)

Large language models (LLMs) have garnered significant interest due to their impressive capabilities on a wide range of tasks. However, integrating LLM-powered agents in applications that operate over complex data ecosystems poses significant challenges related to heterogeneous data management and discovery while balancing trade-offs involving cost, latency, accuracy, interpretability, and trustworthiness. Within such ecosystems, dubbed compound AI systems, agentic workflows often deal with a wide variety of information, such as proprietary data in an enterprise, multilingual low-resource data, and heterogeneous data types and formats, among others. Therefore, a robust integration of LLM agents in real-world applications necessitates a systems approach to tackle these challenges and ensure effective and efficient utilization of heterogeneous data. This workshop will focus on exploring innovative approaches towards building such data-aware compound AI systems in the era of LLMs while balancing objectives such as cost, efficiency, robustness, and interpretability. It will be a full-day workshop involving invited talks representative of the work done in these communities, research presentations, and a panel discussion exploring the design space of compound AI systems.

Announcements

  • The workshop program is now available!
  • The list of accepted papers is now available!
  • The second keynote speaker announced!
  • The first keynote speaker announced!
  • The first call for papers of DAIS 2025 is out!
  • We are excited to announce the First Workshop on Data-AI Systems@ICDE 2025 (DAIS).

Keynote Speakers and Panelists

Ziawasch Abedjan is a Professor of Computer Science and chair of the Data Integration and Data Preparation Group at Technische Universität Berlin and the BIFOLD center of excellence.


Kaustubh Beedkar is an Assistant Professor in the Department of Computer Science and Engineering at the Indian Institute of Technology (IIT) Delhi. He is also a Junior Fellow at the Berlin Institute for the Foundations of Learning and Data (BIFOLD). From 2017 to 2023, he was a Senior Researcher in the Database Systems and Information Management (DIMA) group at Technische Universität Berlin. He earned his PhD in Computer Science from the Max Planck Institute for Informatics and the University of Mannheim, Germany. His research lies in the area of data management, with a current focus on compliant data processing in decentralized environments and data federations.


Estevam Hruschka Estevam Hruschka is Lab Director and Staff Research Scientist at Megagon Labs. Prior to Megagon Labs, he was co-leader on the conception, design, creation, deployment, and development of the first Never-Ending Learning System in the history of Computer Science and Artificial Intelligence. Also, he was an associate professor of computer science at the Federal University of Sao Carlos (Brazil) and a Visiting Professor at Carnegie Mellon University (Pittsburgh, PA). His work focuses on both theoretical and applied problems and he is mainly interested in how to build intelligent computer systems capable of deeply understanding Natural Language in a Never-Ending Learning approach. Between 2017 and 2020, he was with Amazon (Alexa Search Team) in Seattle, WA.


Eser Kandogan is Principal Research Engineer at Megagon Labs where he is leading research efforts on building enterprise multi-agent platforms. Between 2000-2019, he was a research staff member at IBM Almaden Research Center, conducting research on visual analytics, human-computer interaction, semantic search, and data science and contributing to several IBM products and patents. He holds a Ph.D. degree from the University of Maryland in Computer Science.


Arpit Narechania is an Assistant Professor in the Department of Computer Science and Engineering (CSE) at the Hong Kong University of Science and Technology (HKUST). He received his PhD in Computer Science from Georgia Institute of Technology, USA and BS (B.Tech.) in Mechanical Engineering from Indian Institute of Technology (IIT) Mandi, India. His main research interests are in visual analytics, human-computer interaction (HCI), and artificial intelligence (AI), with particular focus on 'catalyzing' human-AI collaboration, as part of which he builds interactive tools and fosters conducive environments that actively enhance the way humans and AI work together. He has collaborated with industry leaders such as Adobe Research, Microsoft Research, Ford Motor Company, and Alibaba.com, leading interdisciplinary projects that have resulted in multiple patents, publications, product integrations, and open-source releases.

Program

Location: Y303

Time Event
09:15 - 09:30 Welcome Address
Session chair: Sajjadur Rahman
09:30 - 10:30 Keynote 1
Session chair: Anna Fariha
Speaker: Eser Kandogan
Title: Agentic for Enterprise: Challenges and Opportunities in the 'Wild' [Slides]
Abstract: With advances in AI and LLMs, in particular 'agentic computing', offering novel experiences and unprecedented capabilities, companies are increasingly eager to embrace the agentic AI to streamline their businesses, improve customer experiences, and explore novel business models. Yet, the road to 'agentic' isn't so straightforward, especially for enterprise companies. In this talk, I will present some challenges and opportunities of 'agentic for enterprise', both from the perspective of a developer of such agentic applications as well as a developer of agentic frameworks. The main thesis of the talk will be: as computation and data moves from the 'controlled' zone of determinism, structured, and apriori into the 'wild' zone of probabilistic, unstructured, and in-situ, we need to rethink the whole software stack, how we develop and deploy applications, how we acquire and process data, and how we present and interface to the user. I will make the case by discussing examples, identify design requirements for 'agentic for enterprise', and put out a few research opportunities for the data management, artificial intelligence, systems, and human-computer interaction research communities to tackle.
10:30 - 11:00 Coffee Break
11:00 - 11:30 Invited Paper Session
Session chair: Sairam Gurajada
11:30 - 12:30 Keynote 2
Session chair: Nikita Bhutani
Speaker: Ziawasch Abedjan
Title: Navigating Disruption: The Impact of AI Technologies on Data Integration Research [Slides]
Abstract: The advent of novel AI technologies, such as foundation models and agentic systems, has introduced major disruptions in the field of data management, particularly in data integration. This talk will share our experiences in advancing data integration and preparation solutions, highlighting cases where these technologies have brought significant disruption. I will explore the trade-offs associated with leveraging such powerful tools and shed light on the new pitfalls that make data integration research increasingly challenging. By addressing both opportunities and pitfalls, this talk aims to provide insights into navigating this rapidly evolving landscape.
12:30 - 14:00 Lunch
14:00 - 14:45 Long Paper Session
Session chair: Arpit Narechania
14:45 - 15:30 Short Paper Session
Session chair: Kaustubh Beedkar
15:30 - 16:00 Coffee Break
16:00 - 17:00 Panel Discussion: Towards Scalable, Responsible, and Reliable Application of Agentic Systems
Moderator: Sajjadur Rahman
Panelists: Ziawasch Abedjan, Kaustubh Beedkar, Estevam Hruschka, and Arpit Narechania
17:00 - 17:15 Closing Remarks
Speaker: Sajjadur Rahman

Accepted Papers

Long Papers

  • SEED: Enhancing Text-to-SQL Performance and Practical Usability Through Automatic Evidence Generation. Janghyeon Yun (Seoul National University), Sang-goo Lee (Seoul National University)
  • Fine-tuning Large Language Models for Entity Matching. Aaron Steiner (University of Mannheim), Ralph Peeters (University of Mannheim), Christian Bizer (University of Mannheim)
  • Orchestrating Agents and Data for Enterprise: A Blueprint Architecture for Compound AI. Eser Kandogan (Megagon Labs), Nikita Bhutani (Megagon Labs), Dan Zhang (Megagon Labs), Rafael Li Chen (Megagon Labs), Sairam Gurajada (Megagon Labs), Estevam Hruschka (Megagon Labs)

Short Papers

  • Data Cleaning Using Large Language Models. Shuo Zhang (Columbia University), Zezhou Huang (Columbia University), Eugene Wu (DSI, Columbia University)
  • SLPerf: A Research Library and Benchmark Framework for Split Learning. Zhanyi Hu (East China Normal University), Tianchen Zhou (East China Normal University), Bingzhe Wu (Tencent AI Lab), Cen Chen (East China Normal University), Yanhao Wang (East China Normal University)
  • Agentic Workflows for Extraction of Access Control Matrices from Policy Documents. Pranav Subramaniam (University of Chicago), Sanjay Krishnan (University of Chicago)
  • Towards Regaining Control over Messy Machine Learning Pipelines [Vision]. Stefan Grafberger (BIFOLD & TU Berlin), Hao Chen (BIFOLD & TU Berlin), Olga Ovcharenko (BIFOLD & TU Berlin), Sebastian Schelter (BIFOLD & TU Berlin)

Invited Papers

  • FeVisQA: Free-form Question Answering over Data Visualizations. Yuanfeng Song (WeBank), Jinwei Lu (Shenzhen University), Yuanwei Song (Huawei Technologies), Caleb Chen Cao (HKUST), Raymond Chi-Wing Wong (Hong Kong University of Science and Technology), Haodi Zhang (Shenzhen University)
  • ZeroED: Hybrid Zero-Shot Error Detection with Large Language Model Reasoning. Wei Ni (Zhejiang University; City University of Hong Kong), Kaihang Zhang (Zhejiang University), Xiaoye Miao (Zhejiang University), Xiangyu Zhao (City University of Hong Kong), Yangyang Wu (Zhejiang University), Yaoshu Wang (Shenzhen Institute of Computing Sciences, Shenzhen University), Jianwei Yin (Zhejiang University)

Call for Papers

We encourage participation from academic and industry researchers as well as practitioners in data management, AI, and, systems community and aim to foster interdisciplinary collaborations. We welcome work that proposes innovations in designing compound AI systems and their components as well as work that evaluates components of such systems or studies empirically how humans interact with these systems. We encourage research that comes from academic and industry researchers as well as practitioners in data management, AI, and systems community. A sample of topics that are in the spirit of this workshop include, but are not limited to are given below.

  • knowledge discovery from heterogeneous/multi-modal data sources
    • cross-modal data discovery (e.g., documents, tables, knowledge graphs)
    • preparation and integration of heterogeneous data including parametric and non-parametric knowledge in agentic workflows
    • metadata management for enabling querying over heterogeneous data
  • interaction with heterogeneous/multi-modal data
    • natural language query over structured data
    • multi-modal retrieval augmentation
    • querying parametric and non-parametric knowledge
    • declarative languages for AI systems
  • compound AI system architectures for agentic workflows
    • frameworks for designing compound AI systems
    • data systems for agentic workflows, e.g., data lakes and lakehouses
    • multi-modal query planning and optimization with novel operators
  • usability challenges in compound AI systems
    • fact-checking and verification of LLM generations
    • human-agent interactions, e.g., query refinement
    • fairness and explainability of LLM-based AI systems
  • benchmarking and fine-grained evaluation of components in compound AI systems
Submission Tracks and Instructions

Submissions should present original results and substantial new work not currently under review or published elsewhere. The following submissions are accepted:

  • Regular research papers (up to 8 pages)
  • Short research, application, and vision papers (up to 4 pages)

The page limit for both tracks is excluding the references. No appendix is allowed. Papers must be submitted via Microsoft CMT. Only electronic submissions in PDF format will be considered. Manuscripts must be prepared following the same rules as ICDE conference papers, i.e., in accordance with the IEEE format available here.

A paper submitted to DAIS cannot be under review for any other conference or journal. All accepted papers will appear in the conference proceedings. All accepted papers are expected to be presented at the workshop, and at least one author is required to register.

Evaluation Criteria and Reviewing Process

DAIS will follow a double-anonymous review process to evaluate submissions on the basis of relevance, originality, presentation quality, and technical contribution. The review process will be coordinated by the PC Chairs resulting in a final decision to either accept or reject the submission.


Important Dates

  • Paper submission: January 20 February 3, 2025 11:59pm Pacific Time
  • Notification of acceptance: February 20 March 7, 2025
  • Camera-ready due: March 20, 2025
  • Workshop: May 19, 2025

Organization

Workshop Chairs

Steering Committee

Program Committee

  • Lei Cao (University of Arizona)
  • Amine Mhedhbi (Polytechnique Montréal)
  • Zhengjie Miao (Simon Fraser University)
  • Yuval Moskovitch (Ben-Gurion University of the Negev)
  • Simone Papicchio (Polito.it/Eurecom)
  • El Kindi Rezig (University of Utah)
  • Vraj Shah (IBM Research)
  • Tarique Siddiqui (Microsoft Research)
  • Jin Wang (Megagon Labs)
  • Doris Xin (Linea.ai)
  • Jia Zou (Arizona State University)

Sponsors