DEEM: Workshop on Data Management for End-to-End Machine Learning @ ACM SIGMOD'20

Schedule

Sunday, June 14th (all times are in PDT)

Technical Chairs: Julia Stoyanovich, Ke Yang

DEEM is run as an online event via Zoom this year. The link for joining the online meeting can be found at https://www.sigmod2020online.org. In order to access this website, you have to register here for free.

8:00 - 9:00

Session 1

8:00 - 8:05

Welcome

8:05 - 8:55

Integrative Data Equity Systems [Academic Keynote] [Video]
Bill Howe (University of Washington)

Abstract: As the deployment of automated decision tools in society continues to accelerate, their interactions with fundamental questions in law, in the social sciences, and in public policy have become impossible to ignore. Although this technology holds the promise of reducing costs, reducing errors, and improving objectivity, there is enormous potential for harm. As we train algorithms on biased data, we are amplifying, operationalizing, and, most insidiously, legitimizing the historical discrimination and opacity that the technology was in part intended to address. End-to-end data systems provide abstractions to hide complexity behind (hopefully) simpler contractual interfaces. But as these systems are deployed in complex and changing social contexts, their abstractions can reduce transparency and exacerbate equity issues: It gets easier to do the wrong thing. In this talk, Iâ€™ll argue that data systems research needs to broaden scope to explicitly model, manage, and communicate assumptions and requirements about the contexts in which they are deployed, making equity issues a first-class design consideration. I'll provide some examples of the tensions, and describe some projects in this space, including learning fair representations of heterogeneous data to combat bias and providing interactive warning labels to communicate fitness for use downstream. Iâ€™ll end with some thoughts about how academic research in this space can be conducted more responsibly.

9:00 - 10:30

Session 2 - Chair: Steven Whang

9:00 - 9:30

Apache SystemDS: An ML System for the End-to-End Data Science Lifecycle [Video]
Matthias Boehm (Graz University of Technology)

Abstract: Machine learning (ML) applications become increasingly common in many domains. Existing ML systems to execute these workloads include numerical computing frameworks, ML algorithm libraries, and more specialized deep neural network frameworks. These systems focus primarily on efficient model training and scoring. However, the data science process is exploratory, and deals with underspecified objectives, a range of different algorithms, and a wide variety of heterogeneous data sources with various errors and inconsistencies. Therefore, additional tools are employed for data engineering, exploration, and debugging, which requires boundary crossing, unnecessary manual effort, and lacks optimization across the lifecycle. This talk introduces Apache SystemDS, an open source ML system for the end-to-end data science lifecycle from data integration, cleaning, and preparation, over local and distributed model training, to debugging and serving. In a first part, we describe the overall system architecture and major design decisions that are motivated by lessons learned from Apache SystemML. In a second part, we then share early ideas and preliminary results of selected features including (1) lineage-based reuse of intermediates, (2) federated ML over raw data, and (3) the automatic tuning of data cleaning pipelines.

9:30 - 9:45

A Vision on Accelerating Enterprise IT System 2.0 [Video]
Rekha Singhal; Dheeraj Chahal; Shruti Kunde; Mayank Mishra; Manoj Nambiar (TCS)

9:45 - 10:00

Causality-based Explanation of Classification Outcomes [Video]
Leopoldo Bertossi (Universidad Adolfo Ibanez and RelationalAI Inc.); Dan Suciu; Maximilian Schleich (University of Washington); Jordan Li (Carleton University); Zografoula Vagena (RelationalAI Inc.)

10:00 - 10:15

IntegratedML: Every SQL Developer is a Data Scientist [Video]
Benjamin De Boe; Thomas Dyar; Tom Woodfin (InterSystems Corporation)

10:15 - 10:30

MLOS: an Infrastructure for Automated Performance Engineering [Video]
Carlo Curino; Neha Gowdal; Brian Kroth; Sergiy Kuryata; Greg Lapinski; Siqi Liu; Slava Oks; Olga Poppe; Adam Smiechowski; Ed Thayer; Markus Weimer; Yiwen Zhu (Microsoft)

11:00 - 12:30

Session 3 - Chair: Sebastian Schelter

11:00 - 11:30

Can you do impactful MLSys work outside of large companies? [Video]
Manasi Vartak (Verta AI)

Abstract: A lot of the most impactful MLSys work in recent years (e.g., Tensorflow, PyTorch, Michelangelo) has come from industry and particularly large companies. This is rather unusual compared to the field of databases where lots of high-impact work was done in academia too and this can leave academics wondering how to make a dent in this extremely interesting but also extremely competitive research area. In this talk, I will present some perspectives (supported by several examples) on why so much of the MLSys innovation has come from large companies so far and present a simple, anecdotal framework to assess whether a given MLSys problem is amenable to research in a large company, a startup, or academic setting. I will conclude by presenting several research problems, which based on this framework, are particularly well suited for academia and hope that the broader research community adopts them!

11:30 - 11:45

From Data to Models and Back [Video]
Mike Dreves; Gene Huang; Zhuo Peng; Neoklis Polyzotis; Evan Rosen; Paul Suganthan G. C. (Google)

11:45 - 12:00

Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle [Video]
Andrew Chen; Andy Chow; Aaron Davidson; Arjun DCunha; Ali Ghodsi; Sue Ann Hong; Andy Konwinski; Clemens Mewald; Siddharth Murching; Tomas Nykodym; Paul Ogilvie; Mani Parkhe; Avesh Singh; Fen Xie; Matei Zaharia; Richard Zang; Juntai Zheng; Corey Zumar (Databricks)

12:00 - 12:15

Amazon SageMaker Autopilot: a White Box AutoML Solution at Scale [Video]
Piali Das; Nikita Ivkin; Tanya Bansal; Laurence Rouesnel; Philip Gautier; Zohar Karnin; Leo Dirac; Lakshmi Ramakrishnan; Andre Perunicic; Iaroslav Shcherbatyi; Wilton Wu; Aida Zolic; Huibin Shen; Amr Ahmed; Fela Winkelmolen; Miroslav Miladinovic; Cedric Archembeau; Alex Tang; Bhaskar Dutt; Patricia Grao; Kumar Venkateswar (Amazon)

12:15 - 12:30

Resilient Neural Forecasting Systems [Video]
Michael Bohlke-schneider; Shubham Kapoor; Tim Januschowski (Amazon)

13:30 - 14:30

Session 4

13:30 - 14:25

XLA : Compiling Machine Learning for Peak Performance [Industry Keynote] [Video]
Amit Sabne (Google Brain)

Abstract: XLA (accelerated linear algebra) is a compiler-based linear algebra execution engine. It is the backend that powers machine learning frameworks such as TensorFlow and JAX at Google, on a variety of devices including CPUs, GPUs, and TPUs. This talk will cover how ML growth has fueled accelerator architectures and the way XLA and related technologies help obtain high performance from the accelerators. Unlike traditional compilation methods, where high-level programs are first translated into a common intermediate representation (IR) and only during final code-generation architecture-specific optimizations are applied, XLA compiler maps the IR to accelerator architecture constructs much earlier. This talk will demonstrate how such early architecture mapping is essential for peak performance, how it performs on real-world workloads, and the related techniques under the hood.

14:25 - 14:30

Adjourn

↑ top

Important Dates

Submission Deadline: ~~1st of March~~ 8th of March, 5pm Pacific Time
Submission Website: https://cmt3.research.microsoft.com/DEEM2020
Notification of Acceptance: ~~8th of April~~ 18th of April
Final papers due: ~~27th of April~~ 21st of May
Workshop: Sunday, 14th of June

↑ top

Call for Papers

Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.

For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. Managing this lifecycle requires careful bookkeeping of metadata and lineage (which data was used to train this model?, which models are affected by changes in this feature) and involves methods for continuous analysis, validation, and monitoring of data and models in production. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogeneous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithmsâ€‹ onâ€‹ theâ€‹ otherâ€‹ hand.
Additionally, the importance of incorporating ethics and legal compliance into machine-assisted decision-making is being broadly recognized. Critical opportunities for improving data quality and representativeness, controlling for bias, and allowing humans to oversee and impact computational processes are missed if we do not consider the lifecycle stages upstream from model training and deployment. DEEM welcomes research on providing system-level support to data scientists who wish to develop and deploy responsible machine learning methods.

DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.

↑ top

Topics of Interest

Areas of particular interest for the workshop include (but are not limited to):

Data Management in Machine Learning Applications
Definition, Execution and Optimization of Complex Machine Learning Pipelines
Systems for Managing the Lifecycle of Machine Learning Models
Systems for Efficient Hyperparameter Search and Feature Selection
Machine Learning Services in the Cloud
Modeling, Storage and Provenance of Machine Learning experimentation data
Integration of Machine Learning and Dataflow Systems
Integration of Machine Learning and ETL Processing
Definition and Execution of Complex Ensemble Predictors
Sourcing, Labeling, Integrating, and Cleaning Data for Machine Learning
Benchmarking of Machine Learning Applications
Responsible Data Management
Transparency and Accountability of Machine-Assisted Decision Making
Impact of Data Quality and Data Preprocessing on the Fairness of ML Predictions

↑ top

Submission

The workshop will have two tracks for regular research papers and industrial papers. Submissions can be short papers (4 pages) or long papers (up to 10 pages). Authors are requested to prepare submissions following the ACM proceedings format. Please use the latest ACM paper format (2017) and change the font size to 10 pts (analogous to SIGMOD). DEEM is a single-blind workshop, authors must include their names and affiliations on the manuscript cover page.

Submission Website: https://cmt3.research.microsoft.com/DEEM2020

↑ top

Accepted Papers

From Data to Models and Back
Mike Dreves; Gene Huang; Zhuo Peng; Neoklis Polyzotis; Evan Rosen; Paul Suganthan G. C. (Google)
Amazon SageMaker Autopilot: a White Box AutoML Solution at Scale
Piali Das; Nikita Ivkin; Tanya Bansal; Laurence Rouesnel; Philip Gautier; Zohar Karnin; Leo Dirac; Lakshmi Ramakrishnan; Andre Perunicic; Iaroslav Shcherbatyi; Wilton Wu; Aida Zolic; Huibin Shen; Amr Ahmed; Fela Winkelmolen; Miroslav Miladinovic; Cedric Archembeau; Alex Tang; Bhaskar Dutt; Patricia Grao; Kumar Venkateswar (Amazon)
MLOS: an Infrastructure for Automated Performance Engineering
Carlo Curino; Neha Gowdal; Brian Kroth; Sergiy Kuryata; Greg Lapinski; Siqi Liu; Slava Oks; Olga Poppe; Adam Smiechowski; Ed Thayer; Markus Weimer; Yiwen Zhu (Microsoft)
Resilient Neural Forecasting Systems
Michael Bohlke-schneider; Shubham Kapoor; Tim Januschowski (Amazon)
Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle
Andrew Chen; Andy Chow; Aaron Davidson; Arjun DCunha; Ali Ghodsi; Sue Ann Hong; Andy Konwinski; Clemens Mewald; Siddharth Murching; Tomas Nykodym; Paul Ogilvie; Mani Parkhe; Avesh Singh; Fen Xie; Matei Zaharia; Richard Zang; Juntai Zheng; Corey Zumar (Databricks) (Databricks)
Causality-based Explanation of Classification Outcomes
Leopoldo Bertossi (Universidad Adolfo Ibanez and RelationalAI Inc.); Dan Suciu; Maximilian Schleich (University of Washington); Jordan Li (Carleton University); Zografoula Vagena (RelationalAI Inc.)
IntegratedML: Every SQL Developer is a Data Scientist
Benjamin De Boe; Thomas Dyar; Tom Woodfin (InterSystems Corporation)
A Vision on Accelerating Enterprise IT System 2.0
Rekha Singhal; Dheeraj Chahal; Shruti Kunde; Mayank Mishra; Manoj Nambiar (TCS)

Invited Speakers

Academic Keynote: Bill Howe (University of Washington)

Bill Howe is an Associate Professor in the Information School and Adjunct Associate Professor in Computer Science & Engineering at University of Washington. He leads an interdisciplinary group in Responsible Data Science with emphasis on urban applications. His group's research aims to make the techniques and technologies of data science dramatically more accessible and reliable, particularly at scale. Their applied methods are rooted in database models and languages, though they sometimes work in machine learning, visualization, HCI, and high-performance computing. His group is an applied, systems-oriented group, frequently sourcing projects through collaborations in the physical, life, and social sciences.

Industry Keynote: Amit Sabne (Google Brain)

Amit Sabne is a software engineer at Google Brain. He works on high performance compilers for Tensor Processing Units (TPUs), named XLA. Before that, he was a software engineer at Microsoft, working in the Visual C++ compiler team. He researched and designed novel optimization techniques to improve program performance while lowering binary size and compilation time. He earned a PhD in Computer Engineering from the School of Electrical and Computer Engineering, Purdue. His PhD research area was High Performance Computing, with a focus on heterogeneous computing systems. His dissertation proposed and developed efficient programming models for accelerators, and also provided compiler and runtime support for these models and formulated fast autotuning mechanisms for accelerator programs.

Invited Talk: Manasi Vartak (Verta.ai)

Manasi Vartak is the founder and CEO of Verta.AI, which is based on her PhD work at MIT CSAIL on systems for software to streamline the process of data science and machine learning. Previously, she was a PhD student in the Database Group at MIT. She worked on systems for the analysis of large scale data, specifically on making machine learning and visual analysis faster, interactive, and more efficient. She worked and interned at Twitter, Google, Facebook and Microsoft, and is a recipient of the Facebook PhD Fellowship and Google Anita Borg Fellowship.

Invited Talk: Matthias Boehm (TU Graz)

Matthias Boehm is a BMVIT-endowed professor for data management at Graz University of Technology, Austria, and a research area manager for data management at the colocated Know-Center GmbH, Austria. Prior to joining TU Graz in 2018, he was a research staff member at IBM Research - Almaden, CA, USA, with a major focus on compilation and runtime techniques for declarative, large-scale machine learning in Apache SystemML. Matthias received his Ph.D. from Dresden University of Technology, Germany in 2011 with a dissertation on cost-based optimization of integration flows. His previous research also includes systems support for time series forecasting as well as in-memory indexing and query processing. Matthias is a recipient of the 2016 VLDB Best Paper Award, a 2016 SIGMOD Research Highlight Award, and a 2016 IBM Pat Goldberg Memorial Best Paper Award.

↑ top

About | Schedule | Important Dates | CfP | Topics | Submission | Accepted Papers | Invited Speakers | People

About