The DEEM workshop will be held on Friday, 15th of June in Houston, TX in conjunction with SIGMOD/PODS 2018. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.
The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments. Submissions can be short papers (4 pages) or long papers (up to 10 pages) following the ACM proceedings format.
Jens Dittrich is a full professor of Computer Science in the area of Databases, Data Management, and Big Data at Saarland University, Germany. Previous affiliations include U Marburg, SAP AG, and ETH Zurich. He received an Outrageous Ideas and Vision Paper Award at CIDR 2011, a BMBF VIP Grant in 2011, a best paper award at VLDB 2014, three CS teaching awards (one in 2018 for a data science seminar), as well as ~10 presentation awards. He has been a PC member and area chair/group leader of prestigious international database conferences and journals such as PVLDB/VLDB, SIGMOD, ICDE, and VLDB Journal. He is on the scientific advisory board of Software AG. He was a keynote speaker at VLDB 2017: “Deep Learning (m)eats Databases“. At Saarland University he co-organizes the Data Science Summer School. His research focuses on fast access to big data including in particular: data analytics on large datasets, scalability, main-memory databases, database indexing, reproducability, and scalable data science. Since 2016 he has been working on a start-up at the intersection of data science and data management (http://daimond.ai). He tweets at https://twitter.com/jensdittrich.
Martin Zinkevich is a Research Scientist at Google. He received his Ph.D. from Carnegie Mellon University and has been conducting research at Brown University, University of Alberta and the Machine Learning Group at Yahoo Research. His works have been published in numerous conference such as NIPS, ICML, KDD, WWW, CIKM, AAAI, COLT as well as the Journal of the ACM and the Journal of Machine Learning Research. Additionally, Martin contributes to the discussion on data management and engineering aspects of ML with his online book on Rules of Machine Learning: Best Practices for ML Engineering and a tutorial on Data Management Challenges in Production Machine Learning at SIGMOD 2017.
Matei Zaharia is an assistant professor at Stanford CS, where he works on computer systems and big data as part of Stanford DAWN. He is also co-founder and Chief Technologist of Databricks, the big data company commercializing Apache Spark. Prior to joining Stanford, he was an assistant professor of CS at MIT.
Joaquin Vanschoren is assistant professor of machine learning at the Eindhoven University of Technology (TU/e). His research focuses on the progressive automation of machine learning. He founded and leads OpenML.org, an open science platform for machine learning research used all over the world. He obtained several demonstration and application awards, the Dutch Data Prize, and has been invited speaker at ECDA, StatComp, AutoML@ICML, CiML@NIPS, Reproducibility@ICML, and many other conferences. He also co-organized machine learning conferences (e.g. ECMLPKDD 2013, LION 2016, Discovery Science 2017) and many workshops, including the AutoML Workshop series at ICML.
"Machine Learning", no wait, I mean "A.I.", no, that is the same as "Deep Learning“, isn’t it? What about "Data Science“ or "Big Data Analytics“, is that any better? Hmmmm, ok, let’s phrase it like this: <something> is going on out there. And <something> has a lot to do with playing BS bingo with buzzwords. What is the relationship of the data management community to <something>? Where are opportunities? Where can we help? Where can we learn? How do we increase our impact in the <something>-world? In my talk, I will show: 1.) opportunities for doing research at the intersection of <something> and data management, 2.) experiences from teaching <something>, and 3.) experiences from solving problems in the <something>-domain together with domain experts.
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I’ll present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools. In addition, this kind of platform introduces new data management challenges that I will summarize in the talk.
There have been many amazing results in machine learning: machines that play checkers, chess, poker, and Go better than humans, self-driving cars, self-flying helicopters, programs that can transcribe speech, et cetera. In comparison, supervised learning, where one has a labeled data set and one wants to build a model for ranking, regression, or classification, are not considered as challenging. In fact, most of the above problems distinguish themselves by how they are not as simple as supervised learning, and require novel solutions.
In this talk, I will discuss how supervised learning is usually nestled inside a problem that is inherently more difficult to solve. Specifically, as teams continue to work on supervised learning problems, the issues they face have less and less to do with optimizing a known, simple objective, and have more to do with finding the right objective, which is inherently not a supervised learning problem. In order to solve this, we need to do three things: gather data and infrastructure with the intent of solving this problem in a disciplined fashion, formalize the questions that we need to ask, and develop algorithms to answer these questions. In this talk, I will show early steps toward these three tasks.
Building machine learning systems remains something of a (black) art, requiring a lot of prior experience to compose appropriate ML workflows and their hyperparameters. To democratize machine learning, and make it easily accessible to those who need it, we need a more principled approach to experimentation to understand how to build machine learning systems and progressively automate this process as much as possible. First, we created OpenML, an open science platform allowing scientists to share datasets and train many machine learning models from many software tools in a frictionless yet principled way. It also organizes all results online, providing detailed insight into the performance of machine learning techniques, and allowing a more scientific, data-driven approach to building new machine learning systems. Second, we use this knowledge to create automatic machine learning (AutoML) techniques that learn from these experiments to help people build better models, faster, or automate the process entirely.
ML/AI systems and ML/AI-powered applications are transforming the landscape of computing, with almost all major tech companies pivoting towards an "AI-first" future and many enterprise companies creating applied ML/AI labs. A motley of computing research communities ranging from core ML/AI to data management, systems, computer architecture, human-computer interaction, programming languages, software engineering, and more are increasingly tackling novel technical problems posed by the new preponderance of ML/AI. As the home of data management and data systems research, is the SIGMOD/VLDB community really stepping up to the plate of driving a data-centric agenda in this increasingly important direction, or is it losing relevance and ceding leadership on data-centric research to nearby communities such as NSDI/OSDI/SOSP, HPCA/ISCA, etc.?
This panel brings together experts from multiple pertinent research communities to discuss and debate various aspects of the above question and chart the paths forward. There will be three main topics for discussion: research content and problem selection, logistics and optics of publication venues, and training the next generation of students.
The following panelists have confirmed their participation so far:
Assistant Professor at UC Berkeley
Assistant Professor of Machine Learning at the Eindhoven University of Technology
Assistant Professor of Computer Science, Stanford
Professor of Databases, Data Management and Big Data at the University of Saarland
PhD student, MIT
Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.
For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogenous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.
DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.
The workshop will have two tracks for regular research papers and industrial papers. Submissions can be short papers (4 pages) or long papers (up to 10 pages). Authors are requested to prepare submissions following the ACM proceedings format. DEEM is a single-blind workshop, authors must include their names and affiliations on the manuscript cover page.
Submission Website: https://cmt3.research.microsoft.com/DEEM2018