The DEEM workshop will be held on Sunday, 30th of June in Amsterdam, NL in conjunction with SIGMOD/PODS 2019. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.
The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments. Submissions can be short papers (4 pages) or long papers (up to 10 pages) following the ACM proceedings format. Please use the latest ACM paper format (2017) and change the font size to 10 pts (analogous to SIGMOD).
Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.
For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. Managing this lifecycle requires careful bookkeeping of metadata and lineage (“which data was used to train this model?”, “which models are affected by changes in this feature?”) and involves methods for continuous analysis, validation, and monitoring of data and models in production. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogeneous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.
DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.
The workshop will have two tracks for regular research papers and industrial papers. Submissions can be short papers (4 pages) or long papers (up to 10 pages). Authors are requested to prepare submissions following the ACM proceedings format. Please use the latest ACM paper format (2017) and change the font size to 10 pts (analogous to SIGMOD). DEEM is a single-blind workshop, authors must include their names and affiliations on the manuscript cover page.
Submission Website: https://cmt3.research.microsoft.com/DEEM2019
Ce Zhang is an Assistant Professor in Computer Science at ETH Zurich. He believes that by making data—along with the processing of data—easily accessible to non-CS users, we have the potential to make the world a better place. His current research focuses on building data systems to support machine learning and help facilitate other sciences. Before joining ETH, Ce was advised by Christopher Ré. He finished his PhD round-tripping between the University of Wisconsin-Madison and Stanford University, and spent another year as a postdoctoral researcher at Stanford. His PhD work produced DeepDive, a trained data system for automatic knowledge-base construction. He participated in the research efforts that won the SIGMOD Best Paper Award (2014) and SIGMOD Research Highlight Award (2015), and was featured in special issues including the Science magazine (2017), the Communications of the ACM (2017), “Best of VLDB” (2015), and the Nature magazine (2015).
Leonidas Galanis is an Engineering Manager at Facebook where he supports the distributed training platform team in Facebook's Artificial Intelligence Infrastructure organization. Prior to that, he was managing the RocksDB and MySQL Software Engineering teams, and built the engineering team that delivered the MyRocks server that is used to store the Facebook social graph and Facebook messenger data (among other data). Before his time at Facebook, he was a Director at Oracle, where he was responsible for the diagnostic & tuning pack and the real application testing option of the Oracle relational database. In 2004, he got his Ph.D. in databases from the University of Wisconsin-Madison.Distributed Training of Deep Learning Models for Recommendation Systems
Machine learning results have been increasingly impressive over the last years with many successful real world applications. Deep learning, in particular, has enabled major breakthroughs. For example, image classification can now achieve better results than humans. At Facebook the need for faster training of deep learning models with larger amounts of data is growing. At the same time the use of existing hardware resources as they are available is essential. We are addressing the aforementioned challenges with distributed training. This presentation provides an overview of distributed training at Facebook. There are challenges in reading training data, interacting with model artifacts and training checkpoints during the often very long process of offline training. Moving and reconfiguring model data through the various training stages (offline, model warmup and online training) while scaling the model size beyond what can fit into a single host presents several problems. Devising distributed optimization algorithms that allow us to use increasingly more trainers in order to reduce the time of training as well as leverage existing resources opportunistically is a very active research area.
Julia Stoyanovich is an Assistant Professor at New York University in the Department of Computer Science and Engineering at the Tandon School of Engineering, and the Center for Data Science. She is a recipient of an NSF CAREER award and of an NSF/CRA CI Fellowship. Julia's research focuses on responsible data management and analysis practices: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data acquisition and processing lifecycle. She established the Data, Responsibly consortium, and serves on the New York City Automated Decision Systems Task Force. In addition to data ethics, Julia works on management and analysis of preference data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst.
Zachary Lipton is an Assistant Professor at Carnegie Mellon University in the Tepper School of Business, and affiliate faculty in the Machine Learning Department and Heinz school of public policy. His research interests span core ML methods as well as critical applications and the social impacts of ML. Right now he is focused on the robustness of ML systems under distribution shift, applications of ML to healthcare (e.g., learning to diagnose), data-efficient deep learning, and questions regarding the fairness and interpretability of ML systems.