The DEEM workshop will be held in conjunction with SIGMOD/PODS 2017. DEEM brings together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios.
The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments. Submissions can be short papers (4 pages) or long papers (up to 10 pages) following the ACM proceedings format.
Vishal Chowdhary, Scott Greenwood (Microsoft Research)
EMT: End To End Model Training for MSR Machine Translation
Jose Picado, Arash Termehchy, Alan Fern, Sudhanshu Pathak (Oregon State University)
AutoMode: Relational Learning With Less Black Magic
Tom van der Weide, Dimitris Papadopoulos, Oleg Smirnov, Michal Zielinski, Tim van Kasteren (Schibsted Media Group)
Versioning for end-to-end machine learning pipelines
Lingjiao Chen, Paraschos Koutris (University of Wisconsin-Madison), Arun Kumar (University of California San Diego)
Model-based Pricing: Do Not Pay for More than What You Learn!
Rajesh Bordawekar, (IBM Research), Oded Shmueli (Technion Haifa)
Using Word Embedding to Enable Semantic Queries in Relational Databases
Hui Miao, Ang Li, Larry Davis, Amol Deshpande (University of Maryland)
On Model Discovery For Hosted Data Science Projects
Matei Zaharia is an assistant professor at Stanford CS, where he works on computer systems and big data. He is also co-founder and Chief Technologist of Databricks, the big data company commercializing Apache Spark. Prior to joining Stanford, he was an assistant professor of computer science at MIT.
Arun Kumar is an Assistant Professor in the Department of Computer Science and Engineering at the University of California, San Diego. He obtained his PhD from the University of Wisconsin-Madison in 2016. His primary research interests are in data management, especially the intersection of data management and machine learning, with a focus on problems related to usability, developability, performance, and scalability. Systems and ideas based on his research have been released as part of the MADlib open-source library, shipped as part of products from EMC, Oracle, Cloudera, and IBM, and used internally by Facebook, LogicBlox, and Microsoft. A paper he co-authored was accorded the Best Paper Award at ACM SIGMOD 2014. He was awarded the 2016 Graduate Student Research Award for the best dissertation research in UW-Madison CS and the Anthony C. Klug NCR Fellowship in Database Systems in 2015.
Xin Luna Dong is a principal scientist at Amazon since July 2016, leading the efforts to build the Amazon Product Graph. Prior to joining Amazon, she worked for Google and AT&T Labs - Research. She received her Ph.D. in Computer Science and Engineering at University of Washington. Before coming to the United States, she obtained a M.S. in Computer Science at Peking University, and a B.S. in Computer Science at Nankai University in China. Her research interests include data integration, data cleaning, and knowledge management. She recently won the VLDB Early Career Research Contribution Award for "advancing the state of the art of knowledge fusion".
Ted Dunning is Chief Application Architect at MapR Technologies and committer and PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects. Ted has been very active in mentoring new Apache projects and is currently serving as vice president of incubation for the Apache Software Foundation. Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (later purchased by LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.
Markus Weimer is a Principal Scientist at Microsoft, where he leads the machine learning group of the Cloud Information Services Laboratory (CISL). He is also a member of the Apache Software Foundation, where he has been the PMC chair (VP) of Apache REEF. His work focuses on machine learning techniques, systems therefore and applications thereof. Recently, he has been heavily involved in systems building for these applications, both Apache REEF and internal ones.
Applying Machine Learning (ML) in real-world scenarios is a challenging task. In recent years, the main focus of the database community has been on creating systems and abstractions for the efficient training of ML models on large datasets. However, model training is only one of many steps in an end-to-end ML application, and a number of orthogonal data management problems arise from the large-scale use of ML, which require the attention of the data management community.
For example, data preprocessing and feature extraction workloads result in complex pipelines that often require the simultaneous execution of relational and linear algebraic operations. Next, the class of the ML model to use needs to be chosen, for that often a set of popular approaches such as linear models, decision trees and deep neural networks have to be tried out on the problem at hand. The prediction quality of such ML models heavily depends on the choice of features and hyperparameters, which are typically selected in a costly offline evaluation process, that poses huge opportunities for parallelization and optimization. Afterwards, the resulting models must be deployed and integrated into existing business workflows in a way that enables fast and efficient predictions, while still allowing for the lifecycle of models (that become stale over time) to be managed. As a further complication, the resulting systems need to take the target audience of ML applications into account; this audience is very heterogenous, ranging from analysts without programming skills that possibly prefer an easy-to-use cloud-based solution on the one hand, to teams of data processing experts and statisticians developing and deploying custom-tailored algorithms on the other hand.
DEEM aims to bring together researchers and practitioners at the intersection of applied machine learning, data management and systems research, with the goal to discuss the arising data management issues in ML application scenarios. The workshop solicits regular research papers describing preliminary and ongoing research results. In addition, the workshop encourages the submission of industrial experience reports of end-to-end ML deployments.
The workshop will have two tracks for regular research papers and industrial papers. Submissions can be short papers (4 pages) or long papers (up to 10 pages). Authors are requested to prepare submissions following the ACM proceedings format.
Submission Website: https://cmt3.research.microsoft.com/DEEM2017/