MiDAS Seminar

Spring 2024 Calendar
Seminars are held on Thursdays in CDS 950 at 1:30pm [unless otherwise notified].
Date Speaker Title and Abstract
Thursday, May 16 Yanpeng Wei
(TBA)
Bio

TBA

Reading Group: TBA
Click for abstract

TBA

Thursday, May 09 TBA
(TBA)
Bio

TBA

TBA
Click for abstract

TBA

Thursday, Apr 25 Aneesh Raman
(Boston University)
Bio

TBA

Reading Group: Learned Index with Precise Positions
Click for abstract

TBA

Thursday, Apr 18 Tarikul Islam Papon
(Boston University)
Bio

TBA

Reading Group: WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs
Click for abstract

TBA

Thursday, Apr 11 Aaron Ang
(Boston University)
Bio

TBA

Reading Group: Automatic SQL Error Mitigation in Oracle
Click for abstract

TBA

Thursday, Apr 11 Ran Wei
(Boston University)
Bio

TBA

Reading Group: Improving LSM-Tree Based Key-Value Stores With Fine-Grained Compaction Mechanism
Click for abstract

TBA

Thursday, Apr 04 Zichen Zhu
(Boston University)
Bio

Zichen Zhu is currently a 5th-year PhD candidate in MiDAS of Boston University, advised by Prof. Manos Athanassoulis. Prior to joining Boston University, he studied as an MPhil student in Database Group, department of Computer Science in the University of Hong Kong, advised by Prof. Reynold Cheng. His research interest focuses on designing workload-aware query optimization under memory constraint.

Reading Group: InfiniFilter: Expanding Filters to Infinity and Beyond
Click for abstract

Filter data structures have been used ubiquitously since the 1970s to answer approximate set-membership queries in various areas of computer science including architecture, networks, operating systems, and databases. Such filters need to be allocated with a given capacity in advance to provide a guarantee over the false positive rate. In many applications, however, the data size is not known in advance, requiring filters to dynamically expand. This paper shows that existing methods for expanding filters exhibit at least one of the following flaws: (1) they entail an expensive scan over the whole data set, (2) they require a lavish memory footprint, (3) their query, delete and/or insertion performance plummets, (4) their false positive rate skyrockets, and/or (5) they cannot expand indefinitely. We introduce InfiniFilter, a new method for expanding filters that addresses these shortcomings. InfiniFilter is a hash table that stores a fingerprint for each entry. It doubles in size when it reaches capacity, and it sacrifices one bit from each fingerprint to map it to the expanded hash table. The core novelty is a new and flexible hash slot format that sets longer fingerprints to newer entries. This keeps the average fingerprint length long and thus the false positive rate stable. At the same time, InfiniFilter provides stable insertion/query/delete performance as it is comprised of a unified hash table. We implement InfiniFilter on top of Quotient Filter, and we demonstrate theoretically and empirically that it offers superior cost properties compared to existing methods: it better scales performance, the false positive rate, and the memory footprint, all at the same time.

Thursday, Mar 21 Andy Huynh
(Boston University)
Bio

TBA

Reading Group: Learning to Optimize LSM-trees: Towards A Reinforcement Learning based Key-Value Store for Dynamic Workloads
Click for abstract

TBA

Monday, Mar 18 Aamod Khatiwada
(Northeastern University)
Bio

Aamod Khatiwada is a PhD Candidate (Computer Science, Databases) at Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA. He originally from Jhorahat, a town in the southeastern part of Nepal. His primary research focus is to develop novel techniques and algorithms that find and integrate open data tables in a principled way. During free times, he enjoy watching cricket and soccer, writing poems, stories, and news articles.

Table Discovery and Integration in Data Lakes (5:00pm - 6:15pm @ HAR 306)
Click for abstract

Data lakes serve as repositories that house millions of tables covering a wide range of subjects. Data scientists use tables from data lakes to support decision-making processes, train machine learning models, perform statistical analysis, and more. However, the enormous size and heterogeneity of data lakes, together with missing or unreliable metadata, make it challenging for data scientists to find suitable tables for their analyses. Furthermore, after table discovery, another challenge is to integrate the discovered tables together to get a unified view of data, enabling queries that go beyond a single table. In this talk, I will present novel techniques for table discovery and integration tasks. First, I will overview SANTOS [1] which helps data scientists discover tables from a data lake that union with their query table and possibly extend it with more rows. Contrary to existing techniques that only use the semantics of columns to perform union search, we improve the search accuracy by also using semantic relationships between pairs of columns in a table. Second, I will cover ALITE (Align and Integrate) [2], the first proposal for scalable integration of tables that may have been discovered using join, union, or related table search. ALITE aligns the columns of discovered tables using holistic schema matching and then applies Full Disjunction, an associative version of outer join, to get an integrated table. I will also describe DIALITE [3], a novel table discovery pipeline that allows users to discover, integrate, and analyze open data tables. DIALITE is flexible such that the user can easily add and compare additional discovery and integration algorithms. [1] Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald. SANTOS: Relationship-based Semantic Table Union Search. SIGMOD 2023 [2] Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller. Integrating Data Lake Tables. VLDB 2023. [3] Aamod Khatiwada, Roee Shraga, Renée J. Miller. DIALITE: Discover, Align and Integrate Open Data Tables. SIGMOD-Companion 2023.

Thursday, Mar 14 Wolfgang Gatterbauer
(Northeastern University)
Bio

Wolfgang Gatterbauer is an Associate Professor in the Khoury College of Computer Sciences at Northeastern University. A major focus of his research is to extend the capabilities of modern data management systems in generic ways and to allow them to support novel functionalities that seem hard at first. He received an NSF Career award and, with his students and collaborators, a best paper award at EDBT 2021, best-of-conference selections for PODS 2021, SIGMOD 2017, WALCOM 2017, and VLDB 2015, and two out of three reproducibility awards for papers published at SIGMOD 2020.

https://gatterbauer.name

HITSnDIFFs: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property
Click for abstract

We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the "truth"), our problem is to determine a ranking of the users based on their ability to answer questions. We call this problem "ability discovery" to emphasize the connection to and duality with the more well-studied problem of "truth discovery".

To model items and their labels in a principled way, we draw upon Item Response Theory (IRT) which is the widely accepted theory behind standardized tests such as SAT and GRE. We start from an idealized setting where the relative performance of users is consistent across items and better users choose better fitting labels for each item. We posit that a principled algorithmic solution to our more general problem should solve this ideal setting correctly and observe that the response matrices in this setting obey the Consecutive Ones Property (C1P). While C1P is well understood algorithmically with various discrete algorithms, we devise a novel variant of the HITS algorithm which we call "HITSNDIFFS" (or HND), and prove that it can recover the ideal C1P-permutation in case it exists. Unlike fast combinatorial algorithms for finding the consecutive ones permutation (if it exists), HND also returns an ordering when such a permutation does not exist. Thus it provides a principled heuristic for our problem that is guaranteed to return the correct answer in the ideal setting. Our experiments show that HND produces user rankings with robustly high accuracy compared to state-of-the-art truth discovery methods. We also show that our novel variant of HITS scales better in the number of users than ABH, the only prior spectral C1P reconstruction algorithm.

Based on join work with Zixuan Chen, Subhodeep Mitra and R. Ravi

https://arxiv.org/abs/2401.00013
https://github.com/northeastern-datalab/HITSnDIFFs/
https://www.youtube.com/watch?v=2pN1Nx7-sd0

Thursday, Mar 07 Konstantinos Karatsenidis
(Boston University)
Bio

TBA

Reading Group: LRU-C: Parallelizing Database I/Os for Flash SSDs
Click for abstract

TBA

Thursday, Feb 29 Ju Hyoung Mun
(Boston University)
Bio

TBA

Reading Group: Programmable DDRx Controllers
Click for abstract

TBA

Thursday, Feb 22 Teona Bagashvili
(Boston University)
Bio

TBA

Reading Group: ZNS SSDs
Click for abstract

TBA

Wednesday, Feb 21 Dan Suciu
(University of Washington)
Bio

Dan Suciu is a Microsoft Endowed Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. Suciu is conducting research in data management, on topics such as query optimization, probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He received the ACM SIGMOD Codd Innovation Award, received several best paper awards and test of time awards, and is a Fellow of the ACM. Suciu is currently an associate editor for the Journal of the ACM. Suciu’s PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.

Cardinality Estimation: Achilles’ Heel of Query Optimizers (11:00am @ CDS 1750 also part of the CS Distinguished Colloquium)
Click for abstract

Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The estimator has access to some limited statistics on the database, such as table cardinalities, number of distinct values in various columns, sometimes histograms, and needs to estimate the output size of complex queries, often involving many joins and filter predicates. The cardinality estimator is a critical piece of any query optimizer, and is often the main culprit when the optimizer chooses a poor plan. In this talk I will briefly survey existing cardinality estimation methods, discussing their pros and cons, then I will introduce a new approach to the problem, which is based on computing an upper bound instead of an estimate. The upper bound is given by a mathematical formula, and can use the available database statistics in surprising ways.

Thursday, Feb 15 Jinqi Lu
(Boston University)
Bio

Jinqi is currently a CS Master's student at BU DiSC Lab advised by Prof. Athanassoulis. He posses a wide range of interests including Database Systems, Cloud, and Networks. Currently, he is engaged in the relational memory project and also contributes to the testing of computational and zoned storage devices.

Reading Group: Selection Pushdown in Column Stores using Bit Manipulation Instructions
Click for abstract

Modern analytical database systems predominantly rely on column-oriented storage, which offers superior compression efficiency due to the nature of the columnar layout. This compression, however, creates challenges in decoding speed during query processing. Previous research has explored predicate pushdown on encoded values to avoid decoding, but these techniques are restricted to specific encoding schemes and predicates, limiting their practical use. In this paper, we propose a generic predicate pushdown approach that supports arbitrary predicates by leveraging selection pushdown to reduce decoding costs. At the core of our approach is a fast select operator capable of directly extracting selected encoded values without decoding, by using Bit Manipulation Instructions, an instruction set extension to the X86 architecture. We empirically evaluate the proposed techniques in the context of Apache Parquet using both micro-benchmarks and the TPC-H benchmark, and show that our techniques improve the query performance of Parquet by up to one order of magnitude with representative scan queries. Further experimentation using Apache Spark demonstrates speed improvements of up to 5.5X even for end-to-end queries involving complex joins.

Thursday, Feb 01 Kapil Vaidya
(MIT)
Bio

Kapil Vaidya is an Applied Scientist at Amazon Redshift in the Learned Systems Group. He recently graduated from MIT, where he completed his PhD under the guidance of Professor Tim Kraska in the Data Systems Group. His research primarily revolves around the application of machine learning to data systems, with a special focus on the data structures and algorithms integral to these systems. Before that Kapil completed his Bachelors from IIT Bombay.

Sparse Numerical Array-Based Range Filters (SNARF) (12:00pm @ CDS 1101 also part of BU Systems Seminar)
Click for abstract

We present Sparse Numerical Array-Based Range Filters (SNARF),a learned range filter that efficiently supports range queries for numerical data. SNARF creates a model of the data distribution to map the keys into a bit array which is stored in a compressed form. The model along with the compressed bit array which constitutes SNARF are used to answer membership queries. We evaluate SNARF on multiple synthetic and real-world datasets as a stand-alone filter and by integrating it into RocksDB. For range queries, SNARF provides up to 50x better false positive rate than state-of-the-art range filters, such as SuRF and Rosetta, with the same space usage. We also evaluate SNARF in RocksDB as a filter replacement for filtering requests before they access on-disk data structures. For RocksDB, SNARF can improve the execution time of the system up to 10x compared to SuRF and Rosetta for certain read-only workloads.

Spring 2024: Jinqi Lu, e-mail: jinqilu at bu dot edu
Fall 2023 Calendar
Seminars are held in CDS 950 at 11:30am [unless otherwise notified].
Date Speaker Title and Abstract
Monday, Dec 11 Internal Project Updates
Tuesday, Dec 05 Johes Bater
(Tufts University)
Bio

Johes Bater is an assistant professor of Computer Science at Tufts University and co-director of the Tufts Security and Privacy Lab. Before that, he was a postdoctoral researcher in the Database Group at Duke University, received his Ph.D. in computer science from Northwestern University, and completed his B.S. and M.S. in electrical engineering at Stanford University. His research centers on how to balance privacy, security, and utility to build fast, accurate database systems that support privacy-preserving analytics with provable security guarantees.

Building Systems That Protect Users and Their Data (CAS 313, 12:30pm)
Click for abstract

Simply put, data security and privacy affects every single person in the world. Data breaches, where organizations compromise sensitive user data, are frighteningly common. Current approaches however, fail to properly safeguard security and privacy. Existing encrypted databases introduce data pipelines that leak enough side channel information that data breaches are an inevitability, while cryptographically secure solutions are so slow and cumbersome that they no longer serve their intended purpose. Moreover, many business models - such as targeted advertising - are predicated on their continued ability to analyze and make decisions based on private user data. Without a principled approach to protecting user privacy, it is unclear if this setup is sustainable. To address this problem, we need systems that treat security and privacy as first-class citizens in their system design. In this talk, I will discuss how to build secure and private database management systems that balance the trade-offs among privacy, performance, and query result accuracy. This work spans numerous disciplines of computer science, including, 1) databases: modeling query workloads to estimate and optimize their privacy properties, runtime, and result accuracy across multiple settings, 2) security: extending secure computation protocols to evaluate SQL statements over the private data of many data owners without revealing data in plaintext, and 3) privacy: rigorously modeling the end-to-end information leakage profile of a privacy-preserving DBMS. The goal of this work is to bring together these disparate research areas to create usable query processing engines with strong privacy and security guarantees.

Thursday, Nov 30 Viktor Leis
(Technical University of Munich)
Bio

Viktor Leis is a Professor of Computer Science at the Technical University of Munich (TUM). His research revolves around designing high-performance data management systems and includes core database systems topics such as query processing, query optimization, transaction processing, index structures, and storage. Another major research area is designing cloud-native, cost-efficient information systems. Viktor Leis received his doctoral degree in 2016 from TUM and was a professor at the Friedrich-Schiller-Universität Jena and Friedrich-Alexander-Universität Erlangen-Nürnberg before returning to TUM in 2022. He received the ACM SIGMOD dissertation award, the IEEE TCDE Rising Star Award, best paper awards at IEEE ICDE and ACM SIGMOD, and an ERC Starting Grant.

LeanStore: A High-Performance Storage Engine for NVMe SSDs and Multi-Core CPUs (CAS 313, 12:30pm)
Click for abstract

The goal of the LeanStore project is to achieve robust performance comparable to in-memory systems when the data set fits into RAM, while being able to fully exploit the bandwidth of fast NVMe SSDs for larger data sets. To achieve these goals we had to redesign all components of the traditional DBMS stack, including buffer management, indexing, concurrency control, I/O management, logging, recovery, and page replacement. In this talk, I will give an overview of these components and the evolution of the system over the past 6 years.

Monday, Nov 06 Xiao Hu
(University of Waterloo)
Bio

Xiao Hu is an incoming Assistant Professor in the Cheriton School of Computer Science at the University of Waterloo. Prior to that, she was a Visiting Faculty Scholar in the Discrete Algorithm Group at Google Research from and a postdoctoral associate within the Department of Computer Science at Duke University. She received her Ph.D. in Computer Science and Engineering from HKUST, and a BE degree in Computer Software from Tsinghua University. Her research has focused on studying fundamental problems in database theory and their implications to practical systems.

Incremental View Maintenance for Conjunctive Queries
Click for abstract

Incremental view maintenance (IVM) is a critical problem in the field of database management systems. It addresses the challenge of efficiently updating materialized views in response to changes in the underlying data, reducing the computational cost and ensuring up-to-date query results. In this context, conjunctive queries, a subset of relational algebra, provide a rich framework for expressing complex queries of both theoretical and practical interest. In this talk, I will introduce a general IVM framework for conjunctive queries, such that views can be maintained efficiently while supporting enumerating full query results as well as delta query results with delay guarantee. I will describe the intrinsic relationship between the sequence of updates and the maintenance cost. At last, I will briefly discuss some interesting questions when aggregation is involved.

Wednesday, Oct 18 C. Mohan
(IBM Fellow)
Bio

Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2002-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (https://bit.ly/sigBcP, https://bit.ly/CMoTalks). Since 2017, he has been an evangelist of permissioned blockchains and the myth buster of permissionless blockchains. During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. In 2019, he became an Honorary Advisor to the Tamil Nadu e-Governance Agency. In 2020, he joined the Advisory Board of the Kerala Blockchain Academy. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020.

A Technical Retrospect: Q&A with Mohan (CDS 364, 2:30pm)
Click for abstract

In this talk, using my own life as an example, I discuss the excitement of doing research and developing advanced technologies. Starting with my undergraduate days as a chemical engineering student at IIT Madras, I narrate how my academic life changed when I got introduced to an IBM mainframe which led to my ultimately doing a PhD in computer science at University of Texas at Austin. Along the way, I talk about the frustrations of the prolonged hunt for a PhD thesis topic. Then, I trace my professional career at IBM which included working on hot topics as well as what were initially thought by others to be passé topics which turned out to be not so! I deal with the difficulties of getting papers published and doing technology transfer to new products as well as old ones. Having never been a manager during my entire 38.5 years IBM career, even though as an IBM Fellow I had been an IBM executive for 23 years, I bring a unique perspective on what it takes to be on a long-term purely technical career path. I discuss the DOs and DONTs of choosing technical topics to work on, and on being successful in having impact internally and outside one’s employer. Based in general on my extensive travels across the world and in particular on my experience of working as the IBM India Chief Scientist, while based in Bangalore for almost 3 years, I also address issues that arise in being part of a multinational corporation and trying to be successful in doing deeply technical work in a developing country. I also talk about how I am keeping myself active technically and professionally since my retirement from IBM in June 2020.

Wednesday, Oct 18 C. Mohan
(IBM Fellow)
Bio

Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2002-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (https://bit.ly/sigBcP, https://bit.ly/CMoTalks). Since 2017, he has been an evangelist of permissioned blockchains and the myth buster of permissionless blockchains. During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. In 2019, he became an Honorary Advisor to the Tamil Nadu e-Governance Agency. In 2020, he joined the Advisory Board of the Kerala Blockchain Academy. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020.

Systems Design Dilemmas: Simplicity Versus Complexity (CDS 17th floor, 11:00am)
Click for abstract

Based on my 4 decades long involvement in and exposure to numerous database systems, in this talk I will discuss how various factors led multiple systems to be evolved in ways that made them become more complex in terms of their design as well as their usability aspects. While simplicity of design and architecture might be a highly desirable objective, there will invariably be conflicting requirements that lead to complexity creeping in, at least with respect to the internals of such systems. Examples of such requirements relate to the cloud environment, performance, availability, heterogeneity in terms of supported hardware, operating systems and programming language environments, accommodation of multiple workload types, extensibility, etc. It is also the case that aiming for simplicity in one aspect of system design might cause complexity to be introduced in other areas. A case in point is the desire for the near elimination of the need for DBAs and tuning knobs in the cloud environment which makes the internals of such systems become much more complex so that they could dynamically detect and react to different workload types that such systems might be used for. While KISS (Keep It Simple, Stupid) might be a desirable design principle, following it mindlessly might result in the kiss of death with respect to performance and reliability!

Fall 2023: Zichen Zhu, e-mail: zczhu at bu dot edu
Spring 2023 Calendar
Seminars are held in CDS 950 at 11:30am [unless otherwise notified].
Date Speaker Title and Abstract
Friday, Jun 02 Tianru Zhang
(Uppsala University)
Bio

Tianru Zhang is a PhD student in Department of Information Technology at Uppsala University. He received his MSc degree from ENSAI (National school of statistics and information analysis of France), and his bachelor degree from University of Science and Technology of China. Before joining Uppsala University, he also worked as an assistant researcher in Fujitsu R&D center, Beijing. Zhang is interested in research field including cloud computing, data management, and federated machine learning.

Efficient Hierarchical Storage Management Empowered by Reinforcement Learning (Online only @ 11:30 AM)
Click for abstract

With the rapid development of big data and cloud computing, data management has become increasingly challenging. Over the years, a number of frameworks for data management have become available. Most of them are highly efficient, but ultimately create data silos. It becomes difficult to move and work coherently with data as new requirements emerge. A possible solution is to use an intelligent hierarchical (multi-tier) storage system (HSS). A HSS is a meta solution that consists of different storage frameworks organized as a jointly constructed storage pool. A built-in data migration policy that determines the optimal placement of the datasets in the hierarchy is essential. Placement decisions is a non-trivial task since it should be made according to the characteristics of the dataset, the tier status in a hierarchy, and access patterns. This paper presents an open-source hierarchical storage framework with a dynamic migration policy based on reinforcement learning (RL). We present a mathematical model, a software architecture, and implementations based on both simulations and a live cloud-based environment. We compare the proposed RL-based strategy to a baseline of three rule-based policies, showing that the RL-based policy achieves significantly higher efficiency and optimal data distribution in different scenarios.

Friday, May 05 Georgios Fakas
(Uppsala University)
Bio

Georgios Fakas is an Associate Professor at the Department of Information Technology, Uppsala University. He is currently leading the Uppsala DataBase Laboratory (UDBL) within the Department of Information Technology. Before joining Uppsala University, he worked as a Research Fellow at the Hong Kong University of Science and Technology (with Prof. Dimitris Papadias), Hong Kong University (with Prof. Nikos Mamoulis), EPFL (Lausanne, Switzerland), UMIST (Manchester, UK), University of Cyprus (Cyprus). Prof. Georgios Fakas also worked as a Senior Lecturer at Manchester Metropolitan University (UK). He obtained his Ph.D. in 1998, his M.Phil. in 1996 and his B.Sc. in Computation in 1995; all from the Department of Computation, UMIST, Manchester, UK.

Proportionality on Spatial Data with Context
Click for abstract

More often than not, spatial objects are associated with some context, in the form of text, descriptive tags (e.g., points of interest, flickr photos), or linked entities in semantic graphs (e.g., Yago2, DBpedia). Hence, location-based retrieval should be extended to consider not only the locations but also the context of the objects, especially when the retrieved objects are too many and the query result is overwhelming. In this article, we study the problem of selecting a subset of the query result, which is the most representative. We argue that objects with similar context and nearby locations should proportionally be represented in the selection. Proportionality dictates the pairwise comparison of all retrieved objects and hence bears a high cost. We propose novel algorithms which greatly reduce the cost of proportional object selection in practice. In addition, we propose pre-processing, pruning, and approximate computation techniques that their combination reduces the computational cost of the algorithms even further. We theoretically analyze the approximation quality of our approaches. Extensive empirical studies on real datasets show that our algorithms are effective and efficient. A user evaluation verifies that proportional selection is more preferable than random selection and selection based on object diversification.

Wednesday, Apr 26 Sudeepa Roy
(Duke University)
Bio

Sudeepa Roy is an Associate Professor in Computer Science at Duke University. She works broadly in data management, with a focus on the foundational aspects of big data analysis, which includes causality and explanations for big data, data repair, query optimization, probabilistic databases, and database theory. Before joining Duke in 2015, she did a postdoc at the University of Washington, and obtained her Ph.D. from the University of Pennsylvania. She is a recipient of the VLDB Early Career Research Contributions Award, an NSF CAREER Award, and a Google Ph.D. fellowship in structured data. She is a co-director of the Almost Matching Exactly (AME) lab for interpretable causal inference at Duke (https://almost-matching-exactly.github.io/).

Connecting Causal Inference with Databases towards Actionable Data Analysis (CDS 701, 11:00am)
Click for abstract

Causal analysis – i.e., estimating the effect of a "treatment" on an "outcome" – lays the foundation of sound and robust policy making by providing a means to estimate the impact of a certain intervention to the world, and is practically indispensable in health, medicine, social sciences, and other domains. It forms the stepping stone for actionable data analysis that correlation, association, or model-based prediction analysis cannot provide -- as cited widely in statistical studies, correlation does not imply causation. In this talk, I will discuss some of our research in making a connection between traditional database research with causal inference research from AI and Statistics, and exploring how they can strengthen each other. First, I will present a "matching" framework for causal inference called "Almost Matching Exactly", which rivals black box machine learning methods in estimation accuracy but has the benefit of being interpretable and easier to troubleshoot, and also can use efficient SQL query evaluations from databases for scalability. Then I will discuss how standard causal inference techniques on homogenous populations (single table) can be extended to "relational" (multi-table) data frequently found in practice. Finally, I will talk about how such "relational causal inference" can be applied to hypothetical reasoning in database research in terms of "what-if" and "how-to" queries, and conclude with an overview of our ongoing research directions on actionable and interpretable data analysis with causality and explanations.

Tuesday, Apr 25 Sanket Purandare
(Harvard University)
Bio

Sanket Purandare is a 4th year PhD student at the Data and Systems Lab (DASLab) at Harvard University, advised by Prof. Stratos Idreos. His research interests lie in the area of systems for deep learning. In particular, he works on the characterization of deep learning training and inference workloads and optimizing the deep learning software infrastructure for hardware efficiency by applying systems concepts from operating systems, compilers, and computer architecture in a novel fashion. Currently, he is a Visiting Researcher with Meta Inc. (formerly Facebook) and works with the PyTorch Compilers and Distributed Teams on building the compiler stack for distributed training.

μ -TWO: Multi-Model Training with Orchestration and Memory Optimization (CDS 701, 11:00am)
Click for abstract

Causal analysis – i.e., estimating the effect of a "treatment" on an "outcome" – lays the foundation of sound and robust policy making by providing a means to estimate the impact of a certain intervention to the world, and is practically indispensable in health, medicine, social sciences, and other domains. It forms the stepping stone for actionable data analysis that correlation, association, or model-based prediction analysis cannot provide -- as cited widely in statistical studies, correlation does not imply causation. In this talk, I will discuss some of our research in making a connection between traditional database research with causal inference research from AI and Statistics, and exploring how they can strengthen each other. First, I will present a "matching" framework for causal inference called "Almost Matching Exactly", which rivals black box machine learning methods in estimation accuracy but has the benefit of being interpretable and easier to troubleshoot, and also can use efficient SQL query evaluations from databases for scalability. Then I will discuss how standard causal inference techniques on homogenous populations (single table) can be extended to "relational" (multi-table) data frequently found in practice. Finally, I will talk about how such "relational causal inference" can be applied to hypothetical reasoning in database research in terms of "what-if" and "how-to" queries, and conclude with an overview of our ongoing research directions on actionable and interpretable data analysis with causality and explanations.

Friday, Apr 21 Immanuel Trummer
(Cornell University)
Bio

Immanuel Trummer is an assistant professor at Cornell University. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His online lecture introducing students to database topics collected over a million views. He received the NSF CAREER Award and multiple Google Faculty Research Awards.

Towards Highly Customizable Database Systems via Generative AI (MCS B37, 12:30pm)
Click for abstract

Deep neural networks are becoming ever more useful for numerous applications. However, it is increasingly more challenging to utilize high-quality networks due to massive computational costs. In fact, computational costs continuously increase due to the need for ever more complex networks and due to the need to explore numerous network designs during the development of an AI solution. This leads to several problems including limiting development and access to high-quality AI solutions, time-to-market limitations for organizations, and negative environmental impact caused by increased use of resources. We observe that modern GPUs, the substrate where neural networks are developed, are severely underutilized. In fact, the utilization factor is only in the order of 50% and drops as GPUs get ever faster. Fusing operators across multiple models and larger mini-batch sizes are two key methods for increasing hardware utilization and training throughput. However, the large memory requirements of training algorithms coupled with the limited memory capacity of modern GPUs pose a strict restriction on the applicability of these methods. In this paper, we introduce μ-two, a new scheduling algorithm that maximizes GPU utilization by using operator fusion and memory optimization. The core novelty behind μ-two is in being able to schedule enough independent computational operations from multiple models such as to utilize all compute capacity of the available GPU while at the same time, it utilizes all compute time to overlap independent memory swapping operations. We show how to create μ-two schedules for arbitrary neural network and GPU architectures. We integrated μ-two within the PyTorch compiler and demonstrate a 3× speed-up with diverse network architectures and diverse NVIDIA GPUs, spanning vision, natural language processing and recommendation applications.

Tuesday, Apr 18 Ryan Marcus
(University of Pennsylvania)
Bio

Ryan Marcus is working on using machine learning to build the next generation of data management tools that automatically adapt to new hardware and user workloads, invent novel processing strategies, and understand user intention. He is especially interested in query optimization, index structures, intelligent clouds, programming language runtimes, program synthesis for data processing, and applications of reinforcement learning to systems problems in general.

LSI: A Learned Secondary Index Structure (MCS B37, 12:30pm)
Click for abstract

Learned index structures have been shown to achieve favorable lookup performance and space consumption compared to their traditional counterparts such as B-trees. However, most learned index studies have focused on the primary indexing setting, where the base data is sorted. In this work, we investigate whether learned indexes sustain their advantage in the secondary indexing setting. We introduce Learned Secondary Index (LSI), a first attempt to use learned indexes for indexing unsorted data. LSI works by building a learned index over a permutation vector, which allows binary search to performed on the unsorted base data using random access. We additionally augment LSI with a fingerprint vector to accelerate equality lookups. We show that LSI achieves comparable lookup performance to state-of-the-art secondary indexes while being up to 6× more space efficient.

Friday, Apr 14 Karan Vombatkere
(Boston University)
Bio

Karan Vombatkere is a 2nd year PhD student in the Massive Data & Algorithms (MiDAS) research group at BU, advised by Dr. Evimaria Terzi. His research interests are primarily focused on algorithmic data mining and computational social science problems. His current research is on designing efficient approximation algorithms for submodular optimization problems, specifically the Team Formation problem.

Balancing Task Coverage and Expert Workload in Team Formation
Click for abstract

In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. We deviate from this setting and propose a variant of the classical problem in which we aim to cover the skills of every task as well as possible, while also trying to minimize the maximum workload among the experts. Instead of setting the coverage constraint and minimizing the maximum load, we combine these two objectives into one. We call the corresponding assignment problem the Balanced Coverage problem, and show that it is NP-hard. We adopt a weaker notion of approximation and we show that under this notion we can design a polynomial-time approximation algorithm with provable guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale for reasonably large datasets. From the practical point of view, we demonstrate how the nature of the objective function allows us to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real datasets demonstrate the utility of our problem formulation as well as the efficacy and efficiency of our algorithm in practice.

Friday, Mar 17 Tarikul Islam Papon
(Boston University)
Bio

Papon is a Ph.D candidate, part of DiSC Lab at BU, advised by Prof. Athanassoulis. His research interests broadly include Data Systems, specifically on hardware-aware data management challenges, stemming from the evolution of storage and memory devices. Currently, he is working on proposing a new design paradigm for interacting with durable storage that takes into account performance asymmetry between read and write operations, as well as the variable access concurrency that different devices may support. Prior to joining BU, Papon served as a lecturer for 4 years at the Department of Computer Science & Engineering (CSE) in Bangladesh University of Engineering & Technology (BUET). He completed his MSc and BSc degrees from the same department where he worked on medical informatics using various machine learning and embedded system techniques.

ACEing the Bufferpool Management Paradigm for Modern Storage Devices
Click for abstract

Over the past few decades, solid-state drives (SSDs) have been replacing hard disk drives (HDDs) due to their faster reads and writes, and their superior random access performance. Further, when compared to HDDs, SSDs have two fundamentally different properties: (i) read/write asymmetry (writes are slower than reads) and (ii) access concurrency (multiple I/Os can be executed in parallel to saturate the device bandwidth). However, several database operators are designed without considering storage asymmetry and concurrency resulting in device underutilization, which is typically addressed opportunistically by device-specific tuning during deployment. As a key example and the focus of our work, the bufferpool management of a Database Management System (DBMS) is tightly connected to the underlying storage device, yet, state-of-the-art approaches treat reads and writes equally, and do not expressly exploit the device concurrency, leading to subpar performance. In this paper, we propose a new Asymmetry & Concurrency-aware bufferpool management (ACE) that batches writes based on device concurrency and performs them in parallel to amortize the asymmetric write cost. In addition, ACE performs parallel prefetching to exploit the device’s read concurrency. ACE does not modify the existing bufferpool replacement policy, rather, it is a wrapper that can be integrated with any eplacement policy. We implement ACE in PostgreSQL and evaluate its benefits using a synthetic benchmark and TPC-C for several popular eviction policies (Clock Sweep, LRU, CFLRU, LRU-WSR). The ACE counterparts of all four policies lead to significant performance improvements, exhibiting up to 32.1% lower runtime for mixed workloads (33.8% for write-intensive TPC-C transactions) with a negligible increase in total disk writes and buffer misses, which shows that incorporating asymmetry and concurrency in algorithm design leads to more faithful storage modeling and, ultimately, to better device utilization.

Friday, Mar 03 Aneesh Raman
(Boston University)
Bio

Aneesh Raman is a PhD student at Boston University and a member of the DiSC Lab, working with Prof. Manos Athanassoulis. His research focus is on Database systems - indexing and access methods, and his interests include high performance computing, data mining and machine learning. Prior to Boston University, he was at Purdue University Fort Wayne, where he graduated with highest distinction with a Bachelor's in Computer Science.

SWARE: Indexing for Near-Sorted Data
Click for abstract

Indexing in data systems can be perceived as the process of adding structure to the incoming, otherwise unsorted data. If the data ingestion order matches the indexed attribute order, the index ingestion cost is entirely redundant and can be avoided altogether. In this work, we identify sortedness as a resource that can accelerate index ingestion. We propose a new sortedness-aware (SWARE) design paradigm that pays less indexing cost for near-sorted data. When applied to a classical index like the B+-tree SWARE offers a 9x speedup for write intensive workloads and a 4x benefit in mixed workloads by exploiting data sortedness.

Friday, Feb 17 Juhyoung Mun
(Boston University)
Bio

Juhyoung Mun is currently a postdoctoral associate in the Department of Computer Science at Boston University, working with Prof. Manos Athanassoulis. She received her Ph.D. in Electronic and Electrical Engineering at Ewha w. University in Seoul, Korea. Her research interest lies in the area of data systems and networking, particularly, enhancing the lookup performance in LSM-Trees and building data systems for complex hybrid transactional/analytical workloads using hardware/software co-design.

Relational Memory: Native In-Memory Accesses on Rows and Columns
Click for abstract

Analytical database systems are typically designed to use a column-first data layout to access only the desired fields. On the other hand, storing data row-first works great for accessing, inserting, or updating entire rows. Transforming rows to columns at runtime is expensive, hence, many analytical systems ingest data in row-first form and transform it in the background to columns to facilitate future analytical queries. How will this design change if we can always efficiently access only the desired set of columns? To address this question, we present a radically new approach to data transformation from rows to columns. We build upon recent advancements in embedded platforms with re-programmable logic to design native in-memory access on rows and columns. Our approach, termed Relational Memory, relies on an FPGAbased accelerator that sits between the CPU and main memory and transparently transforms base data to any group of columns with minimal overhead at runtime. This design allows accessing any group of columns as if it already exists in memory. We implement and deploy Relational Memory in real hardware, and we show that we can access the desired columns up to 1.63x faster than accessing them from their row-wise counterpart, while matching the performance of a pure columnar access for low projectivity, and outperforming it by up to 1.87X as projectivity (and tuple reconstruction cost) increases. Moreover, our approach can be easily extended to support offloading of a number of operations to hardware, e.g., selection, group by, aggregation, and joins, having the potential to vastly simplify the software logic and accelerate the query execution.

Friday, Feb 10 Subhadeep Sakar
(Boston University)
Bio

Subhadeep Sarkar is a post-doctoral associate at Boston University. His current research interests are data systems, storage layouts, access methods, and their intersection with data privacy. Prior to this, Subhadeep spent two years as a post-doctoral researcher at INRIA Rennes, France. He completed his Ph.D. in 2017 from the IIT Kharagpur. Subhadeep’s work appears at leading database and network conferences and journals, including SIGMOD, VLDB, EDBT, ICDE, IEEE Transactions on Cloud Computing, and IEEE Transactions on Computers. In 2016, Subhadeep was selected as one of the most qualified research scientists to attend the Heidelberg Laureate Forum.

Building Deletion-Compliant Data Systems (CDS 1001)
Click for abstract

Modern data systems are designed around the silent underlying assumption of perpetual data retention. Systems are designed to offer high performance for data ingestion and queries, while deletion of data is realized by simply logically invalidating the target entries (instead of modifying them in place). This has severe implications on the privacy front, especially, in light of the newly enforced privacy regulations which are aimed toward enabling the users to express their preferences about the deletion of their data. In this talk, we present our vision of enabling privacy-through-deletion as a design element in modern data systems empowering them with the ability to ensure timely and persistent deletion of data, and we envision doing so without hurting performance or increasing operational cost.

Spring 2023: Zichen Zhu, e-mail: zczhu at bu dot edu
Fall 2022 Calendar
Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
Date Speaker Title and Abstract
Friday, Oct 07 Mohammed Saeed
(EURECOM)
Bio

Mohammed Saeed obtained a double engineering diploma from the Lebanese University and Télécom Paristech. Since 2019, he is a PhD Student in Computational Fact Checking at EURECOM (France), under the supervision of Prof. Paolo Papotti. His research spans methods across Natural Language Processing, Knowledge Graphs, and Deep Learning.

Explainable Checking for Factual Claims
Click for abstract

Misinformation is an important problem but fact checkers are overwhelmed by the amount of false content that is produced online every day. To support fact checkers in their efforts, we are creating data driven verification methods that use data resources to assess claims and explain their decisions. However, using structured data for facts expressed in natural language is a complex task. In this talk, we present a first solution that jointly learns representations of structured and non-structured data. We translate text claims into SQL queries on relational databases by exploiting text classifiers and tentative execution of query candidates to narrow down the set of alternatives. We then discuss how logical rules can be used to verify claims in a checking task modeled as a probabilistic inference problem. We show that this form of reasoning can be “learned” by transformer-based language models. Experiments show that both methods enable the efficient and effective labeling of claims with interpretable explanations.

Fall 2022: Andy Huynh, e-mail: ndhuynh at bu dot edu
Spring 2022 Calendar
Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
Date Speaker Title and Abstract
Monday, May 16 Oana Balmau
(McGill University)
Bio

Oana Balmau is an Assistant Professor in the School of Computer Science at McGill University, where she leads the Data-Intensive Storage and Computer Systems Lab (DISCS Lab). Her research focuses on storage systems and data management systems, with an emphasis on large-scale data management for workloads in Machine Learning, Data Science, and Edge Computing. She completed her PhD in Computer Science at the University of Sydney, advised by Prof. Willy Zwaenepoel. Before her PhD, Oana earned her Bachelors and Masters degrees in Computer Science from EPFL, Switzerland. Oana won the CORE John Makepeace Bennet Award 2021 for the best computer science dissertation in Australia and New Zealand, an Honorable Mention for the ACM SIGOPS Dennis M. Ritchie Doctoral Dissertation Award 2021, as well as a Best Paper Award in the USENIX Annual Technical Conference (USENIX ATC) 2019. Before joining McGill, Oana worked with Nutanix, ABB Research, and HP Vertica. She is also a part of MLCommons, an open engineering consortium that aims to accelerate machine learning innovation, where she is leading the effort for storage benchmarking.

Reducing Space Amplification in Key-Value Stores (Monday Talk @ 11:30 AM EDT)
Click for abstract

As dataset sizes continue to grow, efficient storage use with low space amplification is increasingly important in production systems. In particular in key-value stores (KVs), space amplification is caused by storing multiple versions of the same item e.g., waiting to be garbage collected, or by large metadata e.g., caused by large indexes and other metadata. While space amplification has been a well-known problem for in-memory data stores, it is becoming problematic when the datasets are stored on modern devices as well, as fast NVMe SSDs can persist millions of items per second. In this talk, I will present two techniques to decrease space amplification at the disk, and main memory levels. Our target workload is a mix of update-heavy and analytics queries running under snapshot isolation. First, I introduce KVell+, a new KV store which supports data analytics running alongside update-heavy workloads with close to no space amplification. KVell+ is built on top of KVell, but can be easily adapted to other state-of-the-art KVs such as RocksDB or LevelDB. On a high-level, the key idea is to propagate an item X to the analytics query to get processed out-of-order, if X happens to be updated as the analytics job is running. Even though KVell+ has almost no space amplification on disk, in memory its index can grow significantly with the dataset size --especially for small keys. To address this issue, I introduce TONE, a new write-optimized learned index. In mixed workloads, TONE uses up to 2.5x less memory compared to state-of-the-art index data structures (e.g., skiplists, B+trees), with order-of-magnitude improvements in tail latency over state-of-the-art learned indexes (e.g., ALEX, XIndex).

Friday, May 13 Raja Appuswamy
(EURECOM)
Bio

Raja Appuswamy is an Assistant Professor in the Data Science department at EURECOM--a Grandes Écoles located in the sunny Sophia Antipolis tech-valley of southern France. Previously, he was as a Researcher and Visiting Professor at EPFL, Switzerland, a Visiting Researcher in the Systems and Networking group at Microsoft Research, Cambridge, and as a Software Development Engineer in the Windows 7 kernel team at Microsoft, Redmond. He received his Ph.D in Computer Science from the Vrije Universiteit, Amsterdam, where he worked under the guidance of Prof. Andrew S. Tanenbaum on designing and implementing a new storage stack for the MINIX 3 microkernel operating system. He also holds dual Masters degrees in Computer Science and Agricultural Engineering from the University of Florida, Gainesville.

Microform and Macromolecules: Archiving digital data on analog or biological storage media
Click for abstract

Today, most, if not all, enterprises in our society are driven by data. For decades, we wanted fast storage devices that can quickly deliver data, and storage technologies evolved to meet this requirement. As data-driven decision making becomes an integral part of enterprises, we are increasingly faced with a new need-–one for cheap, long-term storage devices that can safely store the data we generate for tens or hundreds of years to meet legal and regulatory compliance requirements. In this talk, we will first explore recent trends in the storage hardware landscape that show that all current storage media face fundamental limitations that threaten our ability to store, much less process, the data we generate over long time frames. We will then focus on unconventional biological and analog media that have received quite some attention recently--synthetic Deoxyribonucleic acid (DNA) and film. After highlighting the pros and cons of using each as a digital storage media, I will present our recent work in the EU-funded Future and Emerging Technologies (FET) project OligoArchive, that focuses on overcoming challenges in using such media to build a deep archival tier for data management systems.

Friday, May 06 Raul Castro Fernandez
(University of Chicago)
Bio

I am interested in understanding the economics and value of data, including the potential of data markets to unlock that value. The goal of my research is to understand how to make the best use of data possible. For that, I often build systems to share, discover, prepare, integrate, and process data. I often use techniques from data management, statistics, and machine learning. I am an assistant professor in the Computer Science department at the University of Chicago. Before UChicago, I did a postdoc at MIT with Sam Madden and Mike Stonebraker. And before that, I completed my PhD at Imperial College London with Peter Pietzuch.

Data Station: Delegated, Trustworthy, and Auditable Computation to Enable Data-Sharing Consortia with a Data Escrow
Click for abstract

Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long and tedious one-off negotiations. In this talk, I will present Data Station, a data escrow designed to enable the formation of data-sharing consortia. Data owners share data with the escrow knowing it will not be released without their consent. Data users delegate their computation to the escrow. The data escrow relies on delegated computation to execute queries without releasing the data first. The Data Station leverages hardware enclaves to generate trust among participants and exploit the centralization of data and computation to generate an audit log. Finally, I will contextualize the design and implementation of the Data Station in a larger research agenda that explores “the value of data”, the “economics of data”, and “data markets”.

Friday, Apr 22 Venu Satuluri
(Twitter)
Bio

Venu Satuluri is a Principal ML Engineer at Twitter. He has been working on a variety of recommender systems and applied ML problems during the last 10 years at Twitter. Venu got his Ph.D. focusing on graph clustering and similarity search from the Ohio State University in 2012.

SimClusters: Community-Based Representations for Heterogeneous Recommendations at Twitter (talk @ 11:00 AM)
Click for abstract

Personalized recommendation products at Twitter target a variety of items: Tweets, Events, Topics, Hashtags, and users. Each of these targets varies in their cardinality (which affects the scale of the problem) and their shelf life (which constrains the latency of generating the recommendations). Although Twitter has built a variety of recommendation systems before dating back a decade, solutions to the broader problem were mostly tackled piecemeal. In this work, we present SimClusters, a general-purpose representation layer based on overlapping communities into which users as well as heterogeneous content can be captured as sparse, interpretable vectors to support a multitude of recommendation tasks. We propose a novel algorithm for community discovery based on Metropolis-Hastings sampling, which is both more accurate and significantly faster than off-the-shelf alternatives. SimClusters scales to networks with billions of users and has been effective across a variety of deployed applications at Twitter.

Friday, Mar 18 Ippokratis Pandis
(Amazon AWS)
Bio

Ippokratis Pandis is a senior principal engineer at Amazon Web Services, currently working in Amazon Redshift. Redshift is Amazon's fully managed, petabyte-scale data warehouse service. Previously, Ippokratis has held positions as software engineer at Cloudera where he worked on the Impala SQL-on-Hadoop query engine, and as member of the research staff at the IBM Almaden Research Center, where he worked on IBM DB2 BLU. Ippokratis received his PhD from the Electrical and Computer Engineering department at Carnegie Mellon University. He is the recipient of Best Demonstration awards at ICDE 2006 and SIGMOD 2011, and Test-of-Time award at EDBT 2019. He has served or serving as PC chair of DaMoN 2014, DaMoN 2015, CloudDM 2016, HPTS 2019 and ICDE Industrial 2022, as well as General Chair of SIGMOD 2023.

Amazon Redshift Reinvented (talk @ 12:00 PM)
Click for abstract

In 2013, eight years ago, Amazon Web Services revolutionized the data warehousing industry by launching Amazon Redshift, the first fully managed, petabyte-scale cloud data warehouse solution. Amazon Redshift made it simple and cost-effective to efficiently analyze large volumes of data using existing business intelligence tools. This launch was a significant leap from the traditional on-premise data warehousing solutions which were expensive, rigid (not elastic), and needed a lot of tribal knowledge to perform. Unsurprisingly, customers embraced Amazon Redshift and it went on to become the fastest growing service in AWS. Today, tens of thousands of customers use Amazon Redshift in AWS's global infrastructure of 25 launched Regions and 81 Availability Zones (AZs) to process Exabytes of data daily. The success of Amazon Redshift inspired a lot of innovation in the analytics industry which in turn has benefited consumers. In the last few years, the use cases for Amazon Redshift have evolved and in response, Amazon Redshift has delivered a series of innovations that continue to delight customers. In this talk, we take a peek under the hood of Amazon Redshift, and give an overview of its architecture. We focus on the core of the system and explain how Amazon Redshift maintains its differentiating industry-leading performance and scalability. We discuss how Amazon Redshift extends beyond traditional data warehousing workloads, but integrating with the broad AWS ecosystem making Amazon Redshift a one-stop solution for analytics. We then talk about Amazon Redshift’s autonomics and Amazon Redshift Serverless. In particular, we present how Redshift continuously monitors the system and uses machine learning to improve its performance and operational health without the need of dedicated administration resources, in an easy to use offering.

Friday, Mar 11 Prashant Pandey
(VMware)
Bio

Pandey is a Research Scientist at VMware Research. Previously, he did postdocs at University of California Berkeley and Carnegie Mellon University. He obtained his Ph.D. in Computer Science at Stony Brook University in December 2018. His goal as a researcher is to advance the theory and practice of resource-efficient data structures and employ them to democratize complex and large-scale data analyses. He designs and builds tools for large-scale data management problems across computational biology, stream processing, and storage. He is also the main contributor and maintainer of multiple open-source software tools that are used by hundreds of users across academia and industry. He won the prestigious Catacosinos Fellowship in 2018, a Best paper award at FAST 2016, and Runner’s Up to Best Paper at FAST 2015. His work has appeared at top conferences and journals, such as SIGMOD, FAST, SPAA, ESA, RECOMB, ISMB, Genome Biology, and Cell Systems.

Data Systems at Scale: Scaling Up by Scaling Down and Out (to Disk)
Click for abstract

Our ability to generate, acquire, and store data has grown exponentially over the past decade making the scalability of data systems a major challenge. This talk presents my work on two techniques to tackle the scalability challenge: scaling down, i.e., shrinking data to fit in RAM, and scaling out to disk, i.e., organizing data on disk so that the application can still run fast. I will describe new compact and I/O-efficient data structures and their applications in computational biology, stream processing, and storage. In computational biology, my work shows how to shrink genomic and transcriptomic indexes by a factor of two while accelerating queries by an order of magnitude compared to the state-of-the-art tools. In stream processing, my work bridges the gap between the worlds of external memory and stream processing to perform scalable and precise real-time event-detection on massive streams. In file systems, my work improves file-system random-write performance by an order of magnitude without sacrificing sequential read/write performance.

Friday, Feb 11 Shay Cohen
(University of Edinburgh)
Bio

Shay Cohen is a Reader at the University of Edinburgh (School of Informatics). Prior to this, he was a postdoctoral research scientist in the Department of Computer Science at Columbia University, and held an NSF/CRA Computing Innovation Fellowship. He received his B.Sc. and M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from Carnegie Mellon University in 2011. His research interests span a range of topics in natural language processing and machine learning, with a focus on structured prediction (for example, parsing) and text generation.

Document Summarisation: Modelling, Datasets and Verification of Content
Click for abstract

Within Natural Language Processing, document summarisation is one of the central problems. It has both short-term societal implications and long-term implications in terms of the success of AI. I will describe advances made in this area with respect to three different aspects: methodology and modelling, dataset development and enforcing factuality of summaries. In relation to modelling, I will show how reinforcement learning can be used to directly maximise the metric by which the summaries are being evaluated. With regards to dataset development, I will describe a dataset that we released for summarisation, XSum, in which a single sentence is used to describe the content of a whole article. The dataset has become a standard benchmark for summarisation. Finally, in relation to factuality, I will show how one can improve the quantitative factuality of summaries by re-ranking them in a beam based on a "verification" model.

Spring 2022: Andy Huynh, e-mail: ndhuynh at bu dot edu
Fall 2021 Calendar
Seminars are held in MCS B51 at 10:30am [unless otherwise notified].
Date Speaker Title and Abstract
Friday, Dec 03 Cameron Musco
(University of Massachusetts Amherst)
Hutch++: Optimal Stochastic Trace Estimation
Click for abstract

We study the problem of estimating the trace of a matrix that can only be accessed through matrix-vector multiplication. We introduce a new randomized algorithm, Hutch++, which computes a (1+epsilon) approximation to the trace of any positive semidefinite (PSD) matrix using just O(1/epsilon) matrix-vector products. This improves quadratically on the ubiquitous Hutchinson's estimator, which requires O(1/epsilon^2) matrix-vector products. Our approach is based on a simple technique for reducing the variance of Hutchinson's estimator using low-rank approximation, and is easy to implement and analyze. Moreover, we prove that up to a logarithmic factor, the complexity of Hutch++ is optimal amongst all matrix-vector query algorithms. We show that it significantly outperforms Hutchinson's method in experiments, on both PSD and non-PSD matrices arising in applications such as network analysis and Bayesian machine learning.

Friday, Nov 12 Utku Sirin
(Harvard University)
Bio

Utku is a postdoctoral researcher at Harvard Data Systems lab working with Stratos Idreos. He has done his Ph.D. at EPFL, Switzerland advised by Anastasia Ailamaki. He is interested in hardware- and energy-efficient data systems.

Characterization of OLTP Workloads: From Micro-architecture to Power/Performance
Click for abstract

Traditional online transaction processing (OLTP) systems under-utilize the micro-architectural resources; more than half of the CPU cycles go to stalls, and the number of instructions retired per cycle barely reaches one on machines that are able to retire up to four. This causes large power waste, and renders the power-hungry, heavily optimized Intel Xeon processors energy-inefficient. In this talk, we firstly examine hardware utilization of OLTP workloads for state-of-the-art OLTP systems. We then proceed by examining whether and how machine learning models can be useful in improving system performance. Lastly, we compare a high-end Intel Xeon processor with a recently announced server-grade ARM processor in terms of their power, throughput, and latency characteristics when running OLTP workloads. Based on our observations, we lastly comment on power-saving and performance improvement opportunities in co-designing OLTP software and hardware.

Friday, Nov 05 Immanuel Trummer
(Cornell University)
Bio

Immanuel Trummer is assistant professor at Cornell University, working towards making data analysis more efficient and more user-friendly. His papers were selected for “Best of VLDB”, “Best of SIGMOD”, for the ACM SIGMOD Research Highlight Award, and for publication in CACM as CACM Research Highlight. His current research is funded by the NSF and by multiple Google Faculty Research Awards.

Towards Database Tuning Tools that Read the Manual
Click for abstract

Recent advances in natural language processing (NLP) open up new applications in the context of database systems. In this talk, I discuss how NLP can yield useful information for various database performance tuning problems. In particular, I focus on the problem of database system parameter tuning.

Friday, Oct 29 Xuhao Chen
(Massachusetts Institute of Technology)
Bio

Xuhao Chen is a Research Scientist at MIT CSAIL, working with Prof. Arvind. Dr. Chen is broadly interested in parallel systems for AI & big-data, with a focus on emerging graph algorithms. He has built multiple software and hardware systems for data mining and machine learning on graphs. His work has been published in ISCA, MICRO, VLDB, ICS and DAC. Before joining MIT, Dr. Chen was a research fellow at UT Austin.

System Specialization for Graph Pattern Mining
Click for abstract

Graph algorithms play a key role in numerous real-world applications, including social networks, e-commerce, biomedicine and cybersecurity. These algorithms process massive irregular data, which poses both performance and programmability challenges in computing system design. In this talk, I will explore system specialization approaches for an important class of graph algorithms — graph pattern mining (GPM). I will introduce a specialized software programming framework and a dedicated hardware accelerator that we built for GPM, and describe experiences creating abstractions, optimization techniques and automation methods. Overall, by specializing in both software and hardware, we can make graph algorithms easy to program and run fast at the same time.

Friday, Oct 22 Siying Dong
(Facebook)
Bio

Siying Dong is a software engineer working in the Database Engineering team at Facebook, focusing on RocksDB. He also worked on Hive, HDFS, and some other data warehouse infrastructures. Before joining Facebook, Siying worked in the SQL Azure Team at Microsoft. He received a bachelor’s degree from Tsinghua University and a master’s degree from Brandeis University.

Online only @ 11:00 AM
RocksDB: The Changing Of Optimization Targets

Click for abstract

RocksDB is an embedded persistent storage engine optimized for flash-based SSD. It has been adapted to a wide range of workloads. In the talk, Siying will introduce how and why RocksDB’s resource optimization target migrated from write amplification, to space amplification, to CPU utilization, and why disaggregated storage is the current focus. Siying will also introduce RocksDB’s plan for supporting SSD/HDD hybrid, as well as challenges for it.

Friday, Sep 24 Brian Hentschel
(Harvard University)
Bio

Brian Hentschel is a Ph.D. candidate at Harvard advised by Stratos Idreos. He is interested in blending statistics and machine learning with classical techniques to improve computer systems. His research has been awarded the SIGMOD research highlight award as well a best paper from EDBT. He earned his BA in mathematics and computer science at Pomona College and has previously spent time at Microsoft, Amazon, LinkedIn, and IBM.

Augmenting Classical Data Structures through Learning
Click for abstract

Data systems are moving from data structures which are implicitly designed for specific data and workload characteristics towards data structures explicitly designed to exploit patterns in both. This generally leads to better expected performance but opens up questions about both the predictability of data structure performance and its robustness to workload or data shifts. In this talk, we make the case for ML-Augmented data structures, which solve these problems by using workload information and statistical modeling to redesign components of classical data structures. By keeping the core structure of classical data structures, we keep their robustness properties and provide a pathway towards arguing theoretically about their performance. In using ML for design, we transfer properties of the workload and data into the data structure and achieve better expected performance. The result is that we can argue theoretically that ML-designed data structures perform better than classical data structures and empirically show the benefits of this approach, all while maintaining many of the benefits of classical data structures. We present three applications of this approach. First, we present Stacked Filters, which uses workload skew to produce 100X lower false positive rates for filter data structures. Second, we present Entropy-Learned Hashing, which produces 10X faster hash function evaluation and 4X faster hash tables. And third, we present Column Sketches, which produces 7.5X faster database scans.

Fall 2021: Lina Qiu, e-mail: qlina at bu dot edu
Spring 2021 Calendar
Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
Date Speaker Title and Abstract
Friday, Apr 23 Yannis Chronis
(University of Wisconsin Madison)
Bio

Yannis is a fourth-year Ph.D. student in the Computer Sciences Department at the University of Wisconsin-Madison advised by Prof. Jignesh Patel. His research focuses on efficient data processing using modern and emerging hardware.

Castle: Accelerating analytical query processing with associative computing
Click for abstract

We are in the midst of a Cambrian hardware evolution in which a rich variety of technologies and architectures are being invented with a flurry that we havent seen in a long time. One class of hardware devices, called Processing-in-Memory (PIM), is enabling new ways of utilizing memories. PIM devices generally pack storage and computing closely together and have novel computing primitives that are built into the memory devices. Associative processors are a type of PIM device with massive data-level parallelism and single-cycle search and update capabilities. The focus of our work is on exploring how database analytic workloads can be run efficiently using an associative computing substrate. Our vehicle for this research is the Castle project in which we are building a data processing engine to run on the CAPE hardware, which is a specific implementation of an associative PIM device. Initial results show that associative PIM hardware can potentially deliver manifold performance gains over traditional von Neumann architectures; however, to fully exploit CAPEs potential, data query processing kernels and query optimization methods both have to be adapted in new ways to leverage the style of data-parallel computing that can be carried out directly inside CAPE. We experiment with both individual query processing kernels and end-to-end queries and show that the performance gains using our methods are significant, and up to 46X in some cases. Tentatively, these results point to the exciting potential in accelerating database analytic query processing workloads using associative computing.

Friday, Feb 12 Tarikul Islam Papon
(Boston University)
Bio

Papon is a second-year Ph.D. student, part of MiDAS group at BU, advised by Prof. Athanassoulis. His research interests broadly include Data Systems, specifically on hardware-aware data management challenges, stemming from the evolution of storage and memory devices. Currently, he is working on proposing a new design paradigm for interacting with durable storage that takes into account performance asymmetry between read and write operations, as well as the variable access concurrency that different devices may support. Prior to joining BU, Papon served as a lecturer for 4 years at the Department of Computer Science & Engineering (CSE) in Bangladesh University of Engineering & Technology (BUET). He completed his MSc and BSc degree from the same department where he worked on medical informatics using various machine learning and embedded system techniques.

The Case for A Parametric I/O Model
Click for abstract

Storage devices have evolved to offer increasingly faster read/write access, through flash-based and other solid-state storage technologies. When compared to classical rotating hard disk drives (HDDs), modern solid-state drives (SSDs) have two key differences: (i) the absence of mechanical parts, and (ii) an inherent difference between the process of reading and writing. The former removes a key performance bottleneck, enabling internal device parallelism, whereas the latter manifests as a read/write performance asymmetry. In other words, modern storage devices can serve multiple concurrent I/Os, and writes are generally slower than reads; none of which is true for HDDs.To address this mismatch, we propose a simple yet expressive storage model, termed Parametric I/O Model (PIO) that captures modern (and future) devices by parameterizing read/write asymmetry (α) and access concurrency (k). PIO enables making device-specific decisions at algorithm design time, rather than as an optimization during deployment. To showcase the benefits of algorithm design using PIO, we introduce a new family of bufferpool eviction strategies that offer performance improvements by accurately modeling the behavior of the underlying devices. We compare their performance on various synthetic traces with three state-of-the-art algorithms (LRU, CFLRU, LRU-WSR) to show that they achieve 15.9% 44.5% improvement in runtime with a negligible increase in total writes ( 0.06%) and buffer misses ( 0.004%). We further report up to 1.83x speedup for TPC-C traces on PostgreSQL.

Friday, Jan 22 Nikolaos Tziavelis
(Northeastern University)
Bio

Nikolaos Tziavelis is a PhD student at the Khoury College of Computer Sciences of Northeastern University. His research interests lie in algorithms for database query processing and distributed computing. He holds a MEng in Electrical and Computer Engineering from the National Technical University of Athens.

Any-k: Optimal Join Algorithms meet Top-k
Click for abstract

Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. However, optimality results for queries involving joins hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. In this talk, I will present the problem of ranked enumeration that lies in the intersection of these two research areas. The goal is to return query results one after the other in order of their importance with theoretical guarantees that are better suited for in-memory computation. I will focus on two families of algorithms that have their roots on the k'th shortest path problem. The first one is based on the Lawler-Murty partitioning procedure, while the second one uses a recursive approach that has been developed over several decades.

Spring 2021: Zichen Zhu, e-mail: zczhu at bu dot edu
Fall 2020 Calendar
Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
Date Speaker Title and Abstract
Friday, Dec 18 Chen Luo
(University of California Irvine)
Bio

Chen Luo received his PhD from University of California, Irvine in December 2020 under the supervision of Professor Michael J. Carey. His research interests include storage management, indexing, and LSM-trees. In particular, he worked on optimizing LSM-trees for big data management systems. During his PhD studies, he interned at IBM Almaden (2017) working on HTAP indexing and Microsoft Research (2019) working on customized SSD controllers. Before that, he obtained the Master’s degree from Tsinghua University in 2016 and the Bachelor’s degree from Tongji University in 2013, respectively.

On Optimizing LSM-based Storage for Big Data Management Systems
[Talk Recording]
Click for abstract

In recent years, the Log-Structured Merge-tree (LSM-tree) has been widely used in the storage layer in modern NoSQL systems. Different from traditional index structures that apply updates in-place, an LSM-tree first buffers all writes in memory and subsequently flushes them to disk and merges them using sequential I/Os. This out-of-place update design brings a number of advantages, including superior write performance, high space utilization, tunability, and simplification of concurrency control and recovery. These advantages have enabled LSM-trees to serve a large variety of workloads in production systems. Despite the popularity of LSM-trees, the existing research efforts have been primarily focusing on improving the maximum throughput of LSM-trees in simple key-value store settings. This leads to several outages when adopting LSM-based storage techniques in a Big Data Management System (BDMS) with multiple heterogeneous LSM-trees and requiring performance metrics beyond just the maximum throughput. In this talk, I will describe some of my PhD research to optimize LSM-trees for BDMSs, including efficient maintenance of LSM-based auxiliary structures, minimizing write stalls via better scheduling, and adaptive memory managemnt of LSM-trees.

Friday, Dec 11 Sudhanshu (Dan) Chanpuriya
(UMass Amherst)
Bio

Sudhanshu Chanpuriya is a PhD student at the College of Computer and Information Sciences at UMass Amherst, where he is advised by Cameron Musco. His research interests lie at the intersection of machine learning and graph theory. Prior to joining UMass, he received his undergraduate degree at Dartmouth College.

Towards Understanding the Success and Limitations of Node Embeddings
Click for abstract

The node embedding task involves mapping nodes in networks to low-dimensional, continuous vectors for direct use in downstream machine learning algorithms, such as those for classification and clustering. This task has received great interest from the ML community in the past few years, and I will present some of our recent work, which tries to advance two parts of our understanding: (i) What are the limitations of node embeddings? In work appearing at NeurIPS 2020, we show that, contrary to recent doubts, node embeddings can capture fine network structures like triangles involving only low-degree nodes; in fact, they can exactly capture the full network structure at surprisingly low embedding dimensionalities. (ii) What has made recent node embedding algorithms more effective on downstream tasks? I will present our variant of the popular DeepWalk algorithm, InfiniteWalk, which attributes the success of DeepWalk to the addition of a nonlinear component to older algorithms based on spectral clustering.

Friday, Dec 04 Herodotos Herodotou
(Cyprus University of Technology)
Bio

Herodotos Herodotou is an Assistant Professor in the Department of Electrical Engineering, Computer Engineering and Informatics at the Cyprus University of Technology, where he is leading the Data Intensive Computing Research Lab. He received his Ph.D. in Computer Science from Duke University. His Ph.D. dissertation work on building a self-tuning system for big data analytics received the ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention as well as the Outstanding Ph.D. Dissertation Award in Computer Science at Duke. Before joining CUT, he held research positions at Microsoft Research, Yahoo! Labs, and Aster Data. His research interests are in large-scale Data Processing Systems, Database Systems, and Cloud Computing. In particular, his work focuses on ease-of-use, manageability, and automated tuning of both centralized and distributed data-intensive computing systems. In addition, he is interested in applying database techniques in other areas like maritime informatics, scientific computing, bioinformatics, and social computing. His research work to date has been published in several top scientific conferences and journals, two books, and two book chapters, while he is actively participating in multiple European and nationally funded projects.

Automating Distributed Tiered Storage Management in Cluster Computing
[Talk Recording]
Click for abstract

Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed file systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them, while trying to optimize their workloads. In this talk, I will present OctopusFS, a novel distributed file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes. Smart placement and retrieval policies employ multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. In addition, redistribution policies employ machine learning for tracking and predicting file access patterns, which are used to decide when and which data to move up or down the storage tiers for increasing system performance. The approach uses incremental learning to dynamically refine the models with new file accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases significant benefits in terms of both workload performance and cluster efficiency.

Friday, Nov 06 Abdul Wasay
(Harvard)
Bio

I am a Ph.D. candidate at the Harvard John A. Paulson School of Engineering and Applied Sciences, where I am advised by Prof. Stratos Idreos. I design techniques to accelerate machine learning and data science pipelines, specifically, addressing the bottleneck of repeated data movement. Additionally, I am interested in mapping out the design space of deep neural network models to enable better design. Prior to joining Harvard, I was an undergraduate at Lahore University of Management Sciences (LUMS). In my undergraduate thesis, I investigated ways to improve buffer management in data center networks. During this time, I also spent a summer at École Polytechnique Fédérale de Lausanne (EPFL) working on techniques to enhance the privacy of social network graphs.

How and When to Build Neural Network Ensemble Models
Click for abstract

Neural network models are applied to many tasks with rising complexity and data sizes. Researchers and practitioners seek to scale the representational power of these models by adding more parameters. Increasing parameters, though, requires additional critical resources such as memory and compute time leading to increased training and inference cost. Thus a consistent challenge is to obtain as high as possible accuracy within a parameter budget. As neural network designers navigate this complex landscape, they are guided by conventional wisdom. This comes from empirical studies of the design space. We identify a critical part of this design space that is not well-understood: That is how to decide between the alternatives of expanding a single network model or increasing the number of networks and using them together in an ensemble. We study this question in detail through extensive experimentation across various network architectures and data sets. Crucially, we provide a robust assessment, missing in previous studies, by fixing the number of parameters. We also consider a holistic set of metrics such as training time, inference time, and memory usage. Contrary to conventional wisdom, we show that when we perform a holistic and robust assessment, we uncover a wide design space, where ensembles not only provide better accuracy but also train faster, and deploy at a speed comparable to single convolutional networks with the same total number of parameters.

Fall 2020: Tianyi Chen, e-mail: ctony at bu dot edu
Spring 2020 Calendar
Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
Date Speaker Title and Abstract
Friday, May 22 Zichen Zhu
(BU)
Bio

Zichen Zhu is a first-year PhD student at Boston University advised by Prof. Manos Athanassoulis. Prior to this, he received his MPhil degree in 2019 from Hong Kong University. Zichen's current research interests are data access methods and LSM-based NoSQL key-value stores.

A Fast Workload-Driven Blocking Approach with Sub-optimal Skipping Effectiveness
Click for abstract

With rapidly growing data volume, modern query engines are required to process enormous amount of data timely. In order to speed up the data access, an existing data skipping-based technique is proposed as follows. For each block of tuples, a bit vector (also noted as feature vector) is constructed where each dimension represents whether there exists a tuple qualified for some specific filters in this block. Therefore, the effectiveness of data skipping essentially depends on how we partition data into blocks. Existing skipping-oriented partitioning techniques assume the dimension of feature vectors is given and the feature selection is modeled as a frequent item mining problem with the dimension as the input. We argue that there may exist a more intelligent way to determine the parameters. In this work, a greedy partitioning technique is introduced with sub-optimal skipping performance without specifying the dimension. Further experiments also demonstrate the effectiveness of our algorithm in a synthetic workload and dataset.

Friday, May 15 Subhadeep Sarkar
(BU)
Bio

Subhadeep Sarkar is a post-doctoral researcher at Boston University working with Prof. Manos Athanassoulis. Prior to this, Subhadeep spent nearly two years as a post-doctoral researcher at INRIA Rennes, France. He completed his Ph.D. in 2017 from Indian Institute of Technology Kharagpur. Subhadeep's current research interests are NoSQL key-value stores, data access methods, and data privacy.

Lethe: A Tunable Delete-Aware LSM Engine
Click for abstract

Data-intensive applications fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as a second-class citizen. A delete inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, these LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written. We highlight that fast persistent deletion without affecting read performance is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on the right-to-be-forgotten, and (iii) en masse cloud deployment of data systems that makes storage a precious resource.
In this talk, I will present Lethe, an LSM-based key-value storage that introduces a set of delete-aware compaction policies and a new physical data storage layout, which together, allows for efficient and timely persistent deletion of data. Lethe is capable of supporting any user-defined threshold for the delete persistence latency offering higher read throughput and lower space amplification, with a modest increase in write amplification. In addition, by physically storing data in an interweaved order on the sort and the delete key, Lethe supports efficient range deletes on a secondary (delete) key without employing a costly full tree merge. Our paper on Lethe is set to appear at SIGMOD 2020.

Friday, May 08 Ju Hyoung Mun
(BU)
Bio

Ju Hyoung Mun is currently a postdoctoral associate in the Department of Computer Science at Boston University, working with Prof. Manos Athanassoulis. She received her Ph.D. in Electronic and Electrical Engineering at Ewha w. University in Seoul, Korea, under the supervision of Prof. Hyesook Lim. Her doctoral research was focused on how to retrieve the data efficiently by using Bloom filters in Named Data Networking (NDN). Her research interest lies in the fields of Data Systems and Networking; particularly, in enhancing the lookup performance.

FIB Sharing and Cache Sharing using Bloom filters in NDN
Click for abstract

The major portion of the Internet bandwidth is now being used for content distribution like video streaming. The conventional Internet infrastructure, which is based on a host-to-host communication model, is not efficient for content distribution. In order to accommodate today's communication needs, Named Data Networking (NDN) has been introduced as a future Internet architecture. NDN technology has two main characteristics: name-based content retrieval and in-network caching. There are many research challenges for the realization of NDN. This talk presents two methods of information sharing using Bloom filters in NDN: (1) Content Store sharing, which shares cache summaries with neighboring routers to increase the diversity of cached contents in a network as a whole, and (2) FIB table sharing, which reduces the traffic and the size of the FIB tables.

Friday, May 01 Tarikul Islam Papon
(BU)
Bio

Papon is a first-year Ph.D. student, part of MiDAS group at BU, advised by Prof. Athanassoulis. His research interests broadly include Data Systems, specifically on hardware-aware data management challenges, stemming from the evolution of storage and memory devices. Currently, he is working on proposing a new design paradigm for interacting with durable storage that takes into account performance asymmetry between read and write operations, as well as the variable access concurrency that different devices may support.
Prior to joining BU, Papon served as a lecturer for 4 years at the Department of Computer Science & Engineering (CSE) in Bangladesh University of Engineering & Technology (BUET). He completed his MSc and BSc degree from the same department where he worked on medical informatics using various machine learning and embedded system techniques.

The Case for A Parameterized I/O Model
Click for abstract

The traditional I/O model (EM Model) was developed based on the properties of HDDs, and has not evolved accordingly while it is still widely used. The EM Model assumes that read and write accesses have the same cost (symmetric cost) whereas most modern flash devices and emerging non-volatile memory (NVM) technologies show asymmetric cost - writes are significantly slower compared to reads. In addition, these devices provide some degree of access concurrency which cannot be modeled with the EM Model because it considers only one I/O at a time. The EM Model cannot capture these key properties of modern memory technologies, thus unable to exploit the full benefits of contemporary devices.
We propose a new Parameterized I/O Model PIO(M, k, α) that can interact with present-day storage devices that considers read/write asymmetry (α) as well as the variable access concurrency (k) that different devices may support. In addition, we aim to propose a new database buffer pool page eviction algorithm for flash devices and compare its performance with the traditional algorithms with respect to our proposed PIO(M, k, α) model.

Friday, Apr 24 Subarna Chatterjee
(Harvard)
Bio

Subarna Chatterjee is a post-doc at the Data Systems Lab (DASLab) at Harvard University since January 2019. Prior to this, she was also a post-doc with the Myriads team at Inria, Rennes in France. She completed her Ph.D. from Indian Institute of Technology Kharagpur, India from 2013-2017. Her current research interests are NoSQL storage engines and reasoning about their cost and performance on cloud.

Cosine: A Cloud-Cost Optimized NoSQL Storage Engine
Click for abstract

We present a key-value storage engine, Cosine, that guarantees an optimal cost- performance trade-off, given a workload and a budget. Cosine creates a massive search space comprising of the entire data structure design and hardware space of (LSM-tree/B-tree) key-value stores over diverse cloud pricing policies for the top three cloud providers – AWS, GCP, and Azure. In order to prune configurations from this massive search space, we present distribution-aware cost models that precisely estimate I/O costs of each possible data structure design, which in turn, helps in comparing pairs of designs. By using the cost models, Cosine reduces the massive dimensionality of the search space into a continuum of holistically optimal configurations. Cosine also enables decision makers in applications to quickly answer rich what- if questions about the changes in workload performance and cost as any of the design, hardware, or cloud provider change.

Friday, Apr 17 Andy Huynh
(BU)
Bio

Andy is a third-year PhD student in the MiDaS group at Boston University where he is advised by Professor Manos Athanassoulis. His research focuses on novel design concepts in hybrid analytical/transaction processing systems. He is currently working on an open source column-storage called OakDB, a DBMS that focuses on auto tuning with input uncertainty. Prior to Boston University Andy attended the University of Minnesota, Twin Cities earning a Bachelors in Computer Engineering.

Tuning Data Systems Under Uncertainty in Workload
Click for abstract

Modern hybrid analytical/transactional processing (HTAP) systems are generally tuned with assumptions on the expected workload, however, often times workload knowledge may be inaccurate. Deploying systems tuned on inaccurate knowledge can result in unexpected latency increases of up to 2X from the optimal. Additionally expected workloads may "drift" overtime, effectively acting as uncertainty in the workload leading to frequent and unnecessary retuning. This work focuses on exploring a new robust design paradigm for column stores, a popular layout for modern HTAP storage engines. We explore how much these uncertainties affect performance by benchmarking current systems and how well they handle workload drift. We then introduce some intuitive notions of workload drift and explore different heuristic strategies to combat drift. Additionally we demonstrate these strategies in OakDB, an open source column store engine to demonstrate robust tuning.

Friday, Apr 10 Jialin Ding
(MIT)
Bio

Jialin Ding is a second-year PhD student in the MIT Data Systems Group, where he is advised by Prof. Tim Kraska. His research focuses broadly on the application of machine learning to data systems. He also collaborates with Umar Farooq Minhas and the Database Group at Microsoft Research on learned data structures. Prior to MIT, Jialin was an undergraduate at Stanford University, where he worked on data-intensive systems with Prof. Peter Bailis at part of Stanford DAWN.

Learning Multi-dimensional Indexes
Click for abstract

Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-Trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries.
In this talk, I will present Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system. Our paper on Flood will appear at SIGMOD 2020.
I will also talk about our continuing work on extending the ideas of Flood to address real-world challenges of indexing multi-dimensional data such as data correlation, non-uniform queries, and categorical attributes.

Thursday, Mar 26 Ryan Marcus
(MIT)
Bio

Ryan Marcus is a postdoc researcher at MIT, working under Tim Kraska. Ryan recently graduated from Brandeis University, where he studied applications of machine learning to cloud databases under Olga Papaemmanouil. Before that, Ryan took courses in gender studies and mathematics at the University of Arizona, while banging his head against supercomputers at Los Alamos National Laboratory. He enjoys long walks down steep gradients, and shorter walks down gentler ones. He’s also looking for a job! You can find Ryan online at https://rmarcus.info.

MiDAS Seminar/CS591A1 Guest Lecture (Online@12:30 PM)
Machine Learning for Query Optimization: A Few Interesting Results & Thousands of Practical Barriers

Click for abstract

Query optimization is one of the most well-studied problems in database management systems. Because of the exponential problem space and the heuristic nature of current solutions, researchers have recently applied machine learning techniques to this challenging problem domain. Various approaches range from using machine learning to perform cardinality estimation to outright replacing the query optimizer with deep reinforcement learning. Each of these approaches comes with their own unique sets of advantages and constraints. This talk will lay out these recent advancements in a skeptical light (my own work included), highlighting both potential advantages and the myriad of challenges that still need to be overcome to achieve a fully autonomous query optimizer.

Thursday, Mar 19 Stavros Papadopoulos
(TileDB)
Bio

Dr. Stavros Papadopoulos is the founder and CEO of TileDB, Inc. Prior to founding TileDB, Inc. in February 2017, Dr. Papadopoulos was a Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for three years. He also spent about two years as a Visiting Assistant Professor at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST). Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position at the Chinese University of Hong Kong with Prof. Yufei Tao.

MiDAS Seminar/CS591A1 Guest Lecture (Online@12:30 PM)
The TileDB array data storage engine

Click for abstract

In this talk I will present TileDB, a storage engine for multi-dimensional dense and sparse arrays. TileDB is open-source (MIT License) and built as a C++ embeddable library. It comes with 6 APIs (C, C++, Python, R, Java, Go), and integrates with Spark, Dask, MariaDB, PrestoDB, PDAL and GDAL. I will first explain the TileDB data format, the internal engine mechanics (parallelism, filters, integrations), and its applicability to data science and analytics. I will then describe TileDB’s value in application domains such as Genomics and Geospatial. Finally, I will show a demo of our recently launched product that offers access control, logging and serverless SQL and UDFs on the cloud.

Friday, Mar 06 Dimitris Staratzis
(BU)
Bio

Dimitris Staratzis is a first year PhD student at the Computer Science Department of Boston University. He is a member of the MiDAS group@BU, advised by Professor Manos Athanassoulis. His research interests broadly include Database and Big Data Management. Prior to joining Boston University, he received his BSc in Informatics from Athens University of Economics and Business.

Efficient Range Deletes in LSM Engines
Click for abstract

Modern data systems process enormous amount of data and selecting the appropriate data structure can have major implications in the responsiveness, availability and scalability of such systems. Traditional systems have certain limitations and thus more advanced Key-Value systems are used in most applications today. During updates, inserts and deletes, LSM trees utilise the main memory of the system to achieve better performance. Values are not deleted or updated in place. Each delete requires a new entry referred as a 'tombstone' that marks all previous entries with the same key as deleted. In our knowledge, none of the major Key-Value store developers have implemented range deletes except RocksDB. Their implementation use the so called 'range tombstones' which altogether consist a data-structure that can answer whether a requested key is deleted by a range delete or not. For the database to construct such a response multiple I/Os are needed. Our method tries to reduce these I/Os by efficiently storing past results in memory in an opportunistic way, thus increasing responsiveness, since I/Os majorly affect performance.

Friday, Feb 21 Showan Esmail Asyabi
(BU)
Bio

Showan Esmail Asyabi is a research assistant and a Ph.D. student at Boston University. He is advised by Prof. Azer Bestavros. His research focuses on designing and building system software for emerging computing models (e.g., cloud computing and stream processing). He has designed several CPU schedulers for virtualized and non-virtualized data centers to mitigate power consumption or raise the performance of IO-bound workloads. He is mainly interested in distributed systems, operating systems, cloud computing, and big data.

In-application CPU Scheduling to Reduce Power Consumption of In-memory Key-Value Stores
Click for abstract

The traffic load sent to key-value (KV) stores vary over long timescales of hours to short timescales of a few microseconds. Long-term variations present the opportunity to save power during low or medium periods of utilization. Several techniques exist to save power in servers, including feedback-based controllers that right-size the number of allocated CPU cores, dynamic voltage and frequency scaling (DVFS), and c-state (idle-state) mechanisms. In this talk, we demonstrate that existing power-saving techniques are not effective for KV stores. This is because of the high rate of traffic even under low load prevents the system from entering low power states for extended periods of time. To achieve power savings, we must unbalance the load among the CPU cores so that some of them can enter low power states during periods of low load. We accomplish this by introducing the notion of in-application CPU scheduling. Instead of relying on the kernel to schedule threads, we pin threads to bypass the kernel CPU scheduler and then perform the scheduling within the KV store application. Our design is a KV store that features an in-application CPU scheduler that monitors the load to learn the workload characteristics and then scales the number of active CPU cores when the load drops, leading to notable power savings during low or medium periods of utilization. Our experiments demonstrate that our system outperforms state of the art approaches such as Rubik and µDPM by up to 6x and 4x respectively.

Friday, Feb 07 Ju Hyoung Mun
(BU)
Bio

Juhyoung is currently a postdoctoral associate in the Department of Computer Science at Boston University, working with Prof. Manos Athanassoulis. Her research interest lies in the fields of Data Systems and Networking;particularly, in enhancing the lookup performance. She received her Ph.D. in Electronic and Electrical Engineering at Ewha w. University in Seoul, Korea, under the supervision of Prof. Hyesook Lim. Her doctoral research was focused on how to retrieve the data efficiently by using Bloom filters in Information Centric Networking (ICN).

Reducing Hash Overhead of Bloom Filters in LSM-trees
Click for abstract

A log-structured merge-tree (LSM-tree) is widely used in modern key-value stores. To balance the read performance and ingestion rate, LSM-trees maintain sorted runs in multiple levels so that the lookup may require multiple I/Os. To accelerate the performance of point lookups, LSM-trees often employ Bloom filters in the main memory to avoid unnecessary I/Os for levels that do not contain the matching key. Using Bloom filters is useful when there is a vast gap in the cost of accessing Bloom filters, hashing and probing the memory, and the storage. As the access latency of the modern storage device like NVMe is getting comparable with hashing, the CPU cost for hashing becomes the performance bottleneck of lookups and hinders other CPU intensive tasks. Hashing is crucial for LSM-trees since, by design, they have increased read amplification to mitigate the insertion cost. To alleviate the hashing overhead, we propose to share the hash calculation for multiple levels and re-use it across the tree hierarchy. Furthermore, by sharing the hash calculation, which sections of Bloom filters are going be searched along with the different hierarchy are acquired in advance. Using this property, we also propose to prefetch the necessary Bloom filter sections to improve the lookup performance further. The simulation results show that our proposal can achieve almost the same false-positive ratio as the state-of-the-art while keeping the hashing overhead constant.

Spring 2020: Tarikul Islam Papon, e-mail: papon at bu dot edu
Fall 2019 Calendar
Seminars are held in the old MCS B33 at 11.00 am [unless otherwise notified], followed by lunch.
Date Speaker Title and Abstract
Friday, Dec 06 Brian Hentschel
Bio

Brian is a PhD student at Harvard University in the Data Systems Laboratory. His work focuses on the usage of statistics and machine learning in making better data structures and algorithms, and conversely in the creation of data management solutions for better machine learning. Previous work of his has won the ACM SIGMOD Research Highlight Award as well as the Best Paper Award at EDBT.

Temporally Biased Sampling & Online Model Management
Click for abstract

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified "decay function". We then periodically retrain the models on the current sample. Time-biasing lets the models adapt to recent changes in the data while - unlike in a sliding-window approach - still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

Monday, Nov 18 Stella Pantela
Bio

Stella is a senior distributed systems software engineer at Vertica, chair of the Vertica DataGals and a member of the Vertica patent committee. At Vertica, Stella builds powerful infrastructure used by big data analytics companies including Uber, Etsy, Wayfair, Philps. Her focus is on improving the distributed systems protocols for more efficient maintenance and transfer of metadata in both the cloud and enterprise environments. Stella's work has been featured at SIGMOD and North East Database Day. Before Vertica, Stella completed her B.A. with Honors in Computer Science at Harvard University in 2015, where she also conducted research in the area of column stores with Professor Stratos Idreos. In her free time she maintains a blog on distributed systems and technology at stylianipantela.com.

Guest lecture for CS460: Introduction to Databases [4:30-5:45pm, EPC 207]
Friday, Nov 08 Shreyas Sekar
Bio

Shreyas Sekar is a Postdoctoral Fellow at the Laboratory for Innovation Science at the Harvard Business School. Shreyas received his Ph.D. in Computer Science from Rensselaer Polytechnic Institute in 2017, where he was awarded the Robert McNaughton Prize for the best dissertation in CS. His dissertation was on Non-Discriminative Algorithmic Pricing. Broadly speaking, his research is centered around online decision making and learning in dynamic and strategic environment, with applications to e-commerce, transportation, and the digital marketplaces. Currently, he is excited by questions pertaining to information design---i.e., how can digital firms leverage information as a tool for revenue management. Shreyas has also collaborated with industry partners such as Wayfair and Freelancer to help develop prescriptive strategies for optimizing core metrics.

Learning to Rank an Assortment of Products
Click for abstract

We consider the product ranking challenge that online retailers face when their customers typically behave as "window shoppers": they form an impression of the assortment after browsing products ranked in the initial positions and then decide whether to continue browsing. Further, customers' product preferences and attention spans are correlated and unknown to the retailer. We develop a class of online explore-then-exploit algorithms that prescribe a ranking to offer each customer, learning from preceding customers' clickstream data to offer better rankings to subsequent customers. Our algorithms balance product popularity with diversity: the notion of appealing to a large variety of heterogeneous customers. We prove that our learning algorithms rapidly converge to a ranking that matches the best-known approximation factors for the offline, complete information setting. Finally, we partner with Wayfair - a multi-billion dollar home goods online retailer - to estimate the impact of our algorithm in practice via simulations using actual clickstream data, and we find that our algorithm yields a significant increase (5-30%) in the number of customers that engage with the site.

Friday, Nov 01 Leonardo Torres
Bio

Leo is a 4th year PhD student at the Network Science Institute at Northeastern University, under the advisement of Tina Eliassi-Rad. He is broadly interested in the intersection of Network Science, Machine Learning, and Mathematics, focusing on metric aspects of complex networks and graph mining. He specializes in introducing novel mathematical and computational tools to the network science toolbox, usually coming from differential geometry, algebraic topology, spectral linear algebra, and unsupervised techniques. Leo has a B.S. in Mathematics from Pontificia Universidad Católica del Perú. He has taught mathematics and Spanish at the undergraduate level and he was a Research Programmer at Wolfram Research South America, working on the Wolfram|Alpha knowledge engine.

Non-Backtracking Cycles: Length Spectrum Theory and Graph Mining Applications
Click for abstract

Two basic tasks in graph analysis are: (1) computing the distance between two graphs and (2) embedding of the graph elements (i.e., nodes or links) into a lower-dimensional space. The former task has numerous applications from k-nearest neighbor search, to clustering a collection of graphs, to transfer learning. Unfortunately, there exists no canonical way of computing the distance between two finite graphs because the dissimilarity problem is ill-defined. Despite this fact, the relevant literature contains many methods for measuring graph distance with different heuristics, computational efficiency, interpretability, and theoretical soundness. We introduce a graph distance, called the Non-Backtracking Spectral Distance (NBD), which is theoretically sound and interpretable. NBD is based on a mathematical construct from algebraic topology (in particular, homotopy theory) called the length spectrum. The length spectrum of a graph is a function defined on the graph's first homotopy group (a.k.a. the fundamental group); and it uniquely characterizes a graph's 2-core up to isometry. Thus, if two graphs have the same length spectra, then their 2-cores are isometric. We show the relationship between a graph's length spectrum and its non-backtracking cycles; and present a method based on computing the eigenvalues of a graph's non-backtracking matrix. For the second task (i.e., graph embedding), most existing methods are stochastic and depend on black-box models such as deep networks. Both of these characteristics make their output difficult to analyze. We propose the Non-Backtracking Embedding Dimensions (NBED) for finding a graph embedding in low-dimensional space by computing the eigenvectors of the non-backtracking matrix. Both NBD and NBED are interpretable in terms of features of complex networks such as hubs, triangles, edge centrality, and communities. We showcase the usefulness of both NBD and NBED in experiments on synthetic and real-world networks.

Friday, Oct 25 Matteo Riondato
Bio

Matteo Riondato is an assistant professor of computer science at Amherst College, and a visiting faculty at Brown University. Previously he was a research scientist at Two Sigma, and a postdoc at Stanford and Brown. He obtained his PhD in computer science from Brown. His research focuses on algorithms for knowledge discovery, data mining, and machine learning: he develops methods to analyze rich datasets, including graphs and time series, as fast as possible and in a statistically sound way. His works received best-of-conference awards at the 2014 SIAM International Conference on Data Mining (SDM), the 2016 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), and the 2018 IEEE International Conference on Data Mining (ICDM). He tweets @teorionda and lives at http://matteo.rionda.to.

SPuManTe: Significant Pattern Mining with Unconditional Testing
Click for abstract

We present SPuManTE, an efficient algorithm for mining significant patterns from a transactional dataset. SPuManTE controls the Family-wise Error Rate: it ensures that the probability of reporting one or more false discoveries is less than an user-specified threshold. A key ingredient of SPuManTE is UT, our novel unconditional statistical test for evaluating the significance of a pattern, that requires fewer assumptions on the data generation process and is more appropriate for a knowledge discovery setting than classical conditional tests, such as the widely used Fisher's exact test. Computational requirements have limited the use of unconditional tests in significant pattern discovery, but ut overcomes this issue by obtaining the required probabilities in a novel efficient way. SPuManTE combines UT with recent results on the supremum of the deviations of pattern frequencies from their expectations, grounded in statistical learning theory. Joint work with Leonardo Pellegrina (UniPD) and Fabio Vandin (UniPD), presented at KDD'19.

Friday, Oct 18 Kiran Garimella
Bio

Kiran Garimella is the Michael Hammer postdoc at the MIT Institute for Data, Systems, and Society (IDSS). Before joining MIT, he was a postdoc at EPFL, Switzerland. His research focuses on using digital data for social good, including areas like polarization, misinformation and human migration. His work on studying and mitigating polarization on social media won the best student paper awards at WSDM 2017 and WebScience 2017. Kiran received his PhD at Aalto University, Finland, and Masters & Bachelors from IIIT Hyderabad, India. Prior to his PhD, he worked as a Research Engineer at Yahoo Research, Barcelona, and QCRI, Doha.

A first look at WhatsApp data — Problems, and Challenges
Click for abstract

In this work, I will talk about our on going efforts to collect and analyze large amounts of public Whatsapp data from political groups in Brazil, India, and Indonesia. I will discuss our efforts in monitoring a large number of whatsapp groups and provide insights on various problems arising from such data. I will specifically give examples of novel ways in which misinformation, hate speech and coordination are being spread through whatsapp. I will briefly discuss some of the technical challenges that arise from data, specifically related to identifying image based misinformation. Finally, I will mention our efforts on collecting data across platforms for the Indian election and why such a cross platform analysis is required to understand the spread of issues like misinformation.

Friday, Sep 27 Loqman Salamatian
Bio

Loqman is a Master student in mathematics at Sorbonne University. His main interests lie at the crossroad between Complex Networks, Differential Geometry and Optimal Transport. He is currently working at CAIDA on understanding the topology of the Internet and its relationship with geography.

Monitoring the Internet through Riemannian Geometry
Click for abstract

In this work, we tackle the problem of real-time monitoring of dynamic variations of global Internet using topology changes information provided by combining several BGP feeds. We develop a geometric approach that leverages notions of geometric curvature and recent development in graph embedding using Ollivier-Ricci curvature. In particular, we use our method to detect major events and changes via the geometry of the embedding of the graph, where a major event is defined as an incident that either impacts a substantial number of prefixes or can be observed by numerous route monitors within a given period of time. We first describe the Ricci curvature approach to characterize the AS level graph. The key idea is to detect changes in between consecutive snapshots and to separate events that result in considerable geometric alterations, from those that remain local. The variations of curvature are evaluated through a set of landmarks as reference points. We then introduce an anomaly tracking mechanism to detect large variations of curvature that translate into major variations in the geometry of the graph. These changes are then considered as major BGP-level events. We describe a mechanism for identifying the network elements responsible for the set of coordinated changes and isolate their implications in the geometric space. We finally evaluate this system in operational setting.

Friday, Sep 20 Charalampos (Babis) Tsourakakis Optimal learning of joint alignments
Click for abstract

We consider the following problem, which is useful in applications such as joint image and shape alignment. The goal is to recover \( n \) discrete variables \( g_i \in \{0, \ldots, k-1\} \) (up to some global offset) given noisy observations of a set of their pairwise differences \( \{g_i - g_j \bmod k\} \); specifically, with probability \( \frac{1}{k}+\delta \) for some \( \delta > 0 \) one obtains the correct answer, and with the remaining probability one obtains a uniformly random incorrect answer. We consider a learning-based formulation where one can perform a query to observe a pairwise difference, and the goal is to perform as few queries as possible while obtaining the exact joint alignment. We provide an easy-to-implement, time efficient algorithm that performs \( O(n \lg n/\delta^2 k) \) queries, and recovers the joint alignment with high probability. We also show that our algorithm is optimal. This work improves significantly work of Chen and Candes, who view the problem as a constrained principal components analysis problem that can be solved using the power method. Our approach is simpler both in the algorithm and the analysis, and provides additional insights into the problem structure. Finally, experimentally our algorithm performs well compared to the algorithm of Chen and Candes both in terms of accuracy and running time.

Tuesday, Sep 03 Tianyi Chen and Konstantinos Sotiropoulos Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries and TwitterMancer: Predicting User Interactions on Twitter
Click for abstract

Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries. In the densest subgraph problem, given an undirected graph \( G(V;E;w) \) with non-negative edge weights we are asked to find a subset of nodes \( S \) from \( V \) that maximizes the degree density \( w(S)=|S| \). This problem is solvable in polynomial time, and in general is well studied. But what happens when we are asked questions such as "how can we find a dense subgraph in Twitter with lots of replies and mentions but no follows?", "how do we extract a dense subgraph with high expected reward and low risk from an uncertain graph"? In this work, specifically, we formulate both problems mathematically as special instances of dense subgraph discovery in graphs with negative weights. We prove that the problem in general is NP-hard, but we also provide sufficient conditions under which it is poly-time solvable. We design an efficient approximation algorithm that works best in the presence of small negative weights, and an effective heuristic for the more general case. Finally, we perform experiments on various real-world datasets that verify the value of the proposed primitives, and the effectiveness of our proposed algorithms.
TwitterMancer: Predicting User Interactions on Twitter. This work investigates the interplay between different types of user interactions on Twitter, with respect to predicting missing or unseen interactions. For example, given a set of retweet interactions between Twitter users, how accurately can we predict reply interactions? Is it more difficult to predict retweet or quote interactions between a pair of accounts? Also, which features of interaction patterns are most important to enable accurate prediction of specific Twitter interactions? Our empirical study of Twitter interactions contributes initial answers to these questions. We have crawled an extensive dataset of Twitter accounts and their follow, quote, retweet, reply interactions over a period of a month. Using a machine learning framework, we find we can accurately predict many interactions of Twitter users. Interestingly, the most predictive features vary with the user profiles, and are not the same across all users. For example, for a pair of users that interact with a large number of other Twitter users, we find that certain "higher-dimensional" triads, i.e., triads that involve multiple types of interactions, are very informative, whereas for less active Twitter users, certain in-degrees and out-degrees play a major role. Finally, we provide various other insights on Twitter user behavior.

Fall 2019: Konstantinos Sotiropoulos, e-mail: ksotirop at bu dot edu