MiDAS Seminar

Spring 2021 Calendar
Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
Date Speaker Title and Abstract
Friday, Apr 23 Yannis Chronis (University of Wisconsin Madison)
Bio

Yannis is a fourth-year Ph.D. student in the Computer Sciences Department at the University of Wisconsin-Madison advised by Prof. Jignesh Patel. His research focuses on efficient data processing using modern and emerging hardware.

Castle: Accelerating analytical query processing with associative computing
Click for abstract

We are in the midst of a Cambrian hardware evolution in which a rich variety of technologies and architectures are being invented with a flurry that we havent seen in a long time. One class of hardware devices, called Processing-in-Memory (PIM), is enabling new ways of utilizing memories. PIM devices generally pack storage and computing closely together and have novel computing primitives that are built into the memory devices. Associative processors are a type of PIM device with massive data-level parallelism and single-cycle search and update capabilities. The focus of our work is on exploring how database analytic workloads can be run efficiently using an associative computing substrate. Our vehicle for this research is the Castle project in which we are building a data processing engine to run on the CAPE hardware, which is a specific implementation of an associative PIM device. Initial results show that associative PIM hardware can potentially deliver manifold performance gains over traditional von Neumann architectures; however, to fully exploit CAPEs potential, data query processing kernels and query optimization methods both have to be adapted in new ways to leverage the style of data-parallel computing that can be carried out directly inside CAPE. We experiment with both individual query processing kernels and end-to-end queries and show that the performance gains using our methods are significant, and up to 46X in some cases. Tentatively, these results point to the exciting potential in accelerating database analytic query processing workloads using associative computing.

Friday, Feb 12 Tarikul Islam Papon (Boston University)
Bio

Papon is a second-year Ph.D. student, part of MiDAS group at BU, advised by Prof. Athanassoulis. His research interests broadly include Data Systems, specifically on hardware-aware data management challenges, stemming from the evolution of storage and memory devices. Currently, he is working on proposing a new design paradigm for interacting with durable storage that takes into account performance asymmetry between read and write operations, as well as the variable access concurrency that different devices may support. Prior to joining BU, Papon served as a lecturer for 4 years at the Department of Computer Science & Engineering (CSE) in Bangladesh University of Engineering & Technology (BUET). He completed his MSc and BSc degree from the same department where he worked on medical informatics using various machine learning and embedded system techniques.

The Case for A Parametric I/O Model
Click for abstract

Storage devices have evolved to offer increasingly faster read/write access, through flash-based and other solid-state storage technologies. When compared to classical rotating hard disk drives (HDDs), modern solid-state drives (SSDs) have two key differences: (i) the absence of mechanical parts, and (ii) an inherent difference between the process of reading and writing. The former removes a key performance bottleneck, enabling internal device parallelism, whereas the latter manifests as a read/write performance asymmetry. In other words, modern storage devices can serve multiple concurrent I/Os, and writes are generally slower than reads; none of which is true for HDDs.To address this mismatch, we propose a simple yet expressive storage model, termed Parametric I/O Model (PIO) that captures modern (and future) devices by parameterizing read/write asymmetry (α) and access concurrency (k). PIO enables making device-specific decisions at algorithm design time, rather than as an optimization during deployment. To showcase the benefits of algorithm design using PIO, we introduce a new family of bufferpool eviction strategies that offer performance improvements by accurately modeling the behavior of the underlying devices. We compare their performance on various synthetic traces with three state-of-the-art algorithms (LRU, CFLRU, LRU-WSR) to show that they achieve 15.9% 44.5% improvement in runtime with a negligible increase in total writes ( 0.06%) and buffer misses ( 0.004%). We further report up to 1.83x speedup for TPC-C traces on PostgreSQL.

Friday, Jan 22 Nikolaos Tziavelis (Northeastern University)
Bio

Nikolaos Tziavelis is a PhD student at the Khoury College of Computer Sciences of Northeastern University. His research interests lie in algorithms for database query processing and distributed computing. He holds a MEng in Electrical and Computer Engineering from the National Technical University of Athens.

Any-k: Optimal Join Algorithms meet Top-k
Click for abstract

Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. However, optimality results for queries involving joins hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. In this talk, I will present the problem of ranked enumeration that lies in the intersection of these two research areas. The goal is to return query results one after the other in order of their importance with theoretical guarantees that are better suited for in-memory computation. I will focus on two families of algorithms that have their roots on the k'th shortest path problem. The first one is based on the Lawler-Murty partitioning procedure, while the second one uses a recursive approach that has been developed over several decades.

Spring 2021: Zichen Zhu, e-mail: zczhu at bu dot edu
Fall 2020 Calendar
Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
Date Speaker Title and Abstract
Friday, Dec 18 Chen Luo (University of California Irvine)
Bio

Chen Luo received his PhD from University of California, Irvine in December 2020 under the supervision of Professor Michael J. Carey. His research interests include storage management, indexing, and LSM-trees. In particular, he worked on optimizing LSM-trees for big data management systems. During his PhD studies, he interned at IBM Almaden (2017) working on HTAP indexing and Microsoft Research (2019) working on customized SSD controllers. Before that, he obtained the Master’s degree from Tsinghua University in 2016 and the Bachelor’s degree from Tongji University in 2013, respectively.

On Optimizing LSM-based Storage for Big Data Management Systems [Talk Recording]
Click for abstract

In recent years, the Log-Structured Merge-tree (LSM-tree) has been widely used in the storage layer in modern NoSQL systems. Different from traditional index structures that apply updates in-place, an LSM-tree first buffers all writes in memory and subsequently flushes them to disk and merges them using sequential I/Os. This out-of-place update design brings a number of advantages, including superior write performance, high space utilization, tunability, and simplification of concurrency control and recovery. These advantages have enabled LSM-trees to serve a large variety of workloads in production systems. Despite the popularity of LSM-trees, the existing research efforts have been primarily focusing on improving the maximum throughput of LSM-trees in simple key-value store settings. This leads to several outages when adopting LSM-based storage techniques in a Big Data Management System (BDMS) with multiple heterogeneous LSM-trees and requiring performance metrics beyond just the maximum throughput. In this talk, I will describe some of my PhD research to optimize LSM-trees for BDMSs, including efficient maintenance of LSM-based auxiliary structures, minimizing write stalls via better scheduling, and adaptive memory managemnt of LSM-trees.

Friday, Dec 11 Sudhanshu (Dan) Chanpuriya (UMass Amherst)
Bio

Sudhanshu Chanpuriya is a PhD student at the College of Computer and Information Sciences at UMass Amherst, where he is advised by Cameron Musco. His research interests lie at the intersection of machine learning and graph theory. Prior to joining UMass, he received his undergraduate degree at Dartmouth College.

Towards Understanding the Success and Limitations of Node Embeddings
Click for abstract

The node embedding task involves mapping nodes in networks to low-dimensional, continuous vectors for direct use in downstream machine learning algorithms, such as those for classification and clustering. This task has received great interest from the ML community in the past few years, and I will present some of our recent work, which tries to advance two parts of our understanding: (i) What are the limitations of node embeddings? In work appearing at NeurIPS 2020, we show that, contrary to recent doubts, node embeddings can capture fine network structures like triangles involving only low-degree nodes; in fact, they can exactly capture the full network structure at surprisingly low embedding dimensionalities. (ii) What has made recent node embedding algorithms more effective on downstream tasks? I will present our variant of the popular DeepWalk algorithm, InfiniteWalk, which attributes the success of DeepWalk to the addition of a nonlinear component to older algorithms based on spectral clustering.

Friday, Dec 04 Herodotos Herodotou (Cyprus University of Technology)
Bio

Herodotos Herodotou is an Assistant Professor in the Department of Electrical Engineering, Computer Engineering and Informatics at the Cyprus University of Technology, where he is leading the Data Intensive Computing Research Lab. He received his Ph.D. in Computer Science from Duke University. His Ph.D. dissertation work on building a self-tuning system for big data analytics received the ACM SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention as well as the Outstanding Ph.D. Dissertation Award in Computer Science at Duke. Before joining CUT, he held research positions at Microsoft Research, Yahoo! Labs, and Aster Data. His research interests are in large-scale Data Processing Systems, Database Systems, and Cloud Computing. In particular, his work focuses on ease-of-use, manageability, and automated tuning of both centralized and distributed data-intensive computing systems. In addition, he is interested in applying database techniques in other areas like maritime informatics, scientific computing, bioinformatics, and social computing. His research work to date has been published in several top scientific conferences and journals, two books, and two book chapters, while he is actively participating in multiple European and nationally funded projects.

Automating Distributed Tiered Storage Management in Cluster Computing [Talk Recording]
Click for abstract

Data-intensive platforms such as Hadoop and Spark are routinely used to process massive amounts of data residing on distributed file systems like HDFS. Increasing memory sizes and new hardware technologies (e.g., NVRAM, SSDs) have recently led to the introduction of storage tiering in such settings. However, users are now burdened with the additional complexity of managing the multiple storage tiers and the data residing on them, while trying to optimize their workloads. In this talk, I will present OctopusFS, a novel distributed file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes. Smart placement and retrieval policies employ multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. In addition, redistribution policies employ machine learning for tracking and predicting file access patterns, which are used to decide when and which data to move up or down the storage tiers for increasing system performance. The approach uses incremental learning to dynamically refine the models with new file accesses, allowing them to naturally adjust and adapt to workload changes over time. Our extensive evaluation using realistic workloads derived from Facebook and CMU traces compares our approach with several other policies and showcases significant benefits in terms of both workload performance and cluster efficiency.

Friday, Nov 06 Abdul Wasay (Harvard)
Bio

I am a Ph.D. candidate at the Harvard John A. Paulson School of Engineering and Applied Sciences, where I am advised by Prof. Stratos Idreos. I design techniques to accelerate machine learning and data science pipelines, specifically, addressing the bottleneck of repeated data movement. Additionally, I am interested in mapping out the design space of deep neural network models to enable better design. Prior to joining Harvard, I was an undergraduate at Lahore University of Management Sciences (LUMS). In my undergraduate thesis, I investigated ways to improve buffer management in data center networks. During this time, I also spent a summer at École Polytechnique Fédérale de Lausanne (EPFL) working on techniques to enhance the privacy of social network graphs.

How and When to Build Neural Network Ensemble Models
Click for abstract

Neural network models are applied to many tasks with rising complexity and data sizes. Researchers and practitioners seek to scale the representational power of these models by adding more parameters. Increasing parameters, though, requires additional critical resources such as memory and compute time leading to increased training and inference cost. Thus a consistent challenge is to obtain as high as possible accuracy within a parameter budget. As neural network designers navigate this complex landscape, they are guided by conventional wisdom. This comes from empirical studies of the design space. We identify a critical part of this design space that is not well-understood: That is how to decide between the alternatives of expanding a single network model or increasing the number of networks and using them together in an ensemble. We study this question in detail through extensive experimentation across various network architectures and data sets. Crucially, we provide a robust assessment, missing in previous studies, by fixing the number of parameters. We also consider a holistic set of metrics such as training time, inference time, and memory usage. Contrary to conventional wisdom, we show that when we perform a holistic and robust assessment, we uncover a wide design space, where ensembles not only provide better accuracy but also train faster, and deploy at a speed comparable to single convolutional networks with the same total number of parameters.

Fall 2020: Tianyi Chen, e-mail: ctony at bu dot edu
Spring 2020 Calendar
Because of COVID-19, seminars are held online through Zoom at 11.00 AM [unless otherwise notified].
Date Speaker Title and Abstract
Friday, May 22 Zichen Zhu (BU)
Bio

Zichen Zhu is a first-year PhD student at Boston University advised by Prof. Manos Athanassoulis. Prior to this, he received his MPhil degree in 2019 from Hong Kong University. Zichen's current research interests are data access methods and LSM-based NoSQL key-value stores.

A Fast Workload-Driven Blocking Approach with Sub-optimal Skipping Effectiveness
Click for abstract

With rapidly growing data volume, modern query engines are required to process enormous amount of data timely. In order to speed up the data access, an existing data skipping-based technique is proposed as follows. For each block of tuples, a bit vector (also noted as feature vector) is constructed where each dimension represents whether there exists a tuple qualified for some specific filters in this block. Therefore, the effectiveness of data skipping essentially depends on how we partition data into blocks. Existing skipping-oriented partitioning techniques assume the dimension of feature vectors is given and the feature selection is modeled as a frequent item mining problem with the dimension as the input. We argue that there may exist a more intelligent way to determine the parameters. In this work, a greedy partitioning technique is introduced with sub-optimal skipping performance without specifying the dimension. Further experiments also demonstrate the effectiveness of our algorithm in a synthetic workload and dataset.

Friday, May 15 Subhadeep Sarkar (BU)
Bio

Subhadeep Sarkar is a post-doctoral researcher at Boston University working with Prof. Manos Athanassoulis. Prior to this, Subhadeep spent nearly two years as a post-doctoral researcher at INRIA Rennes, France. He completed his Ph.D. in 2017 from Indian Institute of Technology Kharagpur. Subhadeep's current research interests are NoSQL key-value stores, data access methods, and data privacy.

Lethe: A Tunable Delete-Aware LSM Engine
Click for abstract

Data-intensive applications fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as a second-class citizen. A delete inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion. Further, these LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written. We highlight that fast persistent deletion without affecting read performance is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on the right-to-be-forgotten, and (iii) en masse cloud deployment of data systems that makes storage a precious resource.
In this talk, I will present Lethe, an LSM-based key-value storage that introduces a set of delete-aware compaction policies and a new physical data storage layout, which together, allows for efficient and timely persistent deletion of data. Lethe is capable of supporting any user-defined threshold for the delete persistence latency offering higher read throughput and lower space amplification, with a modest increase in write amplification. In addition, by physically storing data in an interweaved order on the sort and the delete key, Lethe supports efficient range deletes on a secondary (delete) key without employing a costly full tree merge. Our paper on Lethe is set to appear at SIGMOD 2020.

Friday, May 08 Ju Hyoung Mun (BU)
Bio

Ju Hyoung Mun is currently a postdoctoral associate in the Department of Computer Science at Boston University, working with Prof. Manos Athanassoulis. She received her Ph.D. in Electronic and Electrical Engineering at Ewha w. University in Seoul, Korea, under the supervision of Prof. Hyesook Lim. Her doctoral research was focused on how to retrieve the data efficiently by using Bloom filters in Named Data Networking (NDN). Her research interest lies in the fields of Data Systems and Networking; particularly, in enhancing the lookup performance.

FIB Sharing and Cache Sharing using Bloom filters in NDN
Click for abstract

The major portion of the Internet bandwidth is now being used for content distribution like video streaming. The conventional Internet infrastructure, which is based on a host-to-host communication model, is not efficient for content distribution. In order to accommodate today's communication needs, Named Data Networking (NDN) has been introduced as a future Internet architecture. NDN technology has two main characteristics: name-based content retrieval and in-network caching. There are many research challenges for the realization of NDN. This talk presents two methods of information sharing using Bloom filters in NDN: (1) Content Store sharing, which shares cache summaries with neighboring routers to increase the diversity of cached contents in a network as a whole, and (2) FIB table sharing, which reduces the traffic and the size of the FIB tables.

Friday, May 01 Tarikul Islam Papon (BU)
Bio

Papon is a first-year Ph.D. student, part of MiDAS group at BU, advised by Prof. Athanassoulis. His research interests broadly include Data Systems, specifically on hardware-aware data management challenges, stemming from the evolution of storage and memory devices. Currently, he is working on proposing a new design paradigm for interacting with durable storage that takes into account performance asymmetry between read and write operations, as well as the variable access concurrency that different devices may support.
Prior to joining BU, Papon served as a lecturer for 4 years at the Department of Computer Science & Engineering (CSE) in Bangladesh University of Engineering & Technology (BUET). He completed his MSc and BSc degree from the same department where he worked on medical informatics using various machine learning and embedded system techniques.

The Case for A Parameterized I/O Model
Click for abstract

The traditional I/O model (EM Model) was developed based on the properties of HDDs, and has not evolved accordingly while it is still widely used. The EM Model assumes that read and write accesses have the same cost (symmetric cost) whereas most modern flash devices and emerging non-volatile memory (NVM) technologies show asymmetric cost - writes are significantly slower compared to reads. In addition, these devices provide some degree of access concurrency which cannot be modeled with the EM Model because it considers only one I/O at a time. The EM Model cannot capture these key properties of modern memory technologies, thus unable to exploit the full benefits of contemporary devices.
We propose a new Parameterized I/O Model PIO(M, k, α) that can interact with present-day storage devices that considers read/write asymmetry (α) as well as the variable access concurrency (k) that different devices may support. In addition, we aim to propose a new database buffer pool page eviction algorithm for flash devices and compare its performance with the traditional algorithms with respect to our proposed PIO(M, k, α) model.

Friday, Apr 24 Subarna Chatterjee (Harvard)
Bio

Subarna Chatterjee is a post-doc at the Data Systems Lab (DASLab) at Harvard University since January 2019. Prior to this, she was also a post-doc with the Myriads team at Inria, Rennes in France. She completed her Ph.D. from Indian Institute of Technology Kharagpur, India from 2013-2017. Her current research interests are NoSQL storage engines and reasoning about their cost and performance on cloud.

Cosine: A Cloud-Cost Optimized NoSQL Storage Engine
Click for abstract

We present a key-value storage engine, Cosine, that guarantees an optimal cost- performance trade-off, given a workload and a budget. Cosine creates a massive search space comprising of the entire data structure design and hardware space of (LSM-tree/B-tree) key-value stores over diverse cloud pricing policies for the top three cloud providers – AWS, GCP, and Azure. In order to prune configurations from this massive search space, we present distribution-aware cost models that precisely estimate I/O costs of each possible data structure design, which in turn, helps in comparing pairs of designs. By using the cost models, Cosine reduces the massive dimensionality of the search space into a continuum of holistically optimal configurations. Cosine also enables decision makers in applications to quickly answer rich what- if questions about the changes in workload performance and cost as any of the design, hardware, or cloud provider change.

Friday, Apr 17 Andy Huynh (BU)
Bio

Andy is a third-year PhD student in the MiDaS group at Boston University where he is advised by Professor Manos Athanassoulis. His research focuses on novel design concepts in hybrid analytical/transaction processing systems. He is currently working on an open source column-storage called OakDB, a DBMS that focuses on auto tuning with input uncertainty. Prior to Boston University Andy attended the University of Minnesota, Twin Cities earning a Bachelors in Computer Engineering.

Tuning Data Systems Under Uncertainty in Workload
Click for abstract

Modern hybrid analytical/transactional processing (HTAP) systems are generally tuned with assumptions on the expected workload, however, often times workload knowledge may be inaccurate. Deploying systems tuned on inaccurate knowledge can result in unexpected latency increases of up to 2X from the optimal. Additionally expected workloads may "drift" overtime, effectively acting as uncertainty in the workload leading to frequent and unnecessary retuning. This work focuses on exploring a new robust design paradigm for column stores, a popular layout for modern HTAP storage engines. We explore how much these uncertainties affect performance by benchmarking current systems and how well they handle workload drift. We then introduce some intuitive notions of workload drift and explore different heuristic strategies to combat drift. Additionally we demonstrate these strategies in OakDB, an open source column store engine to demonstrate robust tuning.

Friday, Apr 10 Jialin Ding (MIT)
Bio

Jialin Ding is a second-year PhD student in the MIT Data Systems Group, where he is advised by Prof. Tim Kraska. His research focuses broadly on the application of machine learning to data systems. He also collaborates with Umar Farooq Minhas and the Database Group at Microsoft Research on learned data structures. Prior to MIT, Jialin was an undergraduate at Stanford University, where he worked on data-intensive systems with Prof. Peter Bailis at part of Stanford DAWN.

Learning Multi-dimensional Indexes
Click for abstract

Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often create clustered indexes over a single dimension or multi-dimensional indexes such as R-Trees, or use complex sort orders (e.g., Z-ordering). However, these schemes are often hard to tune and their performance is inconsistent across different datasets and queries.
In this talk, I will present Flood, a multi-dimensional in-memory read-optimized index that automatically adapts itself to a particular dataset and workload by jointly optimizing the index structure and data storage layout. Flood achieves up to three orders of magnitude faster performance for range scans with predicates than state-of-the-art multi-dimensional indexes or sort orders on real-world datasets and workloads. Our work serves as a building block towards an end-to-end learned database system. Our paper on Flood will appear at SIGMOD 2020.
I will also talk about our continuing work on extending the ideas of Flood to address real-world challenges of indexing multi-dimensional data such as data correlation, non-uniform queries, and categorical attributes.

Friday, Apr 03 CS Department Seminar
Thursday, Mar 26 Ryan Marcus (MIT)
Bio

Ryan Marcus is a postdoc researcher at MIT, working under Tim Kraska. Ryan recently graduated from Brandeis University, where he studied applications of machine learning to cloud databases under Olga Papaemmanouil. Before that, Ryan took courses in gender studies and mathematics at the University of Arizona, while banging his head against supercomputers at Los Alamos National Laboratory. He enjoys long walks down steep gradients, and shorter walks down gentler ones. He’s also looking for a job! You can find Ryan online at https://rmarcus.info.

MiDAS Seminar/CS591A1 Guest Lecture (Online@12:30 PM)
Machine Learning for Query Optimization: A Few Interesting Results & Thousands of Practical Barriers
Click for abstract

Query optimization is one of the most well-studied problems in database management systems. Because of the exponential problem space and the heuristic nature of current solutions, researchers have recently applied machine learning techniques to this challenging problem domain. Various approaches range from using machine learning to perform cardinality estimation to outright replacing the query optimizer with deep reinforcement learning. Each of these approaches comes with their own unique sets of advantages and constraints. This talk will lay out these recent advancements in a skeptical light (my own work included), highlighting both potential advantages and the myriad of challenges that still need to be overcome to achieve a fully autonomous query optimizer.

Thursday, Mar 19 Stavros Papadopoulos (TileDB)
Bio

Dr. Stavros Papadopoulos is the founder and CEO of TileDB, Inc. Prior to founding TileDB, Inc. in February 2017, Dr. Papadopoulos was a Senior Research Scientist at the Intel Parallel Computing Lab, and a member of the Intel Science and Technology Center for Big Data at MIT CSAIL for three years. He also spent about two years as a Visiting Assistant Professor at the Department of Computer Science and Engineering of the Hong Kong University of Science and Technology (HKUST). Stavros received his PhD degree in Computer Science at HKUST under the supervision of Prof. Dimitris Papadias, and held a postdoc fellow position at the Chinese University of Hong Kong with Prof. Yufei Tao.

MiDAS Seminar/CS591A1 Guest Lecture (Online@12:30 PM)
The TileDB array data storage engine
Click for abstract

In this talk I will present TileDB, a storage engine for multi-dimensional dense and sparse arrays. TileDB is open-source (MIT License) and built as a C++ embeddable library. It comes with 6 APIs (C, C++, Python, R, Java, Go), and integrates with Spark, Dask, MariaDB, PrestoDB, PDAL and GDAL. I will first explain the TileDB data format, the internal engine mechanics (parallelism, filters, integrations), and its applicability to data science and analytics. I will then describe TileDB’s value in application domains such as Genomics and Geospatial. Finally, I will show a demo of our recently launched product that offers access control, logging and serverless SQL and UDFs on the cloud.

Friday, Mar 06 Dimitris Staratzis (BU)
Bio

Dimitris Staratzis is a first year PhD student at the Computer Science Department of Boston University. He is a member of the MiDAS group@BU, advised by Professor Manos Athanassoulis. His research interests broadly include Database and Big Data Management. Prior to joining Boston University, he received his BSc in Informatics from Athens University of Economics and Business.

Efficient Range Deletes in LSM Engines
Click for abstract

Modern data systems process enormous amount of data and selecting the appropriate data structure can have major implications in the responsiveness, availability and scalability of such systems. Traditional systems have certain limitations and thus more advanced Key-Value systems are used in most applications today. During updates, inserts and deletes, LSM trees utilise the main memory of the system to achieve better performance. Values are not deleted or updated in place. Each delete requires a new entry referred as a 'tombstone' that marks all previous entries with the same key as deleted. In our knowledge, none of the major Key-Value store developers have implemented range deletes except RocksDB. Their implementation use the so called 'range tombstones' which altogether consist a data-structure that can answer whether a requested key is deleted by a range delete or not. For the database to construct such a response multiple I/Os are needed. Our method tries to reduce these I/Os by efficiently storing past results in memory in an opportunistic way, thus increasing responsiveness, since I/Os majorly affect performance.

Friday, Feb 28 CS Department Seminar
Friday, Feb 21 Showan Esmail Asyabi (BU)
Bio

Showan Esmail Asyabi is a research assistant and a Ph.D. student at Boston University. He is advised by Prof. Azer Bestavros. His research focuses on designing and building system software for emerging computing models (e.g., cloud computing and stream processing). He has designed several CPU schedulers for virtualized and non-virtualized data centers to mitigate power consumption or raise the performance of IO-bound workloads. He is mainly interested in distributed systems, operating systems, cloud computing, and big data.

In-application CPU Scheduling to Reduce Power Consumption of In-memory Key-Value Stores
Click for abstract

The traffic load sent to key-value (KV) stores vary over long timescales of hours to short timescales of a few microseconds. Long-term variations present the opportunity to save power during low or medium periods of utilization. Several techniques exist to save power in servers, including feedback-based controllers that right-size the number of allocated CPU cores, dynamic voltage and frequency scaling (DVFS), and c-state (idle-state) mechanisms. In this talk, we demonstrate that existing power-saving techniques are not effective for KV stores. This is because of the high rate of traffic even under low load prevents the system from entering low power states for extended periods of time. To achieve power savings, we must unbalance the load among the CPU cores so that some of them can enter low power states during periods of low load. We accomplish this by introducing the notion of in-application CPU scheduling. Instead of relying on the kernel to schedule threads, we pin threads to bypass the kernel CPU scheduler and then perform the scheduling within the KV store application. Our design is a KV store that features an in-application CPU scheduler that monitors the load to learn the workload characteristics and then scales the number of active CPU cores when the load drops, leading to notable power savings during low or medium periods of utilization. Our experiments demonstrate that our system outperforms state of the art approaches such as Rubik and µDPM by up to 6x and 4x respectively.

Friday, Feb 14 CS Department Seminar
Friday, Feb 07 Ju Hyoung Mun (BU)
Bio

Juhyoung is currently a postdoctoral associate in the Department of Computer Science at Boston University, working with Prof. Manos Athanassoulis. Her research interest lies in the fields of Data Systems and Networking;particularly, in enhancing the lookup performance. She received her Ph.D. in Electronic and Electrical Engineering at Ewha w. University in Seoul, Korea, under the supervision of Prof. Hyesook Lim. Her doctoral research was focused on how to retrieve the data efficiently by using Bloom filters in Information Centric Networking (ICN).

Reducing Hash Overhead of Bloom Filters in LSM-trees
Click for abstract

A log-structured merge-tree (LSM-tree) is widely used in modern key-value stores. To balance the read performance and ingestion rate, LSM-trees maintain sorted runs in multiple levels so that the lookup may require multiple I/Os. To accelerate the performance of point lookups, LSM-trees often employ Bloom filters in the main memory to avoid unnecessary I/Os for levels that do not contain the matching key. Using Bloom filters is useful when there is a vast gap in the cost of accessing Bloom filters, hashing and probing the memory, and the storage. As the access latency of the modern storage device like NVMe is getting comparable with hashing, the CPU cost for hashing becomes the performance bottleneck of lookups and hinders other CPU intensive tasks. Hashing is crucial for LSM-trees since, by design, they have increased read amplification to mitigate the insertion cost. To alleviate the hashing overhead, we propose to share the hash calculation for multiple levels and re-use it across the tree hierarchy. Furthermore, by sharing the hash calculation, which sections of Bloom filters are going be searched along with the different hierarchy are acquired in advance. Using this property, we also propose to prefetch the necessary Bloom filter sections to improve the lookup performance further. The simulation results show that our proposal can achieve almost the same false-positive ratio as the state-of-the-art while keeping the hashing overhead constant.

Spring 2020: Tarikul Islam Papon, e-mail: papon at bu dot edu
Fall 2019 Calendar
Seminars are held in the old MCS B33 at 11.00 am [unless otherwise notified], followed by lunch.
Date Speaker Title and Abstract
Friday, Dec 06 Brian Hentschel
Bio

Brian is a PhD student at Harvard University in the Data Systems Laboratory. His work focuses on the usage of statistics and machine learning in making better data structures and algorithms, and conversely in the creation of data management solutions for better machine learning. Previous work of his has won the ACM SIGMOD Research Highlight Award as well as the Best Paper Award at EDBT.

Temporally Biased Sampling & Online Model Management
Click for abstract

To maintain the accuracy of supervised learning models in the presence of evolving data streams, we provide temporally-biased sampling schemes that weight recent data most heavily, with inclusion probabilities for a given data item decaying over time according to a specified "decay function". We then periodically retrain the models on the current sample. Time-biasing lets the models adapt to recent changes in the data while - unlike in a sliding-window approach - still keeping some old data to ensure robustness in the face of temporary fluctuations and periodicities in the data values. In addition, the sampling-based approach allows existing analytic algorithms for static data to be applied to dynamic streaming data essentially without change. We provide and analyze both a simple sampling scheme (T-TBS) that probabilistically maintains a target sample size and a novel reservoir-based scheme (R-TBS) that is the first to provide both control over the decay rate and a guaranteed upper bound on the sample size. The R-TBS and T-TBS schemes are of independent interest, extending the known set of unequal-probability sampling schemes. We discuss distributed implementation strategies; experiments in Spark illuminate the performance and scalability of the algorithms, and show that our approach can increase machine learning robustness in the face of evolving data.

Monday, Nov 18 Stella Pantela
Bio

Stella is a senior distributed systems software engineer at Vertica, chair of the Vertica DataGals and a member of the Vertica patent committee. At Vertica, Stella builds powerful infrastructure used by big data analytics companies including Uber, Etsy, Wayfair, Philps. Her focus is on improving the distributed systems protocols for more efficient maintenance and transfer of metadata in both the cloud and enterprise environments. Stella's work has been featured at SIGMOD and North East Database Day. Before Vertica, Stella completed her B.A. with Honors in Computer Science at Harvard University in 2015, where she also conducted research in the area of column stores with Professor Stratos Idreos. In her free time she maintains a blog on distributed systems and technology at stylianipantela.com.

Guest lecture for CS460: Introduction to Databases [4:30-5:45pm, EPC 207]
Friday, Nov 08 Shreyas Sekar
Bio

Shreyas Sekar is a Postdoctoral Fellow at the Laboratory for Innovation Science at the Harvard Business School. Shreyas received his Ph.D. in Computer Science from Rensselaer Polytechnic Institute in 2017, where he was awarded the Robert McNaughton Prize for the best dissertation in CS. His dissertation was on Non-Discriminative Algorithmic Pricing. Broadly speaking, his research is centered around online decision making and learning in dynamic and strategic environment, with applications to e-commerce, transportation, and the digital marketplaces. Currently, he is excited by questions pertaining to information design---i.e., how can digital firms leverage information as a tool for revenue management. Shreyas has also collaborated with industry partners such as Wayfair and Freelancer to help develop prescriptive strategies for optimizing core metrics.

Learning to Rank an Assortment of Products
Click for abstract

We consider the product ranking challenge that online retailers face when their customers typically behave as "window shoppers": they form an impression of the assortment after browsing products ranked in the initial positions and then decide whether to continue browsing. Further, customers' product preferences and attention spans are correlated and unknown to the retailer. We develop a class of online explore-then-exploit algorithms that prescribe a ranking to offer each customer, learning from preceding customers' clickstream data to offer better rankings to subsequent customers. Our algorithms balance product popularity with diversity: the notion of appealing to a large variety of heterogeneous customers. We prove that our learning algorithms rapidly converge to a ranking that matches the best-known approximation factors for the offline, complete information setting. Finally, we partner with Wayfair - a multi-billion dollar home goods online retailer - to estimate the impact of our algorithm in practice via simulations using actual clickstream data, and we find that our algorithm yields a significant increase (5-30%) in the number of customers that engage with the site.

Friday, Nov 01 Leonardo Torres
Bio

Leo is a 4th year PhD student at the Network Science Institute at Northeastern University, under the advisement of Tina Eliassi-Rad. He is broadly interested in the intersection of Network Science, Machine Learning, and Mathematics, focusing on metric aspects of complex networks and graph mining. He specializes in introducing novel mathematical and computational tools to the network science toolbox, usually coming from differential geometry, algebraic topology, spectral linear algebra, and unsupervised techniques. Leo has a B.S. in Mathematics from Pontificia Universidad Católica del Perú. He has taught mathematics and Spanish at the undergraduate level and he was a Research Programmer at Wolfram Research South America, working on the Wolfram|Alpha knowledge engine.

Non-Backtracking Cycles: Length Spectrum Theory and Graph Mining Applications
Click for abstract

Two basic tasks in graph analysis are: (1) computing the distance between two graphs and (2) embedding of the graph elements (i.e., nodes or links) into a lower-dimensional space. The former task has numerous applications from k-nearest neighbor search, to clustering a collection of graphs, to transfer learning. Unfortunately, there exists no canonical way of computing the distance between two finite graphs because the dissimilarity problem is ill-defined. Despite this fact, the relevant literature contains many methods for measuring graph distance with different heuristics, computational efficiency, interpretability, and theoretical soundness. We introduce a graph distance, called the Non-Backtracking Spectral Distance (NBD), which is theoretically sound and interpretable. NBD is based on a mathematical construct from algebraic topology (in particular, homotopy theory) called the length spectrum. The length spectrum of a graph is a function defined on the graph's first homotopy group (a.k.a. the fundamental group); and it uniquely characterizes a graph's 2-core up to isometry. Thus, if two graphs have the same length spectra, then their 2-cores are isometric. We show the relationship between a graph's length spectrum and its non-backtracking cycles; and present a method based on computing the eigenvalues of a graph's non-backtracking matrix. For the second task (i.e., graph embedding), most existing methods are stochastic and depend on black-box models such as deep networks. Both of these characteristics make their output difficult to analyze. We propose the Non-Backtracking Embedding Dimensions (NBED) for finding a graph embedding in low-dimensional space by computing the eigenvectors of the non-backtracking matrix. Both NBD and NBED are interpretable in terms of features of complex networks such as hubs, triangles, edge centrality, and communities. We showcase the usefulness of both NBD and NBED in experiments on synthetic and real-world networks.

Friday, Oct 25 Matteo Riondato
Bio

Matteo Riondato is an assistant professor of computer science at Amherst College, and a visiting faculty at Brown University. Previously he was a research scientist at Two Sigma, and a postdoc at Stanford and Brown. He obtained his PhD in computer science from Brown. His research focuses on algorithms for knowledge discovery, data mining, and machine learning: he develops methods to analyze rich datasets, including graphs and time series, as fast as possible and in a statistically sound way. His works received best-of-conference awards at the 2014 SIAM International Conference on Data Mining (SDM), the 2016 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), and the 2018 IEEE International Conference on Data Mining (ICDM). He tweets @teorionda and lives at http://matteo.rionda.to.

SPuManTe: Significant Pattern Mining with Unconditional Testing
Click for abstract

We present SPuManTE, an efficient algorithm for mining significant patterns from a transactional dataset. SPuManTE controls the Family-wise Error Rate: it ensures that the probability of reporting one or more false discoveries is less than an user-specified threshold. A key ingredient of SPuManTE is UT, our novel unconditional statistical test for evaluating the significance of a pattern, that requires fewer assumptions on the data generation process and is more appropriate for a knowledge discovery setting than classical conditional tests, such as the widely used Fisher's exact test. Computational requirements have limited the use of unconditional tests in significant pattern discovery, but ut overcomes this issue by obtaining the required probabilities in a novel efficient way. SPuManTE combines UT with recent results on the supremum of the deviations of pattern frequencies from their expectations, grounded in statistical learning theory. Joint work with Leonardo Pellegrina (UniPD) and Fabio Vandin (UniPD), presented at KDD'19.

Friday, Oct 18 Kiran Garimella
Bio

Kiran Garimella is the Michael Hammer postdoc at the MIT Institute for Data, Systems, and Society (IDSS). Before joining MIT, he was a postdoc at EPFL, Switzerland. His research focuses on using digital data for social good, including areas like polarization, misinformation and human migration. His work on studying and mitigating polarization on social media won the best student paper awards at WSDM 2017 and WebScience 2017. Kiran received his PhD at Aalto University, Finland, and Masters & Bachelors from IIIT Hyderabad, India. Prior to his PhD, he worked as a Research Engineer at Yahoo Research, Barcelona, and QCRI, Doha.

A first look at WhatsApp data — Problems, and Challenges
Click for abstract

In this work, I will talk about our on going efforts to collect and analyze large amounts of public Whatsapp data from political groups in Brazil, India, and Indonesia. I will discuss our efforts in monitoring a large number of whatsapp groups and provide insights on various problems arising from such data. I will specifically give examples of novel ways in which misinformation, hate speech and coordination are being spread through whatsapp. I will briefly discuss some of the technical challenges that arise from data, specifically related to identifying image based misinformation. Finally, I will mention our efforts on collecting data across platforms for the Indian election and why such a cross platform analysis is required to understand the spread of issues like misinformation.

Friday, Sep 27 Loqman Salamatian
Bio

Loqman is a Master student in mathematics at Sorbonne University. His main interests lie at the crossroad between Complex Networks, Differential Geometry and Optimal Transport. He is currently working at CAIDA on understanding the topology of the Internet and its relationship with geography.

Monitoring the Internet through Riemannian Geometry
Click for abstract

In this work, we tackle the problem of real-time monitoring of dynamic variations of global Internet using topology changes information provided by combining several BGP feeds. We develop a geometric approach that leverages notions of geometric curvature and recent development in graph embedding using Ollivier-Ricci curvature. In particular, we use our method to detect major events and changes via the geometry of the embedding of the graph, where a major event is defined as an incident that either impacts a substantial number of prefixes or can be observed by numerous route monitors within a given period of time. We first describe the Ricci curvature approach to characterize the AS level graph. The key idea is to detect changes in between consecutive snapshots and to separate events that result in considerable geometric alterations, from those that remain local. The variations of curvature are evaluated through a set of landmarks as reference points. We then introduce an anomaly tracking mechanism to detect large variations of curvature that translate into major variations in the geometry of the graph. These changes are then considered as major BGP-level events. We describe a mechanism for identifying the network elements responsible for the set of coordinated changes and isolate their implications in the geometric space. We finally evaluate this system in operational setting.

Friday, Sep 20 Charalampos (Babis) Tsourakakis Optimal learning of joint alignments
Click for abstract

We consider the following problem, which is useful in applications such as joint image and shape alignment. The goal is to recover \( n \) discrete variables \( g_i \in \{0, \ldots, k-1\} \) (up to some global offset) given noisy observations of a set of their pairwise differences \( \{g_i - g_j \bmod k\} \); specifically, with probability \( \frac{1}{k}+\delta \) for some \( \delta > 0 \) one obtains the correct answer, and with the remaining probability one obtains a uniformly random incorrect answer. We consider a learning-based formulation where one can perform a query to observe a pairwise difference, and the goal is to perform as few queries as possible while obtaining the exact joint alignment. We provide an easy-to-implement, time efficient algorithm that performs \( O(n \lg n/\delta^2 k) \) queries, and recovers the joint alignment with high probability. We also show that our algorithm is optimal. This work improves significantly work of Chen and Candes, who view the problem as a constrained principal components analysis problem that can be solved using the power method. Our approach is simpler both in the algorithm and the analysis, and provides additional insights into the problem structure. Finally, experimentally our algorithm performs well compared to the algorithm of Chen and Candes both in terms of accuracy and running time.

Tuesday, Sep 03 Tianyi Chen and Konstantinos Sotiropoulos Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries and TwitterMancer: Predicting User Interactions on Twitter
Click for abstract

Novel Dense Subgraph Discovery Primitives: Risk Aversion and Exclusion Queries. In the densest subgraph problem, given an undirected graph \( G(V;E;w) \) with non-negative edge weights we are asked to find a subset of nodes \( S \) from \( V \) that maximizes the degree density \( w(S)=|S| \). This problem is solvable in polynomial time, and in general is well studied. But what happens when we are asked questions such as "how can we find a dense subgraph in Twitter with lots of replies and mentions but no follows?", "how do we extract a dense subgraph with high expected reward and low risk from an uncertain graph"? In this work, specifically, we formulate both problems mathematically as special instances of dense subgraph discovery in graphs with negative weights. We prove that the problem in general is NP-hard, but we also provide sufficient conditions under which it is poly-time solvable. We design an efficient approximation algorithm that works best in the presence of small negative weights, and an effective heuristic for the more general case. Finally, we perform experiments on various real-world datasets that verify the value of the proposed primitives, and the effectiveness of our proposed algorithms.
TwitterMancer: Predicting User Interactions on Twitter. This work investigates the interplay between different types of user interactions on Twitter, with respect to predicting missing or unseen interactions. For example, given a set of retweet interactions between Twitter users, how accurately can we predict reply interactions? Is it more difficult to predict retweet or quote interactions between a pair of accounts? Also, which features of interaction patterns are most important to enable accurate prediction of specific Twitter interactions? Our empirical study of Twitter interactions contributes initial answers to these questions. We have crawled an extensive dataset of Twitter accounts and their follow, quote, retweet, reply interactions over a period of a month. Using a machine learning framework, we find we can accurately predict many interactions of Twitter users. Interestingly, the most predictive features vary with the user profiles, and are not the same across all users. For example, for a pair of users that interact with a large number of other Twitter users, we find that certain "higher-dimensional" triads, i.e., triads that involve multiple types of interactions, are very informative, whereas for less active Twitter users, certain in-degrees and out-degrees play a major role. Finally, we provide various other insights on Twitter user behavior.

Fall 2019: Konstantinos Sotiropoulos, e-mail: ksotirop at bu dot edu