EuroSys 2021 has ended
Back To Schedule
Monday, April 26 • 16:00 - 21:00
[HAOC] 1st Workshop on High Availability and Observability of Cloud Systems

Log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.
Website: https://haoc2021.cs.jhu.edu/
Slack: #workshop-haoc
YouTube: https://www.youtube.com/playlist?list=PLzDuHU-z7gNi_wZuMBGCIFXJssdydBihB
Proceedings: http://dl.acm.org/citation.cfm?id=3447851
Organizers: Rebecca Isaacs (Twitter), Ryan Huang (Johns Hopkins University)

(all times are in British Summer Time)

16:00 - 16:05: Welcome
16:05 - 17:05: Keynote: Haryadi Gunawi (University of Chicago)
  • Too Many Tests, Too Little Time: How to Find Bugs Faster
17:05 - 17:30: Break
17:30 - 18:00: Paper Presentation Session I
  • CARE: Infusing Causal Aware Thinking to Root Cause Analysis in Cloud System
Yong Xu and Xu Zhang (Microsoft Research, Beijing, China); Chuan Luo and Si Qin (Microsoft Research); Rohit Pandey (Microsoft, Redmond); Chao Du and Qingwei Lin (Microsoft Research, Beijing, China); Yingnong Dang (Microsoft, Redmond); Andrew Zhou (Microsoft)
  • Frisbee: A Suite for Benchmarking Systems Recovery
Fotis Nikolaidis (FORTH, Greece); Angelos Bilas (Univ. of Crete and FORTH, Greece); Manolis Marazakis (FORTH-ICS); Antonis Chazapis (FORTH, Greece)

18:00 - 18:30: Invited Talk I: Samer Al-Kiswany (University of Waterloo)
  • On the Art of Wielding a Double-Edged Sword (or Finessing Modern Networks)
18:30 - 18:40: Break
18:40 - 19:10: Paper Presentation Session II
  • Examining Raft's behaviour during partial network failures
Chris Jensen, Heidi Howard, and Richard Mortier (University of Cambridge)
  • Service mesh circuit breaker: From panic button to performance management tool
Mohammad Reza Saleh Sedghpour, Cristian Klein, and Johan Tordsson (Umeå University)

19:10 - 19:40: Invited Talk II: Kay Ousterhout (LightStep)
  • Sampling Distributed Traces: Evolution from 2005 to Today
19:40 - 19:50: Break
19:50 - 20:50: Panel
  • Chair: John Wilkes, Google
  • Evangelia Kalyvianaki, University of Cambridge
  • Jonathan Mace, Max Planck Institute for Software Systems (MPI-SWS)
  • Noa Zilberman, University of Oxford
  • Yonatan Zunger, Twitter
20:50 - 21:00: Wrap-up

  • 16:00 - 17:00: Keynote: Haryadi Gunawi (University of Chicago)
Too Many Tests, Too Little Time: How to Find Bugs Faster

As more data and computation move from local to cloud environments, datacenter distributed systems have become a dominant backbone for many modern applications. However, the complexity of cloud-scale hardware and software ecosystems has outpaced existing testing, debugging, and verification tools. I will describe a classical class of bugs that surface in large-scale datacenter distributed systems, distributed concurrency bugs, caused by non-deterministic timings of distributed events such as message arrivals as well as multiple crashes and reboots. The challenge is the too many tests to perform. I will describe our 7-year of experience in taming this problem, in particular how to systematically reduce the number of tests to perform when building software model checkers for distributed datacenter systems.

Bio: Haryadi S. Gunawi is an Associate Professor in the Department of Computer Science at the University of Chicago where he leads the UCARE research group (UChicago systems research on Availability, Reliability, and Efficiency). He received his Ph.D. in Computer Science from the University of Wisconsin, Madison in 2009. He was a postdoctoral fellow at the University of California, Berkeley from 2010 to 2012. His current research focuses on cloud computing reliability and new storage technology. He has won numerous awards including NSF CAREER award, NSF Computing Innovation Fellowship, Google Faculty Research Award, NetApp Faculty Fellowships, and Honorable Mention for the 2009 ACM Doctoral Dissertation Award.

  • 18:00 - 18:30: Invited Talk I: Samer Al-Kiswany (University of Waterloo)
On the Art of Wielding a Double-Edged Sword (or Finessing Modern Networks)

Unprecedented advances in networking technology have introduced network configurability and programmability. However, this increase in network "softwarization" is a double-edged sword. On one hand, network softwarization facilitates the building of line-rate application-specific packet-processing logic. On the other hand, increased network softwarization (perhaps unsurprisingly) increases the frequency and complexity of network faults. In this talk, I will discuss a peculiar type of a network fault that my group identified: partial network partitioning. First, I will present a comprehensive study of system failures caused by this type of fault. Our study reveals that the studied failures are catastrophic (e.g., lead to data loss) and are easily manifested. Second, I will present an analysis of fault-tolerance techniques for eight popular systems and highlight their shortcomings. Finally, I will present Nifty, a transparent communication layer that masks partial network partitions. Nifty overcomes the shortcomings of current fault-tolerance approaches and effectively masks partial partitions while imposing negligible overhead.

Bio: Samer Al-Kiswany is an assistant professor at the David Cheriton School of Computer Science at the University of Waterloo, Canada. His research interests are in distributed systems, networking, and data management and processing engines. In particular, his work focuses on reconsidering systems design in light of recent changes in cloud applications and platforms. Samer received his PhD from the University of British Columbia in 2013. After earning his PhD, he joined the University of Wisconsin–Madison as a postdoctoral fellow. Dr. Al-Kiswany is the recipient of ten national and international awards, including the Killam Doctoral Fellowship and the NSERC Postdoctoral Fellowship.

  • 19:10 - 19:40: Invited Talk II: Kay Ousterhout (LightStep)
Sampling Distributed Traces: Evolution from 2005 to Today

Distributed tracing has become a widespread tool for understanding the performance of large-scale systems: unlike metrics or logs, traces follow a request through every service to provide critical context for debugging problems. Retaining traces for every request through a system is typically prohibitively expensive, so all systems for distributed tracing sample a subset of traces to save. In this talk, I’ll talk about three iterations of sampling for distributed tracing, starting from the simple, random approach employed by Google’s production tracing system, Dapper, in 2005, and ending in a new approach that Lightstep is beginning to use in production. Over time, technology improvements have allowed sampling decisions to be delayed later and later in the data ingestion pipeline, which has enabled more sophisticated sampling algorithms that consider more dimensions in choosing which traces to save. I’ll end by talking about ongoing challenges and opportunities in selecting the most useful sample of tracing data.

Bio: Kay Ousterhout is a software engineer at Lightstep, where she's building performance management tools that enable users to understand the performance of complex distributed systems. Before Lightstep, Kay received a PhD from UC Berkeley. Her thesis focused on building high-performance data analytics frameworks that allow

avatar for Kay Ousterhout
avatar for Samer Al-Kiswany

Samer Al-Kiswany

University of Waterloo
avatar for Haryadi Gunawi

Haryadi Gunawi

University of Chicago

Monday April 26, 2021 16:00 - 21:00 BST
  Workshop, HAOC
  • Slack Channel #workshop-haoc
  • Volunteers Brian Choi, Yigong Hu, Haoyu Liu, Kostis Kaffes