Enabling Fast and Accurate Run-Time Decisions in Geo-Distributed Systems : Better Achieving Service Level Objectives

Sammanfattning: Computing services are highly integrated into modern society and used  by millions of people daily. To meet these high demands, many popular  services are implemented and deployed as geo-distributed applications on  top of third-party virtualized cloud providers. However, the nature of  such a deployment leads to variable performance. To deliver high quality  of service, these systems strive to adapt to ever-changing conditions by  monitoring changes in state and making informed run-time decisions, such  as choosing server peering, replica placement, and redirection of requests. In  this dissertation, we seek to improve the quality of run-time decisions made  by geo-distributed systems. We attempt to achieve this through: (1) a better  understanding of the underlying deployment conditions, (2) systematic and  thorough testing of the decision logic implemented in these systems, and (3)  by providing a clear view of the network and system states allowing services  to make better-informed decisions.  First, we validate an application’s decision logic used in popular  storage systems by examining replica selection algorithms. We do this by  introducing GeoPerf, a tool that uses symbolic execution and modeling to  perform systematic testing of replica selection algorithms. GeoPerf was used  to test two popular storage systems and found one bug in each.  Then, using measurements across EC2, we observed persistent correlation  between network paths and network latency. Based on these observations,  we introduce EdgeVar, a tool that decouples routing and congestion based  changes in network latency. This additional information improves estimation  of latency, as well as increases the stability of network path selection.  Next, we introduce Tectonic, a tool that tracks an application’s requests  and responses both at the user and kernel levels. In combination with  EdgeVar, it decouples end-to-end request completion time into three  components of network routing, network congestion, and service time.  Finally, we demonstrate how this decoupling of request completion  time components can be leveraged in practice by developing Kurma, a  fast and accurate load balancer for geo-distributed storage systems. At  runtime, Kurma integrates network latency and service time distributions to  accurately estimate the rate of Service Level Objective (SLO) violations, for  requests redirected between geo-distributed datacenters. Using real-world  data, we demonstrate Kurma’s ability to effectively share load among  datacenters while reducing SLO violations by a factor of up to 3 in high  load settings or reducing the cost of running the service by up to 17%. The  techniques described in this dissertation are important for current and future  geo-distributed services that strive to provide the best quality of service to  customers while minimizing the cost of operating the service.  

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)