Reducing Long Tail Latencies in Geo-Distributed Systems

Detta är en avhandling från Stockholm : KTH Royal Institute of Technology

Sammanfattning: Computing services are highly integrated into modern society. Millions of people rely on these services daily for communication, coordination, trading, and accessing to information. To meet high demands, many popular services are implemented and deployed as geo-distributed applications on top of third party virtualized cloud providers. However, the nature of such deployment provides variable performance characteristics. To deliver high quality of service, such systems strive to adapt to ever-changing conditions by monitoring changes in state and making run-time decisions, such as choosing server peering, replica placement, and quorum selection.In this thesis, we seek to improve the quality of run-time decisions made by geo-distributed systems. We attempt to achieve this through: (1) a better understanding of the underlying deployment conditions, (2) systematic and thorough testing of the decision logic implemented in these systems, and (3) by providing a clear view into the network and system states which allows these services to perform better-informed decisions.We performed a long-term cross datacenter latency measurement of the Amazon EC2 cloud provider. We used this data to quantify the variability of network conditions and demonstrated its impact on the performance of the systems deployed on top of this cloud provider.Next, we validate an application’s decision logic used in popular storage systems by examining replica selection algorithms. We introduce GeoPerf, a tool that uses symbolic execution and lightweight modeling to perform systematic testing of replica selection algorithms. We applied GeoPerf to test two popular storage systems and we found one bug in each.Then, using traceroute and one-way delay measurements across EC2, we demonstrated persistent correlation between network paths and network latency. We introduce EdgeVar, a tool that decouples routing and congestion based changes in network latency. By providing this additional information, we improved the quality of latency estimation, as well as increased the stability of network path selection.Finally, we introduce Tectonic, a tool that tracks an application’s requests and responses both at the user and kernel levels. In combination with EdgeVar, it provides a complete view of the delays associated with each processing stage of a request and response. Using Tectonic, we analyzed the impact of sharing CPUs in a virtualized environment and can infer the hypervisor’s scheduling policies. We argue for the importance of knowing these policies and propose to use them in applications’ decision making process.

  KLICKA HÄR FÖR ATT SE AVHANDLINGEN I FULLTEXT. (PDF-format)