Application service resilience In cloud : (Record no. 433019)
[ view plain ]
000 -LEADER | |
---|---|
fixed length control field | 05704nam a22002657a 4500 |
008 - FIXED-LENGTH DATA ELEMENTS--GENERAL INFORMATION | |
fixed length control field | 250111b |||||||| |||| 00| 0 eng d |
041 ## - LANGUAGE CODE | |
Language code of text/sound track or separate title | en |
082 ## - DEWEY DECIMAL CLASSIFICATION NUMBER | |
Classification number | 004.8 |
Item number | MAT |
100 ## - MAIN ENTRY--PERSONAL NAME | |
Personal name | Mathews, Dhanya R |
245 ## - TITLE STATEMENT | |
Title | Application service resilience In cloud : |
Remainder of title | end-to-end perspective |
260 ## - PUBLICATION, DISTRIBUTION, ETC. (IMPRINT) | |
Place of publication, distribution, etc | Bangalore : |
Name of publisher, distributor, etc | Indian Institute of Science, |
Date of publication, distribution, etc | 2024. |
300 ## - PHYSICAL DESCRIPTION | |
Extent | N/A |
Accompanying material | E-Thesis |
500 ## - GENERAL NOTE | |
General note | Includes bibliographical references. |
502 ## - DISSERTATION NOTE | |
Dissertation note | PhD;2024;Computational and Data Sciences. |
520 ## - SUMMARY, ETC. | |
Summary, etc | Embargo up to 10/1/2026 The idea of computing as a utility was realized with the emergence of the cloud computing paradigm. Cloud service providers offer a wide range of services that are delivered over the Internet to cloud service consumers. In its current manifestation, the Cloud services are realized over multiple logical, virtualized, and distributed resources, typically using a multi-layered architecture. The providers document the non-functional service level guarantees like availability, performance, security, etc, in Service Level Agreements (SLAs) provided to the consumer as Service Level Objectives (SLO). The wide adoption of cloud computing, compounded with the emergence of microservice architecture, has resulted in a considerable increase in the number of components involved in service delivery. Manually addressing failures in real-time is inefficient and often impossible at the cloud scale, where failures are a norm rather than an exception. Ensuring the quality of an application service, as documented in the SLA, therefore requires autonomous mechanisms to enhance cloud services' resilience. Though cloud setups rely on highly autonomous service layers for managing, provisioning, and monitoring applications, most of them focus on a specific cloud service architecture layer or consider only a particular set of faults. Any component across the cloud service stack involved in the service delivery could disrupt the SLO. Further, as cloud services use shared infrastructure, monitoring and acting on the individual service layer metrics is limiting. In such a scenario, the visibility of failure anywhere in the stack can offer effective recovery/remediation strategies; hence, an application-oriented approach that takes an end-to-end view of failures makes a case for any resiliency solution. Towards this, we propose an end-to-end service resilience framework that employs data-dependent intelligent autonomous mechanisms to deal with cloud service disruptions efficiently. The intelligence to reduce the effect of disruptions is based on understanding the complex interconnections and inter-dependencies of end-to-end components in the cloud service stack. The different cloud service abstraction layers and infrastructure sharing have resulted in increased occurrence of faults, more specifically, saturation faults. The initial phase of this work examines real-world disruption scenarios to understand the faults that could disrupt a cloud service. With ever-changing applications and environments on which they are hosted, realizing a failure repository for cloud service faults is infeasible. This makes conventional data-oriented approaches less practical and dynamic observability data-oriented methods more desirable. Towards this, the second phase of this work developed a Topology Aware Root Cause Detection Algorithm (TA-RCD) that considers the observability data from end-to-end service components and their interconnectedness. Our results from the fault injection studies show that the proposed approach performs better than the state-of-the-art RCD algorithm, at least by 2x times for Top-5 recall and 4x times for Top-3 recall, on average. To autonomously recover a service from its anomalous state, the remediation should target the root cause of anomalous behavior. The root-cause localizations, though accurate, are not restricted to a specific component because of causal effects due to service interactions. In order to identify the anomalous component, the third phase of this work developed a Topology Aware end-to-end failure Recovery framework (TA-REC) that identifies the appropriate remediation strategy for an anomaly. The anomaly scores assignment and component activity tracking in TA-REC facilitates the identification of the component and the remediation that needs to be applied to the component. For the saturation fault scenarios injected across the stack, TA-REC can identify an adequate remediation/recovery strategy compared to the state-of-the-art because of the better visibility of the origin of the failure due to the end-to-end visibility. In conclusion, this work demonstrated the usefulness of the end-to-end topology of a cloud application service to remediate anomalies that challenge the service quality efficiently. The observations prove that looking at the service as a black box restricts the development of intelligent autonomous approaches to guarantee SLOs. The proof-of-concept evaluations demonstrated that the intelligence to maintain service resilience effectively is based on an accurate understanding of the end-to-end state, as it facilitates maintaining component serviceability by targeting the cause of failure in the stack. Future work aims to evaluate both TA-RCD and TA-REC for a broader range of fault scenarios in real-life production deployments. |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Cloud Application Services |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Resilience |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Cloud computing |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Topology Aware Root Cause Detection Algorithm |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Service Level Objectives |
650 ## - SUBJECT ADDED ENTRY--TOPICAL TERM | |
Topical term or geographic name as entry element | Distributed Systems |
700 ## - ADDED ENTRY--PERSONAL NAME | |
Personal name | Advised by Lakshmi, J |
856 ## - ELECTRONIC LOCATION AND ACCESS | |
Uniform Resource Identifier | https://etd.iisc.ac.in/handle/2005/6763 |
942 ## - ADDED ENTRY ELEMENTS (KOHA) | |
Koha item type | Thesis |
No items available.