Site reliability engineering : How Google runs production systems | WorldCat.org (original) (raw)

Physical Description:1 online resource

ISBN:

9781491951187, 9781491951170, 1491951184, 1491951176

OCLC Number / Unique Identifier:945577030

Additional Physical Form Entry:

Contents:

Cover; Copyright; Table of Contents; Foreword; Preface; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; Part I. Introduction; Chapter 1. Introduction; The Sysadmin Approach to Service Management; Google's Approach to Service Management: Site Reliability Engineering; Tenets of SRE; Ensuring a Durable Focus on Engineering; Pursuing Maximum Change Velocity Without Violating a Service's SLO; Monitoring; Emergency Response; Change Management; Demand Forecasting and Capacity Planning; Provisioning; Efficiency and Performance The End of the BeginningChapter 2. The Production Environment at Google, from the Viewpoint of an SRE; Hardware; System Software That "Organizes" the Hardware; Managing Machines; Storage; Networking; Other System Software; Lock Service; Monitoring and Alerting; Our Software Infrastructure; Our Development Environment; Shakespeare: A Sample Service; Life of a Request; Job and Data Organization; Part II. Principles; Chapter 3. Embracing Risk; Managing Risk; Measuring Service Risk; Risk Tolerance of Services; Identifying the Risk Tolerance of Consumer Services Identifying the Risk Tolerance of Infrastructure ServicesMotivation for Error Budgets1An early version of this section appeared as an article in ; login: (August 2015, vol. 40, no. 4).; Forming Your Error Budget; Benefits; Chapter 4. Service Level Objectives; Service Level Terminology; Indicators; Objectives; Agreements; Indicators in Practice; What Do You and Your Users Care About?; Collecting Indicators; Aggregation; Standardize Indicators; Objectives in Practice; Defining Objectives; Choosing Targets; Control Measures; SLOs Set Expectations; Agreements in Practice Chapter 5. Eliminating ToilToil Defined; Why Less Toil Is Better; What Qualifies as Engineering?; Is Toil Always Bad?; Conclusion; Chapter 6. Monitoring Distributed Systems; Definitions; Why Monitor?; Setting Reasonable Expectations for Monitoring; Symptoms Versus Causes; Black-Box Versus White-Box; The Four Golden Signals; Worrying About Your Tail (or, Instrumentation and Performance); Choosing an Appropriate Resolution for Measurements; As Simple as Possible, No Simpler; Tying These Principles Together; Monitoring for the Long Term; Bigtable SRE: A Tale of Over-Alerting Gmail: Predictable, Scriptable Responses from HumansThe Long Run; Conclusion; Chapter 7. The Evolution of Automation at Google; The Value of Automation; Consistency; A Platform; Faster Repairs; Faster Action; Time Saving; The Value for Google SRE; The Use Cases for Automation; Google SRE's Use Cases for Automation; A Hierarchy of Automation Classes; Automate Yourself Out of a Job: Automate ALL the Things!; Soothing the Pain: Applying Automation to Cluster Turnups; Detecting Inconsistencies with Prodtest; Resolving Inconsistencies Idempotently; The Inclination to Specialize

More Information:

archive.org Free eBook from the Internet Archive

openlibrary.org Additional information and access via Open Library