Site Reliability Engineers

About

Exploring this Job

Try to learn as much as you can about site reliability engineering. Several Google engineers have written and edited books about the field that you can read for free online at https://landing.google.com/sre/books. YouTube also has videos about the field.

One of the best ways to prepare for this career is to learn how to code. Programming languages that are frequently used by SREs include C, C++, Java, Python, Go, Perl, and Ruby. The following online learning platforms offer free or low-cost classes in coding: Codeacademy (https://www.codecademy.com), edX (https://www.edx.org), Coursera (https://www.coursera.org), and Khan Academy (https://www.khanacademy.org).

Participating in information interviews with and/or a job shadowing opportunity with a site reliability engineer are other effective strategies to learn the ins and outs of this career.

The Job

Job duties for site reliability engineers vary by company, but most are responsible for the availability, latency (i.e., the total time it takes a data packet to travel from one node to another), performance, and capacity (i.e., the maximum possible output that can be produced by a product) of digital products and systems. These engineers are concerned with both keeping apps and computer systems (software, hardware, network) running effectively and responding to any event (e.g., bandwidth outage, hardware degradation, high usage, configuration errors) that affects the ability of customers to use the product.

At many companies, SREs spend about 50 percent of their time on call to resolve issues with technology. These issues might involve an issue that can be fixed in a few minutes, or problems that can take hours or even days to resolve. During the incident, the engineer refers to a runbook—which contains a summary of past technical issues and instructions on how they were fixed—to work through a series of steps to fix the problem. They also collaborate with other engineers and managers to solve the problem. When a runbook is unavailable, SREs must use their analytical and problem-solving abilities to assess the issue, determine potential causes, and devise solutions. As they work to resolve the problem, they record their actions and hypotheses so that a runbook can be created for reference if the problem recurs. Once the problem is resolved, the engineer prepares an incident response report that details what happened, what steps he or she took during the incident to find the root cause, and what was done to solve the problem. For major incidents, members of a site reliability engineering team participate in what is known as a blameless postmortem meeting. This focuses on the facts, rather than singling out employees who may have made a mistake that caused or enhanced the problem. During the meeting, SREs discuss the information presented in the incident response report to determine how the incident can be prevented in the future. The SRE might be asked to prepare additional documentation regarding the incident, and the group may conduct further investigations to gather more information or to test hypotheses of the root cause(s) of the issue.

During the other 50 percent of their workdays, SREs monitor the product or system in real time in order to track trends in performance that may indicate reduced reliability. When they identify an area of concern, they conduct tests, write replacement code (if necessary), and otherwise work to make their company’s products as reliable as possible. When possible, engineers write code that automates time-consuming tasks that have reduced reliability or that have even caused products or systems to malfunction. Other duties include preparing service overviews of new products that summarize their system architecture, components and dependencies, and other parameters; conducting production readiness reviews to ensure that a new product meets expected standards for performance and reliability; projecting future demand for a company’s products in order to ensure that there is enough bandwidth and other computing resources available to satisfy expected customer demand; developing plans to upgrade the behavior or performance of a service, while preserving service reliability; and developing plans to decommission a dated product or system in a way that does not affect the performance of related products or systems.

Overview

Requirements