On-Call (original) (raw)

Expectations for On-Call

Customer Emergency On-Call Rotation

Infrastructure Engineer On-Call

The Infrastructure department’s SREs provide 24x7 on-call coverage for the production environment. For details, please see incident-management.

In addition to incident management responsibilities, the EOC also is responsible for time sensitive interrupt work required to support the production environment that is not owned by another team. This includes:

  1. Fulfilling Security Incident Response Team (SIRT) requests
  2. Fulfilling Legal Preservation requests
  3. Reviewing and handling certain change requests (CRs). This includes:
    1. Reviewing CRs to ensure they do not conflict with any ongoing incidents or investigations
    2. Executing the CR directly if the author does no thave the required permissions to make the change themselves (such as admin-level changes)
    3. Support during C1 CRs, such as database upgrades, that may occur on weekends
  4. Handling urgent teleport access requests
  5. Approving an exception for running ChatOps commands when they fail their safety checks
  6. Investigating and fixing buggy/flapping alerts
  7. Removing alerts that are no longer relevant
  8. Collecting production information when requested
  9. Responding to @sre-oncall Slack mentions
  10. Assisting Release Managers with deployment problems
  11. Being the DRI for incident reviews

Engineering Incident Manager

Development Team On-Call Rotation

Gitaly Engineer On-Call

For more details, see the team page

DBO On-Call

For more details, see the DBO escalation process

Security Team On-Call Rotation

Security Operations (SecOps)

Security Managers

Developer Experience Stage On-Call Rotation

We use PagerDuty to set the on-call schedules, and to route notifications to the appropriate individual(s).

Swapping On-Call Duty

Team members covering a shift for someone else are responsible for adding the override in PagerDuty. This can be arranged in the #eoc-general Slack channel. They can delegate this task back to the requestor, but only after explicitly confirming they will cover the requested shift(s). To set an override, click the “Schedule an Override” button from the side navigation on the Schedule page or after selecting the relevant block of time on the calendar or timeline view. This action defaults the person in the override to you — PagerDuty assumes that you’re the person volunteering an override. If you’re processing this for another team member, you’ll need to select their name from the drop-down list. Also see this article for reference.

Adding and removing people from the roster

When adding a new team member to the on-call roster, it’s inevitable that the rotation schedule will shift. The manager adding a new team member will add the individual towards the end of the current rotation to avoid changing the current schedule, if possible. When adding a new team member to the rotation, the manager will raise the topic to their team(s) to make sure everyone has ample time to review the changes.

Slack

In order to facilitate informal conversations around the on-call process and quality of life, as well as coordination of shifts and communication of broader announcements, we have the #eoc-general channel.