DBO Escalation Process (original) (raw)

About This Page

This page outlines the DBO team’s incident escalation policy.

Shortcuts

DBO PagerDuty schedule
Slack x PD integration: /pd trigger @dbo-oncall (For urgent reach-outs, ALWAYS use PagerDuty)
Slack handles: @dbre or @dbo-oncall (Non-urgent)
Slack channels: #g_database_operations
group::database operations
Production Incidents

SLO and Expectations

DBO RESPONSE IS ON A BEST-EFFORT BASIS
LOCAL TIMEZONE, WEEKDAY COVERAGE ONLY
S1 / S2 INCIDENTS ONLY
- NB1: Due to limited staffing, i.e. having only one person in EMEA timezone, there will be times during the business day, within multible timezones, where there will not be anyone able to respond. We understand the criticality of responding to S1/S2 incidents and we will make every effort to ensure there is adequete and timeliness in our responses, but given the current staffing levels, we are not at this point adhereing to a hard SLO. To do justice to this situation, it is also expected that schedules are changed on an ad-hoc bases.
- NB2: DBO will join incidents as a subject matter expert in a consultative capacity and there should be no expectation that the DBO engineer is solely responsible for a resolution of the escalation. There may be times where the DBO needs to escalate to other subject matter experts, such as the Database Framework (DBF) team, in order to make headway on the incident at hand.

Escalation Process

Scope and Qualifiers

GitLab.com S1 and S2 production incidents raised by the Incident Manager On Call, Engineer On Call and Security teams.
- NB1: Gitlab Dedicated support is consultative at this point. DBO team currently not equipped, i.e. lacking access and and training on how to support Dedicated databases. This may change in the future; check back here for updates on this topic.
- NB2: Self Managed support is discrtionary and will be evaluated on a case-by-case basis.
- NB3: This process is NOT a path to reach the DBO team for non-urgent issues. For non-urgent issues, please create a Request for Help (RFP) issue using this Issue template.
- NB4: The DBO on-shift is responsbile for coordinating warm handoffs during shift changes, especially when there is an ongoing, active incident.

Escalation

EOC/IM, Development or Security page the DBO on-call via PagerDuty
DBO responds by acknowledging the page and joining the incident channel and zoom
DBO triages the issue and works towards a solution.
If necessary, DBO reach out for further help or domain expert as needed.
- NB1: If DBO support does not respond, escalation path as defined within PagerDuty ensues.

Resources

Responding Guidelines

When responding to an Incident, utilize the below procedure as guidelines to follow to assist both yourself and the members requesting your assistance

Join the Incident Zoom - this can be found bookmarked in the relevant incident Slack Channel
Join the appropriate incident slack channel for all communications that are text based - Normally this is #inc-<INCIDENT NUMBER>
Work with the EOC to determine if a known code path is problematic

Should the knowledge of this be in your domain, continue working with the EOC to troubleshoot the problem
Should this be something you may be unfamiliar with, attempt to determine code ownership by team - Knowing this will enable us to see if we can bring online an Engineer from that team into the Incident

Work with the Incident Manager to ensure that the Incident issue is assigned to the appropriate Engineering Manager - if applicable

Shadowing An Incident Triage Session

Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in #incidents-dotcom and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a nice blog post about the shadowing experience.

Replaying Previous Incidents

Situation Room recordings from previous incidents are available in this Google Drive folder (internal).

Shadowing A Whole Shift

To get an idea of what’s expected of an on-call DBO and how often incidents occur it can be helpful to shadow another shift. To do this simply identify and contact the DBO on-call to let them know you’ll be shadowing. During the shift keep an eye on #incidents-dotcom for incidents.

Tips & Tricks of Troubleshooting

Tools for Engineers

Training videos of available tools
Dashboards examples, more are available with the dropdown list at upper-left corner of any dashboard below