DBO Escalation Process (original) (raw)

About This Page

This page outlines the DBO team’s incident escalation policy.

Shortcuts

SLO and Expectations

Escalation Process

Scope and Qualifiers

  1. GitLab.com S1 and S2 production incidents raised by the Incident Manager On Call, Engineer On Call and Security teams.
    • NB1: Gitlab Dedicated support is consultative at this point. DBO team currently not equipped, i.e. lacking access and and training on how to support Dedicated databases. This may change in the future; check back here for updates on this topic.
    • NB2: Self Managed support is discrtionary and will be evaluated on a case-by-case basis.
    • NB3: This process is NOT a path to reach the DBO team for non-urgent issues. For non-urgent issues, please create a Request for Help (RFP) issue using this Issue template.
    • NB4: The DBO on-shift is responsbile for coordinating warm handoffs during shift changes, especially when there is an ongoing, active incident.

Escalation

  1. EOC/IM, Development or Security page the DBO on-call via PagerDuty
  2. DBO responds by acknowledging the page and joining the incident channel and zoom
  3. DBO triages the issue and works towards a solution.
  4. If necessary, DBO reach out for further help or domain expert as needed.
    • NB1: If DBO support does not respond, escalation path as defined within PagerDuty ensues.

Resources

Responding Guidelines

When responding to an Incident, utilize the below procedure as guidelines to follow to assist both yourself and the members requesting your assistance

  1. Join the Incident Zoom - this can be found bookmarked in the relevant incident Slack Channel
  2. Join the appropriate incident slack channel for all communications that are text based - Normally this is #inc-<INCIDENT NUMBER>
  3. Work with the EOC to determine if a known code path is problematic
  1. Work with the Incident Manager to ensure that the Incident issue is assigned to the appropriate Engineering Manager - if applicable

Shadowing An Incident Triage Session

Feel free to participate in any incident triaging call if you would like to have a few rehearsals of how it usually works. Simply watch out for active incidents in #incidents-dotcom and join the Situation Room Zoom call (link can be found in the channel) for synchronous troubleshooting. There is a nice blog post about the shadowing experience.

Replaying Previous Incidents

Situation Room recordings from previous incidents are available in this Google Drive folder (internal).

Shadowing A Whole Shift

To get an idea of what’s expected of an on-call DBO and how often incidents occur it can be helpful to shadow another shift. To do this simply identify and contact the DBO on-call to let them know you’ll be shadowing. During the shift keep an eye on #incidents-dotcom for incidents.

Tips & Tricks of Troubleshooting

  1. How to Investigate a 500 error using Sentry and Kibana.
  2. Walkthrough of GitLab.com’s SLO Framework.
  3. Scalability documentation.
  4. Use Grafana and Kibana to look at PostgreSQL data to find the root cause.
  5. Use Grafana, Thanos, and Prometheus to troubleshoot API slowdown.
  6. Let’s make 500s more fun

Tools for Engineers

  1. Training videos of available tools
    1. Visualization Tools Playlist.
    2. Monitoring Tools Playlist.
    3. How to create Kibana visualizations for checking performance.
  2. Dashboards examples, more are available with the dropdown list at upper-left corner of any dashboard below
    1. Saturation Component Alert.
    2. Service Platform Metrics.
    3. SLAs.
    4. Web Overview.