Security Operations On-Call Guide (original) (raw)

The Security Operations sub-department is collectively responsible for responding to reports of actual or potential security incidents on a 24/7/365 basis.

Prior to scheduling planned time off, Security Operations team members should consult with the team to ensure proper coverage will be available.

Security Operations Managers also share in On-Call responsibilities and need to ensure proper coverage for escalations and emergencies. The Security department maintains a series of On-Call escalations to ensure that every reported incident is responded to in a reasonable timeframe.

On-Call scheduling for SIRT is organized in Pager Duty within the Security Responder policy.

On-Call scheduling for Trust & Safety is organized in Pager Duty within the Trust & Safety Responder policy.

On-Call Security Handoffs

Times

Standard handoff times are as scheduled. However, team members are empowered to agree on a temporary modified handoff schedule as long as all those involved agree and the team is notified about changes.

SIRT (November - April)

SIRT (April - November)

SIRT times are reflected in the SIRT Handoffs calendar. In case both parties agree to change the time, this should be reflected in the calendar at least 24 hours prior to the handoff. This allows other team members the ability to adequately plan their schedules the night before. Changing the meeting time counts as a notification.

Trust and Safety

SIRT Written Handoff

Incident issues are the SSoT for any incident. Be sure to include any significant incident updates within the incident summaries.

SIRT is using the self-developed tool Handogotchi (check promotion of Handogotchi on Tine) for written handover summaries. Handogotchi reminds the SIRT engineer on call to update incidents and add additional information one hour before handoff time. It will automatically send links to open incidents half an hour before handoff times.

Written handoffs are required to be completed at least half an hour before the end of every shift and are the basis for warm handoffs.

For all incidents, we highly encourage pausing after no longer than one hour to document and make it a general habit to keep documentation up to date after every work step. For all questions on documentation, please consult the Incident Report Writing Guide

SIRT Warm Handoff

SIRT uses warm handoffs to clarify written handoffs and avoid misunderstandings in complex situations. They should take no more than 15 minutes. One person per region is required (outgoing and incomming).

The outgoing region prepares the handoff as described in the section above. The incoming region should familiarize themselves with the written handoff before attending the meeting. Team members should be prepared for warm handoff meetings, so that the meetings will be efficient, with a focus on open questions and clarifications.

Because the incidents are well documented, there is no written agenda for warm handoffs. All significant discussion points that come up must be immediately documented and corrected in the incident issue.

Warm handoffs are only required during the week. This is because warm handoffs would otherwise have to be performed at around midnight on weekends for EMEA.

SIRT On-Call

During all On-Call shifts, the On-Call Engineer is filling the role of the "Triage Engineer".

Weekday

Weekday On-Call Engineer Responsibilities

Security Operations provides weekday On-Call coverage using a follow-the-sun model. Weekday On-Call Security Engineers are the team members that cover the On-Call responsibilities during their region’s sunny time.

Weekend

Weekend On-Call Security Responsibilities

The Weekend On-Call Security team member will be responsible for covering On-Call responsibilities from AMER Friday evening until APAC Monday morning according to the established On-Call Security Handoff times.

Weekend On-Call Security Scheduling

Weekend On-Call Security Relief

When scheduled for the Weekend On-Call Security shift, team members should:

SIRT On-Call Paging

On-Call Paging Workflow

The SIRT On-Call paging workflow is currently designed to follow this escalation path:

  1. The first notification goes to our incident slack channel.
  2. The designated Security Engineer On-Call in the sunny region is paged after 5 minutes of no response.
  3. All Security Engineers in the sunny region are paged after 10 minutes of no response. During the weekend, one person will have volunteered for sole responsibility of weekend coverage. Team members not designated as the Security Engineer On-Call can and should provide assistance if the Security Engineer On-Call misses the page. When weekend assistance is provided, team members should request taking time off in lieu with their manager, and target 1:1 (hour for hour) time off, immediately following the weekend as long as proper coverage is available.
  4. The Security Operations manager in the sunny region is paged as a backup after 15 minutes if the team members don’t acknowledge the pages.
  5. Security Managers who volunteer as backups are paged if SIRT does not acknowledge the previous pages after 15 minutes.
  6. The VP of Security Operations is paged if Security Managers don’t acknowledge the pages after 15 minutes.

SIRT On-Call Paging Duties

The On-Call Engineer’s primary concern is to provide timely and adequate responses to incoming pages. When receiving a page:

  1. Acknowledge the alarm in the corresponding alert Slack channel or through PagerDuty directly.
  2. Review incident’s GitLab issue and follow the checklists posted there for triaging.

If the alarm is not acknowledged, the Security Incident Manager On-Call will be alerted.

Engineers should acknowledge pages within the first 15 minutes, and start performing initial triage of potential incidents within the first hour of the initial page.

Security Managers On-Call

In addition to the Security Engineers being On-Call, the Security Managers across the GitLab Security Department act as backups in the event the Security Engineers are unable to acknowledge security pages. PagerDuty will automatically engage the Security Manager On-Call if SIRT doesn’t acknowledge the paging attempts.

It’s the responsibility of the Security Manager On-Call to:

Triage Engineer

During On-Call shifts the Security Engineer On-Call is the Triage DRI and has these core responsibilities:

  1. Acknowledge and triage pages; delegate
  2. Monitor and triage incidents; delegate
  3. Monitor and triage alerts; delegate alerts that are for their own activity; delegate resulting incidents
  4. Continue critical assignments as determined by management

Weekday only:

  1. Monitor and respond to security related slack channels; delegate
  2. Improve on-call and incident handling processes - document on-call related problems and improvement opportunities by opening issues and any necessary support tickets (e.g. Devo support).
  3. Continue ongoing work where possible

Triage Engineer Responsibility to delegate

Delegation allows the team to spread the workload across the global team while maintaining adequate coverage and response times. It also minimizes the risk of one single person having to handle spikes in incident volume.

Triage Process

Incident Triage

Incidents are triaged by following the checklist posted by the TriageBot.

Alert Triage

Alert triage is a continuous task on weekdays and a task that should be done at least once a day on weekends.

Any alert can be converted to an incident and assigned to another team member following the triage delegation guidelines above.

Incident Ownership

Whoever is assigned to the incident after the initial triage is the person responsible for incident resolution. Use the assignee field in the GitLab incident to identify the responsible person. In case of a priority 1 incident, it should have one assignee per region. In these cases, the work should continue around the globe until the incident is contained and the priority lowered. In that case, the incident should go back to only having one assignee and DRI. When multiple people work on one incident, work is divided into tasks with their corresponding assignees.

Ownership of an incident means being the person responsible for:

Being the responsible person does not imply being the sole person to perform incident tasks. Team members from all departments can be called upon to assist in incident resolution, and these requests should be documented as a task or related issue.

Exceptions

Exceptions to this procedure will be tracked as per the Information Security Policy Exception Management Process.