Performance (original) (raw)

Performance Facets

We categorize performance into 3 facets

  1. Backend
  2. Frontend
  3. Infrastructure

Backend performance

Backend performance is scoped to response time of API, Controllers and command line interfaces (e.g. git).

DRI: Tim Zallman, VP of Engineering, Core Development.

Performance Indicators:

Frontend performance

Frontend performance is scoped to response time of the visible pages and UI components of GitLab.

DRI: Tim Zallman, VP of Engineering, Core Development

Performance Indicators:

Infrastructure performance

Infrastructure performance is scoped to the performance of GitLab SaaS Infrastructure.

DRI: Marin Jankovski, Sr. Director of Infrastructure, SaaS Platforms.

Performance Indicators:

Meta issue to track various issues listed here is at on the infrastructure tracker.

GitLab’s Application performance

Measurement

Target

Performance of GitLab and GitLab.com is ultimately about the user experience. As also described in the product management handbook, “faster applications are better applications”.

Our current focus at the moment are two indicators:

On a mid term we target to focus on all of the Web Vitals with introducing also a bigger focus on First Input delay (FID) and Cumulative Layout Shift (CLS). So if routes are already performing well with our main indicators please extend optimisations on those.

There are many other performance metrics that can be useful in analyzing and prioritizing work, some of those are discussed in the sections below. But the user experienced LCP is the target for the site as a whole, and should be what everything ties back to in the end.

Groups should monitor closely the user experience in regards of performance to also improve the perceived performance also outside those measured performance indicators. For example if any action after loading is very slow and takes a lot of time.

What we measure

Every end-user performance metric we can, through sitespeed.io by having automatic runs every 4 hours. Any data we collect can be helpful to provide us to analyze for improvements or bottle necks on specific routes. We are sending the data to an graphite instance for continous data storage which is used for all Grafana dashboards. On top of that we also save full reports (links are visible by activating the Runs toggle on a sitespeed dashboard) to have more insight data, slow motion data, HAR files and full Lighthouse reports.

How we measure

We currently measure with an empty cache, the connection limited to Cable and a medium CPU based machine which is located in us-central every 4 hours.

Past and Current Performance

The URLs from GitLab.com listed in the table below form the basis for measuring performance improvements - these are heavy use cases. The times indicate time passed from web request to “the average time at which visible parts of the page are displayed” (per the definition of Speed Index). Since the “user” of these URLs is a controlled entity in this case, it represents an external measure of our previous performance metric “Speed Index”.

| Type | 2018-04 | 2019-09 | 2020-02 | Now* | | Issue List: GitLab FOSS Issue List | 2872 | 1197 | - | N/A | | Issue List: GitLab Issue List | | | 1581 | | | Issue: GitLab FOSS #4058 | 2414 | 1332 | 1954 | | | Issue Boards: GitLab FOSS repo boards | 3295 | 1773 | - | N/A | | Issue Boards: GitLab repo boards | | | 2619 | | | Merge request: GitLab FOSS !9546 | 27644 | 2450 | 1937 | | | Pipelines: GitLab FOSS pipelines | 1965 | 4098 | - | N/A | | Pipelines: GitLab pipelines | | | 4289 | | | Pipeline: GitLab FOSS pipeline 9360254 | 4131 | 2672 | 2546 | | | Project: GitLab FOSS project | 3909 | 1863 | - | N/A | | Project: GitLab project | | | 1533 | | | Repository: GitLab FOSS Repository | 3149 | 1571 | - | N/A | | Repository: GitLab Repository | | | 1867 | | | Single File: GitLab FOSS Single File Repository | 2000 | 1292 | - | N/A | | Single File: GitLab Single File Repository | | | 2012 | | | Explore: GitLab explore | 2346 | 1354 | 1336 | | | Snippet: GitLab Snippet 1662597 | 1681 | 1082 | 1378 | |

*To access the sitespeed grafana dashboards you need to be logged into your Google account

Note: Since this table spans time before and after single-codebase we kept GitLab FOSS pages close to GitLab ones to enable comparisons despite not being exactly the same project.

All Sitespeed Dashboards

Sitespeed - Site summary

Sitespeed - Page summary

Sitespeed - Page timing summaries

If you activate the runs toggle you will have annotations with links to all full reports. Currently we are running measurements every 2 hours.


Steps

Web Request

All items that start with the tachometer () symbol represent a step in the flow that we measure. Wherever possible, the tachometer icon links to the relevant dashboard in our monitoring. Each step in the listing below links back to its corresponding entry in the goals table.

Consider the scenario of a user opening their browser, and surfing to their dashboard by typing gitlab.com/dashboard, here is what happens:

  1. User request
    1. User enters gitlab.com/dashboard in their browser and hits enter
    2. Lookup IP in DNS (not measured)
      • Browser looks up IP address in DNS server
      • DNS request goes out and comes back (typically ~10-20 ms, [data?]; often times it is already cached so then it would be faster).
      • For more details on the steps from browser to application, enjoy reading https://github.com/alex/what-happens-when
    3. Browser to Azure LB (not measured)
      • Now that the browser knows where to find the IP address, browser sends the web request (for gitlab.com/dashboard) to Azure’s load balancer (LB).
  2. Backend processes
    1. Azure LB to HAProxy (not measured)
      • Azure’s load balancer determines where to route the packet (request), and sends the request to our Frontend Load Balancer(s) (also referred to as HAProxy).
    2. HAProxy SSL with browser (not measured)
      • HAProxy (load balancer) does SSL negotiation with the browser
    3. HAProxy to NGINX (not measured)
      • HAProxy forwards the request to NGINX in one of our front end workers. In this case, since we are tracking a web request, it would be the NGINX box in the “Web” box in the production-architecture diagram; but alternatively the request can come in via API or a git command from the command line, hence the API, and git “boxes” in that diagram.
      • Since all of our servers are in ONE Azure VNET, the overhead of SSL handshake and teardown between HAProxy and NGINX should be close to negligible.
    4. NGINX buffers request (not measured)
      • NGINX gathers all network packets related to the request (“request buffering”). The request may be split into multiple packets by the intervening network, for more on that, read up on MTUs.
      • In other flows, this won’t be true. Specifically, request buffering isswitched off for LFS.
    5. NGINX to Workhorse (not measured)
      • NGINX forwards the full request to Workhorse (in one combined request).
    6. Workhorse distributes request
      • Workhorse splits the request into parts to forward to:
      • Unicorn. Time spent waiting for Unicorn to pick up a request is HTTP queue time.
      • Gitaly [not in this scenario, but not measured in any case]
      • NFS (git clone through HTTP) [not in this scenario, but not measured in any case]
      • Redis (long polling) [not in this scenario, but not measured in any case]
    7. Unicorn calls services
      • Unicorn, (often just called “Rails”, or “application server”), translates the request into a Rails controller request; in this caseRootController#index. The round trip time it takes for a request to_start_ in Unicorn and leave Unicorn is what we call Transaction Timings. RailsController requests are sent to (and data is received from):
      • PostgreSQL (SQL timings),
      • NFS (git timings),
      • Redis (cache timings).
      • In this gitlab.com/dashboard example, the controller addresses all three .
      • There are usually multiple SQL calls (or file, or cache, etc.) calls for a given controller request. These add to the overall timing, especially since they are sequential. For example, in this scenario, there are 29 SQL calls (search for Load)when this particular user hits gitlab.com/dashboard/issues. The number of SQL calls will depend on how many projects the person has, how much may already be in cache, etc.
      • Rails tackles the steps within a controller request sequentially. In other words if it needs to make calls out to the database and to git, it is not set up to those in parallel but rather has to wait for the response to the first step before proceeding to the next step.
      • In the Rails stack, middleware typically adds to the number of round trips to Redis, NFS, and PostgreSQL, per controller call, in addition to the timings of Rails controllers. Middleware is used for {session state, user identity, endpoint authorization, rate limiting, logging, etc} while the controllers typically have at least one round trip for each of {retrieve settings, cache check, build model views, cache store, etc.}. Each such roundtrip is estimated to take < 10 ms.
    8. Unicorn constructs Views
      • The construction of views can take a long time (view timings). In some controllers, data is gathered first after which a view is constructed. In other controllers, data is gathered from within a View, so that the view timing in those cases includes the time it took to call NFS, PostgreSQL, Redis, etc. And in many cases, both are done.
      • A particular view in Rails will often be constructed from multiple partial views. These will be used from a template file, specified by the controller action, that is, itself, generally included within a layout template. Partials can include other partials. This is done for good code organization and reuse. As an example, when the particular user from the example above loads gitlab.com/dashboard/issues, there are 56 nested / partial views rendered (search for View::)
      • Partial views may be cached via various Rails techniques, such as Fragment Caching. In addition, GitLab has a Markdown cache stored in the database that is used to speed up the conversion of Markdown to HTML.
      • Perceived performance in the way of First Paint can be affected by how much of the content of a view is rendered by the backend vs. sending a “minimal” html blob to the user and relying on Javascript / AJAX / etc. to fetch additional elements that take the page from First Paint to “Fully Loaded”. See the section about the frontend for more on this.
    9. Unicorn makes HTML (not measured)
      • Once the Views are built, Unicorn completes making the “HTML blob” that is then returned to the browser.
      • Some of these blobs are expensive to compute, and are sometimes hard-coded to be sent from Unicorn to Redis (i.e. to cache) once rendered.
    10. HTML to Browser (not measured)
  3. Render Page
    1. First Byte
    • The time when the browser receives the first byte. In addition to everything in the backend, this also depends on network speed. In the dashboard linked to by the tachometer above, First Byte is measured from a Digital Ocean box in the US with relatively little network lag thus representing an estimate of internal First Byte. Past performance on first byte is recorded elsewhere on this page.
    • For any page, you can use your browser’s “inspect” tool to look at “TTFB” (time to first byte).
    • First Byte - External is measured for a hand selected number of URLs using SiteSpeed
    1. Speed Index
    • Browser parses the HTML blob and sends out further requests to GitLab.com to fetch assets such as javascript bundles, CSS, images, and webfonts.
    • The timing of this step depends (amongst other things) on the number and the size of assets, as well as network speed. For each static asset, there is a round-trip of: - for cached assets: browser nginx nginx confirms cached asset is still valid browser - for non-cached or expired cached assets: browser workhorse workhorse grabs asset from local cache browser. - for a page that is served through GitLab Pages: browser pages daemon (independent service in the architecture) browser.
    • Stylesheets can block page rendering by default, which can lead to unnecessary delays in page rendering.
    • Starting in 9.5, scripts won’t block rendering anymore as they are loaded with defer="true", so they are parsed and executed in the same order as they are called but only after html + css has been rendered.
    • Enough meaningful content is rendered on screen to calculated the “Speed Index”.
    1. Fully Loaded
    • When the scripts are loaded, Javascript compiles and evaluates them within the page.
    • On some pages, we use AJAX to allow for async loading. The AJAX call can be triggered by all kinds of things; for example a frontend element (button) or e.g. the DOMContentLoaded event. The new call is for a new URL, and such requests are routed either through the Web or API workers, invoke their respective Rails controllers on the backend, and return the requested files (HTML, JSON, etc). For example, the calendar and activity feeds on a username page gitlab.com/usernameare two separate AJAX calls, triggered by DOMContentLoaded. (TheDOMContentLoaded event “marks the point when both the DOMis ready and there are no stylesheets that are blocking JavaScript execution” (taken from an article about the critical rendering path)). The alternative to using AJAX would be to include the full Rails code to generate the calendar and activity feed within the same controller that is called by the gitlab.com/username URL; which would lead to slower First Paint since it simply involves more calls to the database etc.

Git Commit Push

First read about the steps in a web request above, then pick up the thread here.

After pushing to a repository, e.g. from the web UI:

  1. In a web browser, make an edit to a repo file, type a commit message, and hit “Commit”
  2. NGINX receives the git commit and passes it to Workhorse
  3. Workhorse launches a git-receive-pack process (on the workhorse machine) to save the new commit to NFS
  4. On the workhorse machine, git-receive-pack fires a git hook to trigger GitLab Shell.
    • GitLab Shell accepts Git payloads pushed over SSH and acts upon them (e.g. by checking if you’re authorized to perform the push, scheduling the data for processing, etc).
    • In this case, GitLab Shell provides the post-receive hook, and the git-receive-pack process passes along details of what was pushed to the repo to the post-receive hook. More specifically, it passes a list of three items: old revision, new revision, and ref (e.g. tag or branch) name.
  5. Workhorse then passes the post-receive hook to Redis, which is the Sidekiq queue.
    • Workhorse informed that the push succeeded or failed (could have failed due to the repo not available, Redis being down, etc.)
  6. Sidekiq picks up the job from Redis and removes the job from the queue
  7. Sidekiq updates PostgreSQL
  8. Unicorn can now query PostgreSQL.

Goals

Web Request

Consider the scenario of a user opening their browser, and surfing to their favorite URL on GitLab.com. The steps are described in the section on “web request”. In this table, the steps are measured and goals for improvement are set.

Guide to this table:

Step # per request p99 Q2-17 p99 Now p99 Q3-17 goal Issue links and impact
USER REQUEST
Lookup IP in DNS 1 ~10 ? ~10 Use a second DNS provider
Browser to Azure LB 1 ~10 ? ~10
BACKEND PROCESSES Extend monitoring horizon
Azure LB to HAProxy 1 ~2 ? ~2
HAProxy SSL with Browser 1 ~10 ? ~10 Speed up SSL
HAProxy to NGINX 1 ~2 ? ~2
NGINX buffers request 1 ~10 ? ~10
[NGINX to Workhorse 1 ~2 ? ~2
Workhorse distributes request 1 Adding monitoring to workhorse
Workhorse to Unicorn 1 18 10 Adding Unicorns
Workhorse to Gitaly ?
Workhorse to NFS ?
Workhorse to Redis ?
Unicorn calls services 1 2500 1000 Allow more GitLab internals monitoring
Unicorn Postgres 250 100 Speed up slow queries
Unicorn NFS 460 200 Move to Gitaly - sample result
Unicorn Redis 18
Unicorn constructs Views 1500
Unicorn makes HTML
HTML to Browser
Unicorn to Workhorse 1 ~2 ? ~2
Workhorse to NGINX 1 ~2 ? ~2
NGINX to HAProxy 1 ~2 ? ~2 Compress HTML in NGINX
HAProxy to Azure LB 1 ~2 ? ~2
Azure LB to Browser 1 ~20 ? ~20
RENDER PAGE
FIRST BYTE (see [note 1]1)] 1080 - 6347 1000
SPEED INDEX (see [note 2]2) 3230 - 14454 2000 Remove inline scripts, Defer script loading when possible, Lazy load images, Set up a CDN for faster asset loading, Use image resizing in CDN
Fully Loaded (see [note]3) 6093 - 14003 not specified Enable webpack code splitting
——————————————————— ————— ——— ——— ————– ————————

Notes:

Git Commit Push

Table to be built; merge requests welcome!

Modifiers

For any performance metric, the following modifiers can be applied:

First byte

External

Timing history for First Byte are listed in the table below (click on the tachometer icons for current timings). All times are in milliseconds.

Type End of Q4-17 Now
Issue: GitLab CE #4058 857
Merge request: GitLab CE !9546 18673
Pipeline: [GitLab CE pipeline 9360254] 1529
Repo: GitLab CE repo 1076

Internal

To go a little deeper and measure performance of the application & infrastructure without consideration for frontend and network aspects, we look at “transaction timings” as recorded by Unicorn. These timings can be seen on theRails Controller dashboard per URL that is accessed .

Availability and Performance labels

Availability

This section has been moved to Availability severity.

Performance

To clarify the priority of issues that relate to GitLab.com’s performance you should add the ~performance label, as well as a “Severity”

label. There are two factors that influence which severity label you should pick:

  1. How frequently something is used.
  2. How likely it is for something to cause an outage.

For strictly performance related work you can use the Controller Timings OverviewGrafana dashboard. This dashboard categorises data into three different categories, each with their associated severity label:

  1. Frequently Used: ~severity::2
  2. Commonly Used: ~severity::3
  3. Rarely Used: ~severity::4

This means that if a controller (e.g. UsersController#show) is in the “Frequently Used” category you assign it the ~severity::2 label.

For database related timings you can also use theSQL Timings Overview. This is the dashboard primarily used by the Database Team to determine the AP label to use for database related performance work.

Database Performance

Some general notes about parameters that affect database performance, at a very crude level.