xdsclient: support fallback within an authority by easwars · Pull Request #7701 · grpc/grpc-go (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation51 Commits4 Checks14 Files changed

Conversation

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

easwars

#a71-xds-fallback
Fixes #6902

This PR adds support to fallback to a lower priority server (within an authority) when a higher priority goes down and also adds support to revert to a higher priority server when it comes back up.

This PR also adds e2e style tests for the scenarios specified in the fallback interop test spec.

RELEASE NOTES:

@easwars

@easwars easwars changed the titlexdsclient: complete refactor and fallback support xdsclient: support fallback within an authority

Nov 1, 2024

@easwars easwars marked this pull request as ready for review

November 1, 2024 16:40

@codecov Codecov

zasweq

@easwars

purnesh42H

fallbackChannel := a.xdsChannelConfigs[fallbackServerIdx]
// If the server to fallback to already has an xdsChannel, it means that
// this connectivity error is from a server with a higher priority. There

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't quite understand this comment. If there is an error from higher priority server, we should fallback to a lower priority server. How is having an existing channel for fallback server makes any difference?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets say the authority has two servers: primary and fallback. We start off with primary, and say it fails and therefore we switch to fallback. And lets say the connection to fallback works and we get all resources from it and we are happy. Now, we get another error from the primary (in fact, we will keep getting stream errors from it since we retry the stream with backoff). At this point, we will see that we already have a channel to the next server in the list which is the fallback server, and therefore we have nothing to do here.

// resource that has not yet been cached.
//
// Only executed in the context of a serializer callback.
func (a *authority) uncachedWatcherExists() bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name seems to convey "if any uncached watcher exist" where it should convey if there is a watcher that is looking for an uncached resource. May be rename to something like "uncachedResourceExistToWatch" or flip the bool to "allWatchableResourceCached" or simply "areAllRequestedResourceCached"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// existing resources.
//
// Only executed in the context of a serializer callback.
func (a *authority) triggerFallbackOnStreamFailure(failingServerConfig *bootstrap.ServerConfig) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this function is simply falling back and not checking any stream failures. I guess that's done by caller? In that case name can be simply "triggerFallback". Also, in docstring we should mention it switches to next fallback server from the list (i.e. next lower priority than current failing one) and if no more servers below current then it returns as no-op. May be name can be more explicit "switchToNextFallbackServerIfAvailable"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to go with fallbackToNextServerIfPossible since Available is a term that can be viewed differently in this context, i.e. that the server is actually available for traffic, which is not what we check here.

purnesh42H

// unsubscribe, and remove the channel from the list of
// channels that this resource is subscribed to.
if xc == cfg {
state.xdsChannelConfigs = append(state.xdsChannelConfigs[:idx], state.xdsChannelConfigs[idx+1:]...)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this panic due to idx + 1 because of all these mutations?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah if it hits the last index does the index+1: panic, or is it just like nil or something?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to go with a map instead of a slice and that significantly simplifies this piece of code.

zasweq

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some testing comments

@easwars

zasweq

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

// unsubscribe, and remove the channel from the list of
// channels that this resource is subscribed to.
if xc == cfg {
xc.xc.unsubscribe(rType, resourceName)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I still get confused by xc.xc, can we switch the top level symbol to xcc or something

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, in a few other places as well. Also, I renames the struct fields to be more descriptive.

// channels that this resource is subscribed to.
if xc == cfg {
xc.xc.unsubscribe(rType, resourceName)
delete(state.xdsChannelConfigs, xc)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two operations can only hit once per xDS Channel right? I think you can break out of the for once this hits.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also, ended up inverting the conditional.

@easwars

@easwars easwars deleted the xdsclient_ads_stream branch

November 6, 2024 19:52

misvivek pushed a commit to misvivek/grpc-go that referenced this pull request

Nov 7, 2024

@easwars @misvivek