xdsclient: support fallback within an authority by easwars · Pull Request #7701 · grpc/grpc-go (original) (raw)
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Conversation51 Commits4 Checks14 Files changed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})
#a71-xds-fallback
Fixes #6902
This PR adds support to fallback to a lower priority server (within an authority
) when a higher priority goes down and also adds support to revert to a higher priority server when it comes back up.
This PR also adds e2e style tests for the scenarios specified in the fallback interop test spec.
RELEASE NOTES:
- xdsclient: add support to fallback to lower priority servers when higher priority ones are down
easwars changed the title
xdsclient: complete refactor and fallback support xdsclient: support fallback within an authority
easwars marked this pull request as ready for review
fallbackChannel := a.xdsChannelConfigs[fallbackServerIdx] |
---|
// If the server to fallback to already has an xdsChannel, it means that |
// this connectivity error is from a server with a higher priority. There |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't quite understand this comment. If there is an error from higher priority server, we should fallback to a lower priority server. How is having an existing channel for fallback server makes any difference?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets say the authority has two servers: primary and fallback. We start off with primary, and say it fails and therefore we switch to fallback. And lets say the connection to fallback works and we get all resources from it and we are happy. Now, we get another error from the primary (in fact, we will keep getting stream errors from it since we retry the stream with backoff). At this point, we will see that we already have a channel to the next server in the list which is the fallback server, and therefore we have nothing to do here.
// resource that has not yet been cached. |
---|
// |
// Only executed in the context of a serializer callback. |
func (a *authority) uncachedWatcherExists() bool { |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: name seems to convey "if any uncached watcher exist" where it should convey if there is a watcher that is looking for an uncached resource. May be rename to something like "uncachedResourceExistToWatch" or flip the bool to "allWatchableResourceCached" or simply "areAllRequestedResourceCached"
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// existing resources. |
---|
// |
// Only executed in the context of a serializer callback. |
func (a *authority) triggerFallbackOnStreamFailure(failingServerConfig *bootstrap.ServerConfig) { |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this function is simply falling back and not checking any stream failures. I guess that's done by caller? In that case name can be simply "triggerFallback". Also, in docstring we should mention it switches to next fallback server from the list (i.e. next lower priority than current failing one) and if no more servers below current then it returns as no-op. May be name can be more explicit "switchToNextFallbackServerIfAvailable"?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to go with fallbackToNextServerIfPossible
since Available
is a term that can be viewed differently in this context, i.e. that the server is actually available for traffic, which is not what we check here.
// unsubscribe, and remove the channel from the list of |
---|
// channels that this resource is subscribed to. |
if xc == cfg { |
state.xdsChannelConfigs = append(state.xdsChannelConfigs[:idx], state.xdsChannelConfigs[idx+1:]...) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this panic due to idx + 1 because of all these mutations?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah if it hits the last index does the index+1: panic, or is it just like nil or something?
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to go with a map instead of a slice and that significantly simplifies this piece of code.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some testing comments
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
// unsubscribe, and remove the channel from the list of |
---|
// channels that this resource is subscribed to. |
if xc == cfg { |
xc.xc.unsubscribe(rType, resourceName) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I still get confused by xc.xc, can we switch the top level symbol to xcc or something
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, in a few other places as well. Also, I renames the struct fields to be more descriptive.
// channels that this resource is subscribed to. |
---|
if xc == cfg { |
xc.xc.unsubscribe(rType, resourceName) |
delete(state.xdsChannelConfigs, xc) |
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two operations can only hit once per xDS Channel right? I think you can break out of the for once this hits.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also, ended up inverting the conditional.
easwars deleted the xdsclient_ads_stream branch
misvivek pushed a commit to misvivek/grpc-go that referenced this pull request