PyBal - Wikitech (original) (raw)

PyBal is an automated manager for LVS. We use it to continuously monitor Varnish or Apache servers and change the LVS load balancer pooling and weights accordingly. The service is written in Python using the Twisted framework.

For more information about Wikimedia's LVS setup in general, see LVS.

Features

PyBal distinguishes itself from lvsmon in a few aspects:

Setup

PyBal is currently installed on our LVS hosts, in directory /usr/sbin. Start or stop it with systemctl start/stop pybal.service

Configuration is in /etc/pybal/. pybal.conf defines the LVS service parameters

The list of pooled hosts resides wherever the pybal::web class is installed via puppet, under the directory /srv/pybal-config and it will be reachable via the internal address http://configuration-master.$site.wmnet/pybal, with one file per LVS service. Attributes:

The format should be fairly self explanatory; the files more or less use Python assignment / dictionary syntax.

PyBal supports multiple LVS services through a single instance and configuration file pybal.conf, e.g.:

[text] protocol = tcp ip = 145.97.39.155 port = 80 scheduler = wlc config = file:///etc/pybal/text-squids

[images] protocol = tcp ip = 145.97.39.156 port = 80 scheduler = wlc config = file:///etc/pybal/upload-squids

Beware, the code as checked out from git has DryRun = True set in ipvs.py, meaning that it will not modify any actual IPVS state but only show the commands for debugging. This should be changed to a command line option, but for now edit that file to DryRun = False.

The configuration files are generated via puppet.

How to

See LVS.

Updating PyBal on LVS instances

After testing new releases on pybal-test2003.codfw.wmnet, PyBal should be updated site-by-site in the usual order.

Within each datacenter, first update and check that everything is fine on the passive instances and then go for the active instances. As long as BGP is enabled, redirecting traffic from the active instances to the passive one it should as easy as stopping PyBal in the active instance. After stopping it, you should see an increase of the active connections on the passive instance running the ipvsadm command described below.

After updating a BGP enabled PyBal instance you can check that everything is good on the router side with the following commands:

show bgp summary | match

show bgp neighbor

show route receive-protocol bgp

On the PyBal instance the following commands are useful:

/usr/local/lib/nagios/plugins/check_pybal_ipvs_diff --prometheus-url http://:9100/metrics

Testing

New PyBal releases can be tested on pybal-test2003.codfw.wmnet. The systems are deployed with role(pybaltest). Configuration example:

/etc/pybal/pybal.conf on pybal-test2001

[global] bgp = yes bgp-local-asn = 64496 bgp-peer-address = 10.192.16.140 #bgp-as-path = 64460 bgp-nexthop-ipv4 = 10.192.16.139 bgp-nexthop-ipv6 = 2620:0:860:101:10:192:1:3 instrumentation = yes instrumentation_ips = [ '127.0.0.1', '::1', '10.192.16.139' ]

Lower is prefered

bgp-med = 50

Service definition

[textlb6_80] protocol = tcp ip = 2620:0:860:ed1a::1 port = 80 scheduler = sh

config = etcd://conf2001.codfw.wmnet/conftool/v1/pools/codfw/cache_text/varnish-fe/

depool-threshold = .5 monitors = ["IdleConnection"]

IdleConnection monitor configuration

idleconnection.max-delay = 300 idleconnection.timeout-clean-reconnect = 3

/etc/pybal/pybal.conf on pybal-test2003

[global] bgp = yes bgp-local-asn = 64496 bgp-peer-address = 10.192.16.140 #bgp-as-path = 64460 bgp-nexthop-ipv4 = 10.192.16.141 bgp-nexthop-ipv6 = 2620:0:860:101:10:192:1:3 instrumentation = yes instrumentation_ips = [ '127.0.0.1', '::1', '10.192.16.141' ] #Lower is prefered bgp-med = 100

Service definition

[...]

A Quagga instance is installed on pybal-test2002 and can be used to test the BGP component of PyBal:

log file /var/log/quagga/quagga.log ! debug zebra rib debug bgp events debug bgp updates debug bgp zebra ! password SECRET ! interface eth0 ipv6 nd suppress-ra ! interface lo ! router bgp 64460 bgp router-id 10.192.16.140 no bgp default ipv4-unicast network 127.0.0.2/32 neighbor 10.192.16.139 remote-as 64496 neighbor 10.192.16.139 description PyBal on pybal-test2001 neighbor 10.192.16.139 activate neighbor 10.192.16.139 prefix-list NONE out

neighbor 10.192.16.141 remote-as 64496 neighbor 10.192.16.141 description PyBal on pybal-test2003 neighbor 10.192.16.141 activate neighbor 10.192.16.141 prefix-list NONE out ! address-family ipv6 network 2620:0:860:102::/64 neighbor 10.192.16.139 activate neighbor 10.192.16.141 activate exit-address-family ! ip prefix-list NONE seq 5 deny any ! ip forwarding ipv6 forwarding ! line vty !

The IPv4 routing table can be inspected with:

vtysh -c 'show ip route'

Similarly, to inspect the IPv6 routing table:

vtysh -c 'show ipv6 route'

Alerts

PyBal IPVS diff check

The alert fires whenever pybal and ipvs disagree on the current configuration.

Services in IPVS but unknown to PyBal

For example upon removing services from pybal (or changing ports) the stale ipvs virtual services might not get removed. (In other words, this is shown: CRITICAL: Services in IPVS but unknown to PyBal: set(['addr:port'])). For such cases it is sufficient to delete the stale TCP service from the lvs pair:

ipvsadm --delete-service --tcp-service addr:port

Services known to PyBal but not to IPVS

This alert is usually temporary and is caused by new services being setup (i.e. in etcd, pybal knows about them) but pybal hasn't been restarted yet, and thus hasn't had a chance to program ipvs correctly. The fix is to restart pybal.

For example:

PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.54:443])

Pybal service has not been restarted

This alert means that the Pybal configuration file was changed but the service has not been restarted. To fix this, simply restart the Pybal service on the respective host. Note that this will pick up all changes since the configuration was changed, as expected.

See also