[02:21:15] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.21) (duration: 07m 49s)
[02:21:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:27:14] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sun May  7 02:27:14 UTC 2017 (duration 5m 59s)
[02:27:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:13] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly
[02:30:04] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set
[03:57:13] <icinga-wm>	 PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:25:13] <icinga-wm>	 RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures
[05:03:03] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[05:11:03] <icinga-wm>	 PROBLEM - configured eth on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:11:23] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[05:11:53] <icinga-wm>	 RECOVERY - configured eth on ms-be2001 is OK: OK - interfaces up
[05:12:13] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2001 is OK: OK ferm input default policy is set
[05:29:53] <icinga-wm>	 PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 91 failures. Last run 2 minutes ago with 91 failures. Failed resources (up to 3 shown): Package[python3],Package[ack-grep],Package[python-etcd],Package[screen]
[05:57:53] <icinga-wm>	 RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures
[08:20:53] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (vhtcpd), args vhtcpd
[08:28:42] <elukey>	 I checked cat /tmp/vhtcpd.stats and systemctl status, seems fine --^
[08:32:05] <elukey>	 ah ok ps aux | grep vhtcpd doesn't return anything
[08:42:08] <elukey>	 so last uptime is May 07 08:18:00
[08:42:17] <elukey>	 (UTC)
[08:42:35] <elukey>	 since I have no idea what's happening, I think it is safer to just depool it
[08:43:46] <elukey>	 !log depooled cp4018.ulsfo.wmnet (sudo -i depool from localhost) due to issues with HTCP)
[08:44:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:06] <elukey>	 bblack,ema --^
[09:01:18] <ema>	 elukey: thank you, looking
[09:04:45] <ema>	 [Sun May  7 08:17:53 2017] vhtcpd[1069]: segfault at 64 ip 00007fdba18394a0 sp 00007ffd04c94d48 error 4 in libc-2.19.so[7fdba1702000+1a1000]
[09:07:03] <icinga-wm>	 RECOVERY - Varnish HTCP daemon on cp4018 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd
[09:08:04] <ema>	 !log cp4018: restart vhtcpd and varnish services; repool
[09:08:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:43] <icinga-wm>	 PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 2.078 second response time
[09:50:43] <icinga-wm>	 RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 74845 bytes in 0.235 second response time
[09:54:03] <icinga-wm>	 PROBLEM - puppet last run on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[09:54:53] <icinga-wm>	 RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures
[10:42:33] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR
[10:46:43] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[10:49:43] <icinga-wm>	 PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute
[10:51:43] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 13 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[10:51:43] <icinga-wm>	 RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute
[11:00:13] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[11:03:13] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[11:15:03] <paladox>	 Is that meant to happen ^^
[11:30:14] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[11:33:13] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[12:15:53] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[12:45:39] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3241375 (10Aklapper) Regarding Debian packages for GNU Mailman 3, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799292 for the issues.
[14:25:14] <icinga-wm>	 PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:53:13] <icinga-wm>	 RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures
[15:00:13] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[15:03:13] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[15:46:44] <wikibugs>	 (03PS2) 10Andrew Bogott: labs::lvm:  Fix extend-instance-vol to work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/352168 (https://phabricator.wikimedia.org/T164534)
[15:53:42] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] labs::lvm:  Fix extend-instance-vol to work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/352168 (https://phabricator.wikimedia.org/T164534) (owner: 10Andrew Bogott)
[16:00:13] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[16:03:13] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[16:05:27] <andrewbogott>	 !log restarted pdns and pdns-recursor on labcontrol1002
[16:05:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:07:41] <andrewbogott>	 !log restarted designate-central on labservices1002 due to many log messages like 'Deadlock detected. Retrying...'
[16:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:23] <icinga-wm>	 PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:17:03] <icinga-wm>	 PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:21:23] <icinga-wm>	 RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[16:22:02] <andrewbogott>	 ^ no-op puppet run just took 346.65 seconds
[16:30:14] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[16:33:13] <icinga-wm>	 PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack
[16:44:03] <icinga-wm>	 RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures
[16:48:05] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Switch labservices1002 to the primary designate/dns server." [puppet] - 10https://gerrit.wikimedia.org/r/352476 (https://phabricator.wikimedia.org/T164014)
[16:49:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 032] Revert "Switch labservices1002 to the primary designate/dns server." [puppet] - 10https://gerrit.wikimedia.org/r/352476 (https://phabricator.wikimedia.org/T164014) (owner: 10Andrew Bogott)
[16:52:04] <andrewbogott>	 !log switching primary designate server from labservices1002 to labservices1001
[16:52:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:23] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 50 failures. Last run 2 minutes ago with 50 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Mount[/srv/sda3],Mount[/srv/sdb3],Mount[/var/lib/nginx]
[17:07:23] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp4016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (vhtcpd), args vhtcpd
[17:12:11] <andrewbogott>	 !log rebooting labservices1002 in hopes of getting its IO unstuck
[17:12:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:20:21] <andrewbogott>	 !log clearing out broken instances in the nova fullstack queue and restarting the tests.
[17:20:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:22:12] <icinga-wm>	 RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack
[17:31:22] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[20:53:37] <elukey>	 [Sun May  7 17:05:51 2017] vhtcpd[986]: segfault at 90 ip 00007eff33e204a0 sp 00007ffcf6588a28 error 4 in libc-2.19.so
[20:53:42] <elukey>	 this is for 4016
[20:53:49] <elukey>	 (cp4016)
[20:53:59] <elukey>	 same thing that happened before for cp4018
[21:09:02] <elukey>	 !log depooled cp4016.ulsfo.wmnet (sudo -i depool from localhost) due to issues with vhtcpd (segfaults in dmesg).
[21:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:38] <elukey>	 ema,bblack --^ - hello again, depooled cp4016 too if you need to investigate it
[21:11:41] <elukey>	 (afk)
[21:51:12] <icinga-wm>	 PROBLEM - puppet last run on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:51:32] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[21:52:02] <icinga-wm>	 RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures
[21:52:22] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2003 is OK: OK ferm input default policy is set
[22:36:13] <Revent>	 Anyone awake?
[22:36:46] <Revent>	 Attempting to login to Quarry (using OAuth) is giving a “502 bad gateway” error.
[22:37:44] <Revent>	 Other OAuth tools (CropTool is the one I just tested) work.
[22:38:27] <Revent>	 Er, nevermind, now it’s working…. glitch, I guess.
[23:06:12] <icinga-wm>	 PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:06:12] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:07:02] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:07:02] <icinga-wm>	 RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient