[02:21:15] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.21) (duration: 07m 49s) [02:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:27:14] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun May 7 02:27:14 UTC 2017 (duration 5m 59s) [02:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:29:13] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [02:30:04] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1001 is OK: OK ferm input default policy is set [03:57:13] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:25:13] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [05:03:03] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [05:11:03] PROBLEM - configured eth on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:11:23] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:11:53] RECOVERY - configured eth on ms-be2001 is OK: OK - interfaces up [05:12:13] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2001 is OK: OK ferm input default policy is set [05:29:53] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 91 failures. Last run 2 minutes ago with 91 failures. Failed resources (up to 3 shown): Package[python3],Package[ack-grep],Package[python-etcd],Package[screen] [05:57:53] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:20:53] PROBLEM - Varnish HTCP daemon on cp4018 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (vhtcpd), args vhtcpd [08:28:42] I checked cat /tmp/vhtcpd.stats and systemctl status, seems fine --^ [08:32:05] ah ok ps aux | grep vhtcpd doesn't return anything [08:42:08] so last uptime is May 07 08:18:00 [08:42:17] (UTC) [08:42:35] since I have no idea what's happening, I think it is safer to just depool it [08:43:46] !log depooled cp4018.ulsfo.wmnet (sudo -i depool from localhost) due to issues with HTCP) [08:44:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:06] bblack,ema --^ [09:01:18] elukey: thank you, looking [09:04:45] [Sun May 7 08:17:53 2017] vhtcpd[1069]: segfault at 64 ip 00007fdba18394a0 sp 00007ffd04c94d48 error 4 in libc-2.19.so[7fdba1702000+1a1000] [09:07:03] RECOVERY - Varnish HTCP daemon on cp4018 is OK: PROCS OK: 1 process with UID = 114 (vhtcpd), args vhtcpd [09:08:04] !log cp4018: restart vhtcpd and varnish services; repool [09:08:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:43] PROBLEM - HHVM rendering on mw1265 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 2.078 second response time [09:50:43] RECOVERY - HHVM rendering on mw1265 is OK: HTTP OK: HTTP/1.1 200 OK - 74845 bytes in 0.235 second response time [09:54:03] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:54:53] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [10:42:33] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/2/3: down - Core: cr2-codfw:xe-5/0/1 (Zayo, OGYX/120003//ZYO) 36ms {#2909} [10Gbps wave]BR [10:46:43] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:49:43] PROBLEM - nova-compute process on labvirt1013 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [10:51:43] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 13 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [10:51:43] RECOVERY - nova-compute process on labvirt1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [11:00:13] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [11:03:13] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [11:15:03] Is that meant to happen ^^ [11:30:14] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [11:33:13] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [12:15:53] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [12:45:39] 06Operations, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864#3241375 (10Aklapper) Regarding Debian packages for GNU Mailman 3, see https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=799292 for the issues. [14:25:14] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:53:13] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:00:13] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [15:03:13] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [15:46:44] (03PS2) 10Andrew Bogott: labs::lvm: Fix extend-instance-vol to work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/352168 (https://phabricator.wikimedia.org/T164534) [15:53:42] (03CR) 10Andrew Bogott: [C: 032] labs::lvm: Fix extend-instance-vol to work on stretch [puppet] - 10https://gerrit.wikimedia.org/r/352168 (https://phabricator.wikimedia.org/T164534) (owner: 10Andrew Bogott) [16:00:13] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:03:13] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:05:27] !log restarted pdns and pdns-recursor on labcontrol1002 [16:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:41] !log restarted designate-central on labservices1002 due to many log messages like 'Deadlock detected. Retrying...' [16:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:23] PROBLEM - puppet last run on labservices1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:17:03] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:21:23] RECOVERY - puppet last run on labservices1002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:22:02] ^ no-op puppet run just took 346.65 seconds [16:30:14] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [16:33:13] PROBLEM - nova instance creation test on labnet1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack [16:44:03] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:48:05] (03PS1) 10Andrew Bogott: Revert "Switch labservices1002 to the primary designate/dns server." [puppet] - 10https://gerrit.wikimedia.org/r/352476 (https://phabricator.wikimedia.org/T164014) [16:49:19] (03CR) 10Andrew Bogott: [C: 032] Revert "Switch labservices1002 to the primary designate/dns server." [puppet] - 10https://gerrit.wikimedia.org/r/352476 (https://phabricator.wikimedia.org/T164014) (owner: 10Andrew Bogott) [16:52:04] !log switching primary designate server from labservices1002 to labservices1001 [16:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:23] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 50 failures. Last run 2 minutes ago with 50 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Mount[/srv/sda3],Mount[/srv/sdb3],Mount[/var/lib/nginx] [17:07:23] PROBLEM - Varnish HTCP daemon on cp4016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (vhtcpd), args vhtcpd [17:12:11] !log rebooting labservices1002 in hopes of getting its IO unstuck [17:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:21] !log clearing out broken instances in the nova fullstack queue and restarting the tests. [17:20:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:12] RECOVERY - nova instance creation test on labnet1001 is OK: PROCS OK: 1 process with command name python, args nova-fullstack [17:31:22] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:53:37] [Sun May 7 17:05:51 2017] vhtcpd[986]: segfault at 90 ip 00007eff33e204a0 sp 00007ffcf6588a28 error 4 in libc-2.19.so [20:53:42] this is for 4016 [20:53:49] (cp4016) [20:53:59] same thing that happened before for cp4018 [21:09:02] !log depooled cp4016.ulsfo.wmnet (sudo -i depool from localhost) due to issues with vhtcpd (segfaults in dmesg). [21:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:38] ema,bblack --^ - hello again, depooled cp4016 too if you need to investigate it [21:11:41] (afk) [21:51:12] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:32] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:52:02] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 17 minutes ago with 0 failures [21:52:22] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2003 is OK: OK ferm input default policy is set [22:36:13] Anyone awake? [22:36:46] Attempting to login to Quarry (using OAuth) is giving a “502 bad gateway” error. [22:37:44] Other OAuth tools (CropTool is the one I just tested) work. [22:38:27] Er, nevermind, now it’s working…. glitch, I guess. [23:06:12] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:06:12] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:07:02] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:07:02] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient