[00:00:04] RoanKattouw, ^d: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150128T0000). [00:00:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [00:00:27] RECOVERY - Disk space on amssq62 is OK: DISK OK [00:00:27] RECOVERY - dhclient process on amssq62 is OK: PROCS OK: 0 processes with command name dhclient [00:00:27] RECOVERY - salt-minion processes on amssq62 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:00:27] RECOVERY - puppet last run on amssq62 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures [00:00:31] ^d: no swat [00:00:36] RECOVERY - Varnish HTTP text-backend on amssq62 is OK: HTTP OK: HTTP/1.1 200 OK - 190 bytes in 0.191 second response time [00:00:37] RECOVERY - HTTPS on amssq62 is OK: SSLXNN OK - 36 OK [00:00:47] RECOVERY - RAID on amssq62 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:00:47] RECOVERY - configured eth on amssq62 is OK: NRPE: Unable to read output [00:01:06] RECOVERY - DPKG on amssq62 is OK: All packages OK [00:01:27] RECOVERY - Varnish HTCP daemon on amssq62 is OK: PROCS OK: 1 process with UID = 110 (vhtcpd), args vhtcpd [00:01:48] PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% [00:01:57] RECOVERY - NTP on cp1068 is OK: NTP OK: Offset -0.01376736164 secs [00:02:27] RECOVERY - Varnish traffic logger on amssq62 is OK: PROCS OK: 2 processes with command name varnishncsa [00:02:49] RECOVERY - Varnish HTTP text-frontend on amssq62 is OK: HTTP OK: HTTP/1.1 200 OK - 284 bytes in 0.194 second response time [00:02:56] RECOVERY - Varnishkafka log producer on amssq62 is OK: PROCS OK: 1 process with command name varnishkafka [00:03:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [00:04:57] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [00:06:23] YuviPanda: done https://phabricator.wikimedia.org/T87716 [00:07:36] PROBLEM - Host cp1052 is DOWN: PING CRITICAL - Packet loss = 100% [00:09:26] RECOVERY - Host cp1052 is UP: PING OK - Packet loss = 0%, RTA = 3.25 ms [00:11:46] PROBLEM - Varnish HTTP text-frontend on cp1052 is CRITICAL: Connection refused [00:11:47] PROBLEM - DPKG on cp1052 is CRITICAL: Connection refused by host [00:11:47] PROBLEM - configured eth on cp1052 is CRITICAL: Connection refused by host [00:11:56] PROBLEM - salt-minion processes on cp1052 is CRITICAL: Connection refused by host [00:11:56] PROBLEM - Varnish traffic logger on cp1052 is CRITICAL: Connection refused by host [00:11:57] PROBLEM - Disk space on cp1052 is CRITICAL: Connection refused by host [00:12:06] PROBLEM - HTTPS on cp1052 is CRITICAL: Return code of 255 is out of bounds [00:12:06] PROBLEM - puppet last run on cp1052 is CRITICAL: Connection refused by host [00:12:16] PROBLEM - Varnishkafka log producer on cp1052 is CRITICAL: Connection refused by host [00:12:16] PROBLEM - RAID on cp1052 is CRITICAL: Connection refused by host [00:12:17] PROBLEM - Varnish HTTP text-backend on cp1052 is CRITICAL: Connection refused [00:12:26] PROBLEM - Varnish HTCP daemon on cp1052 is CRITICAL: Connection refused by host [00:12:26] PROBLEM - dhclient process on cp1052 is CRITICAL: Connection refused by host [00:15:16] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [00:15:17] RECOVERY - Varnishkafka log producer on cp1052 is OK: PROCS OK: 1 process with command name varnishkafka [00:15:26] RECOVERY - RAID on cp1052 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [00:15:27] RECOVERY - Varnish HTTP text-backend on cp1052 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.007 second response time [00:15:27] RECOVERY - Varnish HTCP daemon on cp1052 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [00:15:28] RECOVERY - dhclient process on cp1052 is OK: PROCS OK: 0 processes with command name dhclient [00:15:56] RECOVERY - Varnish HTTP text-frontend on cp1052 is OK: HTTP OK: HTTP/1.1 200 OK - 285 bytes in 0.010 second response time [00:15:57] RECOVERY - DPKG on cp1052 is OK: All packages OK [00:15:57] RECOVERY - configured eth on cp1052 is OK: NRPE: Unable to read output [00:16:06] RECOVERY - salt-minion processes on cp1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:16:06] RECOVERY - Varnish traffic logger on cp1052 is OK: PROCS OK: 2 processes with command name varnishncsa [00:16:06] RECOVERY - Disk space on cp1052 is OK: DISK OK [00:18:17] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [00:18:26] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [00:20:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [00:20:27] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [00:21:07] PROBLEM - Host cp4013 is DOWN: PING CRITICAL - Packet loss = 100% [00:22:36] RECOVERY - Host cp4013 is UP: PING OK - Packet loss = 0%, RTA = 80.99 ms [00:23:57] RECOVERY - HTTPS on cp1052 is OK: SSLXNN OK - 36 OK [00:23:57] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [00:24:07] PROBLEM - NTP on cp1052 is CRITICAL: NTP CRITICAL: Offset unknown [00:24:37] PROBLEM - Varnish HTTP upload-frontend on cp4013 is CRITICAL: Connection refused [00:24:37] PROBLEM - RAID on cp4013 is CRITICAL: Connection refused by host [00:24:57] PROBLEM - configured eth on cp4013 is CRITICAL: Connection refused by host [00:25:06] PROBLEM - Varnish HTCP daemon on cp4013 is CRITICAL: Connection refused by host [00:25:17] PROBLEM - Varnish HTTP upload-backend on cp4013 is CRITICAL: Connection refused [00:25:17] PROBLEM - puppet last run on cp4013 is CRITICAL: Connection refused by host [00:25:26] PROBLEM - HTTPS on cp4013 is CRITICAL: Return code of 255 is out of bounds [00:25:26] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [00:25:36] PROBLEM - Varnish traffic logger on cp4013 is CRITICAL: Connection refused by host [00:25:36] PROBLEM - dhclient process on cp4013 is CRITICAL: Connection refused by host [00:25:37] PROBLEM - Varnishkafka log producer on cp4013 is CRITICAL: Connection refused by host [00:25:37] PROBLEM - Disk space on cp4013 is CRITICAL: Connection refused by host [00:25:37] PROBLEM - DPKG on cp4013 is CRITICAL: Connection refused by host [00:25:37] PROBLEM - salt-minion processes on cp4013 is CRITICAL: Connection refused by host [00:28:17] RECOVERY - NTP on cp1052 is OK: NTP OK: Offset 0.002259850502 secs [00:35:07] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [00:35:46] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [00:38:06] PROBLEM - NTP on cp4013 is CRITICAL: NTP CRITICAL: Offset unknown [00:38:16] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:38:37] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:40:37] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.090 second response time [00:41:38] bblack: do you have any idea how it.m.wikipedia.org can still sometimes require an action=purge to show the correct editing permissions for unregistered users? It was supposedly fixed in November. :( https://meta.wikimedia.org/wiki/It.m.wikipedia.org#Unregistered_users [00:42:17] RECOVERY - NTP on cp4013 is OK: NTP OK: Offset -0.002918362617 secs [00:43:05] Nemo_bis: what's an unregistered user? [00:43:43] is that the same as anonymous not-logged-in user? [00:43:57] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:46:13] in any case, can you get more depth on this? is it sometimes, or always? what's the difference in the edits that require the manual purge and do not? is this because it.m.wp is doing something different enough from others that the normal purges aren't being called? [00:46:18] PROBLEM - HTTP on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:47:26] bblack: yes, unregistered user is the technical term of anon [00:48:02] it just seems like an odd term I guess. if they're unregistered, they're not really a user in the sense of having a username. [00:48:28] Username requires being a user, not the contrary [00:48:48] (also, I'm mostly stalling because our plates are kinda full today with fallout from the libc issue, I don't really have time to deeply investigate something unknown and relatively-minor) [00:49:32] Yeah, sure, but I don't know how to investigate more [00:50:01] Other than manually visiting Special:Random many times [00:50:15] do you think the issue is that all of these edits have been broken since november and the purge then just temporarily addressed the backlog of already-broken things? [00:51:30] No idea.. for instance, now I can't find a single broken page [00:51:48] I think there was a regression at some point, possibly temporary [00:52:17] Hm, I just got served http://it.m.wikipedia.org/wiki/Diocesi_di_Gallup unstyled [00:52:51] oh, somewhere in those links about tbhe issue, it sounds like they only planned to turn on the anon editing for the month of dec? [00:53:02] did they turn it back off at the end of dec? [00:53:05] It was made permanent now [00:53:09] oh ok [00:53:11] It wasn't turned off for a second [00:53:48] well, file a phabricator task I guess to investigate it [00:53:50] So if you have directions on what we could pay attention to that would be wonderful [00:54:10] Otherwise I'll just regularly check the situation and file bugs if it comes back [00:54:17] honestly I have no idea. step 1 is a clear problem report with something that's easily visible/reproducible [00:54:25] otherwise I'm just grasping [00:54:37] RECOVERY - HTTP on virt1000 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.535 second response time [00:55:17] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.108 second response time [00:55:17] afk, bus stopping [00:55:34] My problem is how to make it reproducible ;) [00:55:38] Ok, thanks [00:58:36] PROBLEM - puppet last run on virt1000 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:27] PROBLEM - RAID on virt1000 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:46] PROBLEM - puppetmaster https on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:00:46] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 6 minutes ago with 0 failures [01:03:57] PROBLEM - puppet last run on virt1000 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:06:47] PROBLEM - Unmerged changes on repository puppet on virt1000 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:37] PROBLEM - DPKG on virt1000 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:07:47] PROBLEM - SSH on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:08:46] RECOVERY - DPKG on virt1000 is OK: All packages OK [01:10:56] RECOVERY - SSH on virt1000 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [01:11:54] _joe_: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=labmon1001 [01:13:44] _joe_: http://en.wikipedia.beta.wmflabs.org/wiki/Main_Page?debug=True [01:13:46] PROBLEM - configured eth on virt1000 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:14:07] PROBLEM - SSH on virt1000 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:14:47] RECOVERY - configured eth on virt1000 is OK: NRPE: Unable to read output [01:15:32] <^d> YuviPanda: I can't hit wikitech to check instance names :( [01:15:46] ^d: yup, it’s dead. andrew.bogott is working on that [01:15:55] * ^d goes and finds a beer in the meantime [01:16:07] RECOVERY - SSH on virt1000 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [01:16:37] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 22 minutes ago with 0 failures [01:19:27] RECOVERY - Unmerged changes on repository puppet on virt1000 is OK: No changes to merge. [01:19:37] RECOVERY - RAID on virt1000 is OK: RAID STATUS: OPTIMAL [01:21:57] ACKNOWLEDGEMENT - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:22:07] PROBLEM - HTTP on virt1000 is CRITICAL: Connection refused [01:23:17] ACKNOWLEDGEMENT - DPKG on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:18] ACKNOWLEDGEMENT - Disk space on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:18] ACKNOWLEDGEMENT - HTTPS on cp3016 is CRITICAL: Return code of 255 is out of bounds alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:18] ACKNOWLEDGEMENT - NTP on cp3016 is CRITICAL: NTP CRITICAL: No response from NTP server alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:18] ACKNOWLEDGEMENT - RAID on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:18] ACKNOWLEDGEMENT - SSH on cp3016 is CRITICAL: Connection timed out alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:18] ACKNOWLEDGEMENT - Varnish HTCP daemon on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:19] ACKNOWLEDGEMENT - Varnish HTTP upload-backend on cp3016 is CRITICAL: Connection timed out alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:19] ACKNOWLEDGEMENT - Varnish HTTP upload-frontend on cp3016 is CRITICAL: Connection timed out alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:20] ACKNOWLEDGEMENT - Varnish traffic logger on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:20] ACKNOWLEDGEMENT - Varnishkafka log producer on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:21] ACKNOWLEDGEMENT - configured eth on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:21] ACKNOWLEDGEMENT - dhclient process on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:23:22] ACKNOWLEDGEMENT - puppet last run on cp3016 is CRITICAL: Timeout while attempting connection alexandros kosiaris bblack: !log cp3016 out of service for now, needs reinstall (precise!) [01:28:57] ACKNOWLEDGEMENT - DPKG on osm-cp1001 is CRITICAL: Connection refused by host daniel_zahn not in use [01:28:57] ACKNOWLEDGEMENT - Disk space on osm-cp1001 is CRITICAL: Connection refused by host daniel_zahn not in use [01:28:58] ACKNOWLEDGEMENT - NTP on osm-cp1001 is CRITICAL: NTP CRITICAL: No response from NTP server daniel_zahn not in use [01:28:58] ACKNOWLEDGEMENT - RAID on osm-cp1001 is CRITICAL: Connection refused by host daniel_zahn not in use [01:28:58] ACKNOWLEDGEMENT - SSH on osm-cp1001 is CRITICAL: Connection refused daniel_zahn not in use [01:28:58] ACKNOWLEDGEMENT - configured eth on osm-cp1001 is CRITICAL: Connection refused by host daniel_zahn not in use [01:28:58] ACKNOWLEDGEMENT - dhclient process on osm-cp1001 is CRITICAL: Connection refused by host daniel_zahn not in use [01:28:58] ACKNOWLEDGEMENT - puppet last run on osm-cp1001 is CRITICAL: Connection refused by host daniel_zahn not in use [01:28:59] ACKNOWLEDGEMENT - salt-minion processes on osm-cp1001 is CRITICAL: Connection refused by host daniel_zahn not in use [01:30:07] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [01:30:47] RECOVERY - dhclient process on cp1057 is OK: PROCS OK: 0 processes with command name dhclient [01:30:47] RECOVERY - RAID on cp1057 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:30:56] RECOVERY - Varnish HTTP bits on cp1057 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.009 second response time [01:30:57] RECOVERY - salt-minion processes on cp1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:31:01] !log power down osm-cp1001 [01:31:07] RECOVERY - Disk space on cp1057 is OK: DISK OK [01:31:26] RECOVERY - DPKG on cp1057 is OK: All packages OK [01:31:26] RECOVERY - HTTPS on cp1057 is OK: SSLXNN OK - 36 OK [01:31:27] RECOVERY - configured eth on cp1057 is OK: NRPE: Unable to read output [01:31:36] RECOVERY - Varnishkafka log producer on cp1068 is OK: PROCS OK: 1 process with command name varnishkafka [01:31:46] RECOVERY - Varnish traffic logger on cp1068 is OK: PROCS OK: 2 processes with command name varnishncsa [01:31:47] RECOVERY - configured eth on cp1068 is OK: NRPE: Unable to read output [01:31:57] RECOVERY - dhclient process on cp1068 is OK: PROCS OK: 0 processes with command name dhclient [01:32:07] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 79.61 ms [01:32:07] RECOVERY - RAID on cp1068 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:32:07] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:32:18] RECOVERY - salt-minion processes on cp1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:32:19] RECOVERY - Disk space on cp1068 is OK: DISK OK [01:32:19] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:32:19] RECOVERY - Varnish HTTP text-frontend on cp1068 is OK: HTTP OK: HTTP/1.1 200 OK - 285 bytes in 0.083 second response time [01:32:26] RECOVERY - DPKG on cp1068 is OK: All packages OK [01:32:27] RECOVERY - Varnish HTTP text-backend on cp1068 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.029 second response time [01:32:37] RECOVERY - Varnish HTCP daemon on cp1068 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [01:32:47] RECOVERY - Varnish HTTP upload-backend on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.161 second response time [01:32:56] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [01:33:06] RECOVERY - Varnish traffic logger on cp4013 is OK: PROCS OK: 2 processes with command name varnishncsa [01:33:06] RECOVERY - salt-minion processes on cp4013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:33:07] RECOVERY - DPKG on cp4013 is OK: All packages OK [01:33:07] RECOVERY - dhclient process on cp4013 is OK: PROCS OK: 0 processes with command name dhclient [01:33:07] RECOVERY - Disk space on cp4013 is OK: DISK OK [01:33:07] RECOVERY - Varnishkafka log producer on cp4013 is OK: PROCS OK: 1 process with command name varnishkafka [01:33:07] RECOVERY - Varnish HTTP upload-frontend on cp4013 is OK: HTTP OK: HTTP/1.1 200 OK - 367 bytes in 0.159 second response time [01:33:16] RECOVERY - RAID on cp4013 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:33:36] RECOVERY - configured eth on cp4013 is OK: NRPE: Unable to read output [01:33:36] RECOVERY - Varnish HTCP daemon on cp4013 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [01:34:17] PROBLEM - Varnishkafka log producer on cp4007 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [01:34:56] RECOVERY - puppetmaster https on virt1000 is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.321 second response time [01:35:06] RECOVERY - HTTP on virt1000 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 0.002 second response time [01:35:07] PROBLEM - Host osm-cp1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:35:26] RECOVERY - Varnishkafka log producer on cp4007 is OK: PROCS OK: 1 process with command name varnishkafka [01:38:46] PROBLEM - Disk space on cp3022 is CRITICAL: DISK CRITICAL - free space: / 351 MB (3% inode=78%): [01:39:26] !log disabled puppet on tools-submit, stopped bigfuckingbrother [01:40:07] RECOVERY - HTTPS on cp4013 is OK: SSLXNN OK - 36 OK [01:42:06] RECOVERY - HTTPS on cp1068 is OK: SSLXNN OK - 36 OK [01:42:26] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [01:43:35] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [01:44:16] PROBLEM - Disk space on cp3021 is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=79%): [01:44:26] PROBLEM - Host cp1053 is DOWN: PING CRITICAL - Packet loss = 100% [01:44:36] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.009 second response time [01:45:16] PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18) [01:45:25] RECOVERY - Host cp1053 is UP: PING WARNING - Packet loss = 61%, RTA = 3.05 ms [01:45:26] PROBLEM - Host labs-ns0.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19) [01:46:08] !log rebooting fluorine [01:47:16] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 2.61 ms [01:47:45] PROBLEM - Host snapshot1001 is DOWN: PING CRITICAL - Packet loss = 100% [01:47:45] PROBLEM - Host snapshot1003 is DOWN: PING CRITICAL - Packet loss = 100% [01:47:56] PROBLEM - Host snapshot1002 is DOWN: PING CRITICAL - Packet loss = 100% [01:48:15] RECOVERY - Host snapshot1002 is UP: PING OK - Packet loss = 0%, RTA = 2.15 ms [01:48:36] RECOVERY - Host labs-ns0.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 1.66 ms [01:48:56] RECOVERY - Host snapshot1003 is UP: PING OK - Packet loss = 0%, RTA = 3.19 ms [01:49:37] PROBLEM - Disk space on cp3019 is CRITICAL: DISK CRITICAL - free space: / 352 MB (3% inode=83%): [01:51:06] RECOVERY - Host snapshot1001 is UP: PING OK - Packet loss = 0%, RTA = 0.95 ms [01:52:51] !log did hotfix on fluorine for incorrect udp2log conf file location [01:52:56] Logged the message, Master [01:58:05] PROBLEM - Host cp1058 is DOWN: PING CRITICAL - Packet loss = 100% [01:58:16] RECOVERY - Disk space on cp3019 is OK: DISK OK [01:59:06] RECOVERY - Host cp1058 is UP: PING OK - Packet loss = 0%, RTA = 1.19 ms [01:59:25] RECOVERY - Disk space on cp3021 is OK: DISK OK [02:00:36] RECOVERY - Disk space on cp3022 is OK: DISK OK [02:07:22] !log started nodejs-ocg on ocg1001 (didnt listen on 8000 as opposed to ocg1002) [02:07:27] Logged the message, Master [02:10:46] PROBLEM - Host amssq38 is DOWN: PING CRITICAL - Packet loss = 100% [02:11:26] RECOVERY - Host amssq38 is UP: PING OK - Packet loss = 0%, RTA = 95.90 ms [02:13:15] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:14:55] PROBLEM - NTP on cp1058 is CRITICAL: NTP CRITICAL: Offset unknown [02:15:25] RECOVERY - DPKG on labmon1001 is OK: All packages OK [02:18:06] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: puppet fail [02:19:17] RECOVERY - NTP on cp1058 is OK: NTP OK: Offset -0.1115738153 secs [02:19:55] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:20:41] !log l10nupdate Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 02s) [02:20:44] Logged the message, Master [02:21:48] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-28 02:20:45+00:00 [02:21:51] Logged the message, Master [02:22:56] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.085 second response time [02:25:25] PROBLEM - RAID on labmon1001 is CRITICAL: Connection refused by host [02:25:35] PROBLEM - Disk space on labmon1001 is CRITICAL: Connection refused by host [02:25:35] PROBLEM - Graphite Carbon on labmon1001 is CRITICAL: Connection refused by host [02:25:46] PROBLEM - salt-minion processes on labmon1001 is CRITICAL: Connection refused by host [02:25:55] PROBLEM - configured eth on labmon1001 is CRITICAL: Connection refused by host [02:26:05] PROBLEM - SSH on labmon1001 is CRITICAL: Connection refused [02:26:06] PROBLEM - dhclient process on labmon1001 is CRITICAL: Connection refused by host [02:26:15] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: Connection refused [02:28:06] PROBLEM - NTP on amssq38 is CRITICAL: NTP CRITICAL: Offset unknown [02:31:17] RECOVERY - NTP on amssq38 is OK: NTP OK: Offset 0.001534938812 secs [02:32:55] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [02:32:55] RECOVERY - RAID on labmon1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [02:32:55] RECOVERY - DPKG on labmon1001 is OK: All packages OK [02:32:55] RECOVERY - uWSGI web apps on labmon1001 is OK: OK: All defined uWSGI apps are runnning. [02:33:06] RECOVERY - Disk space on labmon1001 is OK: DISK OK [02:33:06] RECOVERY - Graphite Carbon on labmon1001 is OK: OK: All defined Carbon jobs are runnning. [02:33:16] RECOVERY - salt-minion processes on labmon1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [02:33:26] RECOVERY - configured eth on labmon1001 is OK: NRPE: Unable to read output [02:33:35] RECOVERY - SSH on labmon1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [02:33:36] RECOVERY - dhclient process on labmon1001 is OK: PROCS OK: 0 processes with command name dhclient [02:33:45] RECOVERY - Host cp3019 is UP: PING OK - Packet loss = 0%, RTA = 96.14 ms [02:33:45] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.664 second response time [02:34:38] !log l10nupdate Synchronized php-1.25wmf15/cache/l10n: (no message) (duration: 00m 02s) [02:34:43] Logged the message, Master [02:35:45] !log LocalisationUpdate completed (1.25wmf15) at 2015-01-28 02:34:42+00:00 [02:35:48] Logged the message, Master [02:36:37] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [02:37:46] PROBLEM - Disk space on cp3020 is CRITICAL: DISK CRITICAL - free space: / 353 MB (3% inode=83%): [02:43:46] (03CR) 10Yuvipanda: "Testing" [puppet] - 10https://gerrit.wikimedia.org/r/187078 (owner: 10Yuvipanda) [02:45:36] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [02:47:26] RECOVERY - Host cp4014 is UP: PING OK - Packet loss = 0%, RTA = 80.14 ms [02:47:35] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [02:49:35] PROBLEM - NTP on cp3019 is CRITICAL: NTP CRITICAL: Offset unknown [02:50:45] RECOVERY - Disk space on cp3020 is OK: DISK OK [02:52:46] RECOVERY - NTP on cp3019 is OK: NTP OK: Offset 0.002361297607 secs [02:59:25] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [03:02:26] PROBLEM - NTP on cp4014 is CRITICAL: NTP CRITICAL: Offset unknown [03:06:45] RECOVERY - NTP on cp4014 is OK: NTP OK: Offset -0.000493645668 secs [03:22:06] (03PS1) 10Ori.livneh: ve: Add vbench tool [puppet] - 10https://gerrit.wikimedia.org/r/187083 [03:24:48] (03PS2) 10Ori.livneh: ve: Add vbench tool [puppet] - 10https://gerrit.wikimedia.org/r/187083 [03:25:26] (03CR) 10Ori.livneh: [C: 032 V: 032] ve: Add vbench tool [puppet] - 10https://gerrit.wikimedia.org/r/187083 (owner: 10Ori.livneh) [04:01:15] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [04:02:36] RECOVERY - Host cp3004 is UP: PING OK - Packet loss = 0%, RTA = 95.20 ms [04:03:36] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [04:14:06] PROBLEM - Host cp1067 is DOWN: PING CRITICAL - Packet loss = 100% [04:16:15] RECOVERY - Host cp1067 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [04:27:35] PROBLEM - Host cp1046 is DOWN: PING CRITICAL - Packet loss = 100% [04:29:36] RECOVERY - Host cp1046 is UP: PING WARNING - Packet loss = 28%, RTA = 2.76 ms [04:32:15] PROBLEM - NTP on cp1067 is CRITICAL: NTP CRITICAL: Offset unknown [04:34:57] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.019 second response time [04:35:25] RECOVERY - NTP on cp1067 is OK: NTP OK: Offset -0.001196026802 secs [04:40:06] PROBLEM - Disk space on cp1056 is CRITICAL: DISK CRITICAL - free space: / 351 MB (3% inode=83%): [04:41:56] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [04:43:05] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 96.03 ms [04:44:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [04:44:46] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.018 second response time [04:45:28] (03Abandoned) 10Giuseppe Lavagetto: hhvm: disable stat_cache [puppet] - 10https://gerrit.wikimedia.org/r/186579 (owner: 10Giuseppe Lavagetto) [04:52:34] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jan 28 04:51:31 UTC 2015 (duration 51m 30s) [04:52:43] Logged the message, Master [04:55:05] PROBLEM - Host cp1062 is DOWN: PING CRITICAL - Packet loss = 100% [04:56:45] RECOVERY - Host cp1062 is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [04:58:26] PROBLEM - NTP on cp3003 is CRITICAL: NTP CRITICAL: Offset unknown [04:58:54] (03PS15) 10Giuseppe Lavagetto: Strongswan: IPsec Puppet module [puppet] - 10https://gerrit.wikimedia.org/r/181742 (owner: 10Gage) [05:02:56] RECOVERY - NTP on cp3003 is OK: NTP OK: Offset 0.001697063446 secs [05:07:46] RECOVERY - Host virt1009 is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [05:08:55] PROBLEM - Host amssq40 is DOWN: PING CRITICAL - Packet loss = 100% [05:09:27] RECOVERY - Host amssq40 is UP: PING OK - Packet loss = 0%, RTA = 95.73 ms [05:09:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:09:57] PROBLEM - RAID on virt1009 is CRITICAL: Connection refused by host [05:10:05] PROBLEM - DPKG on virt1009 is CRITICAL: Connection refused by host [05:10:05] PROBLEM - Disk space on virt1009 is CRITICAL: Connection refused by host [05:10:05] PROBLEM - puppet last run on virt1009 is CRITICAL: Connection refused by host [05:10:22] (03PS1) 10Dzahn: move all files/icinga to modules/icinga/files [puppet] - 10https://gerrit.wikimedia.org/r/187087 [05:10:25] PROBLEM - configured eth on virt1009 is CRITICAL: Connection refused by host [05:12:16] PROBLEM - NTP on cp1062 is CRITICAL: NTP CRITICAL: Offset unknown [05:15:35] RECOVERY - NTP on cp1062 is OK: NTP OK: Offset 0.00236749649 secs [05:16:58] (03CR) 10Dzahn: [C: 04-1] "http://wikidata.beta.wmflabs.org/w/api.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186889 (owner: 10Jdlrobson) [05:19:42] (03PS2) 10Dzahn: Correct wikidata uri for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186889 (owner: 10Jdlrobson) [05:22:15] PROBLEM - Host amssq60 is DOWN: PING CRITICAL - Packet loss = 100% [05:23:06] PROBLEM - Host virt1009 is DOWN: PING CRITICAL - Packet loss = 100% [05:23:16] RECOVERY - Host amssq60 is UP: PING OK - Packet loss = 0%, RTA = 96.63 ms [05:25:45] PROBLEM - NTP on amssq40 is CRITICAL: NTP CRITICAL: Offset unknown [05:25:46] PROBLEM - Varnishkafka log producer on amssq60 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [05:26:16] RECOVERY - Disk space on virt1009 is OK: DISK OK [05:26:17] RECOVERY - DPKG on virt1009 is OK: All packages OK [05:26:17] RECOVERY - RAID on virt1009 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [05:26:26] RECOVERY - Host virt1009 is UP: PING OK - Packet loss = 0%, RTA = 2.18 ms [05:26:55] RECOVERY - Varnishkafka log producer on amssq60 is OK: PROCS OK: 1 process with command name varnishkafka [05:29:05] RECOVERY - NTP on amssq40 is OK: NTP OK: Offset 0.003779888153 secs [05:30:06] PROBLEM - Disk space on cp1069 is CRITICAL: DISK CRITICAL - free space: / 354 MB (3% inode=83%): [05:35:56] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [05:37:25] RECOVERY - Host cp3014 is UP: PING OK - Packet loss = 0%, RTA = 95.17 ms [05:38:36] PROBLEM - NTP on amssq60 is CRITICAL: NTP CRITICAL: Offset unknown [05:39:46] PROBLEM - Varnishkafka log producer on cp3014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [05:39:57] PROBLEM - Varnish traffic logger on cp3014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [05:40:56] RECOVERY - Varnishkafka log producer on cp3014 is OK: PROCS OK: 1 process with command name varnishkafka [05:41:15] RECOVERY - Varnish traffic logger on cp3014 is OK: PROCS OK: 2 processes with command name varnishncsa [05:41:35] PROBLEM - NTP on virt1009 is CRITICAL: NTP CRITICAL: No response from NTP server [05:41:56] RECOVERY - NTP on amssq60 is OK: NTP OK: Offset 0.002561450005 secs [05:43:06] PROBLEM - Host virt1009 is DOWN: PING CRITICAL - Packet loss = 100% [05:46:16] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [05:47:15] RECOVERY - Host virt1009 is UP: PING OK - Packet loss = 0%, RTA = 1.60 ms [05:49:46] PROBLEM - Host cp1037 is DOWN: PING CRITICAL - Packet loss = 100% [05:49:56] PROBLEM - Disk space on virt1009 is CRITICAL: Connection refused by host [05:49:56] PROBLEM - DPKG on virt1009 is CRITICAL: Connection refused by host [05:49:56] PROBLEM - RAID on virt1009 is CRITICAL: Connection refused by host [05:50:06] PROBLEM - configured eth on virt1009 is CRITICAL: Connection refused by host [05:51:26] RECOVERY - Host cp1037 is UP: PING OK - Packet loss = 0%, RTA = 2.90 ms [05:52:36] PROBLEM - NTP on cp3014 is CRITICAL: NTP CRITICAL: Offset unknown [05:55:35] RECOVERY - Disk space on cp1056 is OK: DISK OK [05:55:46] RECOVERY - NTP on cp3014 is OK: NTP OK: Offset -0.0001796483994 secs [06:00:26] PROBLEM - Host virt1009 is DOWN: PING CRITICAL - Packet loss = 100% [06:01:06] RECOVERY - Disk space on cp1069 is OK: DISK OK [06:03:07] PROBLEM - Host cp1043 is DOWN: PING CRITICAL - Packet loss = 100% [06:03:25] PROBLEM - Disk space on cp4003 is CRITICAL: DISK CRITICAL - free space: / 355 MB (3% inode=79%): [06:04:25] RECOVERY - Host virt1009 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [06:04:45] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:04:46] RECOVERY - Host cp1043 is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [06:11:27] RECOVERY - Disk space on virt1009 is OK: DISK OK [06:11:27] RECOVERY - RAID on virt1009 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [06:11:36] RECOVERY - dhclient process on virt1009 is OK: PROCS OK: 0 processes with command name dhclient [06:11:37] RECOVERY - configured eth on virt1009 is OK: NRPE: Unable to read output [06:12:35] RECOVERY - DPKG on virt1009 is OK: All packages OK [06:12:35] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 9 failures [06:16:56] PROBLEM - Host cp1044 is DOWN: PING CRITICAL - Packet loss = 100% [06:17:15] RECOVERY - Disk space on cp4003 is OK: DISK OK [06:18:15] RECOVERY - Host cp1044 is UP: PING OK - Packet loss = 0%, RTA = 2.47 ms [06:18:25] PROBLEM - Disk space on cp4004 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=83%): [06:18:25] PROBLEM - Disk space on cp4002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=83%): [06:18:57] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:19:55] PROBLEM - NTP on cp1043 is CRITICAL: NTP CRITICAL: Offset unknown [06:24:06] RECOVERY - NTP on cp1043 is OK: NTP OK: Offset -0.008637070656 secs [06:28:26] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:36] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:46] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:15] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:26] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:46] RECOVERY - salt-minion processes on virt1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:29:57] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:57] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:05] PROBLEM - Host amssq51 is DOWN: PING CRITICAL - Packet loss = 100% [06:30:16] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] RECOVERY - Host amssq51 is UP: PING OK - Packet loss = 0%, RTA = 94.81 ms [06:31:16] PROBLEM - Disk space on cp4004 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=83%): [06:31:16] PROBLEM - Disk space on cp4002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=83%): [06:32:57] 3operations: Need to add contractor to fundraising email lists - https://phabricator.wikimedia.org/T87672#997824 (10Dzahn) a:3Dzahn [06:33:15] 3operations: Need to add contractor to fundraising email lists - https://phabricator.wikimedia.org/T87672#996529 (10Dzahn) p:5Triage>3Normal [06:34:06] PROBLEM - NTP on cp1044 is CRITICAL: NTP CRITICAL: Offset unknown [06:37:55] PROBLEM - Disk space on cp4001 is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=83%): [06:42:52] 3operations: add contractor Michael Beattie to fundraising email alias fr-online@ - https://phabricator.wikimedia.org/T87672#997833 (10Dzahn) [06:42:56] RECOVERY - Disk space on cp4004 is OK: DISK OK [06:42:56] RECOVERY - Disk space on cp4002 is OK: DISK OK [06:43:16] RECOVERY - Disk space on cp4001 is OK: DISK OK [06:43:23] 3operations: add contractor Michael Beattie to fundraising email alias fr-online@ - https://phabricator.wikimedia.org/T87672#996529 (10Dzahn) 5Open>3Resolved Hi Caitlin, done. i added mbeattie to fr-online@. fr-all will be automatic because it includes fr-online. Best, Daniel reference for Michael Beatti... [06:43:47] PROBLEM - Host cp3012 is DOWN: PING CRITICAL - Packet loss = 100% [06:44:45] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:44:46] RECOVERY - NTP on cp1044 is OK: NTP OK: Offset 0.0005965232849 secs [06:45:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [06:45:25] RECOVERY - Host cp3012 is UP: PING OK - Packet loss = 0%, RTA = 95.39 ms [06:45:56] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:46:06] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:36] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:15] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:47:26] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:50:15] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:58:26] PROBLEM - Host cp4004 is DOWN: PING CRITICAL - Packet loss = 100% [06:58:55] RECOVERY - Host cp4004 is UP: PING OK - Packet loss = 0%, RTA = 84.68 ms [07:01:05] PROBLEM - NTP on cp3012 is CRITICAL: NTP CRITICAL: Offset unknown [07:04:15] RECOVERY - NTP on cp3012 is OK: NTP OK: Offset 0.001710653305 secs [07:06:37] (03PS1) 10BBlack: nginx: raise body buffer from 16k to 64k [puppet] - 10https://gerrit.wikimedia.org/r/187090 [07:07:18] !log repool amssq42 for text-https [07:07:24] Logged the message, Master [07:11:11] 3operations: migrate all ops-core items into operations project - https://phabricator.wikimedia.org/T87469#997874 (10Dzahn) duplicate of T87291 [07:12:10] 3operations: migrate all ops-core items into operations project - https://phabricator.wikimedia.org/T87469#997876 (10Dzahn) [07:12:12] 3Phabricator, operations: merge tickets in project "ops-core" into project "operations" - https://phabricator.wikimedia.org/T87291#988096 (10Dzahn) [07:13:55] PROBLEM - Varnishkafka log producer on amssq45 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [07:14:45] PROBLEM - NTP on cp4004 is CRITICAL: NTP CRITICAL: Offset unknown [07:14:57] RECOVERY - Varnishkafka log producer on amssq45 is OK: PROCS OK: 1 process with command name varnishkafka [07:17:55] RECOVERY - NTP on cp4004 is OK: NTP OK: Offset 0.008156776428 secs [07:26:46] PROBLEM - NTP on amssq45 is CRITICAL: NTP CRITICAL: Offset unknown [07:30:56] RECOVERY - NTP on amssq45 is OK: NTP OK: Offset 0.001429319382 secs [08:30:36] PROBLEM - Host amssq45 is DOWN: PING CRITICAL - Packet loss = 100% [08:31:26] RECOVERY - Host amssq45 is UP: PING OK - Packet loss = 16%, RTA = 94.93 ms [08:45:17] PROBLEM - Host cp3022 is DOWN: PING CRITICAL - Packet loss = 100% [08:46:25] RECOVERY - Host cp3022 is UP: PING OK - Packet loss = 0%, RTA = 98.16 ms [08:47:16] PROBLEM - NTP on amssq45 is CRITICAL: NTP CRITICAL: Offset unknown [08:50:25] RECOVERY - NTP on amssq45 is OK: NTP OK: Offset -0.0009464025497 secs [08:58:06] PROBLEM - Host cp4019 is DOWN: PING CRITICAL - Packet loss = 100% [08:59:56] RECOVERY - Host cp4019 is UP: PING OK - Packet loss = 0%, RTA = 81.25 ms [09:02:06] PROBLEM - NTP on cp3022 is CRITICAL: NTP CRITICAL: Offset unknown [09:05:16] RECOVERY - NTP on cp3022 is OK: NTP OK: Offset 0.0009537935257 secs [09:11:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [09:11:25] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [09:15:56] RECOVERY - Host cp3015 is UP: PING OK - Packet loss = 0%, RTA = 94.93 ms [09:18:06] PROBLEM - HTTPS on cp3015 is CRITICAL: Return code of 255 is out of bounds [09:18:26] PROBLEM - Disk space on cp3015 is CRITICAL: Connection refused by host [09:18:55] PROBLEM - Varnish HTTP upload-backend on cp3015 is CRITICAL: Connection refused [09:18:56] PROBLEM - Varnish HTTP upload-frontend on cp3015 is CRITICAL: Connection refused [09:18:57] PROBLEM - Varnish HTCP daemon on cp3015 is CRITICAL: Connection refused by host [09:18:57] PROBLEM - configured eth on cp3015 is CRITICAL: Connection refused by host [09:18:57] PROBLEM - Varnishkafka log producer on cp3015 is CRITICAL: Connection refused by host [09:18:57] PROBLEM - dhclient process on cp3015 is CRITICAL: Connection refused by host [09:18:57] PROBLEM - salt-minion processes on cp3015 is CRITICAL: Connection refused by host [09:18:57] PROBLEM - puppet last run on cp3015 is CRITICAL: Connection refused by host [09:18:57] PROBLEM - DPKG on cp3015 is CRITICAL: Connection refused by host [09:18:58] PROBLEM - Varnish traffic logger on cp3015 is CRITICAL: Connection refused by host [09:18:58] PROBLEM - RAID on cp3015 is CRITICAL: Connection refused by host [09:24:38] (03PS1) 10BBlack: disable cp3015 for reinstall as well [puppet] - 10https://gerrit.wikimedia.org/r/187095 [09:25:18] (03CR) 10BBlack: [C: 032 V: 032] disable cp3015 for reinstall as well [puppet] - 10https://gerrit.wikimedia.org/r/187095 (owner: 10BBlack) [09:25:56] PROBLEM - Host cp3010 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:36] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [09:26:40] (03CR) 10BBlack: [C: 032 V: 032] nginx: raise body buffer from 16k to 64k [puppet] - 10https://gerrit.wikimedia.org/r/187090 (owner: 10BBlack) [09:26:46] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 96.49 ms [09:28:05] RECOVERY - Host cp3015 is UP: PING OK - Packet loss = 0%, RTA = 95.14 ms [09:38:10] (03PS1) 10BBlack: switch cp301[56] installer to precise [puppet] - 10https://gerrit.wikimedia.org/r/187098 [09:38:55] (03CR) 10BBlack: [C: 032 V: 032] switch cp301[56] installer to precise [puppet] - 10https://gerrit.wikimedia.org/r/187098 (owner: 10BBlack) [09:40:05] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [09:40:25] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [09:42:25] PROBLEM - NTP on cp3010 is CRITICAL: NTP CRITICAL: Offset unknown [09:45:06] PROBLEM - Host cp1038 is DOWN: PING CRITICAL - Packet loss = 100% [09:46:25] RECOVERY - Host cp1038 is UP: PING WARNING - Packet loss = 93%, RTA = 0.33 ms [09:46:35] RECOVERY - NTP on cp3010 is OK: NTP OK: Offset 0.003423929214 secs [09:49:47] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.009 second response time [09:52:26] RECOVERY - Host cp3016 is UP: PING WARNING - Packet loss = 37%, RTA = 97.40 ms [09:55:35] RECOVERY - Host cp3015 is UP: PING OK - Packet loss = 0%, RTA = 94.91 ms [09:59:15] PROBLEM - Host cp4002 is DOWN: PING CRITICAL - Packet loss = 100% [10:00:16] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 79.81 ms [10:00:25] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.028 second response time [10:03:26] PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% [10:04:36] RECOVERY - Host cp3016 is UP: PING OK - Packet loss = 0%, RTA = 95.46 ms [10:12:25] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [10:13:25] RECOVERY - Host cp4018 is UP: PING WARNING - Packet loss = 64%, RTA = 81.46 ms [10:15:56] PROBLEM - NTP on cp4002 is CRITICAL: NTP CRITICAL: Offset unknown [10:19:15] RECOVERY - NTP on cp4002 is OK: NTP OK: Offset -0.001715540886 secs [10:20:56] PROBLEM - NTP on cp3015 is CRITICAL: NTP CRITICAL: No response from NTP server [10:25:56] PROBLEM - Host cp4020 is DOWN: PING CRITICAL - Packet loss = 100% [10:26:56] RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 93%, RTA = 79.77 ms [10:28:36] PROBLEM - NTP on cp4018 is CRITICAL: NTP CRITICAL: Offset unknown [10:29:46] PROBLEM - Varnishkafka log producer on cp4020 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [10:30:46] RECOVERY - Varnishkafka log producer on cp4020 is OK: PROCS OK: 1 process with command name varnishkafka [10:32:55] RECOVERY - NTP on cp4018 is OK: NTP OK: Offset 0.001896977425 secs [10:34:56] RECOVERY - Disk space on cp3015 is OK: DISK OK [10:35:16] RECOVERY - dhclient process on cp3015 is OK: PROCS OK: 0 processes with command name dhclient [10:35:16] RECOVERY - DPKG on cp3015 is OK: All packages OK [10:35:16] RECOVERY - Varnish HTCP daemon on cp3015 is OK: PROCS OK: 1 process with UID = 112 (vhtcpd), args vhtcpd [10:35:16] RECOVERY - Varnish traffic logger on cp3015 is OK: PROCS OK: 2 processes with command name varnishncsa [10:35:16] RECOVERY - salt-minion processes on cp3015 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:35:16] RECOVERY - RAID on cp3015 is OK: OK: optimal, 2 logical, 2 physical [10:35:17] RECOVERY - configured eth on cp3015 is OK: NRPE: Unable to read output [10:36:36] RECOVERY - NTP on cp3015 is OK: NTP OK: Offset 0.05555558205 secs [10:37:16] RECOVERY - Varnishkafka log producer on cp3015 is OK: PROCS OK: 1 process with command name varnishkafka [10:37:16] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 3 failures [10:38:16] RECOVERY - Varnish HTTP upload-frontend on cp3015 is OK: HTTP OK: HTTP/1.1 200 OK - 367 bytes in 0.192 second response time [10:38:16] RECOVERY - Varnish HTTP upload-backend on cp3015 is OK: HTTP OK: HTTP/1.1 200 OK - 189 bytes in 0.197 second response time [10:39:26] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:39:45] PROBLEM - Host cp1056 is DOWN: PING CRITICAL - Packet loss = 100% [10:40:35] RECOVERY - Host cp1056 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [10:41:16] RECOVERY - HTTPS on cp3015 is OK: SSLXNN OK - 36 OK [10:42:26] PROBLEM - NTP on cp4020 is CRITICAL: NTP CRITICAL: Offset unknown [10:45:53] !log cp301[56] frontends repooled [10:46:00] Logged the message, Master [10:46:23] (03PS1) 10BBlack: re-enable cp3015 backend [puppet] - 10https://gerrit.wikimedia.org/r/187103 [10:46:25] (03PS1) 10BBlack: re-enable cp3016 backend [puppet] - 10https://gerrit.wikimedia.org/r/187104 [10:46:36] RECOVERY - NTP on cp4020 is OK: NTP OK: Offset -0.004033684731 secs [10:48:19] (03CR) 10BBlack: [C: 032] re-enable cp3015 backend [puppet] - 10https://gerrit.wikimedia.org/r/187103 (owner: 10BBlack) [10:52:46] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:26] RECOVERY - Host cp4017 is UP: PING OK - Packet loss = 0%, RTA = 79.83 ms [11:02:36] (03CR) 10BBlack: [C: 032] re-enable cp3016 backend [puppet] - 10https://gerrit.wikimedia.org/r/187104 (owner: 10BBlack) [11:06:26] PROBLEM - Host cp1040 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:45] RECOVERY - Host cp1040 is UP: PING WARNING - Packet loss = 66%, RTA = 0.84 ms [11:10:05] PROBLEM - NTP on cp4017 is CRITICAL: NTP CRITICAL: Offset unknown [11:14:16] RECOVERY - NTP on cp4017 is OK: NTP OK: Offset 0.002462387085 secs [11:19:26] PROBLEM - Host amssq32 is DOWN: PING CRITICAL - Packet loss = 100% [11:20:25] RECOVERY - Host amssq32 is UP: PING OK - Packet loss = 0%, RTA = 95.43 ms [11:23:25] PROBLEM - NTP on cp1040 is CRITICAL: NTP CRITICAL: Offset unknown [11:27:35] RECOVERY - NTP on cp1040 is OK: NTP OK: Offset -0.001349806786 secs [11:35:36] PROBLEM - NTP on amssq32 is CRITICAL: NTP CRITICAL: Offset unknown [11:39:55] RECOVERY - NTP on amssq32 is OK: NTP OK: Offset 0.0009305477142 secs [11:47:36] PROBLEM - Host cp4006 is DOWN: PING CRITICAL - Packet loss = 100% [11:48:05] RECOVERY - Host cp4006 is UP: PING OK - Packet loss = 0%, RTA = 79.19 ms [11:48:35] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [11:49:26] PROBLEM - NTP on amssq58 is CRITICAL: NTP CRITICAL: Offset unknown [11:53:36] RECOVERY - NTP on amssq58 is OK: NTP OK: Offset 0.0007489919662 secs [12:00:06] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [12:02:05] RECOVERY - Host cp3007 is UP: PING OK - Packet loss = 0%, RTA = 95.71 ms [12:03:15] PROBLEM - NTP on cp4006 is CRITICAL: NTP CRITICAL: Offset unknown [12:07:26] RECOVERY - NTP on cp4006 is OK: NTP OK: Offset 0.002272367477 secs [12:13:56] PROBLEM - Host amssq54 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:47] RECOVERY - Host amssq54 is UP: PING OK - Packet loss = 0%, RTA = 95.32 ms [12:14:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [12:16:55] PROBLEM - Varnishkafka log producer on amssq54 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [12:17:57] RECOVERY - Varnishkafka log producer on amssq54 is OK: PROCS OK: 1 process with command name varnishkafka [12:27:15] PROBLEM - Host cp1054 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:56] RECOVERY - Host cp1054 is UP: PING OK - Packet loss = 0%, RTA = 3.15 ms [12:30:36] PROBLEM - NTP on amssq54 is CRITICAL: NTP CRITICAL: Offset unknown [12:33:46] RECOVERY - NTP on amssq54 is OK: NTP OK: Offset 0.001343727112 secs [12:44:15] PROBLEM - NTP on cp1054 is CRITICAL: NTP CRITICAL: Offset unknown [12:48:25] RECOVERY - NTP on cp1054 is OK: NTP OK: Offset -0.003492951393 secs [12:54:16] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [12:55:55] RECOVERY - Host cp3006 is UP: PING WARNING - Packet loss = 61%, RTA = 95.30 ms [12:56:55] PROBLEM - NTP on amssq39 is CRITICAL: NTP CRITICAL: Offset unknown [13:01:15] RECOVERY - NTP on amssq39 is OK: NTP OK: Offset 0.001913785934 secs [13:07:35] PROBLEM - Host amssq31 is DOWN: PING CRITICAL - Packet loss = 100% [13:08:26] RECOVERY - Host amssq31 is UP: PING WARNING - Packet loss = 86%, RTA = 94.63 ms [13:10:55] PROBLEM - Varnishkafka log producer on amssq31 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [13:11:46] PROBLEM - NTP on cp3006 is CRITICAL: NTP CRITICAL: Offset unknown [13:11:55] RECOVERY - Varnishkafka log producer on amssq31 is OK: PROCS OK: 1 process with command name varnishkafka [13:14:56] RECOVERY - NTP on cp3006 is OK: NTP OK: Offset 4.79221344e-05 secs [13:22:16] PROBLEM - Host cp1045 is DOWN: PING CRITICAL - Packet loss = 100% [13:23:05] RECOVERY - Host cp1045 is UP: PING OK - Packet loss = 0%, RTA = 1.74 ms [13:23:26] PROBLEM - NTP on amssq31 is CRITICAL: NTP CRITICAL: Offset unknown [13:27:36] RECOVERY - NTP on amssq31 is OK: NTP OK: Offset -0.0006786584854 secs [13:35:06] PROBLEM - Host cp4015 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [13:36:36] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 79.66 ms [13:43:43] (03PS1) 10Mjbmr: Enable new user messaage for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187112 [13:49:06] PROBLEM - Host cp4009 is DOWN: PING CRITICAL - Packet loss = 100% [13:49:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:49:55] RECOVERY - Host cp4009 is UP: PING OK - Packet loss = 0%, RTA = 80.48 ms [13:51:27] PROBLEM - NTP on cp4015 is CRITICAL: NTP CRITICAL: Offset unknown [13:55:45] RECOVERY - NTP on cp4015 is OK: NTP OK: Offset 0.006914496422 secs [14:02:07] PROBLEM - Host cp1055 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:45] RECOVERY - Host cp1055 is UP: PING OK - Packet loss = 0%, RTA = 1.80 ms [14:05:17] PROBLEM - NTP on cp4009 is CRITICAL: NTP CRITICAL: Offset unknown [14:09:26] RECOVERY - NTP on cp4009 is OK: NTP OK: Offset -0.002692580223 secs [14:15:16] PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100% [14:16:16] RECOVERY - Host amssq43 is UP: PING WARNING - Packet loss = 37%, RTA = 96.44 ms [14:29:07] PROBLEM - Host cp1051 is DOWN: PING CRITICAL - Packet loss = 100% [14:30:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [14:30:46] RECOVERY - Host cp1051 is UP: PING OK - Packet loss = 0%, RTA = 1.42 ms [14:31:55] PROBLEM - NTP on amssq43 is CRITICAL: NTP CRITICAL: Offset unknown [14:36:05] RECOVERY - NTP on amssq43 is OK: NTP OK: Offset -0.005672454834 secs [14:43:17] PROBLEM - Host cp3008 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:15] RECOVERY - Host cp3008 is UP: PING WARNING - Packet loss = 61%, RTA = 95.35 ms [14:46:25] PROBLEM - NTP on cp1051 is CRITICAL: NTP CRITICAL: Offset unknown [14:50:35] RECOVERY - NTP on cp1051 is OK: NTP OK: Offset -0.002853035927 secs [14:55:46] PROBLEM - Host cp1048 is DOWN: PING CRITICAL - Packet loss = 100% [14:58:06] RECOVERY - Host cp1048 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [14:59:17] PROBLEM - NTP on cp3008 is CRITICAL: NTP CRITICAL: Offset unknown [14:59:35] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [15:03:35] RECOVERY - NTP on cp3008 is OK: NTP OK: Offset -0.001182675362 secs [15:10:26] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:11:56] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 96.35 ms [15:13:55] PROBLEM - NTP on cp1048 is CRITICAL: NTP CRITICAL: Offset unknown [15:16:56] RECOVERY - NTP on cp1048 is OK: NTP OK: Offset -0.001486539841 secs [15:18:35] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [15:22:56] PROBLEM - Host amssq57 is DOWN: PING CRITICAL - Packet loss = 100% [15:23:56] RECOVERY - Host amssq57 is UP: PING WARNING - Packet loss = 86%, RTA = 105.77 ms [15:25:57] PROBLEM - Varnishkafka log producer on amssq57 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:25:57] PROBLEM - Varnish traffic logger on amssq57 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishncsa [15:26:48] PROBLEM - NTP on cp3020 is CRITICAL: NTP CRITICAL: Offset unknown [15:27:05] RECOVERY - Varnishkafka log producer on amssq57 is OK: PROCS OK: 1 process with command name varnishkafka [15:27:05] RECOVERY - Varnish traffic logger on amssq57 is OK: PROCS OK: 2 processes with command name varnishncsa [15:29:57] RECOVERY - NTP on cp3020 is OK: NTP OK: Offset -0.009299397469 secs [15:32:26] (03CR) 10Glaisher: "Create a task with a link to community consensus first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187112 (owner: 10Mjbmr) [15:37:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [15:38:05] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:55] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 95.15 ms [15:39:37] PROBLEM - NTP on amssq57 is CRITICAL: NTP CRITICAL: Offset unknown [15:42:55] RECOVERY - NTP on amssq57 is OK: NTP OK: Offset -0.00222992897 secs [15:51:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:51:26] PROBLEM - Host cp4001 is DOWN: PING CRITICAL - Packet loss = 100% [15:52:03] (03PS1) 10BBlack: Let icinga client NTP check wait longer for success [puppet] - 10https://gerrit.wikimedia.org/r/187117 [15:52:16] RECOVERY - Host cp4001 is UP: PING OK - Packet loss = 0%, RTA = 81.90 ms [15:52:29] (03CR) 10BBlack: [C: 032 V: 032] Let icinga client NTP check wait longer for success [puppet] - 10https://gerrit.wikimedia.org/r/187117 (owner: 10BBlack) [15:53:17] 3WMF-NDA-Requests, operations: Grant Nikerabbit access to WMF-NDA group - https://phabricator.wikimedia.org/T86632#998346 (10Qgil) I spoke yesterday with @LuisV_WMF and he said that any existing WMF employee and contractor with #req has an NDA signed, and all new employees/contractors-with-#req sign one as part... [15:53:38] 3WMF-NDA-Requests, operations: Grant Nikerabbit access to WMF-NDA group - https://phabricator.wikimedia.org/T86632#998347 (10Qgil) a:5LuisV_WMF>3Aklapper [15:54:24] 3WMF-Legal, WMF-NDA-Requests, operations: Grant WMF-NDA access to Stas in Phabricator - https://phabricator.wikimedia.org/T85170#998349 (10Qgil) a:5LuisV_WMF>3Aklapper See T86632#998346 [15:54:27] PROBLEM - NTP on cp3009 is CRITICAL: NTP CRITICAL: Offset unknown [15:56:49] 3WMF-Legal, WMF-NDA-Requests, operations: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#998357 (10Qgil) For what is worth, Maarten, Luis, and me met yesterday night at the MediaWiki Developer Summit party. Not the most serious and official environment, but we talked about this task a... [15:57:45] RECOVERY - NTP on cp3009 is OK: NTP OK: Offset -0.003054261208 secs [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150128T1600). [16:04:28] PROBLEM - Host cp4010 is DOWN: PING CRITICAL - Packet loss = 100% [16:05:27] RECOVERY - Host cp4010 is UP: PING OK - Packet loss = 0%, RTA = 80.01 ms [16:07:08] PROBLEM - Host cp3017 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [16:10:45] * ^d takes swat [16:10:46] <^d> looks like no patches [16:10:48] <^d> ok, swat over [16:11:57] RECOVERY - Host cp3017 is UP: PING OK - Packet loss = 0%, RTA = 90.19 ms [16:17:28] PROBLEM - Host cp1059 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:28] RECOVERY - Host cp1059 is UP: PING OK - Packet loss = 0%, RTA = 2.93 ms [16:28:11] Does wmgUseDualLicense have any effect? [16:30:06] I don't see any difference in wikis where it is set to true from the wikis where it is set to false. [16:31:44] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:24] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:32:54] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 79.83 ms [16:43:33] marktraceur: http://en.wikipedia.beta.wmflabs.org/wiki/User:Jdforrester_(WMF)/Sandbox ain't got no images. [16:47:08] Hm. [16:47:23] James_F: I have access to the deployment-upload machine but not sure how to diagnose or fix [16:47:27] I could just restart it [16:47:39] marktraceur: Can you see if it's running any services? [16:48:04] PROBLEM - Varnishkafka log producer on amssq53 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [16:48:13] PROBLEM - puppet last run on amssq53 is CRITICAL: CRITICAL: puppet fail [16:48:23] A few services running [16:48:33] James_F: Which ones would you be interested in? [16:49:00] marktraceur: I have no idea. [16:49:13] RECOVERY - Varnishkafka log producer on amssq53 is OK: PROCS OK: 1 process with command name varnishkafka [16:49:14] RECOVERY - puppet last run on amssq53 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:49:44] 3Datasets-General-or-Unknown, operations: Enable IPv6 on dumps.wikimedia.org - https://phabricator.wikimedia.org/T68996#998380 (10Dzahn) ``` 2: eth0: mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 78:2b:cb:76:42:ee brd ff:ff:ff:ff:ff:ff ``` ``` 4: eth2: (03PS1) 10Dzahn: add IPv6 interface to dataset1001 (eth2) [puppet] - 10https://gerrit.wikimedia.org/r/187121 (https://phabricator.wikimedia.org/T68996) [16:53:24] (03CR) 10Dzahn: "new attempt in Change-Id: Iade83d3af3a21841c8" [puppet] - 10https://gerrit.wikimedia.org/r/183074 (owner: 10Faidon Liambotis) [16:58:03] PROBLEM - Host amssq55 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:03] RECOVERY - Host amssq55 is UP: PING OK - Packet loss = 0%, RTA = 88.23 ms [17:00:37] Glaisher: Plese elaborate: https://wikitech.wikimedia.org/w/index.php?title=Deployments&oldid=142163&diff=prev [17:00:57] i talked with a tech yesturday and he sayed to me i can add it there *Sigh* [17:01:19] Steinsplitter: Oh I did that because I might not be around when it was deployed [17:01:30] oh, ok :) [17:01:30] and wasn't some question asked on the task? [17:02:12] if really needed, we it can be merged later in a speperat patch. The actually patch schould solve the problen. [17:07:31] (03CR) 10Nuria: "Sean:" [puppet] - 10https://gerrit.wikimedia.org/r/186356 (owner: 10Milimetric) [17:07:43] YuviPanda: yt? [17:08:16] nuria: vaguely. Still in bed, waking up :) [17:08:34] YuviPanda: let's talk when you have had your chai then ...will be here [17:08:41] YuviPanda: no rush [17:09:02] nuria: cool. Are you still in the office? [17:09:25] YuviPanda: no, not anymore but I am on PST so it'll be easy to sync up [17:10:58] (03CR) 10Manybubbles: "I'd prefer to default the thread count to whatever the default is in Elasticsearch (3?) and make Cirrus set it to 1 but this is cool too." [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) (owner: 10BryanDavis) [17:11:53] PROBLEM - Host cp4012 is DOWN: PING CRITICAL - Packet loss = 100% [17:13:00] (03Abandoned) 10KartikMistry: WIP: Accept requests from the given domains [puppet] - 10https://gerrit.wikimedia.org/r/183429 (owner: 10KartikMistry) [17:13:24] RECOVERY - Host cp4012 is UP: PING OK - Packet loss = 0%, RTA = 80.87 ms [17:14:47] 3Wikimedia-OTRS, operations: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#998431 (10Dzahn) [17:15:59] 3Wikimedia-OTRS, operations: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#986330 (10Dzahn) This sounds like it's an upstream issue with OTRS itself and a feature request more than an Ops thing. ? [17:23:36] 3Wikimedia-OTRS, operations: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#998470 (10pajz) Second Rjd0060's comment here. OTRS can be configured to stop constantly comparing your current ip with your 'login ip' (see my first comment), and, of course, technically, we could chan... [17:24:08] (03CR) 10BryanDavis: "Default is `Math.max(1, Math.min(3, Runtime.getRuntime().availableProcessors() / 2))`; I think this would generally == 3 for all of our se" [puppet] - 10https://gerrit.wikimedia.org/r/186986 (https://phabricator.wikimedia.org/T87526) (owner: 10BryanDavis) [17:25:04] PROBLEM - Host cp1060 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:44] RECOVERY - Host cp1060 is UP: PING WARNING - Packet loss = 93%, RTA = 98.47 ms [17:30:59] 3Triagers, Phabricator, operations, Project-Creators: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#998482 (10RobLa-WMF) Please add @Gilles to Project-Creators. He is managing sprints for the Multimedia team. Thanks! [17:32:50] 3Wikimedia-OTRS, operations: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#998487 (10Dzahn) [17:34:43] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#998494 (10Dzahn) >>! In T73156#974224, @RobH wrote: > bug-attachment.wikimedia.org - old service, should push behind misc web since its depreciating > bugzilla.wikimedia.org - old service, should push behind misc web si... [17:37:53] !log restarted elasticsearch on logstash1001; OOM [17:38:00] Logged the message, Master [17:38:53] PROBLEM - Host amssq35 is DOWN: PING CRITICAL - Packet loss = 100% [17:39:43] RECOVERY - Host amssq35 is UP: PING OK - Packet loss = 0%, RTA = 88.17 ms [17:44:45] 3Triagers, Phabricator, operations, Project-Creators: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#998551 (10Legoktm) [17:44:58] 3Triagers, Phabricator, operations, Project-Creators: Broaden the group of users that can create projects in Phabricator - https://phabricator.wikimedia.org/T706#998557 (10Qgil) {{done}} [17:46:25] 3ops-core, operations: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#998563 (10Dzahn) p:5Triage>3Normal [17:46:44] 3operations: ircd doesnt come back after server reboot - https://phabricator.wikimedia.org/T87679#998564 (10Dzahn) p:5Triage>3Normal [17:50:08] 3operations: ircd doesnt come back after server reboot - https://phabricator.wikimedia.org/T87679#998572 (10Dzahn) a:3chasemp Thanks:) Wanna just merge my change and watch the puppet run? I just copied the upstart file from the ircecho service and removed the one line with the "respawn limit". As long as it do... [17:51:18] 3operations, Beta-Cluster: Minimize differences between beta and production (Tracking) - https://phabricator.wikimedia.org/T87220#998581 (10Dzahn) p:5Triage>3Normal [17:51:51] 3Phabricator, operations: Add @emailbot to #wmf-nda - https://phabricator.wikimedia.org/T87611#998583 (10Dzahn) p:5Triage>3Normal [17:52:25] PROBLEM - Host amssq36 is DOWN: PING CRITICAL - Packet loss = 100% [17:53:09] 3operations, Wikimedia-Bugzilla: analyze Bugzilla access logs - https://phabricator.wikimedia.org/T86859#998584 (10Dzahn) p:5Triage>3Low [17:53:23] RECOVERY - Host amssq36 is UP: PING OK - Packet loss = 0%, RTA = 97.32 ms [17:53:30] 3ops-core, operations: Setup the main appservers cluster in codfw - https://phabricator.wikimedia.org/T86893#998585 (10Dzahn) p:5Triage>3Normal [17:53:37] 3ops-core, operations: Setup the api appservers cluster in codfw - https://phabricator.wikimedia.org/T86892#998588 (10Dzahn) p:5Triage>3Normal [17:53:46] 3ops-core, operations: Setup videoscalers cluster in codfw - https://phabricator.wikimedia.org/T86891#998590 (10Dzahn) p:5Triage>3Normal [17:53:55] 3ops-core, operations: Setup imagescalers cluster in codfw - https://phabricator.wikimedia.org/T86890#998592 (10Dzahn) p:5Triage>3Normal [17:54:02] 3ops-core, operations: Setup jobrunners cluster in codfw - https://phabricator.wikimedia.org/T86889#998594 (10Dzahn) p:5Triage>3Normal [17:54:11] 3ops-core, operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#998597 (10Dzahn) p:5Triage>3Normal [17:54:18] 3ops-core, operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#998599 (10Dzahn) p:5Triage>3Normal [17:55:55] 3MediaWiki-Core-Team, operations: Audit log sources and see if we can make them less spammy - https://phabricator.wikimedia.org/T87205#998605 (10Dzahn) p:5Triage>3Low is it about running out of disk space or just to remove noise for humans? [17:57:04] 3Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#998607 (10Dzahn) [17:57:37] 3Wikimedia-Git-or-Gerrit, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#998609 (10Dzahn) p:5Triage>3High [17:58:42] (03PS1) 10Jforrester: Beta Labs: Point to the Citoid service on WMF Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187131 [17:58:44] (03PS1) 10Jforrester: Provide the Citoid extension for test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187132 [17:59:38] 3Phabricator, operations: Add @emailbot to #wmf-nda - https://phabricator.wikimedia.org/T87611#998617 (10csteipp) a:5csteipp>3None >>! In T87611#995412, @RobH wrote: > I've assigned this to Chris for his commentary. > > Chris: Please provide feedback and then feel free to unassign yourself as owner (or assi... [18:01:29] 3ops-core, operations: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#998621 (10Dzahn) p:5Triage>3Normal how urgent is this? well, when it says "chronically unhealthy" that sounds like "normal" is appropriate? [18:05:02] 3operations: Remove misc/statistics.pp - https://phabricator.wikimedia.org/T87450#998639 (10Dzahn) p:5Triage>3Normal [18:05:44] PROBLEM - Host amssq33 is DOWN: PING CRITICAL - Packet loss = 100% [18:06:09] 3operations: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#998643 (10Dzahn) >>! In T86971#981861, @faidon wrote: > slight preference against non-human users having a home directory under /home.> > I'd prefer e.g. /var/lib/mwdeploy for this. ^ this. +1 [18:06:55] 3operations: Decide on /var/lib vs /home as locations of homedir for mwdeploy - https://phabricator.wikimedia.org/T86971#998644 (10Dzahn) p:5Triage>3Low [18:07:13] RECOVERY - Host amssq33 is UP: PING OK - Packet loss = 0%, RTA = 93.87 ms [18:08:35] 3Phabricator, operations: Add @emailbot to #wmf-nda - https://phabricator.wikimedia.org/T87611#998655 (10Dzahn) a:3RobH [18:08:58] 3operations: Problems applying role::mediawiki to a fresh Trusty install - https://phabricator.wikimedia.org/T87550#998658 (10Dzahn) p:5Triage>3Normal [18:09:24] 3operations: db1021 %iowait up - https://phabricator.wikimedia.org/T87277#998670 (10Dzahn) @Springle Btw, do you want a project tag for db related things? [18:10:06] (03PS2) 10KartikMistry: cxserver: Remove unused dict packages [puppet] - 10https://gerrit.wikimedia.org/r/186983 [18:10:20] 3ops-core, operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#998672 (10Dzahn) p:5Triage>3Normal [18:12:30] 3operations, Beta-Cluster: Set up an alert for unmerged changes in deployment-prep - https://phabricator.wikimedia.org/T87616#998675 (10Dzahn) production has a check for this, should be among this: modules/monitoring/manifests/icinga/git_merge.pp: description => "Unmerged changes on repository ${title}"... [18:14:32] 3ops-core, operations: Allocate a few servers to logstash - https://phabricator.wikimedia.org/T87031#998679 (10bd808) I think we should kill this request and just get the procurement ticket in for "real" hardware. @YuviPanda is working on that now I think but I don't know if the ticket exists yet. We got a littl... [18:14:54] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.136:9200/_cluster/health error while fetching: Request timed out. [18:14:54] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.137:9200/_cluster/health error while fetching: Request timed out. [18:15:15] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.138:9200/_cluster/health error while fetching: Request timed out. [18:15:53] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 119, initializing_shards: 2, number_of_data_nodes: 3 [18:16:13] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 2, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 119, initializing_shards: 2, number_of_data_nodes: 3 [18:17:03] 3operations, Wikimedia-Etherpad: move etherpad behind misc-web - https://phabricator.wikimedia.org/T85788#998682 (10Dzahn) [18:17:23] 3Wikimedia-Planet, operations: move planet behind misc-web - https://phabricator.wikimedia.org/T85789#998683 (10Dzahn) [18:17:33] 3operations, Wikimedia-Bugzilla: move old-bugzilla / bug-attachment behind misc-web - https://phabricator.wikimedia.org/T85785#998685 (10Dzahn) [18:18:58] 3operations, ops-requests: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#998689 (10Dzahn) [18:19:34] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 34 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 30, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 89, uinitializing_shards: 4, unumber_of_data_nodes: 3} [18:19:34] PROBLEM - Host cp4008 is DOWN: PING CRITICAL - Packet loss = 100% [18:20:14] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 17 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 13, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 106, uinitializing_shards: 4, unumber_of_data_nodes: 3} [18:20:34] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 8, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 4, number_of_data_nodes: 3 [18:20:51] blah elasticsearch what are you doing [18:21:03] RECOVERY - Host cp4008 is UP: PING OK - Packet loss = 0%, RTA = 80.11 ms [18:21:04] (03CR) 10Catrope: [C: 032] Beta Labs: Point to the Citoid service on WMF Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187131 (owner: 10Jforrester) [18:21:09] (03Merged) 10jenkins-bot: Beta Labs: Point to the Citoid service on WMF Labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187131 (owner: 10Jforrester) [18:21:14] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 4, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 115, initializing_shards: 4, number_of_data_nodes: 3 [18:21:14] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 4, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 115, initializing_shards: 4, number_of_data_nodes: 3 [18:21:24] 3operations, ops-requests: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#998710 (10Dzahn) >The repository should be made compliant with the lint checks .. Yes. that. It's just an ungoing and very long task. We have been making tons of lint changes over the... [18:22:39] 3operations, ops-requests: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#998722 (10Dzahn) [18:23:00] 3operations, ops-requests: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#963609 (10Dzahn) The patch has been abandoned but says there will be a new pep8 compliant version. Added Operations tag. [18:23:21] (03CR) 10Catrope: [C: 04-1] Provide the Citoid extension for test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187132 (owner: 10Jforrester) [18:24:17] (03PS1) 10Ori.livneh: Pass --proxy-server=http://127.0.0.1 to Chromium on osmium [puppet] - 10https://gerrit.wikimedia.org/r/187135 [18:26:13] (03CR) 10Ori.livneh: [C: 032] Pass --proxy-server=http://127.0.0.1 to Chromium on osmium [puppet] - 10https://gerrit.wikimedia.org/r/187135 (owner: 10Ori.livneh) [18:30:54] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:32:49] 3ops-core, Analytics, operations: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#998750 (10Ottomata) FYI, the pagecounts-raw files found at dumps.wikimedia.org/other/pagecounts-raw/ use the nginx logs. We and are now recreating this data using Hive via varnishkafka. Christian and I... [18:33:04] PROBLEM - Host cp4011 is DOWN: PING CRITICAL - Packet loss = 100% [18:34:28] (03PS1) 10Ori.livneh: Add role::mediawiki::appserver to osmium [puppet] - 10https://gerrit.wikimedia.org/r/187138 [18:34:34] RECOVERY - Host cp4011 is UP: PING OK - Packet loss = 0%, RTA = 81.45 ms [18:34:38] 3Analytics, operations: Hadoop logs on logstash are being really spammy - https://phabricator.wikimedia.org/T87206#998753 (10Ottomata) FYI, I restarted most hadoop daemons yesterday. There might be a few that I didn't, but the volume of logs should be much less now. [18:34:59] (03CR) 10Ori.livneh: [C: 032 V: 032] Add role::mediawiki::appserver to osmium [puppet] - 10https://gerrit.wikimedia.org/r/187138 (owner: 10Ori.livneh) [18:35:24] 3ops-core, Analytics, operations: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#998754 (10faidon) Are they? So are these just counting the X% of requests that come via HTTPS, where X is < 5 probably (and also a biased sample, as this is predominantly editors)? [18:37:54] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: puppet fail [18:38:26] 3operations: Remove misc/statistics.pp - https://phabricator.wikimedia.org/T87450#998761 (10Ottomata) I'm close! Just geowiki stuff left to do. I have this mostly done, but it needs a little more thought,. [18:39:34] PROBLEM - Host cp3018 is DOWN: PING CRITICAL - Packet loss = 100% [18:40:37] (03PS1) 10Ori.livneh: Use role() to apply webserver role to osmium [puppet] - 10https://gerrit.wikimedia.org/r/187139 [18:41:04] (03CR) 10Ori.livneh: [C: 032 V: 032] Use role() to apply webserver role to osmium [puppet] - 10https://gerrit.wikimedia.org/r/187139 (owner: 10Ori.livneh) [18:41:38] 3ops-core, Analytics, operations: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#998771 (10Ottomata) So, in udp2log, the https and the duplicated proxied http request both exist. That means that a any given https request will have 2 entries in the logs. The webstatscollector code ch... [18:42:14] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [18:42:14] (03PS1) 10Ori.livneh: Remove duplicate 'admin' from osmium [puppet] - 10https://gerrit.wikimedia.org/r/187140 [18:42:30] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove duplicate 'admin' from osmium [puppet] - 10https://gerrit.wikimedia.org/r/187140 (owner: 10Ori.livneh) [18:44:24] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:45:03] RECOVERY - Host cp3018 is UP: PING WARNING - Packet loss = 54%, RTA = 108.53 ms [18:45:51] 3operations: decom amslvs1-4 - https://phabricator.wikimedia.org/T87729#998778 (10Dzahn) https://gerrit.wikimedia.org/r/#/c/187077/ [18:46:24] PROBLEM - Host cp1039 is DOWN: PING CRITICAL - Packet loss = 100% [18:47:43] (03CR) 10KartikMistry: "We can go ahead with this!" [puppet] - 10https://gerrit.wikimedia.org/r/186538 (owner: 10KartikMistry) [18:48:24] RECOVERY - Host cp1039 is UP: PING OK - Packet loss = 16%, RTA = 0.50 ms [18:49:30] 3operations: decom amslvs1-4 - https://phabricator.wikimedia.org/T87729#998779 (10Dzahn) http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20esams&h=amslvs1.esams.wikimedia.org&m=cpu_report&r=hour&s=descending&hc=4&mc=2 http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20esams&h=amslvs2.esams.wikimedia.org... [18:50:57] 3Wikimedia-OTRS, operations: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#998789 (10lfaraone) a:3lfaraone >>! In T87217#998470, @pajz wrote: > Second Rjd0060's comment here. OTRS can be configured to stop constantly comparing your current ip with your 'login ip' (see my fir... [18:51:16] nuria: ‘sup [18:52:14] (03CR) 10BBlack: [C: 031] "These should be fine to kill w/ regular decom procedure from https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Reclaim_or_Decommission" [puppet] - 10https://gerrit.wikimedia.org/r/187077 (https://phabricator.wikimedia.org/T87729) (owner: 10Dzahn) [18:52:26] 3operations: decom amslvs1-4 - https://phabricator.wikimedia.org/T87729#998796 (10Dzahn) a:3Dzahn [18:52:33] YuviPanda: Talking to springle on #wikimedis-analytics about the db changes in labs [18:54:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:55:38] (03PS1) 10Reedy: Enable Extension:Graph on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187142 (https://phabricator.wikimedia.org/T87770) [18:55:38] 3operations: update ServerLifecycle page - https://phabricator.wikimedia.org/T87782#998806 (10Dzahn) 3NEW [18:56:00] 3operations: update ServerLifecycle page - https://phabricator.wikimedia.org/T87782#998816 (10Dzahn) [18:58:24] (03PS3) 10Reedy: Correct wikidata uri for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186889 (owner: 10Jdlrobson) [18:58:32] (03CR) 10Reedy: [C: 032] Correct wikidata uri for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186889 (owner: 10Jdlrobson) [18:58:52] (03Merged) 10jenkins-bot: Correct wikidata uri for beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186889 (owner: 10Jdlrobson) [18:59:13] Reedy: :) [18:59:29] (03PS2) 10Reedy: Enable Extension:Graph on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187142 (https://phabricator.wikimedia.org/T87770) [18:59:35] (03CR) 10Reedy: [C: 032] Enable Extension:Graph on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187142 (https://phabricator.wikimedia.org/T87770) (owner: 10Reedy) [18:59:39] (03Merged) 10jenkins-bot: Enable Extension:Graph on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187142 (https://phabricator.wikimedia.org/T87770) (owner: 10Reedy) [19:00:22] PROBLEM - Host cp1050 is DOWN: PING CRITICAL - Packet loss = 100% [19:01:26] (03PS2) 10Reedy: Enable new user messaage for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187112 (owner: 10Mjbmr) [19:01:53] (03CR) 10Reedy: [C: 032] Enable new user messaage for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187112 (owner: 10Mjbmr) [19:01:53] RECOVERY - Host cp1050 is UP: PING OK - Packet loss = 0%, RTA = 2.11 ms [19:01:57] (03Merged) 10jenkins-bot: Enable new user messaage for fawikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187112 (owner: 10Mjbmr) [19:03:57] (03PS2) 10Reedy: Tidyup extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186910 [19:04:31] (03CR) 10Reedy: [C: 032] Tidyup extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186910 (owner: 10Reedy) [19:04:35] (03Merged) 10jenkins-bot: Tidyup extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186910 (owner: 10Reedy) [19:04:43] PROBLEM - mediawiki-installation DSH group on osmium is CRITICAL: Host osmium is not in mediawiki-installation dsh group [19:05:51] (03PS3) 10Dzahn: decom amslvs1-4 [puppet] - 10https://gerrit.wikimedia.org/r/187077 (https://phabricator.wikimedia.org/T87729) [19:06:16] (03CR) 10Dzahn: [C: 032] decom amslvs1-4 [puppet] - 10https://gerrit.wikimedia.org/r/187077 (https://phabricator.wikimedia.org/T87729) (owner: 10Dzahn) [19:07:56] kart_: Do you want https://gerrit.wikimedia.org/r/#/c/186358/ merging and deploying? [19:10:02] (03PS2) 10Reedy: Don't include DefaultSettings.php or set $DP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164010 (owner: 10PleaseStand) [19:10:09] (03CR) 10Reedy: [C: 032] Don't include DefaultSettings.php or set $DP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164010 (owner: 10PleaseStand) [19:10:13] !log decom amslvs1-4, removing from puppet [19:10:14] (03Merged) 10jenkins-bot: Don't include DefaultSettings.php or set $DP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164010 (owner: 10PleaseStand) [19:10:17] Logged the message, Master [19:11:16] (03PS4) 10Reedy: Remove Anexo namespace on pt.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172012 (https://bugzilla.wikimedia.org/73164) (owner: 10Dereckson) [19:11:25] (03PS5) 10Reedy: Remove Anexo namespace on pt.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172012 (https://phabricator.wikimedia.org/T75164) (owner: 10Dereckson) [19:11:59] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181358 (https://phabricator.wikimedia.org/T85049) (owner: 10Gergő Tisza) [19:12:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nope, not really. We expect https://phabricator.wikimedia.org/T87587 so we can configure the HTTP forward proxy in the config file (via th" [puppet] - 10https://gerrit.wikimedia.org/r/186538 (owner: 10KartikMistry) [19:13:43] (03CR) 10Reedy: [C: 04-1] "Needs rebasing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186539 (owner: 10Mattflaschen) [19:13:50] (03PS2) 10Reedy: Update logo URL for nostalgiawiki to point to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186617 (owner: 10PleaseStand) [19:13:58] (03CR) 10Reedy: [C: 032] Update logo URL for nostalgiawiki to point to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186617 (owner: 10PleaseStand) [19:14:04] (03Merged) 10jenkins-bot: Update logo URL for nostalgiawiki to point to Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186617 (owner: 10PleaseStand) [19:15:08] 3operations: decom amslvs1-4 - https://phabricator.wikimedia.org/T87729#998903 (10Dzahn) removed all 4 from stored puppet configs, stopped puppet agents, revoked puppet certs on puppetmaster, revoked salt keys, removed from dsh groups, from site.pp ... [19:15:13] (03PS2) 10Reedy: Change enwikinews $wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186742 (https://phabricator.wikimedia.org/T87522) (owner: 10Glaisher) [19:15:19] (03CR) 10Reedy: [C: 032] Change enwikinews $wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186742 (https://phabricator.wikimedia.org/T87522) (owner: 10Glaisher) [19:15:24] (03Merged) 10jenkins-bot: Change enwikinews $wgNamespacesToBeSearchedDefault [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186742 (https://phabricator.wikimedia.org/T87522) (owner: 10Glaisher) [19:15:42] (03PS3) 10Reedy: Create "eliminator" user group on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186763 (https://phabricator.wikimedia.org/T87558) (owner: 10Calak) [19:15:50] (03CR) 10Reedy: [C: 032] Create "eliminator" user group on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186763 (https://phabricator.wikimedia.org/T87558) (owner: 10Calak) [19:16:01] (03Merged) 10jenkins-bot: Create "eliminator" user group on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186763 (https://phabricator.wikimedia.org/T87558) (owner: 10Calak) [19:16:04] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [19:16:06] (03PS2) 10Reedy: Remove toolserver IP ConfirmEdit whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186823 (owner: 10Hoo man) [19:16:12] (03PS1) 10Nuria: Switching wikimetrics to connect to labsdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/187147 [19:16:14] (03CR) 10Reedy: [C: 032] Remove toolserver IP ConfirmEdit whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186823 (owner: 10Hoo man) [19:16:26] (03Merged) 10jenkins-bot: Remove toolserver IP ConfirmEdit whitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186823 (owner: 10Hoo man) [19:16:45] (03PS2) 10Reedy: Set $wmgAbuseFilterEmergencyDisableCount to 25 at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186743 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [19:16:52] (03CR) 10Reedy: [C: 032] Set $wmgAbuseFilterEmergencyDisableCount to 25 at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186743 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [19:16:56] (03Merged) 10jenkins-bot: Set $wmgAbuseFilterEmergencyDisableCount to 25 at commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186743 (https://phabricator.wikimedia.org/T87431) (owner: 10Glaisher) [19:17:22] (03CR) 10Springle: [C: 031] Switching wikimetrics to connect to labsdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/187147 (owner: 10Nuria) [19:17:24] (03PS3) 10Reedy: Add $wgLogo for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186919 (owner: 10Glaisher) [19:17:34] (03CR) 10Reedy: [C: 032] Add $wgLogo for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186919 (owner: 10Glaisher) [19:17:39] (03Merged) 10jenkins-bot: Add $wgLogo for wikimania2016wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186919 (owner: 10Glaisher) [19:18:15] (03CR) 10Springle: [C: 032] Switching wikimetrics to connect to labsdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/187147 (owner: 10Nuria) [19:19:31] (03PS3) 10Yuvipanda: scap: Move 'common_scripts' into scripts class [puppet] - 10https://gerrit.wikimedia.org/r/186597 (https://phabricator.wikimedia.org/T87221) [19:19:50] !log reedy Synchronized wmf-config/: Various config updates (duration: 00m 06s) [19:19:56] Logged the message, Master [19:19:58] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [19:23:52] (03PS1) 10Dzahn: remove amslvs[1-4] from DNS, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/187152 (https://phabricator.wikimedia.org/T87729) [19:24:37] (03CR) 10Yuvipanda: [C: 032] scap: Move 'common_scripts' into scripts class [puppet] - 10https://gerrit.wikimedia.org/r/186597 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:27:06] (03CR) 10Chad: "scap seems like a random place to put some of these...they're definitely not all scap-related." [puppet] - 10https://gerrit.wikimedia.org/r/186597 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:27:26] ^lurker: yup, needs further refactoring into mediawiki::deployment and other places as well [19:27:35] (03PS3) 10Yuvipanda: scap: Move scap master code into own class [puppet] - 10https://gerrit.wikimedia.org/r/186598 (https://phabricator.wikimedia.org/T87221) [19:27:40] * ^lurker nods [19:27:49] (03PS2) 10Alexandros Kosiaris: Remove init script for default udp2log instance on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/187082 (https://phabricator.wikimedia.org/T87726) (owner: 10Mark Bergsma) [19:27:58] <^lurker> YuviPanda: Long as it wasn't just moving them to move them :) [19:28:07] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:28:07] PROBLEM - Host cp1069 is DOWN: PING CRITICAL - Packet loss = 100% [19:28:19] (03CR) 10Yuvipanda: [C: 032 V: 032] scap: Move scap master code into own class [puppet] - 10https://gerrit.wikimedia.org/r/186598 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:28:56] ^lurker: yeah, I think the current series of patches are noops and I want to get them in first. [19:29:07] RECOVERY - Host cp1069 is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [19:30:26] (03PS1) 10Yuvipanda: scap: Add missing role file [puppet] - 10https://gerrit.wikimedia.org/r/187155 (https://phabricator.wikimedia.org/T87221) [19:31:12] (03PS2) 10Yuvipanda: scap: Add missing role file [puppet] - 10https://gerrit.wikimedia.org/r/187155 (https://phabricator.wikimedia.org/T87221) [19:31:26] (03CR) 10Yuvipanda: [C: 032] scap: Add missing role file [puppet] - 10https://gerrit.wikimedia.org/r/187155 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:31:37] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [19:31:56] (03CR) 10Yuvipanda: [V: 032] scap: Add missing role file [puppet] - 10https://gerrit.wikimedia.org/r/187155 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:32:53] (03PS3) 10Yuvipanda: scap: Clean up absent'd lint related files / packages [puppet] - 10https://gerrit.wikimedia.org/r/186599 [19:32:59] (03PS3) 10Yuvipanda: scap: Move l10nupdate into module [puppet] - 10https://gerrit.wikimedia.org/r/186600 (https://phabricator.wikimedia.org/T87221) [19:33:07] (03CR) 10Yuvipanda: [C: 032] scap: Clean up absent'd lint related files / packages [puppet] - 10https://gerrit.wikimedia.org/r/186599 (owner: 10Yuvipanda) [19:33:16] (03CR) 10Yuvipanda: [C: 032] scap: Move l10nupdate into module [puppet] - 10https://gerrit.wikimedia.org/r/186600 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:33:54] (03CR) 10Yuvipanda: [V: 032] scap: Clean up absent'd lint related files / packages [puppet] - 10https://gerrit.wikimedia.org/r/186599 (owner: 10Yuvipanda) [19:33:56] 3operations: add contractor Michael Beattie to fundraising email alias fr-online@ - https://phabricator.wikimedia.org/T87672#999019 (10Aklapper) [19:34:04] (03CR) 10Yuvipanda: [V: 032] scap: Move l10nupdate into module [puppet] - 10https://gerrit.wikimedia.org/r/186600 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:34:39] 3ops-esams, operations: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#999023 (10Dzahn) 3NEW a:3Dzahn [19:36:27] (03PS1) 10Yuvipanda: scap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/187159 (https://phabricator.wikimedia.org/T87221) [19:36:42] (03CR) 10Yuvipanda: [C: 032 V: 032] scap: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/187159 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:36:49] !log shutting down amslvs1 [19:36:57] Logged the message, Master [19:36:58] (03PS1) 10Ori.livneh: Exempt osmium's script URL from bits rewriting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187160 [19:37:45] (03CR) 10Ori.livneh: [C: 032] Exempt osmium's script URL from bits rewriting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187160 (owner: 10Ori.livneh) [19:38:15] 3ops-esams, operations: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#999056 (10Dzahn) [19:38:17] 3operations: decom amslvs1-4 - https://phabricator.wikimedia.org/T87729#999055 (10Dzahn) [19:38:34] 3operations: decom amslvs1-4 - https://phabricator.wikimedia.org/T87729#997436 (10Dzahn) [19:38:37] 3ops-esams, operations: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#999023 (10Dzahn) [19:39:06] (03CR) 10Ori.livneh: [V: 032] Exempt osmium's script URL from bits rewriting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187160 (owner: 10Ori.livneh) [19:39:07] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:39:29] (03PS2) 10Yuvipanda: scap: Move rsync proxies into module [puppet] - 10https://gerrit.wikimedia.org/r/186601 (https://phabricator.wikimedia.org/T87221) [19:39:37] !log ori Synchronized wmf-config/CommonSettings.php: Exempt osmium's script URL from bits rewriting (duration: 00m 05s) [19:39:42] Logged the message, Master [19:39:57] (03CR) 10Yuvipanda: [C: 032 V: 032] scap: Move rsync proxies into module [puppet] - 10https://gerrit.wikimedia.org/r/186601 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:41:20] (03PS2) 10Yuvipanda: logging: Move fatalmonitor into role::mediawiki::logging [puppet] - 10https://gerrit.wikimedia.org/r/186603 (https://phabricator.wikimedia.org/T87221) [19:42:54] (03CR) 10Yuvipanda: [C: 032 V: 032] logging: Move fatalmonitor into role::mediawiki::logging [puppet] - 10https://gerrit.wikimedia.org/r/186603 (https://phabricator.wikimedia.org/T87221) (owner: 10Yuvipanda) [19:44:39] (03PS3) 10Alexandros Kosiaris: Remove init script for default udp2log instance on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/187082 (https://phabricator.wikimedia.org/T87726) (owner: 10Mark Bergsma) [19:45:41] ACKNOWLEDGEMENT - mediawiki-installation DSH group on osmium is CRITICAL: Host osmium is not in mediawiki-installation dsh group ori.livneh Ori setting up osmium for VE perf testing [19:46:07] (03CR) 10Alexandros Kosiaris: [C: 032] Remove init script for default udp2log instance on fluorine [puppet] - 10https://gerrit.wikimedia.org/r/187082 (https://phabricator.wikimedia.org/T87726) (owner: 10Mark Bergsma) [19:47:12] 3WMF-Legal, WMF-NDA-Requests, operations: Add multichill to WMF-NDA group - https://phabricator.wikimedia.org/T87097#999146 (10Aklapper) 5Open>3Resolved Added Multichill to the group. Welcome :) [19:47:27] 3ops-esams, operations: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#999148 (10Dzahn) a:5Dzahn>3None [19:47:42] 3ops-esams, operations: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#999023 (10Dzahn) [19:48:37] (03PS1) 10Reedy: Add Graph to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187165 [19:48:51] 3WMF-NDA-Requests, operations: Grant Nikerabbit access to WMF-NDA group - https://phabricator.wikimedia.org/T86632#999154 (10Aklapper) 5Open>3Resolved Added Nikerabbit to the group. Welcome :) [19:49:04] (03CR) 10Jforrester: Provide the Citoid extension for test wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187132 (owner: 10Jforrester) [19:49:14] * Reedy sighs [19:49:36] PROBLEM - puppet last run on fluorine is CRITICAL: CRITICAL: puppet fail [19:52:38] ACKNOWLEDGEMENT - puppet last run on fluorine is CRITICAL: CRITICAL: puppet fail alexandros kosiaris known, possibly due to 7bfad15, investigating [19:53:15] akosiaris: I merged a change on fluorine too, but it worked ok after I hand ran puppet [19:53:32] Reedy: Whysosigh? [19:53:33] YuviPanda: The fatalmonitor one ? [19:53:38] yeah [19:53:46] niah, it can't be that [19:53:47] James_F: Extensions apparently being deployed incorrectly [19:53:50] * akosiaris thinks [19:54:02] Just saw "" on cawiki [19:54:02] ok [19:54:10] Reedy: Oh. Lovely. [19:56:09] (03PS3) 10Giuseppe Lavagetto: hiera: fix up for labs hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/185940 [19:56:47] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: fix up for labs hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/185940 (owner: 10Giuseppe Lavagetto) [19:58:17] 3WMF-Legal, WMF-NDA-Requests, operations: Grant WMF-NDA access to Stas in Phabricator - https://phabricator.wikimedia.org/T85170#999183 (10Aklapper) 5Open>3Resolved Added Smalyshev to the group. [19:59:23] 3WMF-Legal, WMF-NDA-Requests, operations: Grant WMF-NDA access to Stas in Phabricator - https://phabricator.wikimedia.org/T85170#999202 (10Smalyshev) Halleluyah! Thanks @Aklapper :) [19:59:38] (03PS1) 10Ottomata: Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187168 [20:00:16] (03PS2) 10Ottomata: Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187168 [20:01:35] (03PS1) 10Alexandros Kosiaris: Fix typo introduced in 7bfad15 [puppet] - 10https://gerrit.wikimedia.org/r/187169 [20:02:02] (03PS3) 10Ottomata: Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187168 [20:02:42] (03CR) 10Reedy: [C: 032] Add Graph to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187165 (owner: 10Reedy) [20:02:47] (03Merged) 10jenkins-bot: Add Graph to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187165 (owner: 10Reedy) [20:03:09] !log shutdown remaining amslvs2-4 [20:03:14] Logged the message, Master [20:03:16] !log reedy Synchronized wmf-config/extension-list: Add Graph to extension-list (duration: 00m 07s) [20:03:21] Logged the message, Master [20:05:31] (03CR) 10Alexandros Kosiaris: [C: 032] Fix typo introduced in 7bfad15 [puppet] - 10https://gerrit.wikimedia.org/r/187169 (owner: 10Alexandros Kosiaris) [20:07:37] RECOVERY - puppet last run on fluorine is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:09:01] !log reedy Started scap: mostly nooop, but adding graph to l10n cache [20:09:06] Logged the message, Master [20:11:04] (03CR) 10QChris: Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/187168 (owner: 10Ottomata) [20:12:28] (03PS4) 10Ottomata: Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187168 (https://phabricator.wikimedia.org/T86656) [20:12:53] @speak [20:12:57] oops [20:14:44] (03PS5) 10Ottomata: Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187168 (https://phabricator.wikimedia.org/T86656) [20:16:47] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.008 second response time [20:17:57] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.005 second response time [20:40:35] !log reedy Finished scap: mostly nooop, but adding graph to l10n cache (duration: 31m 33s) [20:40:44] Logged the message, Master [20:43:48] (03PS1) 10QChris: Add maintainership and backend to overview page for pagecounts-raw dataset [puppet] - 10https://gerrit.wikimedia.org/r/187174 [20:43:50] (03PS1) 10QChris: Add link to the dataset's wiki page to overview page of pagecounts-raw dataset [puppet] - 10https://gerrit.wikimedia.org/r/187175 [20:49:37] PROBLEM - Apache HTTP on mw1069 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50360 bytes in 0.021 second response time [20:50:06] PROBLEM - HHVM rendering on mw1069 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50360 bytes in 0.010 second response time [20:54:11] _joe_: ^ should I restart? [21:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150128T2100). Please do the needful. [21:00:07] RECOVERY - Apache HTTP on mw1069 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.074 second response time [21:00:16] !log restarted hhvm on mw1069 [21:00:20] Logged the message, Master [21:00:37] RECOVERY - HHVM rendering on mw1069 is OK: HTTP OK: HTTP/1.1 200 OK - 66273 bytes in 0.165 second response time [21:09:53] (03PS2) 10Yuvipanda: Move deployment-prep hiera config into ops/puppet [puppet] - 10https://gerrit.wikimedia.org/r/186852 (https://phabricator.wikimedia.org/T87223) [21:19:40] !log restarting parsoid across wtp* hosts for subbu [21:19:47] Logged the message, Master [21:20:16] (03CR) 10Dzahn: [C: 032] remove amslvs[1-4] from DNS, keep mgmt [dns] - 10https://gerrit.wikimedia.org/r/187152 (https://phabricator.wikimedia.org/T87729) (owner: 10Dzahn) [21:20:49] (03PS1) 10Florianschmidtwelzow: mediawikiwiki: Allow sysop to add and remove themself from translationadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) [21:20:51] !log synced parsoid version 88605a4a for restart [21:20:56] Logged the message, Master [21:21:38] 3ops-esams, operations: decom amslvs1-4 (dc work) - https://phabricator.wikimedia.org/T87790#999383 (10Dzahn) [21:21:39] 3operations: decom amslvs1-4 - https://phabricator.wikimedia.org/T87729#999381 (10Dzahn) 5Open>3Resolved shutdown and removed from DNS a little while later. resolving [21:21:40] !log synced parsoid version 88605a4a for restart [21:26:49] (03PS1) 10Dzahn: add misc-web varnish for dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187184 (https://phabricator.wikimedia.org/T308) [21:28:46] (03CR) 10Ottomata: [C: 032] Add maintainership and backend to overview page for pagecounts-raw dataset [puppet] - 10https://gerrit.wikimedia.org/r/187174 (owner: 10QChris) [21:30:27] PROBLEM - Host cp3021 is DOWN: PING CRITICAL - Packet loss = 100% [21:31:03] (03CR) 10Ottomata: [C: 032] Add link to the dataset's wiki page to overview page of pagecounts-raw dataset [puppet] - 10https://gerrit.wikimedia.org/r/187175 (owner: 10QChris) [21:31:17] RECOVERY - Host cp3021 is UP: PING OK - Packet loss = 0%, RTA = 97.14 ms [21:32:16] 3operations: decom cp1037,cp1038,cp1039,cp1040 - https://phabricator.wikimedia.org/T87800#999413 (10BBlack) 3NEW a:3Dzahn [21:32:41] (03PS3) 10Alexandros Kosiaris: cxserver: Remove unused dict packages [puppet] - 10https://gerrit.wikimedia.org/r/186983 (owner: 10KartikMistry) [21:32:42] (03PS1) 10Alexandros Kosiaris: Remove unused dictd package declarations in cxserver [puppet] - 10https://gerrit.wikimedia.org/r/187185 [21:33:28] (03PS1) 10Dzahn: add basic Apache site and docroot for dev.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/187186 (https://phabricator.wikimedia.org/T308) [21:34:09] (03PS1) 10QChris: Drop qchris from eventlogging groups [puppet] - 10https://gerrit.wikimedia.org/r/187188 [21:34:15] 3operations: Fix udp2log init script in puppet - https://phabricator.wikimedia.org/T87726#999434 (10akosiaris) 5Open>3Resolved a:3akosiaris Marking as resolved. I merged this and a followup commit to fix a typo and shepherded them in production [21:35:25] (03CR) 10QChris: Drop qchris from eventlogging groups (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/187188 (owner: 10QChris) [21:37:08] (03CR) 10Dzahn: [C: 032] add misc-web varnish for dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187184 (https://phabricator.wikimedia.org/T308) (owner: 10Dzahn) [21:37:21] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Remove unused dict packages [puppet] - 10https://gerrit.wikimedia.org/r/186983 (owner: 10KartikMistry) [21:38:34] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#999463 (10ssastry) 3NEW [21:38:43] 3operations: support big files in atftpd - https://phabricator.wikimedia.org/T87804#999470 (10fgiunchedi) 3NEW a:3fgiunchedi [21:38:53] (03PS2) 10Ottomata: Drop qchris from eventlogging groups [puppet] - 10https://gerrit.wikimedia.org/r/187188 (owner: 10QChris) [21:40:27] (03CR) 10Ottomata: [C: 032] "If others are empty, this group should be fine empty too :)" [puppet] - 10https://gerrit.wikimedia.org/r/187188 (owner: 10QChris) [21:41:17] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#999497 (10Dzahn) I thought ssh from all internal networks was still allowed in base::firewall, how come this even blocks ssh from tin? [21:41:18] <_joe_> YuviPanda: I'm on it [21:41:46] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#999499 (10akosiaris) Yeah, the firewall now does not accept connections from tin, but bast1001 is allowed. The dsh command works as documented on that wikitech page as long as it is run from tin. Perh... [21:43:47] PROBLEM - Host amssq48 is DOWN: PING CRITICAL - Packet loss = 100% [21:43:57] <_joe_> oh you already did it. <3 [21:44:07] RECOVERY - Host amssq48 is UP: PING OK - Packet loss = 0%, RTA = 96.71 ms [21:44:18] 3operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#999505 (10akosiaris) @Dzahn, no, what would be the point of allowing all internal networks anyway ? We would not be achieving any kind of network separation. [21:46:08] (03PS1) 10BBlack: dysprosium -> jessie cache::text reinstall T83070 [puppet] - 10https://gerrit.wikimedia.org/r/187230 [21:47:04] (03CR) 10BBlack: [C: 032] dysprosium -> jessie cache::text reinstall T83070 [puppet] - 10https://gerrit.wikimedia.org/r/187230 (owner: 10BBlack) [21:47:06] akosiaris: so first patch will remove packages and then we can remove code from Puppet (that's what I understood) :) [21:47:32] kart_: absolutely correct [21:47:39] two staged removal [21:47:43] akosiaris: thanks! [21:47:57] kart_: don't mention it [21:48:03] :) [21:49:06] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: Puppet has 1 failures [21:52:40] akosiaris: thought about, https://gerrit.wikimedia.org/r/#/c/184217/ ? [21:56:48] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:57:37] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [21:57:40] bblack: I am merging the dysprosium change [21:57:55] akosiaris: ok, it's fine, sorry! [21:57:56] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [21:58:46] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:59:10] kart_: yeah, we can merge it know, but IIRC it needs some config changes as well [21:59:21] let me verify and comment at the change [21:59:29] akosiaris: sure [21:59:47] PROBLEM - Varnishkafka log producer on amssq37 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [22:00:56] RECOVERY - Varnishkafka log producer on amssq37 is OK: PROCS OK: 1 process with command name varnishkafka [22:01:01] (03PS1) 10Ori.livneh: vbench: coerce URL scheme to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/187237 [22:05:19] _joe_: yeah, restarted [22:08:27] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [22:10:03] (03CR) 10Yuvipanda: [C: 04-2] "Yes, current naming / location seems fuckall. Need to find a better place." [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [22:10:37] PROBLEM - Host cp1065 is DOWN: PING CRITICAL - Packet loss = 100% [22:10:56] (03PS2) 10Yuvipanda: Make standard class's exim including behavior configurable [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) [22:11:04] (03PS6) 10Ottomata: Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187168 (https://phabricator.wikimedia.org/T86656) [22:11:57] RECOVERY - Host cp1065 is UP: PING OK - Packet loss = 0%, RTA = 1.44 ms [22:13:29] (03CR) 10Ottomata: [C: 032] Copy pagecounts-raw dataset from stat1002 hdfs-archive (data generated by Hive) to dumps.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/187168 (https://phabricator.wikimedia.org/T86656) (owner: 10Ottomata) [22:17:31] (03PS1) 10Ottomata: Add trailing / to incoming dir for pagecounts-raw copy [puppet] - 10https://gerrit.wikimedia.org/r/187245 [22:17:37] (03CR) 10jenkins-bot: [V: 04-1] Add trailing / to incoming dir for pagecounts-raw copy [puppet] - 10https://gerrit.wikimedia.org/r/187245 (owner: 10Ottomata) [22:17:39] (03PS2) 10Ottomata: Add trailing / to incoming dir for pagecounts-raw copy [puppet] - 10https://gerrit.wikimedia.org/r/187245 [22:18:56] (03CR) 10Ottomata: [C: 032] Add trailing / to incoming dir for pagecounts-raw copy [puppet] - 10https://gerrit.wikimedia.org/r/187245 (owner: 10Ottomata) [22:21:36] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I would use "has_standard_mail_relay" or something (I suck at naming). Also, get finally rid of standard-noexim and substitute it everywhe" [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [22:23:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:24:26] PROBLEM - Host cp1061 is DOWN: PING CRITICAL - Packet loss = 100% [22:24:47] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: Puppet has 1 failures [22:26:16] RECOVERY - Host cp1061 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [22:27:36] PROBLEM - Apache HTTP on osmium is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 268 bytes in 0.046 second response time [22:27:58] ori: ^ [22:28:07] PROBLEM - HHVM rendering on osmium is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 268 bytes in 0.036 second response time [22:28:40] mutante: ack; it's not server using requests, so it shouldn't have alerts. i'll fix that. [22:28:42] sorry for the noise. [22:28:47] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:28:58] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:29:35] ottomata: ^ is that you? [22:29:46] ori: no worries, i'm aware it's a test box, just wanted to let you know the test failed [22:30:15] oop [22:30:15] yes [22:30:16] thanks [22:30:21] merged [22:30:56] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:31:06] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [22:31:07] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [22:36:07] (03PS2) 10Ori.livneh: vbench: coerce URL scheme to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/187237 [22:36:12] (03CR) 10Ori.livneh: [C: 032 V: 032] vbench: coerce URL scheme to HTTP [puppet] - 10https://gerrit.wikimedia.org/r/187237 (owner: 10Ori.livneh) [22:36:50] (03PS2) 10Dzahn: add basic Apache site and docroot for dev.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/187186 (https://phabricator.wikimedia.org/T308) [22:37:57] PROBLEM - Host cp1049 is DOWN: PING CRITICAL - Packet loss = 100% [22:38:23] mmhh that spike of 500s seems related to thumbs, namely the requested width is greater than the original [22:38:46] RECOVERY - Host cp1049 is UP: PING WARNING - Packet loss = 66%, RTA = 1.72 ms [22:39:19] Wondering why we have a ton of npm failures in Jenkins today... they are unrelated to the changes under review. e.g. https://integration.wikimedia.org/ci/job/mwext-DonationInterface-npm/904/console [22:39:49] kart_: where does cxserver search for its config file ? Relative to pwd or to Server.js ? [22:40:42] (03CR) 10Dzahn: [C: 032] "easy enough, just like the annualreport site" [puppet] - 10https://gerrit.wikimedia.org/r/187186 (https://phabricator.wikimedia.org/T308) (owner: 10Dzahn) [22:41:08] (03PS1) 10Ori.livneh: osmium: remove appserver role; apply mediawiki::web [puppet] - 10https://gerrit.wikimedia.org/r/187257 [22:44:14] kart_: I think I have answered my own question. https://git.wikimedia.org/blob/mediawiki%2Fservices%2Fcxserver/f7214aff10883bfa260017027b60cd82f9ea9f92/utils%2FConf.js#L14 [22:44:32] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow using a different web user than apache [puppet] - 10https://gerrit.wikimedia.org/r/187259 [22:45:10] (03PS3) 10Yuvipanda: Make standard class's exim including behavior configurable [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) [22:45:11] (03PS1) 10Yuvipanda: Get rid of superfluous diamond includes [puppet] - 10https://gerrit.wikimedia.org/r/187261 [22:45:18] Maybe there's something wrong with our npm package mirror? [22:45:45] "npm install" works fine in my dev instance [22:46:31] awight: krinkle just found a permissions problem caused by puppet on the jenkins slaves, could be related? [22:47:07] awight: ebernhardson: https://integration.wikimedia.org/ci/job/mwext-DonationInterface-npm/904/console has nothing to do with permissions [22:47:11] it's just cache corruption [22:47:15] npmjs.org cache [22:47:25] Have someone clear it or wait until I'm done fixing the current outage [22:47:26] RECOVERY - HHVM rendering on osmium is OK: HTTP OK: HTTP/1.1 200 OK - 1046 bytes in 0.030 second response time [22:47:33] npm cahe happens all the time, not a current outage [22:47:56] RECOVERY - Apache HTTP on osmium is OK: HTTP OK: HTTP/1.1 200 OK - 1016 bytes in 0.006 second response time [22:48:09] (03PS2) 10Ori.livneh: osmium: remove appserver role [puppet] - 10https://gerrit.wikimedia.org/r/187257 [22:48:42] ori: Can you maybe chip in a second? I need puppet to git-clone a repo on integration slaves in a way that the files are executable and readable by non-root. [22:49:04] ori: It seems it worked when I created it, but now every other month when I do a git push it resets the whole thing to root:root rw------ [22:49:05] set umask for the git clone command? [22:49:23] ori: the files are executable according to git, and also on the slaves, it's just the chmod being root-only [22:49:23] currently automatic puppet runs default to an 077 umask for things puppet runs... [22:49:31] https://github.com/wikimedia/operations-puppet/blob/production/modules/contint/manifests/slave-scripts.pp#L21-L27 [22:49:38] Krinkle: cool, thx for the note [22:50:31] 3operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#999736 (10Gage) 3NEW [22:50:57] PROBLEM - Host amssq52 is DOWN: PING CRITICAL - Packet loss = 100% [22:51:28] RECOVERY - Host amssq52 is UP: PING WARNING - Packet loss = 93%, RTA = 96.11 ms [22:51:47] bblack: I'm reading https://github.com/wikimedia/operations-puppet/blob/production/modules/git/manifests/clone.pp but not getting the issue. [22:51:55] (03CR) 10Yuvipanda: "Better now, I think." [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [22:52:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:52:17] Can anyone here refresh the npm cache that Jenkins is using? There's corruption... [22:52:19] Jenkins is essentially broken right now due to this. However it was manually set up worked, but now that it git-clone pushed a new commit, the slaves are all broken with permissions issues. [22:52:33] 3Ops-Access-Requests: Requesting access to EventLogging cluster for mforns - https://phabricator.wikimedia.org/T87816#999749 (10mforns) 3NEW [22:53:21] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Missing:" [puppet] - 10https://gerrit.wikimedia.org/r/184217 (owner: 10KartikMistry) [22:54:10] awight: done, integration-slave1007 now has a fresh cache for DonationInterface [22:54:21] Krinkle: badass. thx [22:54:26] bblack: Do I use the 'shared [22:54:39] bblack: Do I use the 'shared' flag? That seems for group sharing, not root to non-root reading from a user level. [22:54:42] Krinkle: the umask is set at a far outer scope for all of the puppet run. if nothing else is explicitly setting it more permissively for your one action/command, most files create by things underneath puppet will lack all permissions for group/other (umask 077) [22:54:47] PROBLEM - HHVM rendering on osmium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 708 bytes in 0.105 second response time [22:54:48] 3operations: Icinga check for Torrus - https://phabricator.wikimedia.org/T87817#999760 (10Gage) 3NEW [22:54:52] The files modified in the last git-pull were made readable only by root via puppet [22:55:33] https://docs.puppetlabs.com/references/latest/type.html#exec-attribute-umask [22:55:40] bblack: Are you saying the only way puppet can ensure files are readable by non-root is to have all of the puppet client run under a differnet umask? How am I supposed to do that? [22:55:46] !log restarting torrus on netmon1001 [22:55:47] RECOVERY - HHVM rendering on osmium is OK: HTTP OK: HTTP/1.1 200 OK - 66113 bytes in 0.239 second response time [22:55:53] Logged the message, Master [22:55:53] ^ basically, the "exec" that executes "git clone" probably needs the "umask => 022" attribute or simlar [22:56:13] window 43 [22:56:18] gah [22:56:22] Krinkle: no, I'm saying if you want to do things that are world-readable, you may have to set a more-permissive umask for the specific task in question within puppet [22:56:25] bblack: Is there a flag for that in git.clone.pp? [22:56:37] I'm not seeing it [22:56:41] https://github.com/wikimedia/operations-puppet/blob/production/modules/git/manifests/clone.pp#L121 [22:56:49] ^ in that exec block there, add a umask attribute alongside the others [22:57:00] That would affect all invocations. [22:57:04] and perhaps the other exec blocks in that file as well [22:57:14] I wonder how this worked in the past. [22:57:14] I'll hack something together [22:57:30] how it worked in the past is that the default umask puppet ran under was not as restrictive, and now it is [22:57:36] Since when? [22:57:43] at least, that's my working theory, I ran into this elsewhere [22:58:20] bblack: can the '$mode' flag be used somehow? [22:59:14] I think since I7cd0c2406235163cec77e508572d1e48ebbf92ef (Dec 17?) [22:59:26] This is the first deployment since then, yeah. [22:59:42] It defaults to 2775 it says, that's world readable, right? [23:00:21] The directory and .git are 755/644, but any files from after Dec 17 are 600 [23:00:27] right [23:00:30] 3operations, ops-core, Analytics: Deprecate HTTPS udp2log stream? - https://phabricator.wikimedia.org/T86656#999773 (10Ottomata) Update: the Hive generated pagecounts-raw data is now being copied every hour from HDFS to dumps.wikimedia.org. The data is still being backfilled in hadoop. Once all backfill jobs... [23:00:31] (03PS1) 10Gilles: Remove Media Viewer tag [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187268 [23:00:47] E.g. on integration-slave1008.eqiad.wmflabs /srv/deployment/integration/slave-scripts [23:01:00] so I really think the answer is "umask => 022" attribute on those execs. if that's not desired for all invocations through that class, then it would need to be a parameter to control when you are or aren't needing that [23:01:47] PROBLEM - Apache HTTP on osmium is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 268 bytes in 0.037 second response time [23:02:24] Yup that's me breaking osmium [23:02:24] I guess a new optional parameters at the top of the manifest "umask", defaulting to 077, and then you can change it for this instance [23:02:26] PROBLEM - HHVM rendering on osmium is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 268 bytes in 0.038 second response time [23:02:30] and pass down to all the execs [23:02:57] RECOVERY - Apache HTTP on osmium is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.043 second response time [23:04:07] PROBLEM - Host cp3011 is DOWN: PING CRITICAL - Packet loss = 100% [23:04:32] actually it may have been more like Dec 11; this change put in the actual umask: I4ae2b64d096d38b3eeeda2791949274fb1f69ffa [23:04:44] but I'm not 100% sure as to when related commits were merged and took effect in practice [23:06:07] RECOVERY - Host cp3011 is UP: PING OK - Packet loss = 0%, RTA = 95.85 ms [23:06:07] PROBLEM - Apache HTTP on osmium is CRITICAL: HTTP CRITICAL: HTTP/1.0 500 Internal Server Error - 268 bytes in 0.030 second response time [23:06:27] 3operations: Torrus is broken - https://phabricator.wikimedia.org/T87815#999788 (10Gage) 5Open>3Resolved a:3Gage Fixed by following this procedure: https://wikitech.wikimedia.org/wiki/Torrus#Deadlock_problem sudo service torrus-common stop sudo db5.1_recover -h /var/lib/torrus/db sudo -u Debian-torr... [23:07:26] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [23:08:38] (03PS5) 10KartikMistry: Use only cxserver/deploy in deployment [puppet] - 10https://gerrit.wikimedia.org/r/184217 [23:09:14] (03CR) 10KartikMistry: "See change in base_path too." [puppet] - 10https://gerrit.wikimedia.org/r/184217 (owner: 10KartikMistry) [23:09:32] (03PS6) 10KartikMistry: Use only cxserver/deploy in deployment [puppet] - 10https://gerrit.wikimedia.org/r/184217 [23:09:54] 3Ops-Access-Requests: Requesting access to EventLogging cluster for mforns - https://phabricator.wikimedia.org/T87816#999809 (10Tnegrin) Approved by manager [23:14:27] (03PS1) 10Mforns: Add mforns to eventlogging-admins group [puppet] - 10https://gerrit.wikimedia.org/r/187271 (https://phabricator.wikimedia.org/T87816) [23:15:50] 3Ops-Access-Requests: Requesting access to EventLogging cluster for mforns - https://phabricator.wikimedia.org/T87816#999823 (10mforns) Added a changeset to gerrit: https://gerrit.wikimedia.org/r/#/c/187271 [23:19:16] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:27:48] (03PS4) 10Yuvipanda: Make standard class's exim including behavior configurable [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) [23:28:42] (03CR) 10Giuseppe Lavagetto: Make standard class's exim including behavior configurable (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [23:28:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] Use only cxserver/deploy in deployment (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/184217 (owner: 10KartikMistry) [23:29:40] (03CR) 10Yuvipanda: Make standard class's exim including behavior configurable (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [23:29:52] (03CR) 10Yuvipanda: "Got an in-person +1 from Joe" [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [23:29:57] (03PS5) 10Yuvipanda: Make standard class's exim including behavior configurable [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) [23:31:46] PROBLEM - Host amssq44 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:17] RECOVERY - Host amssq44 is UP: PING OK - Packet loss = 0%, RTA = 94.76 ms [23:32:50] (03PS7) 10KartikMistry: Use only cxserver/deploy in deployment [puppet] - 10https://gerrit.wikimedia.org/r/184217 [23:34:27] PROBLEM - Varnishkafka log producer on amssq44 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [23:35:16] (03CR) 10Giuseppe Lavagetto: [C: 031] Make standard class's exim including behavior configurable [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [23:35:36] RECOVERY - Varnishkafka log producer on amssq44 is OK: PROCS OK: 1 process with command name varnishkafka [23:35:43] (03CR) 10Yuvipanda: [C: 032] Make standard class's exim including behavior configurable [puppet] - 10https://gerrit.wikimedia.org/r/186891 (https://phabricator.wikimedia.org/T86575) (owner: 10Yuvipanda) [23:38:26] alright, puppet failures incoming [23:38:28] (that’s all me) [23:40:17] PROBLEM - puppet last run on magnesium is CRITICAL: CRITICAL: puppet fail [23:40:22] 3operations, Beta-Cluster: Move scap puppet code into a module - https://phabricator.wikimedia.org/T87221#999959 (10greg) p:5Triage>3Normal [23:40:27] _joe_: hmm, so puppet isn’t picking up the hiera values at all [23:40:27] RECOVERY - Apache HTTP on osmium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 433 bytes in 0.254 second response time [23:40:28] 3operations, Beta-Cluster: Set up an alert for unmerged changes in deployment-prep - https://phabricator.wikimedia.org/T87616#999963 (10greg) p:5Triage>3Normal [23:40:58] PROBLEM - puppet last run on iodine is CRITICAL: CRITICAL: puppet fail [23:41:07] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: puppet fail [23:41:17] <_joe_> YuviPanda: wat? [23:41:32] <_joe_> mmmh [23:41:38] <_joe_> at least puppet is failing [23:41:45] _joe_: yeah. ^ puppet failures seem to be caused by the conflicting exim4s, which probably points to them being set to true [23:41:52] heh, yeah [23:42:05] <_joe_> YuviPanda: did the change get merged on both hosts? [23:42:14] <_joe_> puppetmasters I mean [23:42:20] _joe_: hmm, I just ran it on palladium [23:42:23] let me check strontium [23:42:39] <_joe_> so [23:42:50] <_joe_> magnesium right? [23:43:09] PROBLEM - puppet last run on lead is CRITICAL: CRITICAL: puppet fail [23:43:47] _joe_: it’s merged on strontium too, I think [23:44:01] <_joe_> yeah so.... [23:44:09] <_joe_> let's understand what does not work [23:44:16] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: puppet fail [23:44:17] <_joe_> 1 minute please [23:44:32] _joe_: ok [23:44:37] PROBLEM - Host cp1066 is DOWN: PING CRITICAL - Packet loss = 100% [23:45:04] (03PS1) 10Spage: Add Dev namespace on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187278 (https://phabricator.wikimedia.org/T369) [23:45:37] PROBLEM - Host cp4005 is DOWN: PING CRITICAL - Packet loss = 100% [23:46:07] PROBLEM - Host amssq49 is DOWN: PING CRITICAL - Packet loss = 100% [23:46:16] ^ all ok [23:46:37] RECOVERY - Host cp1066 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [23:46:47] <_joe_> mmmmh I am not sure, Yuvi [23:46:49] (03PS5) 10Ottomata: Create geowiki module [puppet] - 10https://gerrit.wikimedia.org/r/186551 [23:46:57] RECOVERY - Host amssq49 is UP: PING WARNING - Packet loss = 66%, RTA = 149.51 ms [23:46:59] <_joe_> I'm gonna test it again [23:47:13] <_joe_> we can manage to have puppet broken for 30 mins, right? [23:47:19] _joe_: yeah, I think so [23:47:27] RECOVERY - Host cp4005 is UP: PING OK - Packet loss = 0%, RTA = 82.15 ms [23:48:05] 3operations, Beta-Cluster: Move deployment-prep hiera data values into ops/puppet.git repo - https://phabricator.wikimedia.org/T87223#1000016 (10greg) p:5Triage>3Normal [23:48:07] PROBLEM - Host cp1063 is DOWN: PING CRITICAL - Packet loss = 100% [23:48:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [500.0] [23:48:16] (03CR) 10Ottomata: [C: 032 V: 032] Create geowiki module [puppet] - 10https://gerrit.wikimedia.org/r/186551 (owner: 10Ottomata) [23:48:17] RECOVERY - HHVM rendering on osmium is OK: HTTP OK: HTTP/1.1 200 OK - 70728 bytes in 0.187 second response time [23:49:07] PROBLEM - Host cp1047 is DOWN: PING CRITICAL - Packet loss = 100% [23:49:28] PROBLEM - puppet last run on polonium is CRITICAL: CRITICAL: puppet fail [23:51:46] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [23:54:49] (03PS1) 10Alexandros Kosiaris: Remove server beryllium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/187283 [23:55:40] (03CR) 10Yuvipanda: [C: 031] Remove server beryllium from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/187283 (owner: 10Alexandros Kosiaris) [23:58:44] <^lurker> ping superm401, ebernhardson, gi11es for swat [23:59:16] (03PS1) 10Ottomata: Remove bad comma [puppet] - 10https://gerrit.wikimedia.org/r/187285 [23:59:30] (03CR) 10Ottomata: [C: 032 V: 032] Remove bad comma [puppet] - 10https://gerrit.wikimedia.org/r/187285 (owner: 10Ottomata) [23:59:48] 3ops-core: reclaim dysprosium for spare (was: server status) - https://phabricator.wikimedia.org/T83070#1000048 (10BBlack) dysprosium mgmt console is broken. BMC appears to be dead and can't be revived from anything I tried with ipmi or omsa tools.