[00:00:51] (03CR) 10Dzahn: "before: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed" [puppet] - 10https://gerrit.wikimedia.org/r/161631 (owner: 10Dzahn) [00:07:22] bblack: OK that's fine [00:09:57] gwicke: Hey so do I see correctly that mathoid doesn't have a deployment repo? How does that work? How is its config managed? How should the mathoid config be managed? [00:10:03] (03CR) 10Ori.livneh: wmflib: add to_milliseconds() / to_seconds() (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/159692 (owner: 10Ori.livneh) [00:10:41] RoanKattouw: it doesn't have a deploy repo yet [00:10:57] OK so how does it function without one? [00:10:57] (03PS2) 10Ori.livneh: wmflib: add to_milliseconds() / to_seconds() [puppet] - 10https://gerrit.wikimedia.org/r/159692 [00:11:17] magic! [00:11:25] RoanKattouw: I haven't looked at it in detail yet, I suspect that it's just doing a npm install in beta labs [00:11:30] which of course won't fly in prod [00:12:07] Oh is mathoid not in prod yet? [00:12:11] no, not yet [00:12:14] Ugh [00:12:16] OK [00:12:27] Where is it in labs? [00:12:43] I wouldn't hold you back if you'd like to deploy it too ;) [00:12:59] I'll probably have to [00:13:06] I mean, it would be approximately zero extra work [00:13:26] But the fact that mathoid isn't already deployed means there are bugs that you haven't found yet, which means this is going to be more work than I thought :| [00:14:15] I need to do some patches to citoid too, though. It needs to not 404 on / because of the health check [00:14:18] it's deployed in beta somewhere [00:14:18] (03PS1) 10Dzahn: icinga - fix syntax error for check_ssl_ldap [puppet] - 10https://gerrit.wikimedia.org/r/161633 [00:14:26] looking at the puppet code [00:14:40] And I want to make it use GET query strings or maybe urlencoded POST, but not this JSON RPC thing [00:14:56] RoanKattouw, gwicke: btw, have you seen http://meta.math.stackexchange.com/q/16809 ? [00:15:31] (03CR) 10Dzahn: [C: 032] icinga - fix syntax error for check_ssl_ldap [puppet] - 10https://gerrit.wikimedia.org/r/161633 (owner: 10Dzahn) [00:15:31] no, not yet [00:15:42] RoanKattouw: mathjax is plenty fast for us right now [00:15:50] and SVG is very fast to render on the client [00:16:01] hah [00:16:07] eh, ori ^^ [00:16:12] Maybe once it matures, it'll still be faster [00:16:19] And hopefully better designed than MathJax [00:16:27] yeah, we'll see [00:16:30] from an external-facing API perspective that is [00:16:35] easy to swap one lib for another [00:17:12] * ori nods [00:17:30] there are quite a few nice features around mathjax though that might take a while to replicate [00:17:30] not suggesting we abandon plans whenever some new repo trends on github. but this one looks pretty cool :) [00:17:50] like Chromevox and mathml [00:18:47] (a fork of chromevox can add readable annotations to the SVG, so that a screenreader can produce something that makes more sense than the raw latex) [00:39:21] (03PS1) 10Dzahn: add README.md to all modules [puppet] - 10https://gerrit.wikimedia.org/r/161634 [00:41:14] (03CR) 10Dzahn: [C: 04-2] "meh, that's not what i meant yet" [puppet] - 10https://gerrit.wikimedia.org/r/161634 (owner: 10Dzahn) [00:42:08] RECOVERY - Certificate expiration on virt1000 is OK: SSL_CERT OK - X.509 certificate for virt1000.wikimedia.org from RapidSSL CA valid until Jan 22 21:31:13 2015 GMT (expires in 125 days) [00:42:20] ^ yay [00:42:45] have a good weekend, everybody [00:42:59] have a good weekend [00:44:16] (03CR) 10Dzahn: "17:44 <+icinga-wm> RECOVERY - Certificate expiration on virt1000 is OK: SSL_CERT OK - X.509 certificate for virt1000.wikimedia.org from Ra" [puppet] - 10https://gerrit.wikimedia.org/r/161631 (owner: 10Dzahn) [01:19:36] PROBLEM - CI tmpfs disk space on lanthanum is CRITICAL: DISK CRITICAL - free space: /var/lib/jenkins-slave/tmpfs 27 MB (5% inode=99%): [02:08:24] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3616 MB (3% inode=99%): [02:19:34] !log LocalisationUpdate completed (1.24wmf20) at 2014-09-20 02:19:34+00:00 [02:19:49] Logged the message, Master [02:33:34] !log LocalisationUpdate completed (1.24wmf21) at 2014-09-20 02:33:34+00:00 [02:33:41] Logged the message, Master [02:46:05] !log LocalisationUpdate completed (1.24wmf22) at 2014-09-20 02:46:05+00:00 [02:46:11] Logged the message, Master [03:00:03] RECOVERY - Disk space on virt0 is OK: DISK OK [03:46:00] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Sep 20 03:46:00 UTC 2014 (duration 45m 59s) [03:46:07] Logged the message, Master [04:11:56] (03PS1) 10Ori.livneh: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/161642 [04:12:53] (03CR) 10Ori.livneh: [C: 032] update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/161642 (owner: 10Ori.livneh) [05:25:53] PROBLEM - Disk space on ms1001 is CRITICAL: DISK CRITICAL - free space: /export 2284794 MB (3% inode=99%): [06:28:13] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Epic puppet fail [06:28:42] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Epic puppet fail [06:29:02] PROBLEM - puppet last run on gallium is CRITICAL: CRITICAL: Epic puppet fail [06:29:52] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:02] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:03] PROBLEM - puppet last run on iron is CRITICAL: CRITICAL: Puppet has 3 failures [06:30:12] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:23] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:23] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:34] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:54] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:55] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on amssq46 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:13] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:23] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:23] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:45:23] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:45:43] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:45:54] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on iron is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 61 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 62 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:19] RECOVERY - puppet last run on gallium is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:47:38] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:47:46] RECOVERY - puppet last run on amssq46 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:58:26] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures [07:06:17] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:06:39] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:07:33] getting lots of 503 errors on en.wiki and commons [07:07:33] PROBLEM - HTTPS on cp4012 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [07:07:33] PROBLEM - HTTPS on cp4005 is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [07:07:34] PROBLEM - RAID on cp4014 is CRITICAL: Timeout while attempting connection [07:07:34] PROBLEM - DPKG on cp4010 is CRITICAL: Timeout while attempting connection [07:07:34] PROBLEM - puppet last run on cp4008 is CRITICAL: Timeout while attempting connection [07:07:34] PROBLEM - Varnish HTCP daemon on cp4014 is CRITICAL: Timeout while attempting connection [07:07:34] PROBLEM - Varnish HTCP daemon on cp4018 is CRITICAL: Timeout while attempting connection [07:07:35] PROBLEM - DPKG on cp4020 is CRITICAL: Timeout while attempting connection [07:07:35] PROBLEM - RAID on cp4004 is CRITICAL: Timeout while attempting connection [07:07:36] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 68827 bytes in 9.921 second response time [07:08:13] PROBLEM - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:08:23] PROBLEM - Disk space on cp4002 is CRITICAL: Timeout while attempting connection [07:08:23] PROBLEM - puppet last run on cp4003 is CRITICAL: Timeout while attempting connection [07:08:23] PROBLEM - puppet last run on lvs4002 is CRITICAL: Timeout while attempting connection [07:08:23] PROBLEM - check configured eth on cp4015 is CRITICAL: Timeout while attempting connection [07:08:23] PROBLEM - check if dhclient is running on cp4015 is CRITICAL: Timeout while attempting connection [07:08:24] PROBLEM - RAID on cp4018 is CRITICAL: Timeout while attempting connection [07:08:24] PROBLEM - DPKG on bast4001 is CRITICAL: Timeout while attempting connection [07:08:25] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [07:08:35] PROBLEM - Varnish traffic logger on cp4014 is CRITICAL: Timeout while attempting connection [07:08:35] PROBLEM - RAID on lvs4002 is CRITICAL: Timeout while attempting connection [07:09:03] PROBLEM - Host cp4015 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:09:03] RECOVERY - Host mobile-lb.ulsfo.wikimedia.org_ipv6 is UP: PING WARNING - Packet loss = 54%, RTA = 110.76 ms [07:09:14] RECOVERY - Disk space on cp4002 is OK: DISK OK [07:09:14] RECOVERY - check if dhclient is running on cp4015 is OK: PROCS OK: 0 processes with command name dhclient [07:09:14] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 403 seconds ago with 0 failures [07:09:14] RECOVERY - check configured eth on cp4015 is OK: NRPE: Unable to read output [07:09:23] RECOVERY - DPKG on bast4001 is OK: All packages OK [07:09:23] RECOVERY - RAID on cp4018 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:09:24] RECOVERY - Host cp4015 is UP: PING OK - Packet loss = 0%, RTA = 74.05 ms [07:09:33] RECOVERY - HTTPS on cp4012 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 487 days) [07:09:34] RECOVERY - HTTPS on cp4005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 487 days) [07:09:34] RECOVERY - Varnish HTCP daemon on cp4014 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [07:09:35] RECOVERY - DPKG on cp4010 is OK: All packages OK [07:09:35] RECOVERY - Varnish traffic logger on cp4014 is OK: PROCS OK: 2 processes with command name varnishncsa [07:09:35] RECOVERY - Varnish HTCP daemon on cp4018 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [07:09:35] RECOVERY - DPKG on cp4020 is OK: All packages OK [07:09:35] RECOVERY - RAID on lvs4002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:09:36] RECOVERY - RAID on cp4014 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:09:37] RECOVERY - RAID on cp4004 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:11:35] PROBLEM - Host cp4018 is DOWN: PING CRITICAL - Packet loss = 100% [07:11:35] PROBLEM - Host cp4017 is DOWN: PING CRITICAL - Packet loss = 100% [07:11:35] PROBLEM - Host cp4014 is DOWN: PING CRITICAL - Packet loss = 100% [07:11:35] PROBLEM - Host cp4003 is DOWN: PING CRITICAL - Packet loss = 100% [07:12:14] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: Connection timed out [07:12:19] uhoh [07:12:37] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:12:38] wonder if anyone's awake [07:12:41] PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [07:12:44] PROBLEM - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [07:12:48] PROBLEM - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [07:12:54] RECOVERY - Host cp4003 is UP: PING OK - Packet loss = 0%, RTA = 119.42 ms [07:12:54] PROBLEM - Host cp4004 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:54] PROBLEM - Host cp4009 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:54] PROBLEM - Host cp4008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:54] PROBLEM - Host cp4019 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:55] PROBLEM - Host lvs4001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:55] PROBLEM - Host cp4005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:56] PROBLEM - Host cp4010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:56] PROBLEM - Host cp4002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:57] PROBLEM - Host cp4001 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:57] PROBLEM - Host lvs4002 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:58] PROBLEM - Host cp4020 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:58] PROBLEM - Host cp4006 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:59] PROBLEM - Host cp4013 is DOWN: CRITICAL - Plugin timed out after 15 seconds [07:12:59] RECOVERY - Host cp4018 is UP: PING WARNING - Packet loss = 50%, RTA = 119.23 ms [07:12:59] people should be getting pages by now? [07:13:07] hopefully [07:13:10] RECOVERY - Host cp4009 is UP: PING WARNING - Packet loss = 28%, RTA = 121.52 ms [07:13:10] RECOVERY - Host cp4004 is UP: PING WARNING - Packet loss = 28%, RTA = 122.20 ms [07:13:10] RECOVERY - Host cp4017 is UP: PING WARNING - Packet loss = 37%, RTA = 121.30 ms [07:13:10] RECOVERY - Host cp4005 is UP: PING WARNING - Packet loss = 44%, RTA = 122.25 ms [07:13:18] RECOVERY - Host cp4014 is UP: PING WARNING - Packet loss = 50%, RTA = 121.92 ms [07:13:18] RECOVERY - Host cp4008 is UP: PING WARNING - Packet loss = 50%, RTA = 122.96 ms [07:13:18] RECOVERY - Host lvs4001 is UP: PING WARNING - Packet loss = 50%, RTA = 121.91 ms [07:13:18] RECOVERY - Host cp4006 is UP: PING WARNING - Packet loss = 50%, RTA = 121.35 ms [07:13:18] RECOVERY - Host cp4020 is UP: PING WARNING - Packet loss = 54%, RTA = 121.64 ms [07:13:19] RECOVERY - Host cp4001 is UP: PING WARNING - Packet loss = 54%, RTA = 121.64 ms [07:13:19] RECOVERY - Host cp4013 is UP: PING WARNING - Packet loss = 73%, RTA = 121.80 ms [07:13:20] RECOVERY - Host cp4019 is UP: PING WARNING - Packet loss = 73%, RTA = 121.05 ms [07:13:20] RECOVERY - Host cp4010 is UP: PING WARNING - Packet loss = 80%, RTA = 120.79 ms [07:13:21] RECOVERY - Host lvs4002 is UP: PING WARNING - Packet loss = 73%, RTA = 120.67 ms [07:13:32] yes [07:13:35] PROBLEM - RAID on cp4015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:13:38] :) [07:13:38] ah... n o pages cause phone is in the other room and I'm still in pjs [07:13:44] PROBLEM - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [07:14:15] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 699 bytes in 0.970 second response time [07:14:23] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 699 bytes in 1.535 second response time [07:14:28] RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22391 bytes in 0.817 second response time [07:14:42] no pages on phone. hrm [07:14:52] RECOVERY - LVS HTTP IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22341 bytes in 1.296 second response time [07:14:56] RECOVERY - LVS HTTP IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 677 bytes in 1.593 second response time [07:15:00] RECOVERY - Host cp4002 is UP: PING OK - Packet loss = 0%, RTA = 100.98 ms [07:15:12] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [07:15:22] RECOVERY - RAID on cp4015 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [07:15:22] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 68860 bytes in 4.567 second response time [07:16:05] RECOVERY - LVS HTTP IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 68749 bytes in 7.090 second response time [07:16:29] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:16:40] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Epic puppet fail [07:16:40] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Epic puppet fail [07:16:41] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Epic puppet fail [07:17:31] hi folks, is this something we should tweet from @wikimedia/@wikipedia? https://wikitech.wikimedia.org/wiki/Incident_response#Communicating_with_the_public [07:17:35] happy to help [07:17:41] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 3 failures [07:18:51] PROBLEM - Packetloss_Average on erbium is CRITICAL: packet_loss_average CRITICAL: 10.7821244706 [07:20:51] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:21:51] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [07:22:50] PROBLEM - Packetloss_Average on oxygen is CRITICAL: packet_loss_average CRITICAL: 8.09920889831 [07:23:06] PROBLEM - Packetloss_Average on analytics1026 is CRITICAL: packet_loss_average CRITICAL: 16.438700339 [07:23:18] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 2 failures [07:24:03] are folks still getting errors now? [07:26:09] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [07:26:43] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:26:59] RECOVERY - Packetloss_Average on analytics1026 is OK: packet_loss_average OKAY: 3.56611737288 [07:27:01] RECOVERY - Packetloss_Average on erbium is OK: packet_loss_average OKAY: 1.81911928571 [07:27:09] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:28:18] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:29:18] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:29:18] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [07:29:18] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:30:18] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [07:31:48] Eloquence: Seems good to me [07:32:04] looks good to me (i was gettting lots of 503s before) [07:33:26] kaldari, haeb - thanks. this outage should only have affected traffic that hit our san francisco data center so limited global user impact [07:34:19] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:35:48] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:35:52] ori, thanks for the en.wp post [07:36:58] Eloquence: i figured.. but that's still a continent or two, right? a tweet at least from @wikimedia would have been defendable, i think [07:37:18] RECOVERY - Packetloss_Average on oxygen is OK: packet_loss_average OKAY: 0.894191101695 [07:37:49] defensible! :) [07:39:11] (03PS1) 10Mark Bergsma: Move ulsfo traffic to eqiad [dns] - 10https://gerrit.wikimedia.org/r/161646 [07:39:52] ori: defendable is defendable: https://en.wiktionary.org/wiki/defendable ! ;) [07:41:04] that it may be; but it is not, alas, defensible [07:41:44] (03CR) 10Mark Bergsma: [C: 032] Move ulsfo traffic to eqiad [dns] - 10https://gerrit.wikimedia.org/r/161646 (owner: 10Mark Bergsma) [07:43:12] What I don't understand is why the server admin log is not used nowadays [07:43:26] it is? [07:43:52] <_joe_> Nemo_bis: ? [07:43:53] never seen it in use for events recently [07:44:22] <_joe_> it is used for manual operations on the cluster, usually [07:44:30] just a lot more of those are now automated [07:44:39] https://wikitech.wikimedia.org/wiki/Incident_documentation is very useful but only after some days; during events one has to scan all the logs [07:45:19] <_joe_> oh so you mean _during_ outages [07:45:39] yes, once upon a time it was very useful for communication during incidents [07:45:52] but in the last several years the communication about events has decreased a lot [07:46:09] maybe it's less needed now because users already tell each other there's a problem on twitter, dunno [07:46:59] there's also https://wikitech.wikimedia.org/wiki/Incident_response#Logging ... [07:47:04] so, in this case, there have been no actions other than my traffic move just now [08:00:23] (03PS1) 10Ori.livneh: hhvm: disable internal stat collection [puppet] - 10https://gerrit.wikimedia.org/r/161647 [08:03:13] <_joe_> ori: I'm going out now, but just set hhvm.stats = false would be enough [08:03:51] _joe_: except some of the configuration keys don't do anything, and some of the others are changing -- so i'd rather not risk having this become voodoo [08:04:33] <_joe_> mmmh, ok, it's easy to put them back [08:04:52] <_joe_> (also, I guess we should try to look at 3.3.0 for metrics to work again) [08:05:04] <_joe_> but, good night! [08:06:20] good night [08:10:41] ori: do you need that merged now? [08:11:17] not critical but would help if issues recur in the weekend [08:11:35] do I need to babysit it on all boxes? [08:11:49] nah, it's safeā„¢ [08:11:58] i'll stick around while it rolls out [08:12:10] but it's a revert to known-good config [08:12:12] (03CR) 10Mark Bergsma: [C: 032] hhvm: disable internal stat collection [puppet] - 10https://gerrit.wikimedia.org/r/161647 (owner: 10Ori.livneh) [08:12:24] thanks [08:14:51] just tested on one, seems fine indeed [08:15:53] /bits.wikimedia.org/skins-1.5/common/images/ajax-loader.gif [08:15:56] cool, thanks again [08:16:01] can somone pls tell me the new url? [08:16:32] is the grafic located on a other .bits /dir ? [08:18:25] ah, https://commons.wikimedia.org/wiki/File:Ajax-loader.gif. its the same :=) [08:47:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [10:51:57] (03PS1) 10Anomie: 'securepoll-create-poll' for sysop on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161653 [10:54:09] (03CR) 10Anomie: "Planned for Monday SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161653 (owner: 10Anomie) [10:56:50] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:08:00] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:17:19] (03PS3) 10Yuvipanda: nagios_common: Refactor custom command definitions [puppet] - 10https://gerrit.wikimedia.org/r/161478 [11:42:39] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:44:02] (03Abandoned) 10Yuvipanda: quarry: Use mysql-server-5.6 instead of 5.5 [puppet] - 10https://gerrit.wikimedia.org/r/153784 (owner: 10Yuvipanda) [11:44:15] (03Abandoned) 10Yuvipanda: toollabs: Increase space on /var for all nodes [puppet] - 10https://gerrit.wikimedia.org/r/143098 (owner: 10Yuvipanda) [11:46:13] (03Abandoned) 10Yuvipanda: Implement last command (per greg-g) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/150082 (owner: 10Yuvipanda) [13:44:33] querycachetwo is not unique... yay -.- [13:48:30] RECOVERY - CI tmpfs disk space on lanthanum is OK: DISK OK [14:41:42] heh, ganglia marks large amounts of free disk space in red, and small amounts in green. As though a surplus of free space was a crisis [14:48:36] (03PS1) 10Florianschmidtwelzow: Add REL1_24 as branch in ExtensionDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161666 [16:56:30] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 5.1GB (= 5.0GB critical): /srv/deployment/ocg/output 3903439758B: /srv/deployment/ocg/postmortem 1604914B: ocg_job_status 12299 msg: ocg_render_job_queue 0 msg [17:00:28] PROBLEM - puppet last run on virt1009 is CRITICAL: CRITICAL: Puppet has 1 failures [17:05:17] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.1GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12301 msg: ocg_render_job_queue 0 msg [17:15:34] (03PS1) 1001tonythomas: Corrected the exim regex expression [puppet] - 10https://gerrit.wikimedia.org/r/161679 [17:18:37] RECOVERY - puppet last run on virt1009 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:30:18] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 14.4GB (= 5.0GB critical): /srv/deployment/ocg/output 3964604775B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12303 msg: ocg_render_job_queue 0 msg [17:42:28] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 22.2GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [17:44:37] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 23.4GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [17:47:38] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 14.2GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [17:50:37] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 27.2GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1611320B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [17:52:09] (03PS1) 10Gerrit Patch Uploader: Remove "MW" and "C" namespace alias on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) [17:52:15] (03CR) 10Gerrit Patch Uploader: "This commit was uploaded using the Gerrit Patch Uploader [1]." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) (owner: 10Gerrit Patch Uploader) [17:52:38] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 16.0GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [17:55:38] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 13.1GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1640275B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [18:02:47] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 20.8GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [18:07:58] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 22.4GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [18:08:57] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 16.8GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1640275B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [18:10:59] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 23.2GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12304 msg: ocg_render_job_queue 0 msg [18:12:18] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [18:13:22] (03PS2) 10Steinsplitter: Remove "MW" and "C" namespace alias on ckbwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161684 (https://bugzilla.wikimedia.org/70564) (owner: 10Gerrit Patch Uploader) [18:18:07] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 21.6GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1645462B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg [18:19:07] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 25.3GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg [18:22:17] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 25.3GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg [18:24:17] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 25.4GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1645462B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg [18:27:18] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 25.3GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg [18:30:27] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: /mnt/tmpfs 5.8GB (= 5.0GB critical): /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1226765B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg [18:30:28] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:31:17] RECOVERY - OCG health on ocg1003 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 3474830228B: /srv/deployment/ocg/postmortem 1255714B: ocg_job_status 12305 msg: ocg_render_job_queue 0 msg [18:32:27] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: /mnt/tmpfs 25.4GB (= 5.0GB critical): /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1648026B: ocg_job_status 12306 msg: ocg_render_job_queue 0 msg [18:35:27] RECOVERY - OCG health on ocg1002 is OK: OK: /mnt/tmpfs 0B: /srv/deployment/ocg/output 4003066292B: /srv/deployment/ocg/postmortem 1677105B: ocg_job_status 12306 msg: ocg_render_job_queue 0 msg [18:41:59] greg-g: awake ? [19:25:38] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: Epic puppet fail [19:33:08] for a config change (operations/mediawiki-config) there is a slot in SWAT needed, right? [19:37:17] FlorianSW: it is the best way to get something merged without having luck :) [19:38:15] JohnLewis: ok, thanks for the info :) [19:44:48] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:04:59] (03PS1) 10Andrew Bogott: Remove virt1006 from the nova-compute pool [puppet] - 10https://gerrit.wikimedia.org/r/161745 [20:06:13] (03PS1) 10Krinkle: contint: Increase CPU monitoring sensitivity from 95-99 to 90-97 [puppet] - 10https://gerrit.wikimedia.org/r/161746 [20:06:54] (03PS3) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [20:11:05] (03PS4) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [20:22:26] (03PS5) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [20:22:54] (03PS6) 10Krinkle: contint: Ensure nodejs-legacy is installed [puppet] - 10https://gerrit.wikimedia.org/r/159226 [20:22:56] (03CR) 10Andrew Bogott: [C: 032] Remove virt1006 from the nova-compute pool [puppet] - 10https://gerrit.wikimedia.org/r/161745 (owner: 10Andrew Bogott) [20:23:44] (03PS1) 10Krinkle: contint: Package 'php5-parsekit' is absent on Trusty, don't require it [puppet] - 10https://gerrit.wikimedia.org/r/161748 [20:26:59] (03PS2) 10Krinkle: labmon: Increase CPU sensitivity for contint from 95-99 to 90-97 [puppet] - 10https://gerrit.wikimedia.org/r/161746 [20:29:56] !log moved all VMs off of virt1006, disabled compute service [20:30:03] Logged the message, Master [20:30:36] !log rebooting virt1006 to make good and sure it doesn't spontaneously re-enter the compute pool [20:30:42] Logged the message, Master [20:32:59] PROBLEM - DPKG on virt1006 is CRITICAL: Connection refused by host [20:32:59] PROBLEM - SSH on virt1006 is CRITICAL: Connection refused [20:33:07] PROBLEM - check if dhclient is running on virt1006 is CRITICAL: Connection refused by host [20:33:19] PROBLEM - puppet last run on virt1006 is CRITICAL: Connection refused by host [20:33:19] PROBLEM - Disk space on virt1006 is CRITICAL: Connection refused by host [20:33:28] PROBLEM - RAID on virt1006 is CRITICAL: Connection refused by host [20:33:28] PROBLEM - check configured eth on virt1006 is CRITICAL: Connection refused by host [20:34:29] (03CR) 10Andrew Bogott: [C: 032] labmon: Increase CPU sensitivity for contint from 95-99 to 90-97 [puppet] - 10https://gerrit.wikimedia.org/r/161746 (owner: 10Krinkle) [20:36:49] ACKNOWLEDGEMENT - DPKG on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:36:49] ACKNOWLEDGEMENT - Disk space on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:36:49] ACKNOWLEDGEMENT - NTP on virt1006 is CRITICAL: NTP CRITICAL: No response from NTP server andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:36:49] ACKNOWLEDGEMENT - RAID on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:36:49] ACKNOWLEDGEMENT - SSH on virt1006 is CRITICAL: Connection refused andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:36:50] ACKNOWLEDGEMENT - check configured eth on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:36:50] ACKNOWLEDGEMENT - check if dhclient is running on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:36:51] ACKNOWLEDGEMENT - puppet last run on virt1006 is CRITICAL: Connection refused by host andrew bogott I moved all VMs off of Virt1006 and its slated for rebuild. Its failure to come up after reboot is interesting but not urgent. [20:57:10] (03CR) 10Aaron Schulz: "I don't really get the hyperthreading/threads bit. There is of course less context switching with less threads. On the other hand parsing " [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke) [21:05:49] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Epic puppet fail [21:23:58] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:28:32] !log Reloading Zuul to deploy I0170766cfc06b8e6 [22:28:38] Logged the message, Master [23:03:40] (03CR) 10GWicke: "> I don't really get the hyperthreading/threads bit. There is of course less context switching with less threads. On the other hand parsin" [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke) [23:20:18] (03CR) 10GWicke: "I should add that the linux run queue can include processes blocked on disk IO. Jobs are typically blocking on network though, which means" [puppet] - 10https://gerrit.wikimedia.org/r/161473 (owner: 10GWicke) [23:21:10] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [23:37:29] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0]