[00:01:13] andrewbogott, syncing [00:01:23] !log krenair Synchronized php-1.25wmf23/extensions/OpenStackManager/nova/OpenStackNovaHost.php: https://gerrit.wikimedia.org/r/201385 (duration: 00m 13s) [00:01:26] please test [00:01:28] Logged the message, Master [00:01:32] * andrewbogott tests [00:02:56] Krenair: all looks good. Thank you! [00:03:13] !log krenair Synchronized php-1.25wmf24/extensions/OpenStackManager/nova/OpenStackNovaHost.php: https://gerrit.wikimedia.org/r/201386 (duration: 00m 12s) [00:03:15] ok [00:03:18] Logged the message, Master [00:03:20] -> pm [00:05:55] 6operations: DNS zones do not get re-generated when adding new language - https://phabricator.wikimedia.org/T84684#1172570 (10Dzahn) [00:07:04] Krenair: can i sneak one more config change in? :) https://gerrit.wikimedia.org/r/#/c/201388 [00:07:07] 6operations, 10Analytics, 10Analytics-EventLogging, 6Analytics-Kanban, 5Patch-For-Review: Disk space full on vanadium from logs in /var/log/upstart - https://phabricator.wikimedia.org/T93185#1172575 (10Dzahn) 5Open>3Resolved a:3Dzahn [00:07:12] one sec [00:07:17] thanks [00:08:43] ok [00:09:01] ebernhardson, this is not exactly 'sneaking in' since we're 9 minutes out already :) [00:09:04] but sure [00:09:14] eternal SWAT [00:09:39] uhhhhh [00:09:43] James_F, have you seen that patch? [00:09:55] Is that OK? [00:10:05] Krenair: Discussion in #mediawiki-visualeditor right now. [00:10:09] Krenair: Short form: No. [00:10:23] right :) [00:10:35] (03CR) 10Jforrester: [C: 04-1] "This is a pretty disruptive user-facing change (even if it's only in group0)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201388 (https://phabricator.wikimedia.org/T94282) (owner: 10Mattflaschen) [00:10:54] Krenair: Also, we're 40 minutes after SWAT should have closed, not 10. :_) [00:11:12] (03PS1) 10Dzahn: remove haedus and capella from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/201394 (https://phabricator.wikimedia.org/T94474) [00:11:26] deployment calendar says 23:00 - 00:00... it's now 00:11 [00:11:35] (03Abandoned) 10Mattflaschen: Add NS_TALK to VE for Flow, only on MW.org, test, and test2. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201388 (https://phabricator.wikimedia.org/T94282) (owner: 10Mattflaschen) [00:12:31] Krenair: Hmm. SWAT was meant to last 30 minutes. Maybe it changed? [00:12:38] Okay, let's scratch it (mainly for UX reasons, not because it's 10 minutes late). [00:12:42] yes [00:13:51] superm401: Sorry. [00:14:50] (03PS1) 10Dzahn: remove haedus,capella from hiera, DHCP, netboot [puppet] - 10https://gerrit.wikimedia.org/r/201395 (https://phabricator.wikimedia.org/T94474) [00:14:58] It's okay, I should have followed up about that issue earlier. [00:17:08] (03PS1) 10Yuvipanda: tools: Make webservice2 the default webservice [puppet] - 10https://gerrit.wikimedia.org/r/201396 (https://phabricator.wikimedia.org/T90855) [00:17:24] (03PS2) 10Yuvipanda: tools: Make webservice2 the default webservice [puppet] - 10https://gerrit.wikimedia.org/r/201396 (https://phabricator.wikimedia.org/T90855) [00:17:41] (03CR) 10Yuvipanda: [C: 032] "WHEEEE" [puppet] - 10https://gerrit.wikimedia.org/r/201396 (https://phabricator.wikimedia.org/T90855) (owner: 10Yuvipanda) [00:17:51] (03CR) 10Yuvipanda: [V: 032] "WHEEEE" [puppet] - 10https://gerrit.wikimedia.org/r/201396 (https://phabricator.wikimedia.org/T90855) (owner: 10Yuvipanda) [00:17:54] "This network has 51 channels or dialogs associated with it, would you really like to -" yes -_- [00:19:30] (03PS1) 10Dzahn: remove haedus/capella, decom [dns] - 10https://gerrit.wikimedia.org/r/201397 (https://phabricator.wikimedia.org/T94474) [00:19:31] 6operations, 10ops-esams: Upgrade cp3011-3014 with 10G cards - https://phabricator.wikimedia.org/T88684#1172621 (10BBlack) [00:19:33] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1172620 (10BBlack) [00:20:17] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#973608 (10BBlack) ^ Removed blocker on the 10G card upgrades. With the current plan for how to accomplish the role migrations, this is only a "nice-to-have" at this point; it do... [00:41:26] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172656 (10Dzahn) We are using "cassandra::seeds" as the list of allowed source hosts/IPs, but the structure in hieradata is diff... [00:45:53] (03PS1) 10Dzahn: set cassandra vars in cassandra.yaml, not restbase [puppet] - 10https://gerrit.wikimedia.org/r/201401 [00:49:58] mutante: is the puppet compiler still down? [00:50:10] gwicke: no, it worked again [00:51:10] did you test https://gerrit.wikimedia.org/r/201389 already? [00:51:42] (03PS1) 10Yuvipanda: tools: Fix how bigbrother calls webservice2 for trusty [puppet] - 10https://gerrit.wikimedia.org/r/201402 [00:51:56] (03PS2) 10Yuvipanda: tools: Fix how bigbrother calls webservice2 for trusty [puppet] - 10https://gerrit.wikimedia.org/r/201402 [00:52:05] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Fix how bigbrother calls webservice2 for trusty [puppet] - 10https://gerrit.wikimedia.org/r/201402 (owner: 10Yuvipanda) [00:53:17] gwicke: no [00:53:36] mutante: kk, let me give it a try [00:53:37] (03PS1) 10Dzahn: dumps: add rsync client hostnames to hiera data [puppet] - 10https://gerrit.wikimedia.org/r/201404 [00:53:40] i have another similar case here , for dumps [00:53:44] ok, cool [00:54:19] (03PS2) 10Dzahn: dumps: add rsync client hostnames to hiera data [puppet] - 10https://gerrit.wikimedia.org/r/201404 [00:55:25] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172664 (10GWicke) @dzahn, I just ran this through the puppet compiler, and got a puppet failure: http://puppet-compiler.wmflabs... [00:55:33] (03CR) 10Dzahn: "@Alexandros: how about this? now adding them to hiera instead: https://gerrit.wikimedia.org/r/#/c/201404/" [puppet] - 10https://gerrit.wikimedia.org/r/188188 (owner: 10Dzahn) [00:58:52] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172667 (10Dzahn) @gwicke: http://puppet-compiler.wmflabs.org/668/change/201389/compiled/puppet_catalogs_3_production/restbase10... [01:00:20] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172668 (10Dzahn) But it _did_ work for cerium earlier. http://puppet-compiler.wmflabs.org/666/change/197840/html/cerium.eqiad.wm... [01:01:40] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172669 (10GWicke) @dzahn, I see.. tricky. Maybe your idea of disabling puppet & just trying on one node is the most reliable / s... [01:04:16] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1172671 (10Dzahn) @gwicke actually, that is unrelated to my change, it already fails like that in production: http://puppet-comp... [01:04:35] gwicke: the fail you see on the compiler, it's not my change [01:04:43] it already fails in the "before" version [01:04:47] with the same error [01:05:07] fun ;( [01:05:07] like if you compare the 2 on http://puppet-compiler.wmflabs.org/668/change/201389/compiled/ [01:05:35] modules/cassandra/manifests/init.pp:246 [01:05:59] if (!is_ip_address($listen_address)) { [01:06:08] ^that, but it gets a hostname [01:06:23] mutante: I'm fine with disabling puppet & trying on one node [01:06:44] let's fix the current thing first then [01:06:48] so we can compile [01:07:17] so it's not a puppet-compiler specific issue, but actually broken in prod? [01:07:33] yea [01:07:48] there is a check in the module to check if listen_address is an IP [01:08:00] but it tries to use the hostname from hiera [01:09:34] well, that's my current theory [01:09:49] see that line 246 though [05:13:02] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:07] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 1.96 ms [05:13:12] PROBLEM - puppet last run on labstore1001 is CRITICAL: Connection refused by host [05:13:13] PROBLEM - Disk space on labstore1001 is CRITICAL: Connection refused by host [05:13:14] PROBLEM - configured eth on labstore1001 is CRITICAL: Connection refused by host [05:13:15] PROBLEM - RAID on labstore1001 is CRITICAL: Connection refused by host [05:13:15] PROBLEM - salt-minion processes on labstore1001 is CRITICAL: Connection refused by host [05:13:15] PROBLEM - dhclient process on labstore1001 is CRITICAL: Timeout while attempting connection [05:13:15] PROBLEM - DPKG on labstore1001 is CRITICAL: Timeout while attempting connection [05:13:16] that seems unfortunate [05:13:16] unfortunate, heh [05:13:16] ops are working on it in -labs [05:13:16] oh good [05:13:16] * springle ignores it [05:13:16] PROBLEM - Host labstore1001 is DOWN: PING CRITICAL - Packet loss = 100% [05:13:16] mutante, aww [05:13:16] Krenair: is that a good or a bad aww ? [05:13:16] bad [05:13:16] don't let that keep you from the discussion [05:13:16] aww [05:13:17] one more or less doesn't make a big difference imho, we can still discuss whether they should all be google or all not be google [05:13:17] ? [05:13:17] it's already using both [05:13:17] ori: mail aliases [05:13:17] context: https://phabricator.wikimedia.org/T94789 [05:13:17] oh [05:13:18] really, don't let that influence a general discussion where they should be [05:13:18] RECOVERY - Host labstore1001 is UP: PING OK - Packet loss = 0%, RTA = 2.41 ms [05:13:18] but unless we move them all it didnt make a difference to me [05:13:18] took us 10 seconds to do it or a minute to ask OIT [05:13:19] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:13:19] and mixed usage is already there anyways [05:13:19] RECOVERY - Disk space on labstore1001 is OK: DISK OK [05:13:19] RECOVERY - configured eth on labstore1001 is OK: NRPE: Unable to read output [05:13:19] RECOVERY - RAID on labstore1001 is OK: OK: optimal, 72 logical, 72 physical [05:13:19] RECOVERY - salt-minion processes on labstore1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [05:13:20] RECOVERY - dhclient process on labstore1001 is OK: PROCS OK: 0 processes with command name dhclient [05:13:20] RECOVERY - DPKG on labstore1001 is OK: All packages OK [05:13:21] https://wikitech.wikimedia.org/wiki/Category:Server_type:Bastion fenari and even "pascal" , hah :p .. night [05:13:23] PROBLEM - NTP on labstore1001 is CRITICAL: NTP CRITICAL: Offset unknown [05:13:24] PROBLEM - RAID on db1035 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [05:13:27] !log manually merged Ximilian global account per request and account confirmation [05:13:30] Jamesofur: you're doing those now? [05:13:57] apparently not logging it atm without a log bot... [05:14:17] yeah, labs was having issues, should be back soon [05:14:56] greg-g: I do them some, I'm on the global renamers list (given my work with the stewards and that area in general) and I know lego/keegan are slammed [05:15:02] Might as well use my access to help out [05:15:25] Jamesofur: does it require ssh/some cli tool, or is it web interface? [05:16:08] yeah cli [05:16:55] * greg-g nods [05:17:03] and occasionally even db manipulation (though I attempt to avoid that), the weird edge cases are, shockingly, weird edge cases [05:18:56] Jamesofur: what are you doing that requires direct db touching? [05:19:40] legoktm: generally I don't :) I use eval to change or set the email address (though I'd done the db touch in the past not completely realizing) [05:20:37] ah [05:20:54] if you touch the db directly it won't invalidate memcache [05:20:58] * Jamesofur nods [05:38:16] 6operations, 10ops-eqiad: db1035 raid degraded - https://phabricator.wikimedia.org/T94805#1172994 (10Springle) 3NEW [05:38:57] ACKNOWLEDGEMENT - RAID on db1035 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Sean Pringle T94805 [05:48:15] morebots: everything ok? [05:48:16] I am a logbot running on tools-exec-11. [05:48:16] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [05:48:16] To log a message, type !log . [05:48:25] awesome [06:12:38] (03PS1) 10BBlack: Revert "Drain esams of all traffic v2 (scheduled outage)" [dns] - 10https://gerrit.wikimedia.org/r/201418 [06:13:10] (03CR) 10BBlack: [C: 032] Revert "Drain esams of all traffic v2 (scheduled outage)" [dns] - 10https://gerrit.wikimedia.org/r/201418 (owner: 10BBlack) [06:13:53] !log re-pooling esams (GTT event never happened AFAICS) [06:14:02] Logged the message, Master [06:17:46] (03PS1) 10Prtksxna: Add $wgPopupsSurveyLink if $wmgUsePopups is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201419 (https://phabricator.wikimedia.org/T1005) [06:29:37] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: puppet fail [06:29:47] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: puppet fail [06:29:48] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:48] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on polonium is CRITICAL: CRITICAL: puppet fail [06:30:36] PROBLEM - puppet last run on mc1012 is CRITICAL: CRITICAL: puppet fail [06:31:27] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: puppet fail [06:31:27] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: puppet fail [06:32:27] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:58] (03CR) 10Legoktm: [C: 031] Add $wgPopupsSurveyLink if $wmgUsePopups is true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201419 (https://phabricator.wikimedia.org/T1005) (owner: 10Prtksxna) [06:34:02] !log manually merged Ximilian global account per request and account confirmation [06:34:07] Logged the message, Master [06:34:17] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:26] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:26] PROBLEM - puppet last run on wtp2012 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:27] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:47] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:47] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:17] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 3 failures [06:35:36] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:57] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:06] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:06] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:06] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:07] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:26] PROBLEM - puppet last run on mw2206 is CRITICAL: CRITICAL: Puppet has 2 failures [06:36:27] PROBLEM - puppet last run on mw2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:27] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:36] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:57] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:17] PROBLEM - puppet last run on mw2079 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:17] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:17] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:17] PROBLEM - puppet last run on mw2059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:38] PROBLEM - puppet last run on mw2030 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:38] PROBLEM - puppet last run on mw2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:07] PROBLEM - LVS HTTPS IPv4 on upload-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:28] PROBLEM - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:33] PROBLEM - LVS HTTP IPv4 on bits-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:37] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/2: down - Core: cr2-knams:xe-1/1/0 (GTT, 00341724) [10Gbps MPLS]BR [06:39:37] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:40] PROBLEM - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:43] PROBLEM - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:39:48] (03PS1) 10BBlack: Revert "Revert "Drain esams of all traffic v2 (scheduled outage)"" [dns] - 10https://gerrit.wikimedia.org/r/201420 [06:39:58] (03CR) 10BBlack: [C: 032 V: 032] Revert "Revert "Drain esams of all traffic v2 (scheduled outage)"" [dns] - 10https://gerrit.wikimedia.org/r/201420 (owner: 10BBlack) [06:40:17] PROBLEM - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:20] PROBLEM - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:46] PROBLEM - HTTP 5xx req/min on graphite1002 is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [06:40:56] !log re-depooled esams ... [06:40:56] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:41:03] Logged the message, Master [06:41:17] RECOVERY - LVS HTTPS IPv4 on upload-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 759 bytes in 9.436 second response time [06:41:37] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [06:41:37] RECOVERY - LVS HTTP IPv4 on bits-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4044 bytes in 9.233 second response time [06:41:47] PROBLEM - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:41:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [06:42:06] RECOVERY - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4049 bytes in 0.341 second response time [06:42:37] PROBLEM - puppet last run on lvs3002 is CRITICAL: CRITICAL: puppet fail [06:42:37] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [06:42:46] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 65682 bytes in 5.263 second response time [06:43:27] RECOVERY - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65713 bytes in 0.373 second response time [06:43:30] RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18071 bytes in 0.433 second response time [06:43:33] PROBLEM - puppet last run on amssq59 is CRITICAL: CRITICAL: puppet fail [06:43:33] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [06:43:33] RECOVERY - LVS HTTPS IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 65838 bytes in 0.590 second response time [06:43:37] RECOVERY - LVS HTTP IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 4044 bytes in 2.409 second response time [06:43:57] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: puppet fail [06:44:36] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [06:45:07] RECOVERY - LVS HTTP IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 18112 bytes in 0.253 second response time [06:45:17] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [06:45:26] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [06:45:37] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:38] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:45:38] PROBLEM - puppet last run on amssq37 is CRITICAL: CRITICAL: Puppet has 12 failures [06:45:47] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:45:48] RECOVERY - LVS HTTPS IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 65806 bytes in 9.926 second response time [06:46:18] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: puppet fail [06:46:18] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 16 failures [06:46:18] PROBLEM - puppet last run on cp3043 is CRITICAL: CRITICAL: puppet fail [06:46:26] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:27] RECOVERY - puppet last run on mw2079 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:38] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:46] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:47] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on wtp2012 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:07] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw2206 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on mw2017 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:47:18] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:18] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:18] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:47] RECOVERY - puppet last run on polonium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:56] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on mc1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:07] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:08] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:16] RECOVERY - puppet last run on mw2059 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw2030 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:48:37] RECOVERY - puppet last run on mw2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:57] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:52:56] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: puppet fail [06:52:57] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [06:52:57] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [06:53:18] PROBLEM - puppet last run on amssq48 is CRITICAL: CRITICAL: puppet fail [06:53:37] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: puppet fail [06:53:37] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: puppet fail [06:53:37] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [06:53:58] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [06:55:28] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [06:55:46] RECOVERY - puppet last run on amssq37 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on amssq59 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:27] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:57:57] RECOVERY - puppet last run on lvs3002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on cp3043 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on amssq52 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:58:06] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:47] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:58:48] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:59:48] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:03:00] (03PS1) 10Ori.livneh: webservice2: EAFP, not LBYL [puppet] - 10https://gerrit.wikimedia.org/r/201421 [07:03:03] YuviPanda: ^ [07:04:18] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:04:57] RECOVERY - HTTP 5xx req/min on graphite1002 is OK: OK: Less than 1.00% above the threshold [250.0] [07:07:47] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:08:17] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:08:26] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:47] RECOVERY - puppet last run on amssq48 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:09:07] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:09:07] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [07:09:07] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [07:10:06] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:17:15] <_joe_> ori: I'm not sure that is a better coding style, in that specific case [07:17:44] <_joe_> I think EAFP is a good idea for internal things, not for external resources, but maybe that's just my taste [07:18:12] it's one less system call this way, but the bigger point is that it fixes a bug [07:18:22] <_joe_> oh ok :P [07:18:26] because the file can go away between the exists and open calls [07:18:53] <_joe_> right [07:20:36] (03PS7) 10KartikMistry: CX: Enable newarticle campaign in cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 [07:45:23] (03PS1) 10Springle: repool db1027 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201426 [07:45:48] (03CR) 10Springle: [C: 032] repool db1027 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201426 (owner: 10Springle) [07:45:53] (03Merged) 10jenkins-bot: repool db1027 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201426 (owner: 10Springle) [07:46:43] !log springle Synchronized wmf-config/db-eqiad.php: repool db1027, warm up (duration: 00m 11s) [07:46:54] Logged the message, Master [07:48:43] !log on sync-file from tin: mw2213.codfw.wmnet returned [255]: Host key verification failed [07:48:48] Nemo_bis: you've submitted puppet patches to mediawiki-vagrant before, no? If you are interested, I could help you whip up a Puppet module for LimeSurvey that would work in production. [07:48:50] Logged the message, Master [07:50:03] <_joe_> springle: ach, I'll take a look [07:50:24] _joe_: tnx. couple other recent hist in the SAL for mw2213, but i havn't investigated much [07:50:28] hits* [07:50:41] <_joe_> heh, still need to be properly installed it seems? [07:51:18] that would do it [07:55:39] ori: I'd like to do that, sure. Thanks for the offer to help! [07:56:33] However, if I did that before having a concrete demand, I would be *pushing* LimeSurvey. And I can't do that without knowing if it's actually suitable from a statistical perspective etc. (Which the researchers should tell, I think.) [07:57:03] OTOH I could just volunteer on analytics@ and see if someone is interested. [07:58:16] Nemo_bis: bd808's scholarship app (modules/wikimedia_scholarships in operations/puppet, puppet/modules/scholarships in mediawiki-vagrant) would be a good place to start, since it's a standalone LAMP app. LimeSurvey works well with MySQL and Apache by the looks of it and has no real dependencies beyond that. [07:59:44] Right. Good point, that's a nice starter. [08:00:44] (03CR) 10Krinkle: "Once tested, we should probably make sure this applies to all wikis that use the 'default' $wgLogo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 (owner: 10Ori.livneh) [08:03:40] PROBLEM - RAID on mw2213 is CRITICAL: Connection refused by host [08:03:49] (03CR) 10Krinkle: [C: 031] webservice2: EAFP, not LBYL [puppet] - 10https://gerrit.wikimedia.org/r/201421 (owner: 10Ori.livneh) [08:04:19] PROBLEM - configured eth on mw2213 is CRITICAL: Connection refused by host [08:04:30] PROBLEM - dhclient process on mw2213 is CRITICAL: Connection refused by host [08:04:50] PROBLEM - nutcracker port on mw2213 is CRITICAL: Connection refused by host [08:05:00] PROBLEM - nutcracker process on mw2213 is CRITICAL: Connection refused by host [08:05:09] PROBLEM - puppet last run on mw2213 is CRITICAL: Connection refused by host [08:05:19] PROBLEM - DPKG on mw2213 is CRITICAL: Connection refused by host [08:05:19] PROBLEM - salt-minion processes on mw2213 is CRITICAL: Connection refused by host [08:05:30] PROBLEM - Disk space on mw2213 is CRITICAL: Connection refused by host [08:06:00] PROBLEM - HHVM processes on mw2213 is CRITICAL: Connection refused by host [08:22:43] (03CR) 10Filippo Giunchedi: [C: 031] various role classes - indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/200110 (owner: 10Dzahn) [08:22:54] (03PS2) 10Filippo Giunchedi: various role classes - indentation fixes [puppet] - 10https://gerrit.wikimedia.org/r/200110 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [08:23:37] morning godog [08:23:38] PROBLEM - HHVM processes on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:24:37] PROBLEM - RAID on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:24:49] PROBLEM - configured eth on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:24:50] _joe_: is mw2209 you? [08:25:08] PROBLEM - dhclient process on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:25:18] PROBLEM - nutcracker port on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:25:23] <_joe_> ori: yes, I'm installing it actually [08:25:37] PROBLEM - nutcracker process on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:25:37] cool :) [08:25:45] <_joe_> since it's just 2 systems, I didn't fanatically tried to schedule downtime for them [08:25:48] PROBLEM - puppet last run on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:25:58] PROBLEM - salt-minion processes on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:25:58] PROBLEM - DPKG on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:26:17] PROBLEM - Disk space on mw2209 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [08:26:21] (03CR) 10Filippo Giunchedi: "I'd assumed that's the username from http basic auth? i.e. labs full name" [puppet] - 10https://gerrit.wikimedia.org/r/201251 (https://phabricator.wikimedia.org/T94717) (owner: 10Dzahn) [08:26:27] hey ori [08:27:28] RECOVERY - DPKG on mw2213 is OK: All packages OK [08:27:28] RECOVERY - HHVM processes on mw2213 is OK: PROCS OK: 1 process with command name hhvm [08:27:28] RECOVERY - configured eth on mw2213 is OK: NRPE: Unable to read output [08:27:28] RECOVERY - salt-minion processes on mw2213 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:27:39] RECOVERY - DPKG on mw2209 is OK: All packages OK [08:27:39] RECOVERY - salt-minion processes on mw2209 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:27:57] RECOVERY - Disk space on mw2209 is OK: DISK OK [08:27:58] RECOVERY - RAID on mw2209 is OK: OK: no RAID installed [08:28:08] RECOVERY - RAID on mw2213 is OK: OK: no RAID installed [08:28:09] RECOVERY - Disk space on mw2213 is OK: DISK OK [08:28:18] RECOVERY - dhclient process on mw2213 is OK: PROCS OK: 0 processes with command name dhclient [08:28:18] RECOVERY - configured eth on mw2209 is OK: NRPE: Unable to read output [08:28:37] RECOVERY - dhclient process on mw2209 is OK: PROCS OK: 0 processes with command name dhclient [08:28:37] RECOVERY - nutcracker port on mw2213 is OK: TCP OK - 0.000 second response time on port 11212 [08:28:48] RECOVERY - nutcracker port on mw2209 is OK: TCP OK - 0.000 second response time on port 11212 [08:28:48] RECOVERY - nutcracker process on mw2213 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:28:48] RECOVERY - HHVM processes on mw2209 is OK: PROCS OK: 1 process with command name hhvm [08:28:58] RECOVERY - nutcracker process on mw2209 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [08:30:47] PROBLEM - puppet last run on mw2213 is CRITICAL: CRITICAL: Puppet has 6 failures [08:30:58] PROBLEM - puppet last run on mw2209 is CRITICAL: CRITICAL: Puppet has 6 failures [08:31:27] PROBLEM - HHVM rendering on mw2213 is CRITICAL: HTTP CRITICAL - No data received from host [08:31:46] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Apr 2 08:30:42 UTC 2015 (duration 30m 41s) [08:31:53] Logged the message, Master [08:32:28] RECOVERY - puppet last run on mw2213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:31] (03CR) 10Hashar: [C: 031] "The modifications are fine themselves and is a step to toward enabling the arrow_alignement check." [puppet] - 10https://gerrit.wikimedia.org/r/200110 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [08:32:38] RECOVERY - puppet last run on mw2209 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:33:07] RECOVERY - HHVM rendering on mw2213 is OK: HTTP OK: HTTP/1.1 200 OK - 65422 bytes in 0.473 second response time [08:46:27] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:46:52] (03PS3) 10Hoo man: Adopt dumpwikidatajson.sh to the new naming pattern [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) [08:46:54] (03PS5) 10Hoo man: Add a script to create Wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [08:50:28] 6operations, 10ops-codfw: ms-be2002.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T94014#1173122 (10fgiunchedi) 5Open>3Resolved disk back in service [08:51:27] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:54:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59654 bytes in 0.160 second response time [08:59:15] (03PS1) 10Filippo Giunchedi: admin: add pcoombe to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/201430 (https://phabricator.wikimedia.org/T94466) [09:00:54] (03PS4) 10Hoo man: Adopt dumpwikidatajson.sh to the new naming pattern [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) [09:01:08] (03CR) 10Alexandros Kosiaris: "What Brandon said. The use of parsoid-lb.eqiad.wikimedia.org directly is/was and remains a mistake. As soon as codfw parsoid cluster is up" [puppet] - 10https://gerrit.wikimedia.org/r/185181 (https://phabricator.wikimedia.org/T86847) (owner: 10Alexandros Kosiaris) [09:03:48] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: Puppet has 1 failures [09:06:54] 6operations: luis@? - https://phabricator.wikimedia.org/T94789#1173174 (10faidon) @Dzahn, this should be done by OIT, not by us. We've been talking with Joel about moving all of the existing aliases to OIT, but let's not add new ones and redirect people to OIT instead. (too late for this one, JFYI). [09:07:40] (03CR) 10Alexandros Kosiaris: [C: 031] "LGTM, a minor nitpick in inline comment. +1 from me" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201404 (owner: 10Dzahn) [09:09:05] (03CR) 10Faidon Liambotis: "I believe the plan is to deprecate parsoid-lb entirely and use the public endpoint rest.wikimedia.org (which obviously has HTTPS)." [puppet] - 10https://gerrit.wikimedia.org/r/185181 (https://phabricator.wikimedia.org/T86847) (owner: 10Alexandros Kosiaris) [09:10:08] akosiaris: ^ [09:11:18] 6operations, 10ops-eqiad: fluorine console not working - https://phabricator.wikimedia.org/T94554#1173195 (10fgiunchedi) ack, let me know 10-15 min before bringing it down here or on irc [09:13:17] paravoid: yeah I am not sure on the details of that plan tbh [09:13:52] as in, I don't really know them [09:14:56] I do remember gabriel saying that the parsoid varnishes will be deprecated [09:15:10] but there are various services using them, like for example CX [09:15:33] and iegreview [09:16:02] and well... restbase defaults to it and then is overriden in hiera [09:16:27] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: puppet fail [09:17:37] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [09:24:13] (03PS5) 10Hoo man: Adopt dumpwikidatajson.sh to the new naming pattern [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) [09:27:23] (03PS6) 10Hoo man: Adopt dumpwikidatajson.sh to the new naming pattern [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) [09:27:59] PROBLEM - Host platinum is DOWN: PING CRITICAL - Packet loss = 100% [09:28:18] PROBLEM - Host thallium is DOWN: PING CRITICAL - Packet loss = 100% [09:28:57] 6operations, 10ops-esams: Audit racktables - https://phabricator.wikimedia.org/T94819#1173246 (10mark) 3NEW [09:29:58] RECOVERY - Host thallium is UP: PING OK - Packet loss = 0%, RTA = 0.46 ms [09:30:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 601 [09:30:28] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:31:27] RECOVERY - Host platinum is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [09:34:23] (03CR) 10ArielGlenn: [C: 032] Adopt dumpwikidatajson.sh to the new naming pattern [puppet] - 10https://gerrit.wikimedia.org/r/201238 (https://phabricator.wikimedia.org/T72385) (owner: 10Hoo man) [09:35:17] RECOVERY - check_mysql on db1008 is OK: Uptime: 1795548 Threads: 1 Questions: 11932810 Slow queries: 12030 Opens: 36021 Flush tables: 2 Open tables: 64 Queries per second avg: 6.645 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:35:38] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:36:27] PROBLEM - Host thallium is DOWN: PING CRITICAL - Packet loss = 100% [09:36:47] PROBLEM - Host platinum is DOWN: PING CRITICAL - Packet loss = 100% [09:38:05] 6operations, 10ops-esams: Audit racktables - https://phabricator.wikimedia.org/T94819#1173258 (10fgiunchedi) p:5Triage>3Normal [09:38:13] 6operations, 10ops-eqiad: db1035 raid degraded - https://phabricator.wikimedia.org/T94805#1173260 (10fgiunchedi) p:5Triage>3Normal [09:38:25] 6operations, 10Labs-Vagrant: Backport Vagrant 1.7+ from Debian experimental to our Trusty apt repo - https://phabricator.wikimedia.org/T93153#1173264 (10fgiunchedi) p:5Triage>3Normal [09:38:48] RECOVERY - Host thallium is UP: PING OK - Packet loss = 0%, RTA = 1.32 ms [09:39:47] 6operations, 10MediaWiki-extensions-GeoData, 10OpenStreetMap, 10Wikimedia-Search: Assess growing GeoData hardware requirements - https://phabricator.wikimedia.org/T94768#1173269 (10fgiunchedi) p:5Triage>3Normal [09:39:58] 6operations, 5Patch-For-Review: failed icinga/graphite login for Moritz - https://phabricator.wikimedia.org/T94729#1173271 (10fgiunchedi) p:5Triage>3Normal [09:40:17] 6operations, 10ops-esams: Remove unused fibers - https://phabricator.wikimedia.org/T94704#1173273 (10fgiunchedi) p:5Triage>3Normal [09:40:27] PROBLEM - Host berkelium is DOWN: PING CRITICAL - Packet loss = 100% [09:40:38] <_joe_> berlekium? [09:40:44] uh oh [09:40:49] I just did an RE swap on a switch [09:40:52] 6operations, 10Deployment-Systems, 6Services: Automate compiling service dependencies using production Jessie libraries - https://phabricator.wikimedia.org/T94611#1173275 (10fgiunchedi) p:5Triage>3Normal [09:40:53] PROBLEM - NTP on thallium is CRITICAL: NTP CRITICAL: Offset unknown [09:40:58] berkelium is me :( [09:41:03] kernel crashed during ipsec load testing [09:41:10] i am not happy about this [09:42:02] !log asw-d-eqiad: routing-engine backup switch FPC 7 -> FPC 5, master switchover FPC 8 -> FPC 5 [09:42:03] PROBLEM - Host thallium is DOWN: PING CRITICAL - Packet loss = 100% [09:42:12] Logged the message, Master [09:42:12] jgage: did you get a backtrace? [09:42:32] yeah [09:43:02] RECOVERY - Host berkelium is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [09:43:03] !log asw-d-eqiad: routing-engine backup switch FPC 8 -> FPC 4 [09:43:11] Logged the message, Master [09:43:42] PROBLEM - Host gold is DOWN: PING CRITICAL - Packet loss = 100% [09:44:22] RECOVERY - Host gold is UP: PING OK - Packet loss = 0%, RTA = 1.57 ms [09:44:31] not sure what to make of it [09:44:33] [ 144.020469] Call Trace: [09:44:34] [ 144.022906] [] ? dump_stack+0x41/0x51 [09:44:34] [ 144.028834] [] ? warn_slowpath_common+0x77/0x90 [09:44:37] [ 144.035000] [] ? update_process_times+0x59/0x70 [09:44:40] [ 144.041168] [] ? tick_sched_handle.isra.16+0x20/0x60 [09:44:43] [ 144.047768] [] ? tick_sched_timer+0x3c/0x60 [09:44:45] will attempt to repeat [09:44:47] that's a warn, not a crash, no? [09:44:53] 6operations, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1173277 (10fgiunchedi) p:5Triage>3Normal [09:45:22] RECOVERY - Host thallium is UP: PING OK - Packet loss = 0%, RTA = 0.93 ms [09:45:28] 6operations, 10ops-codfw, 6Labs: rack and connect labstore-array4-codfw in codfw - https://phabricator.wikimedia.org/T93215#1173279 (10fgiunchedi) p:5Triage>3Normal [09:45:35] 6operations: deploy eventlog2001 services - https://phabricator.wikimedia.org/T93220#1173282 (10fgiunchedi) p:5Triage>3Normal [09:45:51] ah you're right there's a warn right after the panic output [09:46:57] i wonder if the fact that i was tcpdumping at the time is related [09:47:06] * jgage retries [09:48:11] 6operations, 10Continuous-Integration: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173290 (10fgiunchedi) p:5Triage>3Normal [09:48:34] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1173292 (10fgiunchedi) p:5Triage>3Normal [09:48:45] 6operations, 10Deployment-Systems, 6Services: Evaluate Docker as a container deployment tool - https://phabricator.wikimedia.org/T93439#1173293 (10fgiunchedi) p:5Triage>3Normal [09:49:44] PROBLEM - Host gold is DOWN: PING CRITICAL - Packet loss = 100% [09:50:03] PROBLEM - Host thallium is DOWN: PING CRITICAL - Packet loss = 100% [09:50:10] dammit now i got a panic on curium. i'm just just running wget of a tiny file in a loop with a 1 second pause. [09:52:12] RECOVERY - Host gold is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [09:52:12] RECOVERY - Host thallium is UP: PING OK - Packet loss = 0%, RTA = 2.19 ms [09:56:17] hashar: there? I was looking at python-gear and looks like the upstream branch isn't pushed to alioth? [09:56:37] godog: yeah I am using uscan [09:56:46] but dont bother with it, I have just pinged the original sponsor mika [09:57:10] following your triage of the task :D [09:57:24] Oh I mean Upload python-gear 0.5.5-2 to Debian project https://phabricator.wikimedia.org/T89952 [09:57:33] for our wikimedia distro ( https://phabricator.wikimedia.org/T92684 ) [09:57:42] you would want to use uscan --force-download [09:57:50] I think there was a getorig target as well [09:58:02] I wish git-buildpackage could rely on uscan to fetch the orig tarball [09:58:14] PROBLEM - Host berkelium is DOWN: PING CRITICAL - Packet loss = 100% [09:58:50] this time berkelium crashed when i was just restarting ipsec service [09:59:07] i'm commenting out bits of config each time to get closer to defaults [09:59:50] gage@berkelium:~$ sudo service ipsec restart [09:59:51] Warning! D-Bus connection terminated. [09:59:51] Failed to wait for response: Success [09:59:55] hashar: what should happen after uscan btw? [10:00:12] RECOVERY - Host berkelium is UP: PING OK - Packet loss = 0%, RTA = 1.37 ms [10:01:04] godog: magic? [10:01:18] godog: what i do is git clone the sources, then uscan --force-download --rename [10:01:44] that got me the upstream tarball python-gear_0.5.5.orig.tar.gz in the parent dir [10:01:55] then I invoke git-buildpackage which magically find the tarball [10:02:16] when I say "magic" it is that I have no clue what is happening inside git buildpackage [10:02:40] I guess it first look for a tarball and if not found fallback to upstream/${version} [10:04:46] hashar: possible, I'll take a look [10:05:27] godog: the original uploaded is looking at pushing the latest 0.5.5-2 to debian project :) [10:05:40] I could push upstream source to alioth [10:05:51] but I am not sure whether the resulting tarball would be the same as the one they publish [10:06:19] yeah for that pristine-tar is used usually [10:09:17] jgage: so what's the panic then? [10:09:21] the backtrace [10:10:58] paravoid i've captured two so far, one in __wake_up and one in __wake_up_sync_key [10:11:10] i don't have much experience reading these, but i can email them if you like [10:11:24] running another test now with default ciphers and so far it's stable [10:11:29] phab please :) [10:11:34] k :) [10:16:02] 6operations, 6MediaWiki-Core-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1173311 (10fgiunchedi) so fluorine disk space is stable now after cleaning up uncompressed logs, what's left is unsampled vs sampled. if we go unsampled a daily rotated log file i... [10:17:15] (03PS2) 10Hoo man: Generalize wikidata dump scripts [puppet] - 10https://gerrit.wikimedia.org/r/201372 [10:19:02] 6operations, 3Interdatacenter-IPsec: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test - https://phabricator.wikimedia.org/T94820#1173312 (10Gage) 3NEW [10:19:14] why are you trying with 3.16? [10:19:22] berkelium was on 3.19 [10:20:08] no, berkelium & curium are both apt-get dist-upgraded and running 3.16 [10:21:16] 6operations, 3Interdatacenter-IPsec: Kernel panics on Jessie (3.16.0-4-amd64) during IPsec load test - https://phabricator.wikimedia.org/T94820#1173323 (10faidon) [10:27:06] 6operations, 10Continuous-Integration: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173325 (10hashar) As pointed by Filippo, the Alioth repository does not have an upstream branch containing the source. There is a debian/watch file though so one ca... [10:30:33] PROBLEM - Host gold is DOWN: PING CRITICAL - Packet loss = 100% [10:30:44] PROBLEM - Host thallium is DOWN: PING CRITICAL - Packet loss = 100% [10:32:08] (03PS3) 10Hoo man: Generalize wikidata dump scripts [puppet] - 10https://gerrit.wikimedia.org/r/201372 [10:34:22] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Set up the mediawiki application layer in codfw - https://phabricator.wikimedia.org/T86894#1173341 (10Joe) a:3Joe [10:37:33] paravoid, i'm headed to bed. will triage those ciphers in the morning. thanks for looking at the traces. [10:37:43] ok [10:37:45] thanks :) [10:37:52] 6operations, 3codfw-appserver-setup, 3wikis-in-codfw: Pybal RunCommand monitor doesn't work correctly on ubuntu trusty - https://phabricator.wikimedia.org/T94822#1173342 (10Joe) 3NEW a:3Joe [10:38:32] <_joe_> !log stopping pybal on lvs2003, running manually to help debugging [10:38:41] Logged the message, Master [10:39:51] (03CR) 10ArielGlenn: [C: 032] Generalize wikidata dump scripts [puppet] - 10https://gerrit.wikimedia.org/r/201372 (owner: 10Hoo man) [10:42:59] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1173355 (10fgiunchedi) [10:43:01] 6operations, 10Continuous-Integration: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173352 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi python-gear 0.5.5-2 uploaded to both jessie-wikimedia and trusty-wikimedia [10:43:03] hashar: ^ [10:43:34] (03PS6) 10Hoo man: Add a script to create Wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [10:52:35] 6operations, 10ops-esams: cp3011 hardware fault - https://phabricator.wikimedia.org/T92306#1173362 (10mark) Dell's Lifecycle Controller's hardware diagnostics gives the following error codes: Error Code: 2000-0251 Validation 78714 [10:54:23] RECOVERY - NTP on labstore1001 is OK: NTP OK: Offset 0.05301225185 secs [10:57:48] 6operations: Upload python-gear 0.5.5-2 to Debian project - https://phabricator.wikimedia.org/T89952#1173375 (10hashar) Debian Developer [[ https://qa.debian.org/developer.php?login=mika%40debian.org Michael "mika" Prokop ]] kindly reviewed 0.5.5-2 from the alioth git repository and uploaded the package to the D... [11:00:38] (03PS7) 10Hoo man: Add a script to create Wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [11:02:02] !log Shutting down cp3012 for 10G upgrade [11:02:10] Logged the message, Master [11:03:13] 6operations, 10Continuous-Integration: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173378 (10hashar) Confirmed. Thanks a lot @fgiunchedi From T89952 : Debian Developer [[ https://qa.debian.org/developer.php?login=mika%40debian.org | Michael "mika"... [11:03:30] godog: thanks a lot. I feel much more comfortable with Debian nowadays :) [11:03:55] np [11:04:03] PROBLEM - Host cp3012 is DOWN: PING CRITICAL - Packet loss = 100% [11:10:37] pybal flap on codfw? [11:10:39] _joe_: that you? [11:15:45] (03PS8) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) [11:19:36] (03CR) 10ArielGlenn: [C: 032] Add a script to create Wikidata ttl dumps [puppet] - 10https://gerrit.wikimedia.org/r/201003 (https://phabricator.wikimedia.org/T93658) (owner: 10Smalyshev) [11:20:46] (03PS9) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) [11:23:19] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: puppet fail [11:23:35] we know, working on it [11:24:01] PROBLEM - puppet last run on ganeti1002 is CRITICAL: CRITICAL: Puppet has 3 failures [11:24:59] PROBLEM - puppet last run on ganeti2005 is CRITICAL: CRITICAL: Puppet has 3 failures [11:25:18] PROBLEM - puppet last run on ganeti1004 is CRITICAL: CRITICAL: Puppet has 3 failures [11:25:29] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Puppet has 3 failures [11:25:58] PROBLEM - puppet last run on ganeti2004 is CRITICAL: CRITICAL: Puppet has 3 failures [11:26:08] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Puppet has 3 failures [11:26:12] hmmm [11:26:49] PROBLEM - puppet last run on ganeti2006 is CRITICAL: CRITICAL: Puppet has 3 failures [11:27:08] PROBLEM - puppet last run on ganeti2003 is CRITICAL: CRITICAL: Puppet has 3 failures [11:27:15] 6operations, 10ops-esams: Upgrade cp3011-3014 with 10G cards - https://phabricator.wikimedia.org/T88684#1173403 (10mark) I've just replaced the network daughterboard in cp3012 with a 2x 10G + 2x 1G one. After boot, it came up with NICs eth4 and up, due to /etc/udev/rules.d/70-persistent-net-rules.conf. I remo... [11:30:39] (03CR) 10Alexandros Kosiaris: "I updated switches configuration and racktables as well" [dns] - 10https://gerrit.wikimedia.org/r/200573 (owner: 10Alexandros Kosiaris) [11:32:14] (03PS1) 10Hoo man: Fix snapshot::wikidatadumps::common inclusion [puppet] - 10https://gerrit.wikimedia.org/r/201440 [11:32:27] (03PS5) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) [11:33:39] (03CR) 10ArielGlenn: [C: 032] Fix snapshot::wikidatadumps::common inclusion [puppet] - 10https://gerrit.wikimedia.org/r/201440 (owner: 10Hoo man) [11:36:59] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:37:23] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-R_________: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1173425 (10phuedx) a:3phuedx [11:37:39] 6operations, 10Wikimedia-Site-requests, 5Patch-For-Review: Rename "chapcomwiki" to "affcomwiki" - https://phabricator.wikimedia.org/T41482#1173427 (10fgiunchedi) p:5Normal>3Low [11:39:43] 6operations, 7Monitoring: Job queue stats are broken - https://phabricator.wikimedia.org/T87594#1173432 (10fgiunchedi) a:3fgiunchedi [11:39:59] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:40:36] <_joe_> paravoid: yes it was me [11:40:44] <_joe_> I think I logged it too [11:42:09] (03PS1) 10Hoo man: snapshot::wikidatadumps: Declare dependencies [puppet] - 10https://gerrit.wikimedia.org/r/201441 [11:44:35] (03PS2) 10Hoo man: snapshot::wikidatadumps: Declare dependencies [puppet] - 10https://gerrit.wikimedia.org/r/201441 [11:45:11] 6operations, 10ops-esams: Remove unused fibers - https://phabricator.wikimedia.org/T94704#1173441 (10fgiunchedi) a:3mark [11:45:29] RECOVERY - Host cp3012 is UP: PING OK - Packet loss = 0%, RTA = 89.25 ms [11:45:31] 6operations, 10ops-esams: cp3011 hardware fault - https://phabricator.wikimedia.org/T92306#1173449 (10fgiunchedi) a:3mark [11:46:01] 6operations, 10ops-eqiad: db1035 raid degraded - https://phabricator.wikimedia.org/T94805#1173450 (10fgiunchedi) a:3Christopher [11:46:18] 6operations, 10ops-esams: Audit racktables - https://phabricator.wikimedia.org/T94819#1173454 (10fgiunchedi) a:3mark [11:46:33] 6operations, 10ops-eqiad: fluorine console not working - https://phabricator.wikimedia.org/T94554#1173455 (10fgiunchedi) a:3Christopher [11:48:15] (03CR) 10ArielGlenn: [C: 032] snapshot::wikidatadumps: Declare dependencies [puppet] - 10https://gerrit.wikimedia.org/r/201441 (owner: 10Hoo man) [11:51:23] 6operations, 10ops-esams: Upgrade cp3011-3014 with 10G cards - https://phabricator.wikimedia.org/T88684#1173467 (10faidon) firmware-bnx2x was missing from all of them. I copied over the firmware via virtual media on cp3012 to have it connect to the network again and apt-get install'ed it and rebooted to be sur... [11:57:39] PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100% [11:58:59] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Puppet has 3 failures [12:03:19] (03PS1) 10Alexandros Kosiaris: base: is_virtual comparison fix [puppet] - 10https://gerrit.wikimedia.org/r/201445 [12:05:56] 6operations, 10ops-eqiad: ganeti1003 DIMM problem - https://phabricator.wikimedia.org/T94825#1173480 (10akosiaris) 3NEW [12:08:28] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: puppet fail [12:08:57] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-R_________: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1173496 (10phuedx) It's worth noting that the lists are static. [12:09:30] 6operations, 10ops-codfw: ganeti2002 has an unresponsive iDRAC - https://phabricator.wikimedia.org/T94827#1173497 (10akosiaris) 3NEW [12:13:49] RECOVERY - Host cp3013 is UP: PING OK - Packet loss = 0%, RTA = 89.04 ms [12:14:57] godog: I think the zuul package for Precise is good enough for now. I have addressed the last few points yesterday :) ( https://gerrit.wikimedia.org/r/#/c/195272/ ) [12:15:20] !log Shutting down cp3014 for 10G upgrade [12:15:27] Logged the message, Master [12:16:49] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [12:19:38] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures [12:21:28] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:22:29] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [12:22:40] (03PS1) 10Andrew Bogott: Add resolv.conf alternates for new dns server. [puppet] - 10https://gerrit.wikimedia.org/r/201448 [12:22:58] 6operations, 10ops-eqiad: ganeti1003 DIMM problem - https://phabricator.wikimedia.org/T94825#1173513 (10Cmjohnson) a:3Cmjohnson [12:25:30] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-R_________: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1173531 (10phuedx) Depending on the size of the list, parsing a massive configuration variable on every... [12:26:38] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:27:35] (03PS1) 10Alexandros Kosiaris: Add forward DNS RRs for ganeti01.svc.{eqiad,codfw} [dns] - 10https://gerrit.wikimedia.org/r/201449 [12:28:09] RECOVERY - Host cp3014 is UP: PING OK - Packet loss = 0%, RTA = 89.90 ms [12:31:02] 6operations, 10ops-esams: Upgrade cp3011-3014 with 10G cards - https://phabricator.wikimedia.org/T88684#1173544 (10mark) cp3012-3014 have been replaced and are back up and running. cp3011 I'll do after it has been fixed. At that point, we should also remove the old GigE links, which are still connected now to... [12:32:09] PROBLEM - puppet last run on db2023 is CRITICAL: CRITICAL: puppet fail [12:35:08] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [12:37:53] (03CR) 10Alexandros Kosiaris: "Error message is exactly the same." [debs/contenttranslation/apertium-dan] - 10https://gerrit.wikimedia.org/r/195897 (https://phabricator.wikimedia.org/T91493) (owner: 10KartikMistry) [12:38:30] (03CR) 10Alexandros Kosiaris: [C: 032] Add forward DNS RRs for ganeti01.svc.{eqiad,codfw} [dns] - 10https://gerrit.wikimedia.org/r/201449 (owner: 10Alexandros Kosiaris) [12:40:33] (03PS1) 10Faidon Liambotis: Pool esams back [dns] - 10https://gerrit.wikimedia.org/r/201450 [12:40:50] (03PS2) 10Faidon Liambotis: Pool esams back [dns] - 10https://gerrit.wikimedia.org/r/201450 [12:41:43] akosiaris: uhm, that space is LVS (i.e. BGP), how are you going to do that with ganeti? [12:42:29] (03PS1) 10Glaisher: Add 100/106 namespaces to be searched by default at frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201452 (https://phabricator.wikimedia.org/T94698) [12:42:39] !log upgrading junos on mr1-esams [12:42:50] Logged the message, Master [12:42:52] paravoid: ignore, stupidity kicked in.. I 'll amend [12:43:12] thanks for spotting it btw [12:43:45] :) [12:45:09] (03PS2) 10Mobrovac: Citoid: switch from localsettings.js to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/200356 [12:45:28] (03CR) 10Mobrovac: [C: 031] Citoid: switch from localsettings.js to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/200356 (owner: 10Mobrovac) [12:45:55] akosiaris: ok, good to go ^^ [12:47:22] (03CR) 10Faidon Liambotis: [C: 032] Pool esams back [dns] - 10https://gerrit.wikimedia.org/r/201450 (owner: 10Faidon Liambotis) [12:47:29] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:48:14] !log repooling esams [12:48:24] Logged the message, Master [12:49:55] RECOVERY - puppet last run on db2023 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:50:35] (03PS1) 10Alexandros Kosiaris: Assign correct IPs to ganeti01.svc.{codfw,eqiad} [dns] - 10https://gerrit.wikimedia.org/r/201456 [12:51:34] !log restarted opendj, pdns on neptunium, nembus, virt1000, labcontrol2001 [12:51:44] Logged the message, Master [12:56:13] (03PS1) 10Glaisher: Disably mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) [12:56:18] (03CR) 10jenkins-bot: [V: 04-1] Disably mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) (owner: 10Glaisher) [12:57:22] (03PS2) 10Glaisher: Disably mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) [12:58:20] (03CR) 10Alexandros Kosiaris: "Minor comment, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200356 (owner: 10Mobrovac) [12:58:39] mobrovac: ^ [12:58:49] yup [12:59:57] akosiaris: afaik, no warning/error is emitted by erb/puppet if the value is undef, right? [13:00:19] if so, we're good, yaml and the service can handle it being empty [13:00:53] so proxy: [13:00:56] hmmm [13:01:13] :) [13:01:16] 6operations, 6MediaWiki-Core-Team, 7Wikimedia-log-errors: rbf1001 and rbf1002 are timing out / dropping clients for Redis - https://phabricator.wikimedia.org/T92591#1173562 (10Gilles) [13:01:21] but k, will put a guard [13:01:37] hmm, I was wondering how to convince you [13:01:41] so, thanks [13:02:30] (03CR) 10Revi: [C: 031] Disably mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) (owner: 10Glaisher) [13:03:21] btw, I would be fine with a comment as well. Anything to avoid surprise [13:04:22] 6operations, 10ops-eqiad: Verify visually that the labstore shelves' wiring is stable - https://phabricator.wikimedia.org/T94828#1173565 (10coren) 3NEW a:3Cmjohnson [13:05:19] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59655 bytes in 2.142 second response time [13:07:55] (03PS3) 10Mobrovac: Citoid: switch from localsettings.js to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/200356 [13:08:04] akosiaris: there ^ [13:08:21] akosiaris: also removed logrotate stuff, as logs are to be sent to logstash from now on [13:08:53] only ? [13:09:11] at some point we need a uniform policy for logging [13:09:47] <_joe_> I'd personally prefer to have local logs and to be able to kill logstash logs easily in case of need [13:09:50] I still like local logs for these types of low request number services [13:11:02] ok, can do that as well, np [13:11:20] _joe_: akosiaris: double log to logstash and syslog sounds ok ? [13:11:50] mobrovac: sounds perfect :-) [13:11:53] <_joe_> yep [13:13:15] euh, syslog needs a bit more work, it's not supported out of the box [13:13:22] how about a local file instead? [13:13:45] sounds fine [13:13:48] (with a promise that redirection to syslog will be available shortly) [13:13:57] +2 from me [13:14:04] a human promise, not a node.js promise :D [13:14:08] cool [13:14:09] lol [13:19:20] (03PS1) 10Glaisher: Set $wgRestrictDisplayTitle to false at cawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201460 (https://phabricator.wikimedia.org/T94346) [13:20:50] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1173620 (10Joe) [13:21:11] <_joe_> mobrovac: thanks for amending it btw [13:21:27] np [13:21:44] we still have to polish it a bit, i think [13:22:03] but we have the necessary ingredients for the first step [13:30:18] (03PS4) 10Mobrovac: Citoid: switch from localsettings.js to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/200356 [13:30:28] akosiaris: ^ [13:33:01] mobrovac: nice [13:33:03] merging [13:33:07] cool [13:33:15] btw [13:33:22] level: info [13:33:34] is that applying to both gelf and file ? [13:33:45] no, for gelf warn is used [13:33:51] info is only for local file logging [13:33:53] oh, a default [13:33:56] ok [13:34:02] to have a bit more logs locally [13:34:14] ok, sounds fine [13:34:33] (03CR) 10Alexandros Kosiaris: [C: 032] Citoid: switch from localsettings.js to config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/200356 (owner: 10Mobrovac) [13:35:05] cool, now wait for puppet to do its thing and then have fun with trebuchet [13:35:13] 6operations, 7Service-Architecture: Create a nagios check script that can monitor multiple endpoints based on what the service exposes - https://phabricator.wikimedia.org/T94831#1173633 (10Joe) 3NEW [13:35:26] 6operations, 7Service-Architecture: Create a nagios check script that can monitor multiple endpoints based on what the service exposes - https://phabricator.wikimedia.org/T94831#1173633 (10Joe) a:3Joe [13:37:35] mobrovac: done and citoid was reloaded by puppet [13:37:51] ok, will deploy now [13:37:56] akosiaris: thnx [13:38:01] (03CR) 10Alexandros Kosiaris: [C: 032] Assign correct IPs to ganeti01.svc.{codfw,eqiad} [dns] - 10https://gerrit.wikimedia.org/r/201456 (owner: 10Alexandros Kosiaris) [13:40:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, but remove the unneeded ganglia_aggregator defs" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/200734 (https://phabricator.wikimedia.org/T90271) (owner: 10RobH) [13:44:43] 6operations: boron passive checks aren't being collected - https://phabricator.wikimedia.org/T89983#1173642 (10Jgreen) hacked around this with puppet in frack, it clobbers the trusty binary with the one from precise, and modifies nsca-client.md5sums with the correct md5sum [13:45:03] 6operations, 10ops-fundraising, 10Wikimania-Hackathon-2015, 10Wikimedia-Hackathon-2015: overhaul fundraising cluster monitoring - https://phabricator.wikimedia.org/T91508#1173645 (10Jgreen) [13:45:05] 6operations: boron passive checks aren't being collected - https://phabricator.wikimedia.org/T89983#1173644 (10Jgreen) 5Open>3Resolved [13:50:51] (03PS1) 10Alexandros Kosiaris: Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 [13:54:47] Krenair: if/when you are up and about, I would love some help getting https://gerrit.wikimedia.org/r/#/c/201461/ merged and onto the proper branches [13:57:10] (03PS2) 10Alexandros Kosiaris: Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 [14:02:02] anomie: will you be around during SWAT? [14:02:15] kart_: Probably [14:02:27] anomie: need review/+2 on https://gerrit.wikimedia.org/r/201131 :) [14:02:43] anomie: or if you review and +1, someone can +2 on it. [14:04:04] PROBLEM - RAID on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:08] kart_: Seems like a straightforward backport, any SWATter should be able to +2 it during the window. [14:04:35] PROBLEM - configured eth on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:44] PROBLEM - dhclient process on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:45] PROBLEM - nutcracker port on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:04:55] PROBLEM - nutcracker process on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:05:15] PROBLEM - puppet last run on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:05:25] PROBLEM - salt-minion processes on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:05:25] PROBLEM - DPKG on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:05:25] anomie: yesterday, SWATer wasn't comfortable, you +1 from you is good idea. [14:05:35] PROBLEM - Disk space on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:05:37] anomie: thanks! [14:06:04] PROBLEM - HHVM processes on mw2050 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [14:06:36] RECOVERY - nutcracker process on mw2050 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:07:05] RECOVERY - salt-minion processes on mw2050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:07:05] RECOVERY - DPKG on mw2050 is OK: All packages OK [14:07:24] RECOVERY - Disk space on mw2050 is OK: DISK OK [14:07:25] RECOVERY - RAID on mw2050 is OK: OK: no RAID installed [14:07:45] RECOVERY - HHVM processes on mw2050 is OK: PROCS OK: 1 process with command name hhvm [14:08:04] RECOVERY - configured eth on mw2050 is OK: NRPE: Unable to read output [14:08:05] RECOVERY - dhclient process on mw2050 is OK: PROCS OK: 0 processes with command name dhclient [14:08:15] RECOVERY - nutcracker port on mw2050 is OK: TCP OK - 0.000 second response time on port 11212 [14:09:31] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1173688 (10mobrovac) [14:10:15] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 6 failures [14:10:25] PROBLEM - Host cp3042 is DOWN: PING CRITICAL - Packet loss = 100% [14:12:04] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:12:05] PROBLEM - HHVM rendering on mw2050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:12:57] !log Jenkins: migrated Zuul cloner on Precise labs slaves (100[1-4] to a version provided by a Debian package. Jobs console output should now shows Zuul version: 2.0.0-304-g685ca22-wmf1precise1 [14:13:08] Logged the message, Master [14:13:35] RECOVERY - HHVM rendering on mw2050 is OK: HTTP OK: HTTP/1.1 200 OK - 65849 bytes in 0.448 second response time [14:15:31] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-R_________: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1173699 (10phuedx) @kaldari: since the lists are static, are there any advantages to storing 'em in JSO... [14:18:11] (03CR) 10Ottomata: [C: 031] admin: add pcoombe to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/201430 (https://phabricator.wikimedia.org/T94466) (owner: 10Filippo Giunchedi) [14:19:45] RECOVERY - Host cp3042 is UP: PING OK - Packet loss = 0%, RTA = 90.41 ms [14:20:12] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1173705 (10mark) [14:20:39] kart_ or anomie, I could use help with getting a patch into SWAT, if one of you has time to hold my hand. [14:20:54] andrewbogott: Sure [14:21:01] https://gerrit.wikimedia.org/r/#/c/201461/ [14:21:14] Needs a +2 but then also needs branch commits (which I haven’t done in ages) [14:21:25] greg suggested that there’s a semi-automatic way to do that in gerrit? [14:22:22] <^d> andrewbogott: "Cherry pick to" button [14:22:38] <^d> Then give it the destination wmf/* branch :) [14:22:42] ok… that’s after +2 or before? [14:22:47] andrewbogott: Partially. Step 1 is getting the +2 on the patch you linked. [14:22:53] 'k [14:23:09] <^d> Prefer +2'd first, but really you can at any time from gerrit's perspective :) [14:23:25] andrewbogott: Also, for step 3 you'll want a checkout of mediawiki/core on one of the wmf branches. [14:23:45] Which can take a long time, so if you don't already have one you might want to start now. [14:23:55] have, I think, lemme check [14:24:41] 6operations, 10Continuous-Integration: Get python-gear 0.5.5 to trusty-wikimedia and jessie-wikimedia - https://phabricator.wikimedia.org/T92684#1173711 (10hashar) [14:24:43] 6operations: Upload python-gear 0.5.5-2 to Debian project - https://phabricator.wikimedia.org/T89952#1173709 (10hashar) 5Open>3Resolved The package has been accepted. https://packages.qa.debian.org/p/python-gear/news/20150402T131852Z.html ``` Accepted: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Form... [14:24:44] And make sure it's had a recent fetch done [14:24:52] Is ‘step 3’ referring to docs someplace? I can’t quite conceive of why I’d have to do something locally if gerrit can cherry-pick to a branch already... [14:25:33] * andrewbogott fetches [14:25:33] andrewbogott: "step 3" is in my internal checklist. https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Updating_the_submodule is the actual docs. [14:27:11] ah, for the submodule commit. I suppose gerrit doesn’t do that [14:27:21] Unfortunately not [14:28:21] 6operations, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1173719 (10fgiunchedi) 3NEW [14:28:37] (03PS2) 10Filippo Giunchedi: admin: add pcoombe to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/201430 (https://phabricator.wikimedia.org/T94466) [14:28:42] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] admin: add pcoombe to statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/201430 (https://phabricator.wikimedia.org/T94466) (owner: 10Filippo Giunchedi) [14:29:20] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1/EventLogging data for Pcoombe - https://phabricator.wikimedia.org/T94466#1173728 (10fgiunchedi) 5Open>3Resolved change merged, should propagate in 1hr or so [14:30:05] RECOVERY - Host cp3031 is UP: PING OK - Packet loss = 0%, RTA = 90.27 ms [14:30:05] RECOVERY - Host cp3035 is UP: PING OK - Packet loss = 0%, RTA = 87.81 ms [14:30:15] RECOVERY - Host cp3039 is UP: PING OK - Packet loss = 0%, RTA = 89.34 ms [14:30:15] RECOVERY - Host cp3030 is UP: PING OK - Packet loss = 0%, RTA = 89.98 ms [14:30:15] RECOVERY - Host cp3037 is UP: PING OK - Packet loss = 0%, RTA = 89.43 ms [14:30:15] RECOVERY - Host cp3034 is UP: PING OK - Packet loss = 0%, RTA = 89.75 ms [14:30:15] RECOVERY - Host cp3036 is UP: PING OK - Packet loss = 0%, RTA = 89.12 ms [14:30:15] RECOVERY - Host cp3032 is UP: PING OK - Packet loss = 0%, RTA = 88.67 ms [14:30:25] RECOVERY - Host cp3038 is UP: PING OK - Packet loss = 0%, RTA = 89.25 ms [14:30:35] RECOVERY - Host cp3033 is UP: PING OK - Packet loss = 0%, RTA = 89.03 ms [14:32:52] anomie: ok, there are the cherry-pick-to-branch commits. [14:33:06] But /they/ need +2 before I can make the submodule commits, right? [14:33:40] Probably. Trying to hack around that requirement would probably just break stuff. [14:33:55] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [14:34:06] 6operations, 10ops-esams: Rack and configure asw-esams (new 2xQFX5100 stack) - https://phabricator.wikimedia.org/T91643#1173735 (10mark) [14:34:18] ok. So, does the +2 of the branch commits happen now, or during the SWAT window? [14:34:25] Or do I leave that to the commander-in-swat? [14:34:25] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [14:34:34] PROBLEM - puppet last run on cp3033 is CRITICAL: CRITICAL: puppet fail [14:34:52] At least for the morning window, +2 the branch commits now and create the submodule updates against mediawiki/core but don't merge them. [14:34:59] ok [14:35:05] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: puppet fail [14:35:15] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [14:35:16] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [14:35:17] I’m allowed to self-merge since it’s a branch commit? [14:35:27] YuviPanda: FYI https://phabricator.wikimedia.org/T94834 [14:35:34] 6operations, 10ops-esams: Rack and configure asw-esams (new 2xQFX5100 stack) - https://phabricator.wikimedia.org/T91643#1092139 (10mark) asw-oe10-esams has asset tag WMF4425, asw-oe13-esams has asset tag WMF4427. They've both been labeled. [14:35:37] Since it's a simple backport of something that was already merged, a self-merge is ok. [14:35:59] If you had to do complex rebasing for the cherry pick you might want someone else to look at it. [14:36:00] godog: I’m looking at that now, although probably will take a break for breakfast [14:36:33] andrewbogott: ok! thanks [14:36:54] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:37:54] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [14:38:01] (03PS3) 10Jforrester: Disable mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) (owner: 10Glaisher) [14:38:13] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1173756 (10faidon) Switch ports are configured, all 10 are reachable now. [14:38:30] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1173757 (10faidon) [14:38:45] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:39:45] RECOVERY - puppet last run on cp3033 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:40:24] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:43:57] Krenair, knee-jerk: your docs say to use git-pull. Never do that! https://wikitech.wikimedia.org/wiki/Help:Git_rebase [14:46:05] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [14:48:14] 6operations, 10ops-esams: Rack and configure asw-esams (new 2xQFX5100 stack) - https://phabricator.wikimedia.org/T91643#1174025 (10mark) a:5mark>3Cmjohnson These switches are now ready to go. These have been put in racktables and in the right rack locations, but need some extra information (serials, purch... [14:48:26] 6operations, 10ops-esams: Rack and configure asw-esams (new 2xQFX5100 stack) - https://phabricator.wikimedia.org/T91643#1174028 (10mark) p:5Normal>3Low [14:50:11] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1174036 (10mark) [14:50:13] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1174037 (10mark) [14:50:15] 6operations, 10ops-esams, 7HTTPS, 3HTTPS-by-default: esams power capacity issues - https://phabricator.wikimedia.org/T90000#1174034 (10mark) 5Open>3Resolved After moving 10 servers out of OE10, with all remaining 10 servers still on one 16A, I ran stress on all of them. Power usage got up to ~3300W, wh... [14:52:07] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1174039 (10mark) a:5mark>3Cmjohnson These servers are now ready for use. After we complete the racktables entries (for which all information should be available in this ticket), we can resolve this ticket... [14:52:17] 6operations, 10ops-esams: Rack, cable, prepare cp3030-3049 - https://phabricator.wikimedia.org/T92514#1174041 (10mark) p:5High>3Low [14:53:08] 6operations, 10ops-eqiad: rename platinum, gold, mercury, thallium to ganeti100{1,2,3,4} respectively - https://phabricator.wikimedia.org/T94839#1174043 (10akosiaris) 3NEW [14:54:41] ok… anomie, https://gerrit.wikimedia.org/r/#/c/201472/ and https://gerrit.wikimedia.org/r/#/c/201475/ — look right? [14:55:09] andrewbogott: At a glance, yes [14:56:51] anomie: ok, thanks [14:57:35] (03CR) 10Ottomata: "Hey, I want to have a proper discussion about this, so anytime anyone wants to find me to talk about it, in email, in google hangout, in I" [puppet] - 10https://gerrit.wikimedia.org/r/196335 (https://phabricator.wikimedia.org/T92560) (owner: 10Eevans) [14:59:15] 6operations, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174079 (10hashar) [14:59:27] 6operations, 10ops-eqiad: ganeti1003 DIMM problem - https://phabricator.wikimedia.org/T94825#1174082 (10Cmjohnson) /admin1-> racadm getsel Record: 1 Date/Time: 03/19/2014 06:51:25 Source: system Severity: Ok Description: Log cleared. --------------------------------------------------------------... [15:00:05] manybubbles, anomie, ^d, thcipriani, marktraceur, Krenair, manybubbles: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150402T1500). [15:00:15] I am here! [15:00:33] heya manybubbles [15:00:44] Is I the swatter? [15:00:56] Hi. I'm here too. [15:01:33] manybubbles: Go for it. [15:01:36] on it [15:01:43] 6operations, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174093 (10hashar) [15:01:47] (03CR) 10Manybubbles: [C: 032] CX: Enable newarticle campaign in cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 (owner: 10KartikMistry) [15:02:25] PROBLEM - Host ganeti1003 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:27] manybubbles: you also need to merge other patch to get above working :) [15:02:41] kart_: oh so the other one first? [15:02:48] both at the same time? [15:02:49] yes [15:02:59] first [15:03:12] (03Merged) 10jenkins-bot: CX: Enable newarticle campaign in cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/197491 (owner: 10KartikMistry) [15:03:25] PROBLEM - NTP on cp3037 is CRITICAL: NTP CRITICAL: Offset unknown [15:03:27] manybubbles: core is first, but that's fine. [15:03:36] kart_: k. I'll get core out first. [15:05:06] RECOVERY - NTP on cp3037 is OK: NTP OK: Offset -0.006937861443 secs [15:06:12] 6operations, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174109 (10hashar) modules/puppet/manifests/self/client.pp has: ``` # We'd best be sure that our ldap config is set up properly # before puppet goes to work, though. clas... [15:06:54] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1174110 (10GWicke) Yeah, we established that production is not actually broken, despite the compiler output. Since the compiler w... [15:09:44] (03CR) 10Manybubbles: [C: 031] Disable mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) (owner: 10Glaisher) [15:10:05] (03CR) 10Manybubbles: [C: 031] Set $wgRestrictDisplayTitle to false at cawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201460 (https://phabricator.wikimedia.org/T94346) (owner: 10Glaisher) [15:10:16] manybubbles: +1? :) [15:10:35] 6operations, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174124 (10Joe) what filippo is reporting is that at least one of those files is not defined in puppet by default, so that class fails, even if the file is somehow managed (like, in... [15:10:41] Glaisher: just marking it +1 so I will not forget and rereview when I get to it later [15:10:58] oh, okay [15:11:12] (03CR) 10Papaul: [C: 031] remove haedus,capella from hiera, DHCP, netboot [puppet] - 10https://gerrit.wikimedia.org/r/201395 (https://phabricator.wikimedia.org/T94474) (owner: 10Dzahn) [15:11:23] (03CR) 10Manybubbles: [C: 031] "Do this one last - it needs the Cirrus saneitizer run after merging I believe." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201452 (https://phabricator.wikimedia.org/T94698) (owner: 10Glaisher) [15:12:05] 6operations, 10ops-eqiad: Verify visually that the labstore shelves' wiring is stable - https://phabricator.wikimedia.org/T94828#1174130 (10coren) p:5Unbreak!>3Normal Due to labstore1002 not having been power cycled, it still sees the cuplrit shelf as lame so it was identified to be the most recent additio... [15:12:43] 6operations, 10ops-eqiad: ganeti1003 DIMM problem - https://phabricator.wikimedia.org/T94825#1174134 (10Cmjohnson) swapped DIMM A2 with DIMM B2 and the error followed the DIMM. This will need a DIMM replacement from Dell. Error: Memory initialization warning detected. MEMBIST Memory Test failure DIMM B2 [15:12:45] (03PS1) 10Giuseppe Lavagetto: analytics: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) [15:12:54] <_joe_> ottomata: you're served! [15:13:43] andrewbogott: I'll do your changes after I do kart_'s - so about 10 minutes? more or less [15:13:59] hahah oh my [15:14:12] _joe_, we'll need to test labs! [15:14:13] manybubbles: sounds good, thanks [15:14:15] <_joe_> ottomata: it's most surely a noop [15:14:20] <_joe_> labs? why? [15:14:41] hmm, wait, you didn't change the roel class at all [15:14:43] hmmmMMMm [15:14:48] <_joe_> nope [15:14:54] ok readinh [15:15:06] <_joe_> btw, the puppet compiler will help us here [15:15:11] 7Blocked-on-Operations, 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation, and 2 others: Create a Debian package for Zuul - https://phabricator.wikimedia.org/T48552#1174146 (10hashar) Thanks to all @fgiunchedi reviews, I used patchset 16 of https://gerrit.wikimedia.org/r/#/c/195272/... [15:15:15] <_joe_> gonna run it in a few [15:15:27] !log manybubbles Synchronized php-1.25wmf23/includes/User.php: SWAT user preferences load from the master by default (duration: 00m 12s) [15:15:36] Logged the message, Master [15:16:23] !log manybubbles Synchronized wmf-config/CommonSettings.php: SWAT enable newarticle campaign on cawiki 1/3 (duration: 00m 12s) [15:16:30] Logged the message, Master [15:16:31] !log ignore last log - its a noop failure on my part [15:16:38] Logged the message, Master [15:16:59] !log manybubbles Synchronized wmf-config/CommonSettings.php: SWAT enable newarticle campaign on cawiki 1/3 (duration: 00m 13s) [15:17:03] _joe_: quick thang: analytics1004 and analytics1010 are ciscos and not in production anymore [15:17:06] Logged the message, Master [15:17:20] !log manybubbles Synchronized wmf-config/CommonSettings-labs.php: SWAT enable newarticle campaign on cawiki 2/3 (duration: 00m 12s) [15:17:25] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1174149 (10mobrovac) Trusty: - Ruby 1.9 - Puppet 3.4.3 Jessie: - Ruby 2.1 - Puppet 3.7 Wrt Ruby, there are no major incompatibi... [15:17:26] Logged the message, Master [15:18:02] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT enable newarticle campaign on cawiki 3/3 (duration: 00m 14s) [15:18:04] kart_: ^^^^^ you are live now [15:18:10] Logged the message, Master [15:18:24] manybubbles: thank you [15:18:29] manybubbles: thanks! [15:18:30] hm, _joe_, ok i have quesstions, want to pint me when you are back? [15:18:31] it is kids / dinner time. Have a good evening everyone [15:18:47] 6operations, 10ops-eqiad: ganeti1003 DIMM problem - https://phabricator.wikimedia.org/T94825#1174151 (10Cmjohnson) Congratulations: Work Order WO6747231 was successfully submitted. [15:19:12] (03CR) 10Manybubbles: [C: 032] Disable mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) (owner: 10Glaisher) [15:19:25] manybubbles: let me know when the sync is done and I’ll test [15:19:37] andrewbogott: it'll be a few minutes before jenkins merges [15:20:36] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1174154 (10GWicke) We could consider using the test/example request/response pairs from the swagger spec. An attribute could mark an end point as 'this should be monitored'. Fo... [15:20:55] manybubbles: based on testing by me and kart_: our patches work as expected [15:21:02] Nikerabbit: hot! [15:25:54] <_joe_> ottomata: so can we remove them from site.pp altoghether? [15:27:39] _joe_: yes, I am not sure the proper thing to do here, beacuse I don't want to turn them off. since no one else wants them, and since they are beefy, they would be useful for some realtime framework testing i might like to do someday [15:27:59] i wouldn't use them in production ever, but they still work, and they would be good for testing stuff [15:28:12] so, until someone needs the rackspace or whatever, i'd like to keep them [15:28:18] so it might be worth keeping them in puppet so people know they exist [15:28:21] not sure. [15:29:46] zuul - I'm waiting [15:32:29] <_joe_> ottomata: it they are turned on, they should stay in puppet [15:34:35] 6operations, 10ops-codfw: ganeti2002 has an unresponsive iDRAC - https://phabricator.wikimedia.org/T94827#1174217 (10Papaul) 5Open>3Resolved a:3Papaul @Akosiaris had to unplug the server for while to resolve the problem. You can access the server now. [15:34:57] (03Merged) 10jenkins-bot: Disable mobile IP editing at kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201457 (https://phabricator.wikimedia.org/T94388) (owner: 10Glaisher) [15:35:04] Finally. [15:35:56] !log manybubbles Synchronized php-1.25wmf24/extensions/OpenStackManager/: SWAT update openstackmanager extension (duration: 00m 11s) [15:36:06] Logged the message, Master [15:36:11] andrewbogott: ^^^ [15:36:20] manybubbles: thanks [15:37:06] 6operations, 10ops-eqiad: rename platinum, gold, mercury, thallium to ganeti100{1,2,3,4} respectively - https://phabricator.wikimedia.org/T94839#1174225 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Done, thx for the ticket [15:39:59] !log manybubbles Synchronized php-1.25wmf23/extensions/OpenStackManager/: SWAT update openstackmanager extension (duration: 00m 14s) [15:40:09] Logged the message, Master [15:40:25] andrewbogott: ^^^^^ [15:42:15] !log manybubbles Synchronized wmf-config/CommonSettings.php: SWAT disable mobile ip editing at kowiki 1/2 (duration: 00m 11s) [15:42:23] Logged the message, Master [15:42:40] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT disable mobile ip editing at kowiki 2/2 (duration: 00m 12s) [15:42:42] Glaisher: ^^^^^^ [15:42:44] Logged the message, Master [15:42:56] (03CR) 10Manybubbles: [C: 032] Set $wgRestrictDisplayTitle to false at cawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201460 (https://phabricator.wikimedia.org/T94346) (owner: 10Glaisher) [15:44:38] (03PS2) 10Giuseppe Lavagetto: analytics: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) [15:44:47] (03CR) 10Manybubbles: [C: 032] Add 100/106 namespaces to be searched by default at frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201452 (https://phabricator.wikimedia.org/T94698) (owner: 10Glaisher) [15:44:50] 6operations, 10ops-esams, 10procurement: Buy fiber patches - https://phabricator.wikimedia.org/T94846#1174251 (10mark) 3NEW [15:44:54] manybubbles: looks like it's working [15:45:04] Glaisher: sweet. two more to go [15:45:59] (03Merged) 10jenkins-bot: Set $wgRestrictDisplayTitle to false at cawikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201460 (https://phabricator.wikimedia.org/T94346) (owner: 10Glaisher) [15:48:19] (03Merged) 10jenkins-bot: Add 100/106 namespaces to be searched by default at frwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201452 (https://phabricator.wikimedia.org/T94698) (owner: 10Glaisher) [15:48:56] !log manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT Set $wgRestrictDisplayTitle to false at cawikimedia (duration: 00m 11s) [15:49:00] Glaisher: ^^^^^^^^ [15:49:06] Logged the message, Master [15:49:28] https://ca.wikimedia.org/wiki/User:Glaisher working [15:49:55] <_joe_> ottomata: ach, found another small error, fixing [15:50:05] !log last sync accidentally picked up 'Add 100/106 namespaces to be searched by default at frwiktionary' - that one might require a cirrus script to finish running before its working properly [15:50:13] Logged the message, Master [15:51:18] !log actually that last patch seems to be working too. cool. sweet. still running the cirrus script just in case. [15:51:26] Logged the message, Master [15:51:37] Glaisher: ^^^ I just verified the last patch. [15:51:43] * manybubbles thinks he's done with swat [15:52:45] so, _joe_ role X is a replacement for mainrole yaml files? [15:53:20] <_joe_> ottomata: a better one, yes [15:54:20] cool, i like it better too [15:54:25] at least the way it looks :) [15:54:30] manybubbles: thanks for long sweat! [15:54:32] fewer files [15:54:47] * kart_ will go ahead with merging patch for SWAT [15:54:53] manybubbles: thanks! [15:55:01] (03PS1) 10John F. Lewis: shinken: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/201479 [15:55:19] kart_: did I forget one? [15:55:20] cool. _joe_, as far as I can tell this will be a no op, and since you didin't have to change the actual manifests at all [15:55:23] shoudl be fine in labs too [15:55:32] (03CR) 10Ottomata: [C: 031] analytics: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) (owner: 10Giuseppe Lavagetto) [15:56:02] manybubbles: this is next deployment :) [15:56:08] kart_: got it [15:56:18] manybubbles: ContentTranslation. [15:57:21] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] shinken: typo fix [puppet] - 10https://gerrit.wikimedia.org/r/201479 (owner: 10John F. Lewis) [15:58:37] (03PS3) 10Giuseppe Lavagetto: analytics: use role, hiera [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) [15:59:47] 6operations, 10ops-eqiad, 10ops-fundraising: barium has a failed HDD - https://phabricator.wikimedia.org/T93899#1174331 (10Cmjohnson) received the new disk. This will require downtime. Sent an email to FR-ALL for Tuesday 4/7 at 930est. [16:00:05] kart_, ^d: Respected human, time to deploy Content Translation/cxserver (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150402T1600). Please do the needful. [16:00:12] I see Glaisher like closing down wikis ;) [16:00:14] yes, jouncebot [16:00:33] waiting to get patches merged for CX. [16:00:37] Not really. [16:03:49] (03CR) 10Ottomata: "I ain't scrrd!" [puppet] - 10https://gerrit.wikimedia.org/r/201477 (https://phabricator.wikimedia.org/T86774) (owner: 10Giuseppe Lavagetto) [16:08:51] <_joe_> ottomata: I'll merge it in the next few days, I'm too tired to spend ~ 1 hour verifying I didn't break anything [16:08:56] <_joe_> :) [16:12:41] !log kartik Started scap: Update ContentTranslation [16:12:50] Logged the message, Master [16:16:25] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 3539 MB (3% inode=98%): [16:18:23] ^ looking [16:22:35] l [16:22:37] k [16:23:25] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: / 3522 MB (3% inode=98%): [16:25:25] !log reload uwsgi on graphite1001 [16:25:32] Logged the message, Master [16:25:52] ahh files removed but fd left open, interview question time [16:26:45] RECOVERY - Disk space on graphite1001 is OK: DISK OK [16:41:09] 6operations, 10Wikimedia-Labs-General: role::puppet::self broken on new labs instances - https://phabricator.wikimedia.org/T94834#1174469 (10Andrew) 5Open>3Resolved a:3Andrew This was a result of https://gerrit.wikimedia.org/r/#/c/201461/, an attempt to not override GUI settings with default new-instance... [16:42:55] 6operations: Fatal errors not going to fatal.log - https://phabricator.wikimedia.org/T94854#1174486 (10EBernhardson) 3NEW [16:43:30] 6operations: Fatal errors not going to fatal.log - https://phabricator.wikimedia.org/T94854#1174493 (10hoo) [16:45:24] 6operations: SlowTimer logs should go to their own location, instead of hhvm.log - https://phabricator.wikimedia.org/T94855#1174501 (10EBernhardson) 3NEW [16:48:28] 6operations, 6Release-Engineering: SlowTimer logs should go to their own location, instead of hhvm.log - https://phabricator.wikimedia.org/T94855#1174515 (10EBernhardson) [16:49:54] andrewbogott, I wonder if the assignees on some of https://phabricator.wikimedia.org/maniphest/query/EHePOvZiFKRt/#R should be changed [16:51:15] Krenair: probably! That’s from the bugzilla import, I have no idea how many of those are still relevant. [16:51:37] 6operations, 10MediaWiki-Logging, 6Release-Engineering, 7HHVM: SlowTimer logs should go to their own location, instead of hhvm.log - https://phabricator.wikimedia.org/T94855#1174518 (10greg) [16:52:31] 6operations, 10ops-codfw, 3codfw-appserver-setup, 3wikis-in-codfw: mw2208-2209, mw2213 have unreachable mgmt interfaces - https://phabricator.wikimedia.org/T93857#1174522 (10Papaul) i was on the phone with Dell support for the IDRAC problem on mw2208. After 40 minutes of troubleshooting the Engineer came t... [16:52:37] andrewbogott, yeah... e.g. https://phabricator.wikimedia.org/T45603 - https://wikitech.wikimedia.org/wiki/Special:Contributions/127.0.0.1 [16:52:41] not since july [16:52:43] !log kartik Finished scap: Update ContentTranslation (duration: 40m 01s) [16:52:51] Logged the message, Master [17:07:15] 6operations, 10Deployment-Systems: Use FQDNs for mediawiki-installation - https://phabricator.wikimedia.org/T93983#1174579 (10greg) 5Open>3Resolved >>! In T93983#1157794, @Dzahn wrote: >>>! In T93983#1151838, @bd808 wrote: >> The fix will be to update `mediawiki-installation` which is currently maintained... [17:12:12] (03PS3) 10Dzahn: dumps: add rsync client hostnames to hiera data [puppet] - 10https://gerrit.wikimedia.org/r/201404 [17:13:29] (03CR) 10Dzahn: dumps: add rsync client hostnames to hiera data (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/201404 (owner: 10Dzahn) [17:14:39] (03CR) 10Dzahn: [C: 032] "just adding them so they can be used in manifests next" [puppet] - 10https://gerrit.wikimedia.org/r/201404 (owner: 10Dzahn) [17:15:16] !log kartik Synchronized php-1.25wmf23/extensions/ContentTranslation/modules/campaigns/ext.cx.campaigns.contributionsmenu.js: (no message) (duration: 00m 15s) [17:15:25] Logged the message, Master [17:15:54] (03CR) 10Dzahn: "added to hiera now instead" [puppet] - 10https://gerrit.wikimedia.org/r/188188 (owner: 10Dzahn) [17:17:03] (03CR) 10Dzahn: "should now use hiera data instead (from https://gerrit.wikimedia.org/r/#/c/201404/3)" [puppet] - 10https://gerrit.wikimedia.org/r/188204 (owner: 10Dzahn) [17:17:33] how does one purge resource loader cache for a given module? I want messages to appear in https://bits.wikimedia.org/es.wikipedia.org/load.php?debug=true&lang=en&modules=ext.cx.campaigns.contributionsmenu&skin=vector [17:20:15] ^d: ^^ [17:20:37] <^d> i ono [17:22:25] ah [17:28:37] (03PS3) 10Chad: Move web::sites to web::prod_sites; begin unification in new class [puppet] - 10https://gerrit.wikimedia.org/r/197655 [17:31:32] (03PS4) 10Chad: Move web::sites to web::prod_sites; begin unification in new class [puppet] - 10https://gerrit.wikimedia.org/r/197655 [17:31:35] (03PS3) 10Dzahn: dumps: ferm service for rsyncd clients using hiera [puppet] - 10https://gerrit.wikimedia.org/r/188204 [17:40:52] (03PS1) 10Chad: Remove labs config of swift backups. Testing done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201490 [17:42:39] (03PS1) 10Chad: Remove search alternatives, already gone from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201491 [17:45:37] (03PS1) 10Chad: Remove labs-specific search all fields config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201493 [17:46:43] (03PS1) 10Chad: Remove labs-specific regex plugin configuration, use default from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201494 [17:48:51] <^d> manybubbles: When you get a chance (it's totally not urgent) that chain of commits could use review before I sync them ^ [18:01:50] (03CR) 10Manybubbles: [C: 031] Remove labs config of swift backups. Testing done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201490 (owner: 10Chad) [18:02:15] (03CR) 10Manybubbles: [C: 031] "Good idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201491 (owner: 10Chad) [18:02:33] (03CR) 10Manybubbles: [C: 031] Remove labs-specific search all fields config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201493 (owner: 10Chad) [18:02:49] (03CR) 10Manybubbles: [C: 031] Remove labs-specific regex plugin configuration, use default from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201494 (owner: 10Chad) [18:02:57] (03CR) 10Manybubbles: "Thanks for catching all these!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201494 (owner: 10Chad) [18:14:45] PROBLEM - puppet last run on rdb2002 is CRITICAL: CRITICAL: puppet fail [18:17:25] <^d> manybubbles: Just trying to reduce deltas between environments :) [18:17:32] you are a good man [18:18:17] delta force [18:18:17] (03CR) 10Chad: [C: 032] Remove labs config of swift backups. Testing done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201490 (owner: 10Chad) [18:18:23] (03CR) 10Chad: [C: 032] Remove search alternatives, already gone from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201491 (owner: 10Chad) [18:18:27] (03CR) 10Chad: [C: 032] Remove labs-specific search all fields config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201493 (owner: 10Chad) [18:18:31] (03CR) 10Chad: [C: 032] Remove labs-specific regex plugin configuration, use default from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201494 (owner: 10Chad) [18:19:52] <^d> And of course, I get caught behind 2 or 3 mw/core gate and submits [18:20:34] ^d: you deploying config? [18:20:50] can you do https://gerrit.wikimedia.org/r/#/c/200898/ [18:20:55] I forgot to do it during swat [18:21:05] <^d> I was going to pull it in to tin, it's a no-op in prod [18:23:56] k. mine can wait [18:26:35] (03CR) 10Chad: [C: 032] Remove "using new search" message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200898 (owner: 10Manybubbles) [18:31:44] RECOVERY - puppet last run on rdb2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:34:59] (03PS2) 10Dzahn: remove haedus,capella from hiera, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/201395 (https://phabricator.wikimedia.org/T94474) [18:35:01] (03Merged) 10jenkins-bot: Remove labs config of swift backups. Testing done [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201490 (owner: 10Chad) [18:35:03] (03Merged) 10jenkins-bot: Remove search alternatives, already gone from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201491 (owner: 10Chad) [18:35:05] (03Merged) 10jenkins-bot: Remove labs-specific search all fields config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201493 (owner: 10Chad) [18:35:07] (03Merged) 10jenkins-bot: Remove labs-specific regex plugin configuration, use default from prod [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201494 (owner: 10Chad) [18:35:09] (03Merged) 10jenkins-bot: Remove "using new search" message [mediawiki-config] - 10https://gerrit.wikimedia.org/r/200898 (owner: 10Manybubbles) [18:36:12] (03PS3) 10Dzahn: remove haedus,capella from hiera, DHCP [puppet] - 10https://gerrit.wikimedia.org/r/201395 (https://phabricator.wikimedia.org/T94474) [18:37:00] !log demon Synchronized wmf-config/CirrusSearch-common.php: turn off "yay new search!!" msg. old news now (duration: 00m 11s) [18:37:07] Logged the message, Master [18:37:14] (03PS4) 10Dzahn: remove hiera admin groups from haedus,capella [puppet] - 10https://gerrit.wikimedia.org/r/201395 (https://phabricator.wikimedia.org/T94474) [18:37:26] !log demon Synchronized wmf-config/CirrusSearch-labs.php: cleanups for labs, no-op (duration: 00m 12s) [18:37:27] <^d> manybubbles: went ahead and did it for you [18:37:31] Logged the message, Master [18:37:33] thanks! [18:37:53] I'll fix mw:Deployments [18:38:07] kart_: Can I ask you some questions about your CX messages deployment problems? [18:40:42] <^d> manybubbles: "A search box automatically queries the database if its contents do not exactly match a page name." [18:40:49] ^d: Are you done deploying things? [18:40:51] <^d> what does enwiki mean lol? [18:40:59] <^d> RoanKattouw: yeah, all yours [18:41:02] OK [18:41:18] ^d: ? [18:41:30] <^d> Last sentence of the lead of https://en.wikipedia.org/wiki/Help:Searching [18:41:31] 6operations, 6Mobile-Web, 3Mobile-Web-Sprint-44-R_________: Spike: figure out the simplest possible way to apply tags to a large group of articles on en wikipedia - https://phabricator.wikimedia.org/T94755#1175003 (10kaldari) I was thinking about the left-field possibility (does that translate into British E... [18:42:26] <^d> Also, https://en.wikipedia.org/wiki/Help:Searching#Delay_in_updating_the_search_index is horribly out of date [18:43:05] <^d> I'm going to edit the latter [18:43:29] !log Running clearMessageBlobs.php [18:43:37] Logged the message, Mr. Obvious [18:44:48] ^d: I guess it's a reference to the "go" feature, trying to be less technical (with little success) [18:45:06] <^d> The go feature doesn't hit the DB either if we can avoid it [18:46:06] ^d: that's what the sentence tries to say [18:46:30] Useless info anyway, for a help page [18:47:00] (03CR) 10Dzahn: [C: 032] remove hiera admin groups from haedus,capella [puppet] - 10https://gerrit.wikimedia.org/r/201395 (https://phabricator.wikimedia.org/T94474) (owner: 10Dzahn) [18:47:46] <^d> Nemo_bis: rm'd [18:50:35] (03PS2) 10Dzahn: remove orientdb role from haedus and capella [puppet] - 10https://gerrit.wikimedia.org/r/201394 (https://phabricator.wikimedia.org/T94474) [18:50:44] PROBLEM - puppet last run on mw2095 is CRITICAL: CRITICAL: puppet fail [18:52:13] !log running puppet on mw2095 - proxy error [18:52:20] Logged the message, Master [18:53:08] (03CR) 10Dzahn: [C: 032] remove orientdb role from haedus and capella [puppet] - 10https://gerrit.wikimedia.org/r/201394 (https://phabricator.wikimedia.org/T94474) (owner: 10Dzahn) [18:54:04] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:54:23] 6operations, 5Patch-For-Review: reclaim / decom haedus and capella - https://phabricator.wikimedia.org/T94474#1175059 (10Dzahn) removed admin groups and system role, kept in DHCP and site.pp with just "standard,admin" to be reclaimed for something else [18:54:46] 6operations, 5Patch-For-Review: reclaim / decom haedus and capella - https://phabricator.wikimedia.org/T94474#1175061 (10Dzahn) next: wipe and reinstall, i assume [19:01:49] 6operations, 7HTTPS, 3HTTPS-by-default: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#1175108 (10BBlack) As I see it today, basically the remaining software-level plan for bringing up esams capacity goes something like this (subject to change as work is done and s... [19:01:55] 6operations, 5Patch-For-Review: Install phpunit on caesium - https://phabricator.wikimedia.org/T94486#1175109 (10Dzahn) 5Open>3declined caesium is precise. simulated install pulls these: Inst php5-common (5.3.10-1ubuntu3.17+wmf1ubuntu1 Wikimedia:12.04/precise-wikimedia [amd64]) Inst php5-cli (5.3.10-1ubu... [19:02:49] 6operations, 5Patch-For-Review: Install phpunit on caesium - https://phabricator.wikimedia.org/T94486#1175117 (10Dzahn) 5declined>3Open [19:33:18] (03CR) 10Ori.livneh: "@Krinkle yeah, good idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 (owner: 10Ori.livneh) [19:33:36] (03PS2) 10Ori.livneh: Set $wgLogoHD for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 [19:35:55] (03CR) 10Ori.livneh: [C: 032] Set $wgLogoHD for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 (owner: 10Ori.livneh) [19:36:00] (03Merged) 10jenkins-bot: Set $wgLogoHD for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 (owner: 10Ori.livneh) [19:37:23] !log ori Synchronized wmf-config/InitialiseSettings.php: I3bbf2418d: Set $wgLogoHD for enwiki (duration: 00m 12s) [19:37:28] Logged the message, Master [19:49:54] 6operations, 6Commons, 6Multimedia, 7HHVM, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1175265 (10Eloquence) @Joe, do we need to push this to next week? [19:50:00] 6operations, 5Patch-For-Review: Install phpunit on caesium - https://phabricator.wikimedia.org/T94486#1175266 (10demon) a:5demon>3None [19:50:05] (03CR) 10Krinkle: "Removed from MediaWiki:Common.css, works great. https://en.wikipedia.org/w/index.php?title=MediaWiki%3ACommon.css&diff=654676782&oldid=653" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201050 (owner: 10Ori.livneh) [19:54:17] 6operations, 6Commons, 6Multimedia, 7HHVM, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1175272 (10ori) >>! In T84842#1175265, @Eloquence wrote: > @Joe, do we need to push this to next week? At minimum. Giuseppe ran into segfaults with Tim's output buffer p... [19:55:57] 6operations, 6Commons, 6Multimedia, 7HHVM, and 4 others: Convert Imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#931787 (10Eloquence) OK, let's schedule to a specific week once we have confidence on the ETA. [20:09:58] (03CR) 10Aaron Schulz: Add pool counter config for Translate (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199263 (https://phabricator.wikimedia.org/T54728) (owner: 10Nikerabbit) [20:31:14] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:55] 6operations, 5Patch-For-Review: Install phpunit on caesium - https://phabricator.wikimedia.org/T94486#1175438 (10Dzahn) >>! In T94486#1166608, @Aklapper wrote: > Is this ops territory? actually. why? [20:35:13] 6operations, 10ops-eqiad: ganeti1003 DIMM problem - https://phabricator.wikimedia.org/T94825#1175440 (10Cmjohnson) Dear Johnson, Christopher, Your dispatch shipped on 4/2/2015 4:30:51 PM [20:35:51] (03CR) 10Dzahn: [C: 032] install phpunit on release server [puppet] - 10https://gerrit.wikimedia.org/r/200872 (https://phabricator.wikimedia.org/T94486) (owner: 10Dzahn) [20:36:32] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/201389 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [20:37:07] 6operations, 5Patch-For-Review: Install phpunit on caesium - https://phabricator.wikimedia.org/T94486#1175457 (10Dzahn) a:3Dzahn [20:37:30] 6operations, 5Patch-For-Review: Install phpunit on caesium - https://phabricator.wikimedia.org/T94486#1175458 (10Dzahn) 5Open>3Resolved Notice: /Stage[main]/Releases/Package[phpunit]/ensure: ensure changed 'purged' to 'present' [20:38:05] 6operations, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1175461 (10faidon) 3NEW [20:38:41] (03Abandoned) 10Dzahn: remove haedus/capella, decom [dns] - 10https://gerrit.wikimedia.org/r/201397 (https://phabricator.wikimedia.org/T94474) (owner: 10Dzahn) [20:39:15] ori, bblack ^ [20:40:44] (03CR) 10Tim Landscheidt: [C: 031] "Apart from moving the "options" line one down, this seems to be a noop on Labs instances, so fine with me. But I sincerely dislike the "u" [puppet] - 10https://gerrit.wikimedia.org/r/201448 (owner: 10Andrew Bogott) [20:41:57] (03CR) 10Andrew Bogott: "It's not too late to rename! What would you suggest? use_dnsmasq_dns?" [puppet] - 10https://gerrit.wikimedia.org/r/201448 (owner: 10Andrew Bogott) [20:42:17] (03PS3) 10Dzahn: icinga: give Moritz permissions to run commands [puppet] - 10https://gerrit.wikimedia.org/r/201251 (https://phabricator.wikimedia.org/T94717) [20:43:03] (03CR) 10Andrew Bogott: "Oh, actually, I withdraw the offer; renaming it will be a drag :(" [puppet] - 10https://gerrit.wikimedia.org/r/201448 (owner: 10Andrew Bogott) [20:43:17] (03CR) 10Dzahn: [C: 032] "Filippo is right" [puppet] - 10https://gerrit.wikimedia.org/r/201251 (https://phabricator.wikimedia.org/T94717) (owner: 10Dzahn) [20:48:05] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:59:46] (03CR) 10Tim Landscheidt: "As said, keeping "use_dnsmasq" is fine with me. In general, I would try to leave dnsmasq out of the name: In this case, it's the DNS stru" [puppet] - 10https://gerrit.wikimedia.org/r/201448 (owner: 10Andrew Bogott) [21:00:04] (03PS3) 10Dzahn: move jobqueue monitoring out of ganglia.pp [puppet] - 10https://gerrit.wikimedia.org/r/199942 (https://phabricator.wikimedia.org/T93776) [21:00:23] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/675/change/199942/html/terbium.eqiad.wmnet.html" [puppet] - 10https://gerrit.wikimedia.org/r/199942 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [21:03:31] (03PS3) 10Dzahn: ganglia: remove class ganglia::logtailer [puppet] - 10https://gerrit.wikimedia.org/r/199943 (https://phabricator.wikimedia.org/T93776) [21:03:59] (03PS1) 10Yuvipanda: labs: Add monitoring for high iowait on labstore instances [puppet] - 10https://gerrit.wikimedia.org/r/201591 (https://phabricator.wikimedia.org/T94606) [21:05:02] Coren: andrewbogott ^ [21:05:19] I’m also thinking of putting one in for ‘load’ as well, but only as a ‘catchall' [21:05:38] like, over 50 for 5 minutes [21:05:43] I think that should need alerting [21:06:03] looking at graphite, that would’ve alerted us a few minutes earlier [21:06:31] (03CR) 10coren: [C: 031] "Sane." [puppet] - 10https://gerrit.wikimedia.org/r/201591 (https://phabricator.wikimedia.org/T94606) (owner: 10Yuvipanda) [21:09:07] ganglia.wmflabs.org is 502 Bad Gateway - do we use it? [21:12:54] (03CR) 10Dzahn: [C: 032] ganglia: remove class ganglia::logtailer [puppet] - 10https://gerrit.wikimedia.org/r/199943 (https://phabricator.wikimedia.org/T93776) (owner: 10Dzahn) [21:14:17] mutante: we don't [21:16:20] YuviPanda: ok, saw the updates. maybe it have a ticket to remove the remnants [21:18:14] kind of wishes it was working to test production changes on ganglia [21:18:41] but i guess it's like nagios.wmflabs.org was [21:18:53] separate setup [21:18:55] yeah [21:18:58] well [21:19:06] ganglia also couldn’t be *hosted* on a labs machine [21:19:11] they’re just not good enough / big enough [21:19:16] so we moved graphite to labmon1001 [21:19:18] hrmm, ok [21:19:26] and if we still want ganglia we’d need to get a prod box for it, set it up, etc [21:21:56] PROBLEM - RAID on vanadium is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 [21:22:40] ugh [21:22:45] milimetric: nuria ^ [21:23:24] i just suppose we still want it [21:23:33] vanadium? [21:23:37] ganglia [21:23:52] hmm, maybe. for labs itself we’re happy with just graphite so far. [21:24:02] also IIRC when this happened our ganglia code was a bit eh [21:24:08] i wish i could use labs as originally intended [21:24:10] so maybe it makes sense not to put ganglia. [21:24:46] yea, well, the reason to touch it is the "code was a bit eh" part i guess:) [21:25:08] ok [21:25:09] YuviPanda: yikes [21:25:17] milimetric: yeah, that doesn’t sound good. [21:25:27] milimetric: we’re ok atm, I don’t think we lost any data ‘coz on ly 2 failed... [21:25:33] milimetric: but can you look and confirm? [21:25:51] i will tail some logs, kick some tires, thx [21:25:59] hmm, just closed that ticket about diskspace on vanadium as resolved [21:26:02] and now a disk dies? [21:26:02] milimetric: ok. it should still be repaired. [21:26:12] it’s an old out of warranty box, isn’t it? [21:26:19] lemme check [21:26:25] YuviPanda: we have been trying to replace that damn box for a while [21:26:29] but everyone's bzzy [21:26:47] yeah.. [21:26:48] Purchase Date:2011-01-27 [21:26:51] milimetric: is there a ticket for it? [21:27:06] milimetric: is it a SPOF? does it have unbackued up critical data? [21:27:26] YuviPanda: it's a SPOF [21:27:30] it's an old rickety box [21:27:40] ugh. [21:27:42] I think of it as a smart rat inside a toaster, [21:27:52] a really smart rat, sure [21:27:54] :) [21:28:02] milimetric: is there a procurement ticket for a new / replacement box? [21:29:26] i see access requests for vanadium when searching for that [21:29:55] that are not in operations though? [21:30:40] YuviPanda: yes, searching (grr) [21:30:59] YuviPanda: https://phabricator.wikimedia.org/T90363 [21:32:01] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Upgrade box for EventLogging (vanadium) - https://phabricator.wikimedia.org/T90363#1175784 (10yuvipanda) ``` PROBLEM - RAID on vanadium is CRITICAL: CRITICAL: Active: 4, Working: 4, Failed: 2, Spare: 0 ``` Just happened, so this box needs to be repla... [21:32:01] YuviPanda: nothing makes me feel like Vanadium's dropping events [21:32:09] (checked graphite and logs and stuff) [21:32:44] milimetric: ok. it should still be replaced very quickly tho [21:32:54] agreed [21:33:19] I'll tell otto first thing tomorrow, he's gone tonight [21:33:37] YuviPanda: so until then it'll be hobbled right? [21:33:49] as in, nobody's going in there to repair it, right? [21:35:57] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: Upgrade box for EventLogging (vanadium) - https://phabricator.wikimedia.org/T90363#1175804 (10yuvipanda) p:5High>3Unbreak! since this is a SPOF for EventLogging, and plenty of people will be sadface if eventlogging dies. [21:36:21] 10Ops-Access-Reviews, 6operations: eventlogging-roots for nuria - https://phabricator.wikimedia.org/T88823#1175808 (10Dzahn) [21:37:06] 10Ops-Access-Reviews, 6operations: eventlogging-roots for milimetric - https://phabricator.wikimedia.org/T88822#1175812 (10Dzahn) [21:39:56] 10Ops-Access-Reviews, 6operations: eventlogging-roots for milimetric - https://phabricator.wikimedia.org/T88822#1175822 (10Dzahn) this shouldn't have had a custom policy and the security dropdown selected. it went unnoticed as well because it wasn't in the Operations project. fixed that looks like these reque... [21:40:15] 10Ops-Access-Reviews, 6operations: eventlogging-roots for nuria - https://phabricator.wikimedia.org/T88823#1175825 (10Dzahn) this shouldn't have had a custom policy and the security dropdown selected. it went unnoticed as well because it wasn't in the Operations project. fixed that looks like these requests h... [21:43:47] PROBLEM - HHVM busy threads on mw1147 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [21:50:26] Mobile apps (Deskana) are reporting a high captcha failure rate today. Anybody know of any captcha changes that went out yesterday or would have hit enwiki with the train deploy? [21:51:37] PROBLEM - OCG health on ocg1003 is CRITICAL: CRITICAL: ocg_job_status 745383 msg: ocg_render_job_queue 3061 msg (=3000 critical) [21:52:07] PROBLEM - HHVM busy threads on mw1147 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [86.4] [21:52:36] PROBLEM - OCG health on ocg1002 is CRITICAL: CRITICAL: ocg_job_status 745567 msg: ocg_render_job_queue 3245 msg (=3000 critical) [21:52:46] PROBLEM - OCG health on ocg1001 is CRITICAL: CRITICAL: ocg_job_status 745599 msg: ocg_render_job_queue 3277 msg (=3000 critical) [21:52:46] (03PS2) 10Yuvipanda: labs: Add monitoring for high iowait on labstore instances [puppet] - 10https://gerrit.wikimedia.org/r/201591 (https://phabricator.wikimedia.org/T94606) [21:53:27] (03PS2) 10Dzahn: move misc/labsdebrepo out of misc to module [puppet] - 10https://gerrit.wikimedia.org/r/194796 [21:55:59] (03CR) 10Dzahn: "since this was moved from a class to a define in https://gerrit.wikimedia.org/r/#/c/118796/6/manifests/misc/labsdebrepo.pp how do all the " [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [21:57:06] (03CR) 10Dzahn: "there are several references to the class that was changed to a define here. for example require => Class['misc::labsdebrepo'] in dynamicp" [puppet] - 10https://gerrit.wikimedia.org/r/118796 (https://phabricator.wikimedia.org/T62925) (owner: 10Tim Landscheidt) [21:57:12] mutante: ^ misc::labsdebrepo is stil there in the link you point to :) see bottom. [21:58:04] YuviPandaa: ah! i see it now [21:58:52] 7Puppet, 6operations, 6Labs, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1175889 (10Krinkle) 3NEW a:3Krinkle [21:59:51] (03CR) 10Dzahn: "ah! i see it now, further down, thx Yuvi" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [22:00:37] PROBLEM - HHVM busy threads on mw1147 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [22:01:07] did anything change recently on the API cluster that could lead to a sudden hike in response times? [22:01:10] (03CR) 10Dzahn: "i just wanna delete as much as possible from the global ./misc/ directory" [puppet] - 10https://gerrit.wikimedia.org/r/194796 (owner: 10Dzahn) [22:03:44] Alert: shit's on fire. [22:04:07] 7Puppet, 6operations, 10Continuous-Integration: Puppet (silently) fails to setup apache on some integration-slave14xx instances - https://phabricator.wikimedia.org/T91832#1175914 (10Krinkle) [22:04:08] It's not possible to create an account on the Wikipedia app right now, including on versions that haven't had code changes for months [22:04:10] Please help. [22:04:22] 7Puppet, 6operations, 10Continuous-Integration: Puppet (silently) fails to setup apache on new trusty instances - https://phabricator.wikimedia.org/T91832#1175915 (10Krinkle) [22:04:30] 7Puppet, 6operations, 6Labs, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1175918 (10Dzahn) you would think the "if ! defined" already protects against this, but apparently not. ``` if ! defined ( Package['gd... [22:05:49] mutante: OK.I'll write patch [22:07:47] Deskana: when did this start? [22:07:54] 7Puppet, 6operations, 6Labs, 7Regression: Puppet: "Package[gdb] is already declared in file modules/java/manifests/tools.pp" - https://phabricator.wikimedia.org/T94917#1175939 (10Dzahn) https://groups.google.com/forum/#!topic/puppet-users/OF_WYN41dMI [22:08:05] Krinkle: ok, cool! [22:08:13] greg-g: Some time in the last 24 hours. [22:08:21] greg-g: Let me try to narrow that down... [22:08:25] mutante: hm.. require_package, Package, ensure_package, ensure_packages [22:08:26] that'd be nice :) [22:09:07] anomie: still around for a bit? we might need some api help, see Deskana's note above about app account creation being broken [22:09:22] mutante: Hm.. standard-packages is doing 'latest' instead of 'present' on these [22:09:25] so can't use ensure_packages, right? [22:09:45] Deskana: who are your api experts? [22:09:58] greg-g: dr0ptp4kt, bearND [22:10:41] dr0ptp4kt: you have shell access, I believe, can you take point for log grep'ing for this (account creation failure for apps)? [22:10:47] mutante: the if statement doesn't work because tools.pp is included first [22:13:09] !log Account creation is broken/not working for either iOS or Android WP apps, investigation in -mobile [22:13:12] greg-g: Account creations seem to have ceased around 20150401230000, UTC [22:13:16] Logged the message, Master [22:13:47] greg-g: Around 23 hours ago [22:13:49] well, crap, that's when a lot of stuff happened, including the branch update [22:14:26] 22:31 logmsgbot: twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf23 [22:16:28] That's the right time for it to break (new wmf branch on enwiki) [22:16:33] other things around there are gather related [22:16:49] So we are looking for changes to captcha and/or api in wmf23 [22:16:50] {action=createaccount, format=json, name=xxx, password=xxx, token=6dcdfb5221e37c865643b01cf952cdab, captchaid=23272803, captchaword=pegsheet} [I've xxx'd out user name/pwd] [22:17:17] RECOVERY - HHVM busy threads on mw1147 is OK: OK: Less than 30.00% above the threshold [57.6] [22:17:45] result = {"createaccount":{"result":"NeedCaptcha","warnings":[{"type":"warning","message":"captcha-createaccount-fail","params":[]}],"captcha":{"type":"image","mime":"image\/png","id":"1206516922","url":"\/w\/index.php?title=Special:Captcha\/image&wpCaptchaId=1206516922"}}} [22:17:57] What about this one? -- https://gerrit.wikimedia.org/r/#/c/197483/ [22:18:15] A 500 turned in to a 400 [22:18:21] https://gerrit.wikimedia.org/r/#/c/198150/ ? [22:20:03] (03PS1) 10Krinkle: Fix duplicate Package[gdb] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201598 (https://phabricator.wikimedia.org/T94917) [22:20:19] I like bryan's idea [22:20:32] (of the culprit) [22:20:36] (03PS2) 10Krinkle: Fix duplicate Package[gdb] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201598 (https://phabricator.wikimedia.org/T94917) [22:22:06] (03CR) 10Krinkle: [C: 031] "Cherry-picked to integration-puppetmaster. Fixes the error." [puppet] - 10https://gerrit.wikimedia.org/r/201598 (https://phabricator.wikimedia.org/T94917) (owner: 10Krinkle) [22:22:10] bethat's what i keep getting, too [22:22:19] Krinkle: ok, i was about to say that, put an "if" in both [22:22:39] Krinkle: but we could also question "latest" as wrong [22:22:42] Hehe, yeah, took me a minute to realise [22:23:01] dr0ptp4kt: ideas on what bryan proposes as the culprit above? [22:23:01] Apr 2 22:22:51 integration-slave-precise-1011 puppet-agent[14854]: Could not retrieve catalog from remote server: Error 400 on SERVER: Duplicate declaration: Package[git-core] is already declared in file /etc/puppet/modules/authdns/manifests/scripts.pp:15; cannot redeclare at /etc/puppet/modules/base/manifests/standard-packages.pp:37 on node i-00000a44.eqiad.wmflabs [22:23:03] ugh [22:23:18] Home come these all start now? [22:23:20] What changed recently [22:23:39] Neither of these changed recently [22:23:57] greg-g: that seems reasonable to me [22:23:58] Krinkle: mutante sorry, but can ya'll take this to -releng while we're diagnosing this broken account creation issue [22:24:05] ok [22:24:28] dr0ptp4kt: revert that patch or? [22:24:31] mutante, Krinkle: Thank you. :-) [22:24:33] looking for guidance here [22:25:00] Do we cache 400? [22:25:16] That's what I was just wondering [22:25:20] 7Puppet, 6operations, 10Continuous-Integration, 7Regression: Puppet: "Package[git-core] is already declared in file modules/authdns/manifests/scripts.pp" - https://phabricator.wikimedia.org/T94921#1176059 (10Krinkle) 3NEW a:3Krinkle [22:25:23] bblack: do we cache 400s? [22:25:37] I think so [22:25:54] greg-g: i think attempting a revert would be worth a shot, although it's really guessing [22:26:13] puppet agent is hitting a varnish cache? :P [22:26:16] what else do we have righ tnow? [22:26:36] bblack: no, https://phabricator.wikimedia.org/T94915#1175957 [22:27:48] greg-g, bd808: 197483 seems unlikely. [22:27:56] so the theory is that the wikiversions update caused a temporary 400 on a previously-valid and still-valid URL, which got cached? [22:28:14] bblack: this change of return code: https://gerrit.wikimedia.org/r/#/c/197483/1/FancyCaptcha.class.php [22:28:24] but anomie doesn't agree :) [22:28:28] greg-g, bd808: Definitely not 198150. [22:28:33] k [22:28:40] But it does seem to be a captcha issue of some sort. [22:28:49] greg-g: I tend to agree with you [22:28:58] let me confirm re: 400 caching [22:28:59] anomie: yeah, that one from me was just guessing/looking for "api" on the changes for wmf23 [22:29:26] (if we are 400-caching, then yes, returning a 400 from a bad captcha input would be a bad idea :)) [22:29:36] So the app is returning images; I see them changing in the screen [22:30:05] legoktm: fyi ^^ re https://gerrit.wikimedia.org/r/#/c/197483/ [22:30:27] bblack: It looks like the change was just for fetching an image based on id. If the id isn't there then it isn't there so caching is actually good [22:30:55] so that one seems to be a false lead [22:31:15] hmmm ok [22:31:19] anomie: any other ideas? :) [22:31:25] greg-g: Looking [22:31:32] * greg-g nods [22:31:41] (03PS1) 10Krinkle: Fix duplicate Package[git-core] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201603 (https://phabricator.wikimedia.org/T94921) [22:31:45] I thought maybe it was a dynamic 400-or-not to a URL that should still work for others [22:31:49] Hash anyone tried solving some captchas on desktop? [22:32:00] csteipp: yeah, no problems [22:32:01] csteipp: Yeah, it worked for me. [22:32:40] do we know exactly what URL mobile's hitting that fails? [22:33:53] bblack: It's a POST to api.php on enwiki I believe [22:33:58] bearND, dr0ptp4kt: ^ [22:33:59] (03PS2) 10Krinkle: Fix duplicate Package[git-core] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201603 (https://phabricator.wikimedia.org/T94921) [22:34:17] YuviPandaa: did you ping me earlier? [22:35:04] bblack: yes, a POST to https://en.m.wikipedia.org/w/api.php [22:35:34] bblack: want me to repeat the request params + response? [22:35:42] Deskana: And mobile didn't change their session handling recently, right? [22:35:44] it can't hurt [22:35:53] request params = {action=createaccount, format=json, name=xxx, password=xxx, token=6dcdfb5221e37c865643b01cf952cdab, captchaid=23272803, captchaword=pegsheet} [22:36:03] response = {"createaccount":{"token":"6dcdfb5221e37c865643b01cf952cdab","result":"NeedToken"}} [22:36:04] grrrit-wm: https://gerrit.wikimedia.org/r/#/c/179838/ is looking suspicious [22:36:08] Krinkle: "base" is in theory applied everywhere, and should be applied for authdns servers. Can't we just dump git-core from authdns and leave it in base? [22:36:14] csteipp: This is happening on *all* versions of the Wikipedia app, even ones from months ago. [22:36:16] greg-g: https://gerrit.wikimedia.org/r/#/c/179838/ is looking suspicious [22:36:27] csteipp: So this isn't caused by a code change on our side. [22:36:39] Krinkle: or perhaps to be more-correct, have some appropriate part of authdns do require => Package['git-core'] [22:37:03] bblack: I'm just wondering why it started failing both for those two. Neither class changed recently. [22:37:15] what caused it to start failing where? [22:37:43] oh, wth does this have to do with java? :) [22:37:50] bblack: correction of response: {"createaccount":{"result":"NeedCaptcha","warnings":[{"type":"warning","message":"captcha-createaccount-fail","params":[]}],"captcha":{"type":"image","mime":"image\/png","id":"1206516922","url":"\/w\/index.php?title=Special:Captcha\/image&wpCaptchaId=1206516922"}}} [22:38:06] bblack: Failures seem to correspond with enwiki getting the 1.25wmf23 branch yesterday [22:38:13] bblack: mutante: btw, greg-g asked us to continue in #wikimedia-releng [22:38:31] (re the git-core/puppet issue) [22:38:33] nuria: let’s talk in -analytics [22:38:40] YuviPandaa: k [22:38:59] yeah ok [22:39:07] dr0ptp4kt: bearND see https://gerrit.wikimedia.org/r/#/c/179838/ ???? [22:39:38] 7Puppet, 6operations, 10Continuous-Integration, 7Regression: Puppet: "Could not find class role::ci::slave::labs" - https://phabricator.wikimedia.org/T94925#1176120 (10Krinkle) 3NEW a:3Krinkle [22:40:04] (03CR) 10Krinkle: "Deployed on integration-puppetmaster. Fixes the error." [puppet] - 10https://gerrit.wikimedia.org/r/201603 (https://phabricator.wikimedia.org/T94921) (owner: 10Krinkle) [22:40:04] Looks like we use CaptchaCacheStore so session handling shouldn't have anything to do with it [22:40:08] (03CR) 10Krinkle: [C: 031] Fix duplicate Package[git-core] declaration [puppet] - 10https://gerrit.wikimedia.org/r/201603 (https://phabricator.wikimedia.org/T94921) (owner: 10Krinkle) [22:40:26] a captchaid maps into memcached and not the user session [22:41:19] 7Puppet, 6operations, 10Continuous-Integration, 7Regression: Puppet: "Could not find class role::ci::slave::labs" - https://phabricator.wikimedia.org/T94925#1176120 (10Krinkle) [22:41:30] 7Puppet, 6operations, 10Continuous-Integration, 7Regression: Puppet: "Could not find class role::ci::slave::labs" - https://phabricator.wikimedia.org/T94925#1176120 (10Krinkle) [22:41:56] greg-g: unclear to me if that's it [22:44:55] So now it's using $wgRequest sometimes and $context->getRequest() other times. legoktm would that be prone to failure? [22:48:17] The change at line 921 of https://gerrit.wikimedia.org/r/#/c/179838/6/Captcha.php,unified seems suspicious based on the behavior. [22:48:54] If wpCaptchaId isn't being passed into the request properly then lookup will fail [22:49:33] bd808: Very likely $wgRequest !== $loginForm->getContext()->getRequest() in SimpleCaptcha::addNewAccountApiForm(), so when it uses $wgRequest to check later on it doesn't see the parameter renames made by that hook function. [22:49:48] Oh, you just spotted that too. [22:49:58] !log lots of SYSTEM ERROR responses from nutcracker on mw1147 [22:50:06] Logged the message, Master [22:50:56] I see in captcha.log lots of " ConfirmEdit: new captcha session; new account 'Anomie test 2 for ACC bug'" which tells me it's not seeing the session. [22:51:51] anomie: bd808, greg-g bblack bearND, so the behavior i'm observing is [22:51:54] a request like: [22:52:01] { [22:52:01] action = createaccount; [22:52:03] captchaid = 646254063; [22:52:04] captchaword = digsafros; [22:52:06] email = ""; [22:52:07] format = json; [22:52:09] language = en; [22:52:10] name = hotelier0000; [22:52:12] password = hotelier0001; [22:52:13] realname = ""; [22:52:14] reason = "iOS App Account Creation"; [22:52:15] token = a413fc20c1bf1dd9f1bc5890908609c9; [22:52:16] } [22:52:30] leads to a response like this: [22:52:34] { [22:52:34] createaccount = { [22:52:36] captcha = { [22:52:37] bd808, greg-g: So my vote is https://gerrit.wikimedia.org/r/#/c/179838/ broke things by making it use different request objects in different places, so the hackery the extension does to rename parameters for other parts of itself is b0rken. [22:52:38] id = 1496012160; [22:52:39] mime = "image/png"; [22:52:41] type = image; [22:52:42] url = "/w/index.php?title=Special:Captcha/image&wpCaptchaId=1496012160"; [22:52:44] }; [22:52:45] result = NeedCaptcha; [22:52:47] warnings = ( [22:52:48] { [22:52:50] message = "captcha-createaccount-fail"; [22:52:51] params = ( [22:52:53] ); [22:52:54] type = warning; [22:52:55] it keeps coming [22:52:56] } [22:52:57] ); [22:52:58] }; [22:52:59] } [22:53:11] dr0ptp4kt: *nod* [22:53:17] anomie: I concur [22:53:21] alright, who's up for a revert and push? [22:53:37] ebernhardson merged it ;) [22:53:39] Not me, my dinner is almost done cooking and I want to eat it. [22:53:41] bd808: and i remain in an infinite loop of answering the captcha correctly [22:53:51] bd808: which? [22:53:57] https://gerrit.wikimedia.org/r/#/c/179838/6 [22:54:14] anomie: thakns a ton for your help [22:54:23] ebernhardson: < anomie> bd808: Very likely $wgRequest !== $loginForm->getContext()->getRequest() in SimpleCaptcha::addNewAccountApiForm(), so when it uses $wgRequest to check later on it doesn't see the parameter renames made by that hook function. [22:54:52] bd808: :( should be safe to revert [22:55:15] ebernhardson: go for it plz [22:55:33] it broke account creation for apps [22:55:35] thank you guys! [22:55:44] bearND: don't thank us yet :) [22:56:29] * bearND reverts thank you (until later) ;) [22:56:35] Apps and other API consumers, I'd imagine. [22:58:04] greg-g: will take a minute, it doesn't revert cleanly and i have to back a matching patch out of flow [22:58:27] since then aaron has done some optimizations of master db handling in same code [22:58:38] weee [22:58:40] thanks [23:00:04] RoanKattouw, ^d, Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150402T2300). Please do the needful. [23:00:30] hey swatters, please hold off while ebernhardson does a revert [23:01:09] this will probably need a half hour to test and ensure i don't just break more things, can probably go ahead [23:02:07] * greg-g nods [23:02:15] fire at will, swatters [23:03:27] Krenair: RoanKattouw swat'ing today? [23:03:39] chad went _away at a most opportune time [23:04:55] I'll do it [23:05:01] ty [23:05:17] superm401 jdlrobson3: You guys here for SWAT? [23:05:27] \o present [23:05:34] Yep [23:05:54] RoanKattouw: jdlrobson3 superm401, ebernhardson is doing emergency work right now [23:06:13] erik will be at least 25 minutes, so go ahead with some quick ones [23:06:35] the vector skin change is also high prio: https://phabricator.wikimedia.org/T93050 [23:06:38] greg-g: ebernhardson thx [23:07:00] ^ yeh that should be pretty quick to test [23:07:40] yeah, let's do that one first [23:07:54] and depending on timing, either the erik revert or other swats [23:08:17] add a "then" in that sentence to make it make sense [23:09:59] OK [23:10:07] I'm gonna +2 things first and let them go through Jenkins [23:10:09] That'll take a while [23:10:11] * greg-g nods [23:10:49] ebernhardson: Should https://gerrit.wikimedia.org/r/#/c/201599/ still be deployed? [23:10:58] !log temp. disabling puppet on restbase servers [23:11:04] Logged the message, Master [23:11:11] Deskana, what's the app CAPTCHA bug number? It might also be related to https://phabricator.wikimedia.org/T94276 (I didn't test the reported fix to that). [23:11:35] superm401: https://phabricator.wikimedia.org/T94915 [23:11:59] (03PS2) 10Dzahn: cassandra: add firewalling on prod [puppet] - 10https://gerrit.wikimedia.org/r/201389 (https://phabricator.wikimedia.org/T92680) [23:13:12] Can someone kick FatBack (see topic history)? [23:13:16] !ops again, over here this time :/ [23:13:32] superm401: he was also causing trouble over in -tech [23:13:45] So not so much kick as ban his IP. [23:14:00] the problem of having too many channels [23:14:38] (03CR) 10Dzahn: [C: 032] cassandra: add firewalling on prod [puppet] - 10https://gerrit.wikimedia.org/r/201389 (https://phabricator.wikimedia.org/T92680) (owner: 10Dzahn) [23:16:21] grr, i must need cookies to use the createaccount api? ... makes testing from curl annoying [23:16:25] if i need to update my ssh key can i just commit that myself to the production puppet repot or do i need to go through phab? [23:16:46] tfinc: yeah, commit it yourself and poke one of us. [23:16:52] ebernhardson: Needs a session for the CSRF token. [23:16:59] tfinc: commiting it yourself is the easiest thing to do since it verifies that it *is* you... [23:17:00] anomie: anon csrf token is always \+ [23:17:13] ebernhardson: Not for login or createaccount. [23:17:32] even using the curl cookie jar its not working :( [23:17:34] tfinc: if you put it up on gerrit and ping the on duty it should be no issue [23:17:38] or hit me up no worries [23:17:47] ha, too bad I can't because my old key isn't working. how is that for chicken and egg :D [23:17:52] heh [23:18:20] tfinc: you can then put it up on phab and someone in the office can verify you :) [23:18:34] if it's linked to the MW staff account [23:18:40] and you put in on your profile in phab [23:18:42] should be ok? [23:19:16] mutante: can you kickban FatBack here also, plz? [23:19:32] 6operations: Update ssh key for 'tomasz' - https://phabricator.wikimedia.org/T94934#1176250 (10Tfinc) 3NEW [23:20:53] ticket cut [23:20:56] mutante: would you please type /cs access #wikimedia-operations add Barras (MANAGER) [23:21:09] so I can take care of it... [23:21:21] or bblack ^ [23:21:39] yeah, bblack could as well [23:23:04] do we need a kick for FatBack ? [23:23:16] Yes [23:23:19] fuck no [23:23:21] hes cool [23:23:23] tfinc: trying to get someone who has the right access to respond :) [23:23:32] greg-g: thanks [23:23:45] (03PS3) 10Yuvipanda: labs: Add monitoring for high iowait on labstore instances [puppet] - 10https://gerrit.wikimedia.org/r/201591 (https://phabricator.wikimedia.org/T94606) [23:23:47] why is wmfgc not set as op :( [23:25:00] Barras: good question [23:25:39] greg-g: Sorry that I can't help, but there is no staff around to give me rights... [23:25:54] Barras: yeah, we'll deal, thanks for trying [23:26:09] I've been pinging relevant people in another channel, but no response yet [23:26:18] (03CR) 10Yuvipanda: [C: 032] labs: Add monitoring for high iowait on labstore instances [puppet] - 10https://gerrit.wikimedia.org/r/201591 (https://phabricator.wikimedia.org/T94606) (owner: 10Yuvipanda) [23:27:26] thanks kloeri [23:27:52] thanks kloeri :) [23:27:56] welcome [23:28:43] Thanks, kloeri. Is #freenode the right place to ask if we need assistance in the future? [23:29:18] What decides who can and can’t kick? [23:29:50] Barras: ^ andrewbogott can give the right access to -operations, pm him some commands :) [23:29:58] i dont know we have any rules here, but in other channels i frequent we've kinda just given chanserv @op to everyone who has been around >1year [23:30:00] ie "trustable" [23:30:06] ensures there is always someone around to deal with things [23:30:23] Eloquence: {"servedby":"mw1227","error":{"code":"500","info":"parsoidserver-http: HTTP 500","*":"See https://en.wikipedia.org/w/api.php for API usage"}} [23:30:24] yep [23:30:32] gwicke: around? [23:30:40] (03PS1) 10Yuvipanda: labs: Alert on high load in labstore* [puppet] - 10https://gerrit.wikimedia.org/r/201618 (https://phabricator.wikimedia.org/T94606) [23:30:41] cscott [23:30:50] (03CR) 10jenkins-bot: [V: 04-1] labs: Alert on high load in labstore* [puppet] - 10https://gerrit.wikimedia.org/r/201618 (https://phabricator.wikimedia.org/T94606) (owner: 10Yuvipanda) [23:31:01] ori: yup [23:31:05] (03PS2) 10Yuvipanda: labs: Alert on high load in labstore* [puppet] - 10https://gerrit.wikimedia.org/r/201618 (https://phabricator.wikimedia.org/T94606) [23:31:11] we just enabled the new firewalling on the RB cluster [23:31:12] ori: yup [23:31:28] it's likely that some requests were dropped [23:31:29] gwicke: Eloquence is noticing that Parsoid is 500ing fairly frequently [23:31:56] andrewbogott: You can give more people op rights in here if you want. (e.g. wmfgc account or me or generally other people) [23:31:59] try Special:Random -> Edit, happens about 1 out of 10 times [23:32:09] some take a lot of time, some fail [23:32:49] for later: the telemetry sucks, we need alerts for this [23:32:50] andrewbogott: "wmfgc" is the group of Wikimedia IRC ops who can come to help us if we just type "ops" with a ! before it (didn't want to ping 'em) [23:32:54] how are these requests done? [23:33:11] is it VE page loads? [23:33:17] yep [23:33:35] still the case right now? [23:33:43] yep [23:34:04] Obama loaded quickly on enwiki [23:34:08] greg-g: wmfgc is actually only used by three people :) Not really that much and if we are all sleeping also not helpful [23:34:19] Barras: ah, misremembered [23:34:28] andrewbogott: you could add me too, and perhapas greg-g as well? [23:34:33] superm401: #freenode is generally a good place to get our attention [23:34:44] YuviPanda: ok, still trying to figure out what the commands are... [23:34:47] andrewbogott: what YuviPanda said [23:34:47] superm401: we might not always be around of course [23:34:52] Right. Thanks, kloeri [23:35:09] OK, SWAT update [23:35:13] The pull-throughs are all merged now [23:35:14] andrewbogott: /cs access #wikimedia-operations add IRCNICK manager [23:35:31] ebernhardson: Are you done doing what you were doing and can I do the SWAT now? [23:35:35] for (almost) full access to the channel. [23:35:53] Otherwise you may use subsets of flags :) [23:36:37] ok, I think I have done all of the above. [23:36:54] andrewbogott: Yeah, looks good :) [23:37:00] aaaaand, now I’m going to go cook dinner. [23:37:03] Thanks Barras [23:37:07] you did not give him full access o_O [23:37:09] you successfully added some people :) [23:37:11] thanks andrewbogott [23:37:40] RoanKattouw: i told greg earlier, and he aggreed, go forward with swat i am only testing this revert so far not deploying it yet [23:37:49] will be deploying soon, once i can see it works on beta [23:37:52] but stilla bit [23:37:53] voice some peop [23:38:28] Mjbmr: can’t managers voice themselves if they want to? [23:38:28] +v'ing is up to someone else, this channel hasn't normally had that status inditation [23:38:35] Ah, ok [23:38:40] * andrewbogott ducks out [23:39:06] yeah, they can [23:39:23] ebernhardson: OK so I'm gonna SWAT now then [23:39:34] gwicke, https://grafana.wikimedia.org/#/dashboard/db/visualeditor-load-save seems to be stabilizing [23:39:35] ori, Eloquence: both manual testing and graphite seem to indicate that things are pretty much back to normal after the firewall was enabled [23:39:43] RoanKattouw: sounds good [23:40:06] yeah, I can no longer get 500s [23:40:17] ebernhardson: Actually never mind I don't have all the pull-throughs lined up yet, I need some more time [23:40:21] I'll ping when I'm actually ready [23:40:24] RoanKattouw: ok [23:40:32] sorry about that, I think it was rolled out a bit too quickly across all nodes [23:41:22] but on the plus side, the cassandra cluster is now properly firewalled off [23:41:32] how long ago did you roll it out? [23:42:05] between 1630 and 1634 [23:42:26] it's pretty easy to see in http://grafana.wikimedia.org/#/dashboard/db/restbase [23:42:30] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Requesting access to analytics-users (stat1002) for Jkatz - https://phabricator.wikimedia.org/T94939#1176345 (10JKatzWMF) 3NEW a:3Ottomata [23:42:37] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1176354 (10Dzahn) It actually worked fine. We had disabled puppet first to just try on restbase1001, confirmed, then reenabled it... [23:44:49] ori: https://gerrit.wikimedia.org/r/#/c/201389/ i had puppet disabled except on one, then re-enabled on the others, i should have probably waited a bit longer in between them [23:46:02] the actual change itself works fine though [23:52:06] testing 123 :) [23:53:00] urandom: it's enabled in prod :) [23:55:28] mutante: sweet! [23:55:53] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1176404 (10Dzahn) a:3Dzahn [23:57:46] 6operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1176419 (10Dzahn) [23:57:49] 6operations, 10RESTBase, 10RESTBase-Cassandra, 6Security, 5Patch-For-Review: iptables firewall to limit access to Cassandra services - https://phabricator.wikimedia.org/T92680#1176417 (10Dzahn) 5Open>3Resolved ``` @restbase1001:~# cat /etc/ferm/conf.d/ 00_defs 10_bastion-ssh... [23:57:52] (03PS2) 10Yuvipanda: webservice2: EAFP, not LBYL [puppet] - 10https://gerrit.wikimedia.org/r/201421 (owner: 10Ori.livneh) [23:58:07] (03CR) 10Yuvipanda: [C: 032] webservice2: EAFP, not LBYL [puppet] - 10https://gerrit.wikimedia.org/r/201421 (owner: 10Ori.livneh) [23:58:18] (03CR) 10Yuvipanda: [V: 032] webservice2: EAFP, not LBYL [puppet] - 10https://gerrit.wikimedia.org/r/201421 (owner: 10Ori.livneh) [23:59:22] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail