[00:00:04] RoanKattouw, ^d, marktraceur, MaxSem, tgr: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141120T0000). [00:00:21] I'll do it! [00:00:36] <^d> I WAS HERE FIRST ROANKATTOUW [00:00:39] ^d: I'm here :) [00:00:39] <^d> GET YER OWN SWAT [00:00:40] <^d> :) [00:00:42] lol [00:00:47] You can do it if you like [00:00:52] RoanKattouw wanted a prize [00:00:53] ^d: In that case, we have two extras not on the list. ;-) [00:00:54] If you're willing to put up with my last-minute addition [00:01:09] <^d> Argh!!!!!! [00:01:15] ^d: See? :-) [00:01:19] That's why I claimed it [00:01:27] But I didn't notice you'd claimed it already [00:01:34] <^d> I was offering prizes. [00:01:40] <^d> But all I have is more cake, so feel free. [00:02:23] whose doing SWAT and would they put up with a massively late addition from me? GPG is failing to execute for SecurePoll and so I want to turn $wgSecurePollShowErrorDetail=true; on ... [or if someone wants to go digging in the logs for me] [00:03:17] <^d> RoanKattouw and I are arguing over who's doing it :) [00:03:26] I'll just start doing it now [00:03:28] perfect you guys keep arguing and I'll submit a patch [00:03:50] kaldari is here so his change can go out [00:03:56] Where is everyone else [00:04:01] legoktm: tgr: SWAT time [00:04:11] RoanKattouw: hi [00:04:14] <^d> I can watch lego's if he doesn't respond. [00:04:17] <^d> Oh, there he is [00:04:28] I can see them both in the office [00:04:30] So it'll be fine [00:05:11] (03CR) 10Catrope: [C: 032] Adding Wikipedia wordmark for mobile and switching to it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174585 (https://bugzilla.wikimedia.org/58886) (owner: 10Kaldari) [00:05:17] RoanKattouw: ready if you are [00:05:18] (03Merged) 10jenkins-bot: Adding Wikipedia wordmark for mobile and switching to it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174585 (https://bugzilla.wikimedia.org/58886) (owner: 10Kaldari) [00:05:22] Awesome [00:05:25] Config changes first [00:05:53] (03CR) 10Catrope: [C: 032] Add SkinDistributor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174480 (owner: 10Legoktm) [00:06:01] (03Merged) 10jenkins-bot: Add SkinDistributor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174480 (owner: 10Legoktm) [00:06:08] (03CR) 10Catrope: [C: 032] Revert "Revert "Enable JPG thumbnail chaining on all wikis except commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174451 (owner: 10Gilles) [00:06:19] (03Merged) 10jenkins-bot: Revert "Revert "Enable JPG thumbnail chaining on all wikis except commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174451 (owner: 10Gilles) [00:08:43] (03PS1) 10Jalexander: Enable SecurePoll error detail for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174603 (https://bugzilla.wikimedia.org/73626) [00:09:26] !log catrope Synchronized images/mobile/: SWAT: new Wikipedia wordmark for mobile (duration: 00m 03s) [00:09:33] Logged the message, Master [00:09:36] RoanKattouw: I'll add it to the list but if possible I'd like to get that out ^ [00:10:00] apologies for the late timing (it was discovered now because polls, including test polls like this, start at 00:00 ) [00:10:44] !log catrope Synchronized wmf-config/: SWAT (duration: 00m 04s) [00:10:46] Logged the message, Master [00:10:57] jamesofur: Will deploy. Happy to do it as long as you add it to the wiki page for posterity [00:11:03] * jamesofur nods [00:11:05] thanks, adding now [00:11:06] (03CR) 10Catrope: [C: 032] Enable SecurePoll error detail for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174603 (https://bugzilla.wikimedia.org/73626) (owner: 10Jalexander) [00:11:14] (03Merged) 10jenkins-bot: Enable SecurePoll error detail for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174603 (https://bugzilla.wikimedia.org/73626) (owner: 10Jalexander) [00:13:26] !log catrope Synchronized wmf-config/: SWAT: temp debugging for SecurePoll (duration: 00m 04s) [00:13:28] Logged the message, Master [00:13:33] OK, those were the config changes [00:13:33] thank ye, it's added on wiki [00:13:50] tgr, kaldari, legoktm, jamesofur: Please verify and confirm those [00:14:02] verified on my side [00:14:09] RoanKattouw: Looks good! [00:14:10] RoanKattouw: mine was a no-op :) [00:14:52] I've broken it :/ [00:15:13] RoanKattouw: hard to confirm but nothing is broken [00:15:13] Request: POST http://en.wikipedia.org/w/index.php?title=Special:MovePage&action=submit, from 10.128.0.116 via cp4010 cp4010 ([10.128.0.110]:3128), Varnish XID 575413096 [00:17:19] OK, extension time [00:26:26] !log catrope Synchronized php-1.25wmf8/includes/media/: SWAT: don't apply EXIF rotation to chained thumbnails (duration: 00m 04s) [00:26:28] Logged the message, Master [00:26:34] tgr: ---^^ [00:34:13] (03PS1) 10GWicke: Add cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/174608 [00:34:17] !log catrope Synchronized php-1.25wmf8/extensions/VisualEditor: SWAT (duration: 00m 04s) [00:34:19] Logged the message, Master [00:34:21] !log catrope Synchronized php-1.25wmf9/extensions/VisualEditor: SWAT (duration: 00m 04s) [00:34:24] Logged the message, Master [00:36:55] (03PS1) 10Kaldari: Changing to relative URL for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 [00:36:57] ori: https://gerrit.wikimedia.org/r/174608 [00:37:16] MaxSem: https://gerrit.wikimedia.org/r/#/c/174609/ [00:37:30] kaldari: Could you improve the commit summary? [00:37:32] RoanKattouw: verified, thanks [00:37:42] kaldari: Since there are hundreds of URLs in that file, let alone the repo [00:38:21] RoanKattouw: Sure [00:38:47] kaldari: Also do you need that deployed right now? I only just finished SWAT so I can throw it in if you want [00:39:02] (03PS2) 10Kaldari: Changing to relative URL for for mobile wordmark image (per MaxSem) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 [00:39:35] RoanKattouw: yes, go ahead and deploy that [00:39:48] OK [00:39:57] kaldari: Could you do me a favor and edit the wiki page to add that commit? [00:40:03] While I deploy it [00:40:08] Just so it's on the record [00:40:09] RoanKattouw: sure [00:40:14] (03CR) 10Catrope: [C: 032] Changing to relative URL for for mobile wordmark image (per MaxSem) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 (owner: 10Kaldari) [00:40:24] (03Merged) 10jenkins-bot: Changing to relative URL for for mobile wordmark image (per MaxSem) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 (owner: 10Kaldari) [00:41:13] !log catrope Synchronized wmf-config/InitialiseSettings.php: Change mobile wordmark image to relative URL (duration: 00m 04s) [00:41:15] Logged the message, Master [00:41:22] kaldari: Done --^^ [00:42:28] jgage: ping [00:42:50] RoanKattouw: Updated the page [00:44:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [00:44:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [00:50:30] !log maxsem Synchronized php-1.25wmf9/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/174613/ (duration: 00m 04s) [00:50:35] Logged the message, Master [00:51:31] * gwicke needs a review https://gerrit.wikimedia.org/r/#/c/174608/ in order to be able to test the module in labs [00:51:33] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 321 seconds [00:51:35] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 327 seconds [00:51:55] gwicke: you can cherry-pick it in labs [00:53:39] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:53:49] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:10] (03PS4) 10GWicke: Give parsoid-roots access to ruthenium; split cassandra test hosts [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [00:54:42] ori: you mean on the beta labs puppet master? [00:55:04] yep [00:55:31] hmm.. won't that potentially conflict with other updates? [00:55:57] we forgot to actually add the submodule after the module was merged [00:56:44] https://gerrit.wikimedia.org/r/#/c/166888/11 [00:58:58] ori: it would really be cleaner to just merge this [00:59:29] OK. [01:00:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:00:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:00:24] (03CR) 10Ori.livneh: [C: 032] Add cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/174608 (owner: 10GWicke) [01:00:37] ori: thanks! [01:14:38] ori: do you know how to get he hiera entry fields for configured roles on a node? [01:19:44] * gwicke has the suspicion that puppet is generally broken in beta labs [01:20:35] (03PS1) 10Rush: phab security-bug macro testing [puppet] - 10https://gerrit.wikimedia.org/r/174614 [01:20:39] https://gist.github.com/gwicke/df92917779ad4f731368 [01:21:33] <^d> Did you point it at the beta puppetmaster or is it still pointing at the normal one? [01:22:17] ^d: I didn't change anything about the master config after creating the instance in beta [01:22:55] won't it default to the right master? [01:23:06] (03CR) 10Rush: [C: 032 V: 032] phab security-bug macro testing [puppet] - 10https://gerrit.wikimedia.org/r/174614 (owner: 10Rush) [01:23:14] <^d> gwicke: No, they don't default to beta's puppetmaster unless that's changed. [01:23:17] <^d> Sec, there's docs. [01:24:02] the strange thing is that it earlier complained about the cassandra class missing [01:24:37] after adding the submodule that error is now gone, so clearly puppet is applying the roles I assigned to the node in the wikitech interface [01:26:15] <^d> gwicke: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Converting_a_host_to_use_local_puppetmaster_and_salt_master [01:28:46] I see, thanks [01:29:39] hmmm.. do I really need puppet::self? [01:31:22] * gwicke always thought that sets up another puppet master [01:32:59] <^d> gwicke: No idea. I just follow the crowd. :) [01:33:51] kk; makes no difference so far, but might not be applied yet [01:36:39] * gwicke defers this to another day [02:21:35] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-20 02:21:35+00:00 [02:21:42] Logged the message, Master [02:27:08] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (12.50%) [02:34:12] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-20 02:34:12+00:00 [02:34:19] Logged the message, Master [02:56:09] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [03:10:54] (03PS2) 10Ori.livneh: hhvm: enable perf_pid.map files w/automatic pruning [puppet] - 10https://gerrit.wikimedia.org/r/174356 [03:13:03] (03PS3) 10Ori.livneh: hhvm: enable perf_pid.map files w/automatic pruning [puppet] - 10https://gerrit.wikimedia.org/r/174356 [03:13:57] (03CR) 10Ori.livneh: [C: 032] hhvm: enable perf_pid.map files w/automatic pruning [puppet] - 10https://gerrit.wikimedia.org/r/174356 (owner: 10Ori.livneh) [03:14:29] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:14:49] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago [03:16:58] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [03:18:18] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [03:18:49] (03PS1) 10Ori.livneh: HHVM: enable perf_pid_map for FCGI only; not CLI. [puppet] - 10https://gerrit.wikimedia.org/r/174625 [03:19:21] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: enable perf_pid_map for FCGI only; not CLI. [puppet] - 10https://gerrit.wikimedia.org/r/174625 (owner: 10Ori.livneh) [03:20:19] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [03:21:19] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [03:24:08] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: Puppet has 1 failures [03:24:58] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [03:25:19] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:26:58] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [03:27:58] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:29:19] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:30] (03PS1) 10Ori.livneh: Fix for ensure_jemalloc_prof_deactivated check [puppet] - 10https://gerrit.wikimedia.org/r/174626 [03:34:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix for ensure_jemalloc_prof_deactivated check [puppet] - 10https://gerrit.wikimedia.org/r/174626 (owner: 10Ori.livneh) [03:36:28] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [04:27:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Nov 20 04:27:25 UTC 2014 (duration 27m 24s) [04:27:29] Logged the message, Master [04:36:26] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [04:36:38] (03PS2) 10Glaisher: Delete vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://bugzilla.wikimedia.org/55737) [05:05:56] PROBLEM - Varnish HTCP daemon on cp1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (vhtcpd), args vhtcpd [05:10:30] sorry that's me above ^, that's not even a prod machine, I'm not sure why it's in icinga :p [05:15:04] !log made myself an administrator on phabricator [05:15:08] Logged the message, Master [05:28:15] !log ~30m to esams power out, starting equipment shutdown and such for OE13/OE15 [05:28:18] Logged the message, Master [05:33:56] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: No route to host [05:34:23] (03PS1) 10BBlack: esams local nets -> eqiad [dns] - 10https://gerrit.wikimedia.org/r/174630 [05:34:38] oh that too I guess [05:34:57] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [05:34:59] (03CR) 10BBlack: [C: 032] esams local nets -> eqiad [dns] - 10https://gerrit.wikimedia.org/r/174630 (owner: 10BBlack) [05:35:29] ignore those, I didn't think about the big addrs when I set downtimes [05:41:04] !log amssq31-62, cp300[12], lvs300[34], ssl300[123] all shut down for esams power event (and downtimed) [05:41:08] Logged the message, Master [05:43:06] PROBLEM - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is CRITICAL: No route to host [05:43:23] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: No route to host [05:43:27] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: No route to host [05:43:30] PROBLEM - LVS HTTPS IPv4 on upload-lb.esams.wikimedia.org is CRITICAL: No route to host [05:43:33] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: No route to host [05:43:37] PROBLEM - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: No route to host [05:43:51] really? :p [05:43:52] wooo [05:44:33] oh yeah, SSL is hidden under the same names in icinga [05:44:42] more things to disable! [05:45:30] Nov 20 05:40:29 berkelium charon: 12[IKE] IKE_SA berkelium-cp3001[23] state change: DELETING => DESTROYING [05:45:33] DESTROY [05:46:44] sorry for the pages. if you wake up randomly and come here looking, go back to sleep! [05:46:55] DAMN YOU [05:47:03] ;) [05:48:38] good morning! [05:48:42] hi [05:49:16] we should be good to go. are they planning to contact us/you about start/finish times for the actual work? [05:49:25] yes they'll call me before they start [05:49:29] ok [05:55:33] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: No route to host [05:55:51] hmmm what's that? [05:55:57] no idea [05:56:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 57, down: 15, shutdown: 1BRPeering with AS13335 not established - The + flag cannot be used with the sub-query features described below.BRPeering with AS3216 not established - SOVAM-ASBRPeering with AS1273 not established - CWBRPeering with AS5650 not established - The + flag cannot be used with the sub-query features described below.BRPeering w [05:56:55] seems related to esams, something about wikidata.org depends on it somehow [05:57:28] that icinga check anyways [05:58:01] it's listed in icinga under eqiad misc, but has an esams IP which is text-lb.esams [05:58:43] weird [06:00:33] it's defined in puppet as /usr/lib/nagios/plugins/check_http -H www.wikidata.org -I 91.198.174.192 -S -u "/w/api.php?action=query&meta=siteinfo&format=json&siprop=statistics" --linespan --ereg '"median":[^}]*"lag":[1-5]?[0-9]' [06:00:53] I'm guessing that's intentional, to measure some kind of replication lag from eqiad->esams for wikidata, and now it can't reach it to check it. [06:01:05] i doubt it's intentional [06:01:29] I mean maybe it's intentional that it's checking esams, since it claims to be about lag [06:01:35] no idewa [06:02:31] ok [06:02:33] evoswitch called [06:02:36] they're starting in a few mins [06:02:54] cool [06:07:23] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 72, down: 0, shutdown: 1 [06:08:44] so are we recording purges? [06:09:14] yes, with socat. the format is a bit ugly, but I think it can be post-processed if necc [06:09:24] on which box? [06:09:31] cp1008.eqiad.wmnet [06:09:39] ok [06:09:53] (it's my ssl test host) [06:10:51] OE15 went missing [06:11:24] yay [06:11:43] and back [06:13:48] perhaps we should upgrade the software on the switches and routers now we have no traffic anyway [06:14:43] sure [06:17:02] ok they're done [06:17:08] so we can power up the machines [06:17:10] that was quick :) [06:17:16] both racks are good to go? [06:17:19] seems so [06:17:49] ok starting on that [06:19:48] i'm looking at the software upgrades in the mean time [06:25:50] <_joe_> morning :) [06:25:55] hi [06:27:00] cool, ipsec reestablished as soon as cp3001 came back [06:27:31] <_joe_> jgage: :) [06:27:44] not a single unencrypted ping reply got through before sec was renegotiated [06:27:54] RECOVERY - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4055 bytes in 0.515 second response time [06:27:57] RECOVERY - LVS HTTPS IPv4 on upload-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 734 bytes in 0.508 second response time [06:28:54] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69134 bytes in 0.486 second response time [06:29:04] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:53] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] <_joe_> oh mod_passenger o'clock [06:30:43] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 4056 bytes in 0.514 second response time [06:31:13] RECOVERY - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69175 bytes in 0.572 second response time [06:31:20] bblack: let me know when you're ready, so I can start the switch upgrade [06:31:26] ready for? [06:31:38] when all machines are turned on [06:31:41] the network will go down [06:31:55] not management in theory though [06:31:58] they're all on now I think, but waiting for the last batch to show up ok in icinga [06:32:02] ok [06:32:13] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 734 bytes in 0.538 second response time [06:32:23] then I'll start [06:32:24] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 21982 bytes in 0.595 second response time [06:32:37] RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22013 bytes in 0.605 second response time [06:32:50] stupid pages [06:32:57] i'm not getting any [06:33:03] i got them [06:33:05] might be because of time [06:33:05] <_joe_> all hosts are up in icinga [06:33:10] yup [06:33:16] ok starting [06:33:23] <_joe_> mark: we get paged 8 AM - midnight I guess [06:33:28] yes [06:33:36] the console server didn't like my attempt to parallel ssh into it like 40x at once, I had to do it in 3 batches heh [06:33:37] mark, still junos 11.x? [06:34:03] what do you mean? [06:34:18] iirc the switches are running junos version 11.something [06:34:28] just wondering if you're upgrading within 11.x or to 12 etc [06:34:32] i think they're up to 14.x [06:34:49] once you're done blipping the network, I'll stop the packet log too, as they should be getting invalidation flow again [06:35:44] <_joe_> bblack: once we're done, I've seen you saying some service showed up as misc_eqiad and that was wrong; I guess this might be a puppet error [06:36:13] i'm upgrading to 12.3R6 [06:36:29] neat. *looks for changelog* [06:36:34] _joe_: maybe, I'm not really sure what's up with that check [06:37:54] _joe_: it's check_wikidata in puppet if you grep for it. it has an explicit esams IP [06:38:07] <_joe_> ok [06:45:48] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:26] starting upgrade [06:46:29] this can get messy [06:46:37] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:52:09] oh are we going to lose lots of other hosts? should I try to downtime all of esams or something to avoid spam? [06:52:20] yes [06:52:24] not of the paging kind though ;) [06:52:27] but all will go down [06:53:11] package is installed, upgrade will start when I reboot individual switches [06:53:23] not of the paging kind? [06:53:31] individual servers won't page no [06:53:35] LVS will of course [06:53:38] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:16] so what do you want to do? [06:55:38] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1509 bytes in 0.857 second response time [06:55:38] I downtimed lvs300[12] [06:55:46] the other two are still in downtime [06:55:51] those won't page either [06:55:54] just the LVS ips will [06:55:59] oh really? [06:56:03] yeah [06:56:11] the LVS ips are all in downtime at this point anyways [06:56:15] ok [06:56:21] i'm going to reboot a single member switch first [06:57:15] Uptime: 768d7h13m34s [06:57:32] heh [06:57:38] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [06:58:17] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [06:58:17] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [06:58:34] that's the EX450 [06:58:35] 4500 [06:58:43] which is the odd duck ;) [06:58:57] PROBLEM - Host cp3015 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3016 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3018 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:58] PROBLEM - Host cp3022 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:58] PROBLEM - Host cp3017 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:59] PROBLEM - Host cp3008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:59:05] wheee spam [06:59:07] PROBLEM - Host cp3021 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:07] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:07] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:18] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:18] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:30] <_joe_> wait for my next puppet change that involves all production :) [06:59:38] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: csw2-esams:xe-5/0/39 {#10088} [10Gbps DF]BRxe-1/1/0: down - Core: csw2-esams:xe-5/0/38 {#10089} [10Gbps DF]BR [07:00:55] I guess since the text caches are still in downtime and they're the bulk of the machines, it won't be so bad in here [07:01:22] i like irc spam [07:01:37] you can nicely see what's happening with it [07:01:51] I guess with a big change like this, it's ok [07:01:56] <_joe_> mark: do you want moar irc spam? [07:02:10] i do, others don't ;p [07:02:12] my general habit is to try to remember to downtime things so I don't make people randomly wonder wtf is up and go looking or worrying [07:02:14] <_joe_> :) [07:02:20] <_joe_> bblack: me too [07:03:03] <_joe_> but well, when puppet fails to apply a change, even if the catalog compiles, everyone will be annoyed [07:03:12] * yuvipandaops wants some +1 on https://gerrit.wikimedia.org/r/#/c/174430/ which increases spam here *slightly* [07:04:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think that screams in -operations should be limited to production." [puppet] - 10https://gerrit.wikimedia.org/r/174430 (owner: 10Yuvipanda) [07:04:14] <_joe_> yuvipandaops: you know I don't agree :) [07:05:07] someone should go hack our icinga-bot to be less-spammy in general [07:05:08] well, betalabs breakages are caused by changes to production that don't take into account betalabs exists [07:05:30] i'm going to reboot the rest of the switches [07:05:53] <_joe_> yuvipandaops: betalabs breakages are cause by the fact we don't design with multi env in mind, and that beta diverges from prod in silly ways [07:06:16] after it hits some immediate-term ratelimit (like 5 messages in 15 seconds?), it should go into a spam-reduction mode, and maybe put out a message every 30 seconds like "55x icinga events affecting hosts: cp3011, cp3012, ..." [07:06:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/3: down - Peering: ! Equinix Exchange {#2648} [10Gbps DF]BR [07:06:28] <_joe_> bblack: yes [07:07:00] true, so solution is to fix the designing, and screaming whenever it goes wrong might be a good first step :) [07:07:28] PROBLEM - Host cp3012 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:28] PROBLEM - Host cp3011 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:28] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:28] PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:58] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [07:08:56] <_joe_> yuvipandaops: I'd rather mail us [07:09:14] hmm [07:09:16] that's true [07:09:29] email is so overused and messy [07:10:14] I'd rather have a site like statuslog.wm.o that shows the icinga log entries with a nice UI, and I can configure it to filter/display whatever and to make little dinging sounds on new events if I want, etc [07:10:25] but noone will be looking at it [07:10:26] (03PS3) 10Yuvipanda: puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) [07:10:27] i think irc is fine [07:10:31] and a real annoyance [07:10:38] IRC is a good start, at least... [07:10:44] <_joe_> bblack: I tend to care immediately of any icinga-wm irc notification I see [07:10:57] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:11:14] we all have browser tabs open all day anyways right? just make it make a sound over there, and then it doesn't spam our conversations in here [07:11:41] you know what this needs? AN ANDROID APP! [07:11:45] * yuvipandaops slinks away [07:11:46] yes! [07:11:58] <_joe_> yuvipandaops: anag if well configured can help [07:12:04] <_joe_> I think sean uses it [07:12:14] one that needs permission to modify my global security settings and read all my email, just so it can launch a browser wrapper [07:12:14] ooooh, nice [07:12:22] not to mention your location [07:12:29] and constant microphone access [07:12:38] just so it can alert extra loudly when you're in the shower [07:12:45] heh [07:12:47] PROBLEM - Host wikidata is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [07:13:04] _joe_: https://gerrit.wikimedia.org/r/#/c/174132/ reworked after you pointed out obvious flaw.... [07:13:05] there's check_wikidata again [07:15:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [07:17:47] PROBLEM - Host ns2-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::e [07:17:52] PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:52] PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:52] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:52] PROBLEM - Host amslvs2 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:01] PROBLEM - Host hooft is DOWN: CRITICAL - Network Unreachable (91.198.174.113) [07:18:02] PROBLEM - Host ns2-v4 is DOWN: CRITICAL - Network Unreachable (91.198.174.239) [07:18:02] PROBLEM - Host nescio is DOWN: CRITICAL - Network Unreachable (91.198.174.106) [07:18:02] PROBLEM - Host mr1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.247) [07:18:08] PROBLEM - Host eeden is DOWN: CRITICAL - Network Unreachable (91.198.174.121) [07:18:08] PROBLEM - Host amslvs3 is DOWN: CRITICAL - Network Unreachable (91.198.174.111) [07:18:08] PROBLEM - Host 91.198.174.6 is DOWN: CRITICAL - Network Unreachable (91.198.174.6) [07:18:17] PROBLEM - Host ms-fe3002 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:27] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:58] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [07:18:58] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 48, down: 13, dormant: 0, excluded: 2, unused: 0BRae1.102: down - Subnet toolserver1-esamsBRae1.401: down - cr1-esams:ae1.401BRae1.100: down - Subnet public1-esamsBRxe-0/0/0: down - Core: csw2-esams:xe-2/1/1 (GBLX leg 1) {#14006} [10Gbps DF]BRae1.405: down - mr1-esams:ge-0/0/1.405BRae1: down - csw2-esams:ae2BRae1.301: down - Subnet [07:19:38] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [07:19:39] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [07:23:43] i don't think esams is coming back [07:25:07] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 81, down: 0, dormant: 0, excluded: 2, unused: 0 [07:25:13] I was about to say wikipedia is very slow relativly, but now i know why, so i'll shut up [07:25:17] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 94.76 ms [07:25:21] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 94.88 ms [07:25:26] maybe it will [07:25:50] _joe_: feel free to merge, https://gerrit.wikimedia.org/r/#/c/173763/ [07:26:07] <_joe_> kart_: ok thanks I was about to ping you actually [07:26:14] :) [07:26:34] * yuvipandaops also prods _joe_ with https://gerrit.wikimedia.org/r/#/c/174132/ again :) [07:26:38] <_joe_> I'll wait for Reedy, he as an apache patch as well and I'd really like to send them toghether [07:26:55] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) (owner: 10Yuvipanda) [07:27:07] <_joe_> yuvipandaops: ^^ I was _already_ doing it [07:27:12] hehe [07:27:33] <_joe_> before you said "now I don't look 30 anymore" like being 30 means being super-old [07:29:12] (03CR) 10Yuvipanda: [C: 04-1] "Hmm, not sure if this belongs in this module. Perhaps have a pdu (or facilities) module, and then put this there?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [07:29:53] _joe_: aww/ow, hmm, I didn't mean that. [07:30:11] <_joe_> yuvipandaops: eheh I know, I get derailed easily [07:30:16] :) [07:31:28] (03PS4) 10Yuvipanda: puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) [07:33:13] (03CR) 10Yuvipanda: [C: 032] puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) (owner: 10Yuvipanda) [07:35:38] hmmm [07:35:47] cloudadmins should be able to edit all of hiera, methinks [07:39:09] * bblack isn't a cloudadmin :( [07:39:12] <_joe_> yuvipanda: yes, and we need a per-instance lookup to, sooner or later [07:39:30] well, I can make all of ops able to edit too, but should be clousdadmin... [07:39:52] _joe_: is trivial already, I think. Hiera:/ can already be a page, editable only by cloudadmins. [07:40:00] yeah I run into that all the time, where I go to look at something on wikitech and it tells me to go away because I'm not a cloudadmin [07:40:21] just become cloudadmin! :) [07:40:22] and then I'm like "I'm so going to go log into whatever this is running on and give it to myself", but then I never bother [07:40:28] hehe [07:40:38] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [07:40:42] _joe_: also, we need labs.yaml, similar to production.yaml, I think. [07:40:48] I don't think we have it already [07:40:51] <_joe_> yuvipanda: make that editable by project admins [07:41:04] <_joe_> yuvipanda: I think I added it in a patch of gabriel's [07:41:05] _joe_: bah, I meant projectadmins, not cloudadmins [07:41:53] <_joe_> yuvipanda: oh ok [07:41:57] ok, making hiera editable by cloudadmins. [07:42:00] * yuvipanda works on patch [07:42:06] hmm, I seem to get distracted all the time. [07:42:27] <_joe_> can you add a section of docs here https://wikitech.wikimedia.org/wiki/Puppet_Hiera about labs? [07:46:51] _joe_: I've it in a tab somewhere, let me finish that up. [07:53:06] _joe_: hmm, as an update, subpages aren't enabled yet, I'll co-ordinate with MW folks to deploy that as well (is a config change) [07:53:08] let me file a bug [07:53:43] <_joe_> yuvipanda: ok np, I also need to update the hiera config I guess [07:53:56] yeah [07:58:48] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:00:43] (03CR) 10Giuseppe Lavagetto: varnish: remove cache separation for HHVM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/174390 (owner: 10Giuseppe Lavagetto) [08:01:00]