[00:00:04] RoanKattouw, ^d, marktraceur, MaxSem, tgr: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141120T0000). [00:00:21] I'll do it! [00:00:36] <^d> I WAS HERE FIRST ROANKATTOUW [00:00:39] ^d: I'm here :) [00:00:39] <^d> GET YER OWN SWAT [00:00:40] <^d> :) [00:00:42] lol [00:00:47] You can do it if you like [00:00:52] RoanKattouw wanted a prize [00:00:53] ^d: In that case, we have two extras not on the list. ;-) [00:00:54] If you're willing to put up with my last-minute addition [00:01:09] <^d> Argh!!!!!! [00:01:15] ^d: See? :-) [00:01:19] That's why I claimed it [00:01:27] But I didn't notice you'd claimed it already [00:01:34] <^d> I was offering prizes. [00:01:40] <^d> But all I have is more cake, so feel free. [00:02:23] whose doing SWAT and would they put up with a massively late addition from me? GPG is failing to execute for SecurePoll and so I want to turn $wgSecurePollShowErrorDetail=true; on ... [or if someone wants to go digging in the logs for me] [00:03:17] <^d> RoanKattouw and I are arguing over who's doing it :) [00:03:26] I'll just start doing it now [00:03:28] perfect you guys keep arguing and I'll submit a patch [00:03:50] kaldari is here so his change can go out [00:03:56] Where is everyone else [00:04:01] legoktm: tgr: SWAT time [00:04:11] RoanKattouw: hi [00:04:14] <^d> I can watch lego's if he doesn't respond. [00:04:17] <^d> Oh, there he is [00:04:28] I can see them both in the office [00:04:30] So it'll be fine [00:05:11] (03CR) 10Catrope: [C: 032] Adding Wikipedia wordmark for mobile and switching to it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174585 (https://bugzilla.wikimedia.org/58886) (owner: 10Kaldari) [00:05:17] RoanKattouw: ready if you are [00:05:18] (03Merged) 10jenkins-bot: Adding Wikipedia wordmark for mobile and switching to it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174585 (https://bugzilla.wikimedia.org/58886) (owner: 10Kaldari) [00:05:22] Awesome [00:05:25] Config changes first [00:05:53] (03CR) 10Catrope: [C: 032] Add SkinDistributor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174480 (owner: 10Legoktm) [00:06:01] (03Merged) 10jenkins-bot: Add SkinDistributor configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174480 (owner: 10Legoktm) [00:06:08] (03CR) 10Catrope: [C: 032] Revert "Revert "Enable JPG thumbnail chaining on all wikis except commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174451 (owner: 10Gilles) [00:06:19] (03Merged) 10jenkins-bot: Revert "Revert "Enable JPG thumbnail chaining on all wikis except commons"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174451 (owner: 10Gilles) [00:08:43] (03PS1) 10Jalexander: Enable SecurePoll error detail for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174603 (https://bugzilla.wikimedia.org/73626) [00:09:26] !log catrope Synchronized images/mobile/: SWAT: new Wikipedia wordmark for mobile (duration: 00m 03s) [00:09:33] Logged the message, Master [00:09:36] RoanKattouw: I'll add it to the list but if possible I'd like to get that out ^ [00:10:00] apologies for the late timing (it was discovered now because polls, including test polls like this, start at 00:00 ) [00:10:44] !log catrope Synchronized wmf-config/: SWAT (duration: 00m 04s) [00:10:46] Logged the message, Master [00:10:57] jamesofur: Will deploy. Happy to do it as long as you add it to the wiki page for posterity [00:11:03] * jamesofur nods [00:11:05] thanks, adding now [00:11:06] (03CR) 10Catrope: [C: 032] Enable SecurePoll error detail for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174603 (https://bugzilla.wikimedia.org/73626) (owner: 10Jalexander) [00:11:14] (03Merged) 10jenkins-bot: Enable SecurePoll error detail for debugging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174603 (https://bugzilla.wikimedia.org/73626) (owner: 10Jalexander) [00:13:26] !log catrope Synchronized wmf-config/: SWAT: temp debugging for SecurePoll (duration: 00m 04s) [00:13:28] Logged the message, Master [00:13:33] OK, those were the config changes [00:13:33] thank ye, it's added on wiki [00:13:50] tgr, kaldari, legoktm, jamesofur: Please verify and confirm those [00:14:02] verified on my side [00:14:09] RoanKattouw: Looks good! [00:14:10] RoanKattouw: mine was a no-op :) [00:14:52] I've broken it :/ [00:15:13] RoanKattouw: hard to confirm but nothing is broken [00:15:13] Request: POST http://en.wikipedia.org/w/index.php?title=Special:MovePage&action=submit, from 10.128.0.116 via cp4010 cp4010 ([10.128.0.110]:3128), Varnish XID 575413096 [00:17:19] OK, extension time [00:26:26] !log catrope Synchronized php-1.25wmf8/includes/media/: SWAT: don't apply EXIF rotation to chained thumbnails (duration: 00m 04s) [00:26:28] Logged the message, Master [00:26:34] tgr: ---^^ [00:34:13] (03PS1) 10GWicke: Add cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/174608 [00:34:17] !log catrope Synchronized php-1.25wmf8/extensions/VisualEditor: SWAT (duration: 00m 04s) [00:34:19] Logged the message, Master [00:34:21] !log catrope Synchronized php-1.25wmf9/extensions/VisualEditor: SWAT (duration: 00m 04s) [00:34:24] Logged the message, Master [00:36:55] (03PS1) 10Kaldari: Changing to relative URL for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 [00:36:57] ori: https://gerrit.wikimedia.org/r/174608 [00:37:16] MaxSem: https://gerrit.wikimedia.org/r/#/c/174609/ [00:37:30] kaldari: Could you improve the commit summary? [00:37:32] RoanKattouw: verified, thanks [00:37:42] kaldari: Since there are hundreds of URLs in that file, let alone the repo [00:38:21] RoanKattouw: Sure [00:38:47] kaldari: Also do you need that deployed right now? I only just finished SWAT so I can throw it in if you want [00:39:02] (03PS2) 10Kaldari: Changing to relative URL for for mobile wordmark image (per MaxSem) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 [00:39:35] RoanKattouw: yes, go ahead and deploy that [00:39:48] OK [00:39:57] kaldari: Could you do me a favor and edit the wiki page to add that commit? [00:40:03] While I deploy it [00:40:08] Just so it's on the record [00:40:09] RoanKattouw: sure [00:40:14] (03CR) 10Catrope: [C: 032] Changing to relative URL for for mobile wordmark image (per MaxSem) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 (owner: 10Kaldari) [00:40:24] (03Merged) 10jenkins-bot: Changing to relative URL for for mobile wordmark image (per MaxSem) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174609 (owner: 10Kaldari) [00:41:13] !log catrope Synchronized wmf-config/InitialiseSettings.php: Change mobile wordmark image to relative URL (duration: 00m 04s) [00:41:15] Logged the message, Master [00:41:22] kaldari: Done --^^ [00:42:28] jgage: ping [00:42:50] RoanKattouw: Updated the page [00:44:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [00:44:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [00:50:30] !log maxsem Synchronized php-1.25wmf9/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/174613/ (duration: 00m 04s) [00:50:35] Logged the message, Master [00:51:31] * gwicke needs a review https://gerrit.wikimedia.org/r/#/c/174608/ in order to be able to test the module in labs [00:51:33] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 321 seconds [00:51:35] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 327 seconds [00:51:55] gwicke: you can cherry-pick it in labs [00:53:39] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [00:53:49] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:10] (03PS4) 10GWicke: Give parsoid-roots access to ruthenium; split cassandra test hosts [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [00:54:42] ori: you mean on the beta labs puppet master? [00:55:04] yep [00:55:31] hmm.. won't that potentially conflict with other updates? [00:55:57] we forgot to actually add the submodule after the module was merged [00:56:44] https://gerrit.wikimedia.org/r/#/c/166888/11 [00:58:58] ori: it would really be cleaner to just merge this [00:59:29] OK. [01:00:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:00:09] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [01:00:24] (03CR) 10Ori.livneh: [C: 032] Add cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/174608 (owner: 10GWicke) [01:00:37] ori: thanks! [01:14:38] ori: do you know how to get he hiera entry fields for configured roles on a node? [01:19:44] * gwicke has the suspicion that puppet is generally broken in beta labs [01:20:35] (03PS1) 10Rush: phab security-bug macro testing [puppet] - 10https://gerrit.wikimedia.org/r/174614 [01:20:39] https://gist.github.com/gwicke/df92917779ad4f731368 [01:21:33] <^d> Did you point it at the beta puppetmaster or is it still pointing at the normal one? [01:22:17] ^d: I didn't change anything about the master config after creating the instance in beta [01:22:55] won't it default to the right master? [01:23:06] (03CR) 10Rush: [C: 032 V: 032] phab security-bug macro testing [puppet] - 10https://gerrit.wikimedia.org/r/174614 (owner: 10Rush) [01:23:14] <^d> gwicke: No, they don't default to beta's puppetmaster unless that's changed. [01:23:17] <^d> Sec, there's docs. [01:24:02] the strange thing is that it earlier complained about the cassandra class missing [01:24:37] after adding the submodule that error is now gone, so clearly puppet is applying the roles I assigned to the node in the wikitech interface [01:26:15] <^d> gwicke: https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/How_code_is_updated#Converting_a_host_to_use_local_puppetmaster_and_salt_master [01:28:46] I see, thanks [01:29:39] hmmm.. do I really need puppet::self? [01:31:22] * gwicke always thought that sets up another puppet master [01:32:59] <^d> gwicke: No idea. I just follow the crowd. :) [01:33:51] kk; makes no difference so far, but might not be applied yet [01:36:39] * gwicke defers this to another day [02:21:35] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-20 02:21:35+00:00 [02:21:42] Logged the message, Master [02:27:08] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (12.50%) [02:34:12] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-20 02:34:12+00:00 [02:34:19] Logged the message, Master [02:56:09] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail [03:10:54] (03PS2) 10Ori.livneh: hhvm: enable perf_pid.map files w/automatic pruning [puppet] - 10https://gerrit.wikimedia.org/r/174356 [03:13:03] (03PS3) 10Ori.livneh: hhvm: enable perf_pid.map files w/automatic pruning [puppet] - 10https://gerrit.wikimedia.org/r/174356 [03:13:57] (03CR) 10Ori.livneh: [C: 032] hhvm: enable perf_pid.map files w/automatic pruning [puppet] - 10https://gerrit.wikimedia.org/r/174356 (owner: 10Ori.livneh) [03:14:29] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [03:14:49] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet last ran 23 hours ago [03:16:58] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [03:18:18] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [03:18:49] (03PS1) 10Ori.livneh: HHVM: enable perf_pid_map for FCGI only; not CLI. [puppet] - 10https://gerrit.wikimedia.org/r/174625 [03:19:21] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: enable perf_pid_map for FCGI only; not CLI. [puppet] - 10https://gerrit.wikimedia.org/r/174625 (owner: 10Ori.livneh) [03:20:19] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [03:21:19] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [03:24:08] PROBLEM - puppet last run on mw1028 is CRITICAL: CRITICAL: Puppet has 1 failures [03:24:58] PROBLEM - puppet last run on mw1031 is CRITICAL: CRITICAL: Puppet has 1 failures [03:25:19] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:26:58] RECOVERY - puppet last run on mw1028 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [03:27:58] RECOVERY - puppet last run on mw1031 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [03:29:19] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 1 failures [03:34:30] (03PS1) 10Ori.livneh: Fix for ensure_jemalloc_prof_deactivated check [puppet] - 10https://gerrit.wikimedia.org/r/174626 [03:34:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix for ensure_jemalloc_prof_deactivated check [puppet] - 10https://gerrit.wikimedia.org/r/174626 (owner: 10Ori.livneh) [03:36:28] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [04:27:25] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Nov 20 04:27:25 UTC 2014 (duration 27m 24s) [04:27:29] Logged the message, Master [04:36:26] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [04:36:38] (03PS2) 10Glaisher: Delete vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://bugzilla.wikimedia.org/55737) [05:05:56] PROBLEM - Varnish HTCP daemon on cp1008 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (vhtcpd), args vhtcpd [05:10:30] sorry that's me above ^, that's not even a prod machine, I'm not sure why it's in icinga :p [05:15:04] !log made myself an administrator on phabricator [05:15:08] Logged the message, Master [05:28:15] !log ~30m to esams power out, starting equipment shutdown and such for OE13/OE15 [05:28:18] Logged the message, Master [05:33:56] PROBLEM - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is CRITICAL: No route to host [05:34:23] (03PS1) 10BBlack: esams local nets -> eqiad [dns] - 10https://gerrit.wikimedia.org/r/174630 [05:34:38] oh that too I guess [05:34:57] PROBLEM - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection timed out [05:34:59] (03CR) 10BBlack: [C: 032] esams local nets -> eqiad [dns] - 10https://gerrit.wikimedia.org/r/174630 (owner: 10BBlack) [05:35:29] ignore those, I didn't think about the big addrs when I set downtimes [05:41:04] !log amssq31-62, cp300[12], lvs300[34], ssl300[123] all shut down for esams power event (and downtimed) [05:41:08] Logged the message, Master [05:43:06] PROBLEM - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is CRITICAL: No route to host [05:43:23] PROBLEM - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is CRITICAL: No route to host [05:43:27] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: No route to host [05:43:30] PROBLEM - LVS HTTPS IPv4 on upload-lb.esams.wikimedia.org is CRITICAL: No route to host [05:43:33] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: No route to host [05:43:37] PROBLEM - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is CRITICAL: No route to host [05:43:51] really? :p [05:43:52] wooo [05:44:33] oh yeah, SSL is hidden under the same names in icinga [05:44:42] more things to disable! [05:45:30] Nov 20 05:40:29 berkelium charon: 12[IKE] IKE_SA berkelium-cp3001[23] state change: DELETING => DESTROYING [05:45:33] DESTROY [05:46:44] sorry for the pages. if you wake up randomly and come here looking, go back to sleep! [05:46:55] DAMN YOU [05:47:03] ;) [05:48:38] good morning! [05:48:42] hi [05:49:16] we should be good to go. are they planning to contact us/you about start/finish times for the actual work? [05:49:25] yes they'll call me before they start [05:49:29] ok [05:55:33] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: No route to host [05:55:51] hmmm what's that? [05:55:57] no idea [05:56:04] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, sessions up: 57, down: 15, shutdown: 1BRPeering with AS13335 not established - The + flag cannot be used with the sub-query features described below.BRPeering with AS3216 not established - SOVAM-ASBRPeering with AS1273 not established - CWBRPeering with AS5650 not established - The + flag cannot be used with the sub-query features described below.BRPeering w [05:56:55] seems related to esams, something about wikidata.org depends on it somehow [05:57:28] that icinga check anyways [05:58:01] it's listed in icinga under eqiad misc, but has an esams IP which is text-lb.esams [05:58:43] weird [06:00:33] it's defined in puppet as /usr/lib/nagios/plugins/check_http -H www.wikidata.org -I 91.198.174.192 -S -u "/w/api.php?action=query&meta=siteinfo&format=json&siprop=statistics" --linespan --ereg '"median":[^}]*"lag":[1-5]?[0-9]' [06:00:53] I'm guessing that's intentional, to measure some kind of replication lag from eqiad->esams for wikidata, and now it can't reach it to check it. [06:01:05] i doubt it's intentional [06:01:29] I mean maybe it's intentional that it's checking esams, since it claims to be about lag [06:01:35] no idewa [06:02:31] ok [06:02:33] evoswitch called [06:02:36] they're starting in a few mins [06:02:54] cool [06:07:23] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 72, down: 0, shutdown: 1 [06:08:44] so are we recording purges? [06:09:14] yes, with socat. the format is a bit ugly, but I think it can be post-processed if necc [06:09:24] on which box? [06:09:31] cp1008.eqiad.wmnet [06:09:39] ok [06:09:53] (it's my ssl test host) [06:10:51] OE15 went missing [06:11:24] yay [06:11:43] and back [06:13:48] perhaps we should upgrade the software on the switches and routers now we have no traffic anyway [06:14:43] sure [06:17:02] ok they're done [06:17:08] so we can power up the machines [06:17:10] that was quick :) [06:17:16] both racks are good to go? [06:17:19] seems so [06:17:49] ok starting on that [06:19:48] i'm looking at the software upgrades in the mean time [06:25:50] <_joe_> morning :) [06:25:55] hi [06:27:00] cool, ipsec reestablished as soon as cp3001 came back [06:27:31] <_joe_> jgage: :) [06:27:44] not a single unencrypted ping reply got through before sec was renegotiated [06:27:54] RECOVERY - LVS HTTPS IPv4 on bits-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 4055 bytes in 0.515 second response time [06:27:57] RECOVERY - LVS HTTPS IPv4 on upload-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 734 bytes in 0.508 second response time [06:28:54] RECOVERY - LVS HTTP IPv6 on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 69134 bytes in 0.486 second response time [06:29:04] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:33] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:29:53] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:06] <_joe_> oh mod_passenger o'clock [06:30:43] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] RECOVERY - LVS HTTPS IPv6 on bits-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 4056 bytes in 0.514 second response time [06:31:13] RECOVERY - LVS HTTP IPv4 on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 69175 bytes in 0.572 second response time [06:31:20] bblack: let me know when you're ready, so I can start the switch upgrade [06:31:26] ready for? [06:31:38] when all machines are turned on [06:31:41] the network will go down [06:31:55] not management in theory though [06:31:58] they're all on now I think, but waiting for the last batch to show up ok in icinga [06:32:02] ok [06:32:13] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 734 bytes in 0.538 second response time [06:32:23] then I'll start [06:32:24] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 21982 bytes in 0.595 second response time [06:32:37] RECOVERY - LVS HTTPS IPv4 on mobile-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 22013 bytes in 0.605 second response time [06:32:50] stupid pages [06:32:57] i'm not getting any [06:33:03] i got them [06:33:05] might be because of time [06:33:05] <_joe_> all hosts are up in icinga [06:33:10] yup [06:33:16] ok starting [06:33:23] <_joe_> mark: we get paged 8 AM - midnight I guess [06:33:28] yes [06:33:36] the console server didn't like my attempt to parallel ssh into it like 40x at once, I had to do it in 3 batches heh [06:33:37] mark, still junos 11.x? [06:34:03] what do you mean? [06:34:18] iirc the switches are running junos version 11.something [06:34:28] just wondering if you're upgrading within 11.x or to 12 etc [06:34:32] i think they're up to 14.x [06:34:49] once you're done blipping the network, I'll stop the packet log too, as they should be getting invalidation flow again [06:35:44] <_joe_> bblack: once we're done, I've seen you saying some service showed up as misc_eqiad and that was wrong; I guess this might be a puppet error [06:36:13] i'm upgrading to 12.3R6 [06:36:29] neat. *looks for changelog* [06:36:34] _joe_: maybe, I'm not really sure what's up with that check [06:37:54] _joe_: it's check_wikidata in puppet if you grep for it. it has an explicit esams IP [06:38:07] <_joe_> ok [06:45:48] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:08] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:26] starting upgrade [06:46:29] this can get messy [06:46:37] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:52:09] oh are we going to lose lots of other hosts? should I try to downtime all of esams or something to avoid spam? [06:52:20] yes [06:52:24] not of the paging kind though ;) [06:52:27] but all will go down [06:53:11] package is installed, upgrade will start when I reboot individual switches [06:53:23] not of the paging kind? [06:53:31] individual servers won't page no [06:53:35] LVS will of course [06:53:38] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [06:54:16] so what do you want to do? [06:55:38] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1509 bytes in 0.857 second response time [06:55:38] I downtimed lvs300[12] [06:55:46] the other two are still in downtime [06:55:51] those won't page either [06:55:54] just the LVS ips will [06:55:59] oh really? [06:56:03] yeah [06:56:11] the LVS ips are all in downtime at this point anyways [06:56:15] ok [06:56:21] i'm going to reboot a single member switch first [06:57:15] Uptime: 768d7h13m34s [06:57:32] heh [06:57:38] PROBLEM - Host cp3007 is DOWN: PING CRITICAL - Packet loss = 100% [06:58:17] PROBLEM - Host cp3004 is DOWN: PING CRITICAL - Packet loss = 100% [06:58:17] PROBLEM - Host cp3020 is DOWN: PING CRITICAL - Packet loss = 100% [06:58:34] that's the EX450 [06:58:35] 4500 [06:58:43] which is the odd duck ;) [06:58:57] PROBLEM - Host cp3015 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3016 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3005 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3018 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:57] PROBLEM - Host cp3010 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:58] PROBLEM - Host cp3022 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:58] PROBLEM - Host cp3017 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:58:59] PROBLEM - Host cp3008 is DOWN: CRITICAL - Plugin timed out after 15 seconds [06:59:05] wheee spam [06:59:07] PROBLEM - Host cp3021 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:07] PROBLEM - Host cp3006 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:07] PROBLEM - Host cp3003 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:18] PROBLEM - Host cp3019 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:18] PROBLEM - Host cp3009 is DOWN: PING CRITICAL - Packet loss = 100% [06:59:30] <_joe_> wait for my next puppet change that involves all production :) [06:59:38] PROBLEM - Router interfaces on cr1-esams is CRITICAL: CRITICAL: host 91.198.174.245, interfaces up: 78, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: csw2-esams:xe-5/0/39 {#10088} [10Gbps DF]BRxe-1/1/0: down - Core: csw2-esams:xe-5/0/38 {#10089} [10Gbps DF]BR [07:00:55] I guess since the text caches are still in downtime and they're the bulk of the machines, it won't be so bad in here [07:01:22] i like irc spam [07:01:37] you can nicely see what's happening with it [07:01:51] I guess with a big change like this, it's ok [07:01:56] <_joe_> mark: do you want moar irc spam? [07:02:10] i do, others don't ;p [07:02:12] my general habit is to try to remember to downtime things so I don't make people randomly wonder wtf is up and go looking or worrying [07:02:14] <_joe_> :) [07:02:20] <_joe_> bblack: me too [07:03:03] <_joe_> but well, when puppet fails to apply a change, even if the catalog compiles, everyone will be annoyed [07:03:12] * yuvipandaops wants some +1 on https://gerrit.wikimedia.org/r/#/c/174430/ which increases spam here *slightly* [07:04:00] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think that screams in -operations should be limited to production." [puppet] - 10https://gerrit.wikimedia.org/r/174430 (owner: 10Yuvipanda) [07:04:14] <_joe_> yuvipandaops: you know I don't agree :) [07:05:07] someone should go hack our icinga-bot to be less-spammy in general [07:05:08] well, betalabs breakages are caused by changes to production that don't take into account betalabs exists [07:05:30] i'm going to reboot the rest of the switches [07:05:53] <_joe_> yuvipandaops: betalabs breakages are cause by the fact we don't design with multi env in mind, and that beta diverges from prod in silly ways [07:06:16] after it hits some immediate-term ratelimit (like 5 messages in 15 seconds?), it should go into a spam-reduction mode, and maybe put out a message every 30 seconds like "55x icinga events affecting hosts: cp3011, cp3012, ..." [07:06:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 203, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/3: down - Peering: ! Equinix Exchange {#2648} [10Gbps DF]BR [07:06:28] <_joe_> bblack: yes [07:07:00] true, so solution is to fix the designing, and screaming whenever it goes wrong might be a good first step :) [07:07:28] PROBLEM - Host cp3012 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:28] PROBLEM - Host cp3011 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:28] PROBLEM - Host ms-be3003 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:28] PROBLEM - Host cp3013 is DOWN: PING CRITICAL - Packet loss = 100% [07:07:58] PROBLEM - Host cp3014 is DOWN: PING CRITICAL - Packet loss = 100% [07:08:56] <_joe_> yuvipandaops: I'd rather mail us [07:09:14] hmm [07:09:16] that's true [07:09:29] email is so overused and messy [07:10:14] I'd rather have a site like statuslog.wm.o that shows the icinga log entries with a nice UI, and I can configure it to filter/display whatever and to make little dinging sounds on new events if I want, etc [07:10:25] but noone will be looking at it [07:10:26] (03PS3) 10Yuvipanda: puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) [07:10:27] i think irc is fine [07:10:31] and a real annoyance [07:10:38] IRC is a good start, at least... [07:10:44] <_joe_> bblack: I tend to care immediately of any icinga-wm irc notification I see [07:10:57] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:11:14] we all have browser tabs open all day anyways right? just make it make a sound over there, and then it doesn't spam our conversations in here [07:11:41] you know what this needs? AN ANDROID APP! [07:11:45] * yuvipandaops slinks away [07:11:46] yes! [07:11:58] <_joe_> yuvipandaops: anag if well configured can help [07:12:04] <_joe_> I think sean uses it [07:12:14] one that needs permission to modify my global security settings and read all my email, just so it can launch a browser wrapper [07:12:14] ooooh, nice [07:12:22] not to mention your location [07:12:29] and constant microphone access [07:12:38] just so it can alert extra loudly when you're in the shower [07:12:45] heh [07:12:47] PROBLEM - Host wikidata is DOWN: CRITICAL - Network Unreachable (91.198.174.192) [07:13:04] _joe_: https://gerrit.wikimedia.org/r/#/c/174132/ reworked after you pointed out obvious flaw.... [07:13:05] there's check_wikidata again [07:15:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 205, down: 0, dormant: 0, excluded: 0, unused: 0 [07:17:47] PROBLEM - Host ns2-v6 is DOWN: /bin/ping6 -n -U -w 15 -c 5 2620:0:862:ed1a::e [07:17:52] PROBLEM - Host amslvs1 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:52] PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:52] PROBLEM - Host ms-fe3001 is DOWN: PING CRITICAL - Packet loss = 100% [07:17:52] PROBLEM - Host amslvs2 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:01] PROBLEM - Host hooft is DOWN: CRITICAL - Network Unreachable (91.198.174.113) [07:18:02] PROBLEM - Host ns2-v4 is DOWN: CRITICAL - Network Unreachable (91.198.174.239) [07:18:02] PROBLEM - Host nescio is DOWN: CRITICAL - Network Unreachable (91.198.174.106) [07:18:02] PROBLEM - Host mr1-esams is DOWN: CRITICAL - Network Unreachable (91.198.174.247) [07:18:08] PROBLEM - Host eeden is DOWN: CRITICAL - Network Unreachable (91.198.174.121) [07:18:08] PROBLEM - Host amslvs3 is DOWN: CRITICAL - Network Unreachable (91.198.174.111) [07:18:08] PROBLEM - Host 91.198.174.6 is DOWN: CRITICAL - Network Unreachable (91.198.174.6) [07:18:17] PROBLEM - Host ms-fe3002 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:27] PROBLEM - Host ms-be3004 is DOWN: PING CRITICAL - Packet loss = 100% [07:18:58] PROBLEM - Host cr1-esams is DOWN: PING CRITICAL - Packet loss = 100% [07:18:58] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 48, down: 13, dormant: 0, excluded: 2, unused: 0BRae1.102: down - Subnet toolserver1-esamsBRae1.401: down - cr1-esams:ae1.401BRae1.100: down - Subnet public1-esamsBRxe-0/0/0: down - Core: csw2-esams:xe-2/1/1 (GBLX leg 1) {#14006} [10Gbps DF]BRae1.405: down - mr1-esams:ge-0/0/1.405BRae1: down - csw2-esams:ae2BRae1.301: down - Subnet [07:19:38] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [07:19:39] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [07:23:43] i don't think esams is coming back [07:25:07] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 81, down: 0, dormant: 0, excluded: 2, unused: 0 [07:25:13] I was about to say wikipedia is very slow relativly, but now i know why, so i'll shut up [07:25:17] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 94.76 ms [07:25:21] RECOVERY - Host ms-be3001 is UP: PING OK - Packet loss = 0%, RTA = 94.88 ms [07:25:26] maybe it will [07:25:50] _joe_: feel free to merge, https://gerrit.wikimedia.org/r/#/c/173763/ [07:26:07] <_joe_> kart_: ok thanks I was about to ping you actually [07:26:14] :) [07:26:34] * yuvipandaops also prods _joe_ with https://gerrit.wikimedia.org/r/#/c/174132/ again :) [07:26:38] <_joe_> I'll wait for Reedy, he as an apache patch as well and I'd really like to send them toghether [07:26:55] (03CR) 10Giuseppe Lavagetto: [C: 031] puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) (owner: 10Yuvipanda) [07:27:07] <_joe_> yuvipandaops: ^^ I was _already_ doing it [07:27:12] hehe [07:27:33] <_joe_> before you said "now I don't look 30 anymore" like being 30 means being super-old [07:29:12] (03CR) 10Yuvipanda: [C: 04-1] "Hmm, not sure if this belongs in this module. Perhaps have a pdu (or facilities) module, and then put this there?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173999 (owner: 10Dzahn) [07:29:53] _joe_: aww/ow, hmm, I didn't mean that. [07:30:11] <_joe_> yuvipandaops: eheh I know, I get derailed easily [07:30:16] :) [07:31:28] (03PS4) 10Yuvipanda: puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) [07:33:13] (03CR) 10Yuvipanda: [C: 032] puppetmaster: Make time to keep old reports for configurable [puppet] - 10https://gerrit.wikimedia.org/r/174132 (https://bugzilla.wikimedia.org/73472) (owner: 10Yuvipanda) [07:35:38] hmmm [07:35:47] cloudadmins should be able to edit all of hiera, methinks [07:39:09] * bblack isn't a cloudadmin :( [07:39:12] <_joe_> yuvipanda: yes, and we need a per-instance lookup to, sooner or later [07:39:30] well, I can make all of ops able to edit too, but should be clousdadmin... [07:39:52] _joe_: is trivial already, I think. Hiera:/ can already be a page, editable only by cloudadmins. [07:40:00] yeah I run into that all the time, where I go to look at something on wikitech and it tells me to go away because I'm not a cloudadmin [07:40:21] just become cloudadmin! :) [07:40:22] and then I'm like "I'm so going to go log into whatever this is running on and give it to myself", but then I never bother [07:40:28] hehe [07:40:38] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [07:40:42] _joe_: also, we need labs.yaml, similar to production.yaml, I think. [07:40:48] I don't think we have it already [07:40:51] <_joe_> yuvipanda: make that editable by project admins [07:41:04] <_joe_> yuvipanda: I think I added it in a patch of gabriel's [07:41:05] _joe_: bah, I meant projectadmins, not cloudadmins [07:41:53] <_joe_> yuvipanda: oh ok [07:41:57] ok, making hiera editable by cloudadmins. [07:42:00] * yuvipanda works on patch [07:42:06] hmm, I seem to get distracted all the time. [07:42:27] <_joe_> can you add a section of docs here https://wikitech.wikimedia.org/wiki/Puppet_Hiera about labs? [07:46:51] _joe_: I've it in a tab somewhere, let me finish that up. [07:53:06] _joe_: hmm, as an update, subpages aren't enabled yet, I'll co-ordinate with MW folks to deploy that as well (is a config change) [07:53:08] let me file a bug [07:53:43] <_joe_> yuvipanda: ok np, I also need to update the hiera config I guess [07:53:56] yeah [07:58:48] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:00:43] (03CR) 10Giuseppe Lavagetto: varnish: remove cache separation for HHVM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/174390 (owner: 10Giuseppe Lavagetto) [08:01:00] (03PS2) 10Giuseppe Lavagetto: varnish: remove cache separation for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/174390 [08:03:38] PROBLEM - Host ms-be3001 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:38] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [08:03:50] there's the power cycle [08:04:21] _joe_: filed https://phabricator.wikimedia.org/T1356?workflow=create [08:04:48] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 59, down: 12, dormant: 0, excluded: 2, unused: 0BRae1.102: down - Subnet toolserver1-esamsBRae1.401: down - cr1-esams:ae1.401BRae1.100: down - Subnet public1-esamsBRxe-0/0/0: down - Core: csw2-esams:xe-2/1/1 (GBLX leg 1) {#14006} [10Gbps DF]BRae1.405: down - mr1-esams:ge-0/0/1.405BRae1: down - csw2-esams:ae2BRae1.301: down - Subnet [08:10:38] RECOVERY - Host cp3013 is UP: PING WARNING - Packet loss = 50%, RTA = 96.79 ms [08:10:42] oh [08:10:43] there we go [08:10:48] RECOVERY - Host cp3011 is UP: PING OK - Packet loss = 0%, RTA = 92.92 ms [08:10:48] RECOVERY - Host cp3012 is UP: PING OK - Packet loss = 0%, RTA = 94.63 ms [08:10:48] RECOVERY - Host cp3022 is UP: PING WARNING - Packet loss = 86%, RTA = 88.46 ms [08:10:48] RECOVERY - Host cp3015 is UP: PING WARNING - Packet loss = 93%, RTA = 94.63 ms [08:10:48] RECOVERY - Host cp3014 is UP: PING WARNING - Packet loss = 37%, RTA = 95.42 ms [08:10:48] RECOVERY - Host ms-be3001 is UP: PING WARNING - Packet loss = 44%, RTA = 96.18 ms [08:10:48] RECOVERY - Host ms-be3002 is UP: PING WARNING - Packet loss = 44%, RTA = 95.39 ms [08:10:49] RECOVERY - Host ms-be3003 is UP: PING WARNING - Packet loss = 44%, RTA = 95.28 ms [08:10:49] RECOVERY - Host cr1-esams is UP: PING WARNING - Packet loss = 37%, RTA = 96.48 ms [08:10:54] ! [08:10:58] RECOVERY - Host cp3003 is UP: PING OK - Packet loss = 0%, RTA = 95.27 ms [08:10:58] RECOVERY - Host cp3017 is UP: PING OK - Packet loss = 0%, RTA = 95.54 ms [08:10:58] RECOVERY - Host cp3021 is UP: PING OK - Packet loss = 0%, RTA = 95.75 ms [08:10:58] RECOVERY - Host cp3019 is UP: PING OK - Packet loss = 0%, RTA = 95.23 ms [08:11:09] RECOVERY - Host cp3004 is UP: PING OK - Packet loss = 0%, RTA = 95.57 ms [08:11:09] RECOVERY - Host cp3005 is UP: PING OK - Packet loss = 0%, RTA = 95.30 ms [08:11:09] RECOVERY - Host cp3020 is UP: PING OK - Packet loss = 0%, RTA = 96.45 ms [08:11:10] RECOVERY - Host cp3016 is UP: PING OK - Packet loss = 0%, RTA = 97.35 ms [08:11:10] RECOVERY - Host cp3006 is UP: PING OK - Packet loss = 0%, RTA = 96.16 ms [08:11:10] RECOVERY - Host hooft is UP: PING OK - Packet loss = 0%, RTA = 95.13 ms [08:11:10] RECOVERY - Host amslvs3 is UP: PING OK - Packet loss = 0%, RTA = 98.07 ms [08:11:11] RECOVERY - Host nescio is UP: PING OK - Packet loss = 0%, RTA = 96.55 ms [08:11:11] RECOVERY - Host amslvs1 is UP: PING OK - Packet loss = 0%, RTA = 95.11 ms [08:11:12] RECOVERY - Host ms-be3004 is UP: PING OK - Packet loss = 0%, RTA = 95.53 ms [08:11:12] RECOVERY - Host ms-fe3001 is UP: PING OK - Packet loss = 0%, RTA = 95.50 ms [08:11:13] RECOVERY - Host cp3007 is UP: PING OK - Packet loss = 0%, RTA = 96.15 ms [08:11:13] RECOVERY - Host ns2-v6 is UP: PING OK - Packet loss = 0%, RTA = 96.35 ms [08:11:14] RECOVERY - Host cp3008 is UP: PING OK - Packet loss = 0%, RTA = 97.04 ms [08:11:14] RECOVERY - Host cp3009 is UP: PING OK - Packet loss = 0%, RTA = 95.34 ms [08:11:15] RECOVERY - Host cp3018 is UP: PING OK - Packet loss = 0%, RTA = 95.86 ms [08:11:15] RECOVERY - Host cp3010 is UP: PING OK - Packet loss = 0%, RTA = 95.70 ms [08:11:16] RECOVERY - Host eeden is UP: PING OK - Packet loss = 0%, RTA = 95.03 ms [08:11:16] RECOVERY - Host ms-fe3002 is UP: PING OK - Packet loss = 0%, RTA = 98.16 ms [08:11:18] RECOVERY - Host amslvs4 is UP: PING OK - Packet loss = 0%, RTA = 95.20 ms [08:11:18] RECOVERY - Host ns2-v4 is UP: PING OK - Packet loss = 0%, RTA = 94.73 ms [08:11:18] RECOVERY - Host amslvs2 is UP: PING OK - Packet loss = 0%, RTA = 96.29 ms [08:11:23] yay, all good [08:11:27] RECOVERY - Host 91.198.174.6 is UP: PING OK - Packet loss = 0%, RTA = 99.33 ms [08:11:29] <_joe_> :) [08:11:37] <_joe_> mark: no daytrip for you, it seems [08:11:43] indeed [08:11:48] well i really need to go there anyway some time soon [08:11:49] but yeah [08:11:57] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 81, down: 0, dormant: 0, excluded: 2, unused: 0 [08:12:07] RECOVERY - Host mr1-esams is UP: PING OK - Packet loss = 0%, RTA = 95.43 ms [08:12:58] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: puppet fail [08:13:07] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: puppet fail [08:13:07] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: puppet fail [08:13:08] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: puppet fail [08:13:08] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [08:13:08] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: puppet fail [08:13:08] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: puppet fail [08:13:08] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [08:13:09] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [08:13:09] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [08:13:10] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [08:13:17] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [08:13:19] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [08:13:30] <_joe_> oh I love "puppet last run" spam [08:13:37] <_joe_> at least now it's accurate [08:13:48] PROBLEM - puppet last run on ms-be3001 is CRITICAL: CRITICAL: puppet fail [08:13:48] PROBLEM - puppet last run on ms-be3002 is CRITICAL: CRITICAL: puppet fail [08:13:58] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: puppet fail [08:14:07] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [08:14:11] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [08:14:11] PROBLEM - puppet last run on amslvs4 is CRITICAL: CRITICAL: puppet fail [08:14:11] PROBLEM - puppet last run on amslvs1 is CRITICAL: CRITICAL: puppet fail [08:14:11] PROBLEM - puppet last run on amslvs2 is CRITICAL: CRITICAL: puppet fail [08:14:11] PROBLEM - puppet last run on ms-fe3002 is CRITICAL: CRITICAL: puppet fail [08:14:12] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: puppet fail [08:14:12] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [08:14:13] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [08:14:13] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [08:15:08] RECOVERY - Host wikidata is UP: PING OK - Packet loss = 0%, RTA = 95.87 ms [08:15:18] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:15:58] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:15:58] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [08:16:08] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:17:08] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [08:17:18] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:17:19] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [08:18:09] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:18:27] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:20:17] RECOVERY - puppet last run on amslvs4 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:21:19] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:21:19] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:22:18] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:22:18] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:22:23] (03PS1) 10Yuvipanda: Add .gitreview file [software/ircyall] - 10https://gerrit.wikimedia.org/r/174641 [08:22:28] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:22:38] (03CR) 10Yuvipanda: [C: 032 V: 032] Add .gitreview file [software/ircyall] - 10https://gerrit.wikimedia.org/r/174641 (owner: 10Yuvipanda) [08:23:18] RECOVERY - puppet last run on amslvs2 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:23:28] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:25:27] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:25:28] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:26:27] RECOVERY - puppet last run on amslvs1 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:26:27] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [08:26:28] (03PS1) 10BBlack: Revert "esams local nets -> eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174644 [08:26:47] (03CR) 10BBlack: [C: 032] Revert "esams local nets -> eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174644 (owner: 10BBlack) [08:27:27] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:27:33] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [08:28:08] RECOVERY - puppet last run on ms-be3002 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:30:17] RECOVERY - puppet last run on ms-be3001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [08:30:31] RECOVERY - puppet last run on ms-fe3002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:35:00] (03PS1) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [08:35:17] ori: https://gerrit.wikimedia.org/r/#/c/174643/ [08:38:42] AaronSchulz: reading [08:39:12] the ternary is strong with this one [08:40:48] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [08:46:26] the null / '' thing is a little evil [09:00:30] (03PS1) 10BBlack: Revert "esams drain: GR/HU/NL/NO/PL/RO -> eqiad, LU/IM/IT -> ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174648 [09:00:32] (03PS1) 10BBlack: Revert "esams drain: 13x->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174649 [09:00:34] (03PS1) 10BBlack: Revert "esams drain: IE/IS/PT->ulsfo, FR/ES eqiad->ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174650 [09:00:36] (03PS2) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [09:00:44] (03CR) 10BBlack: [C: 032] Revert "esams drain: GR/HU/NL/NO/PL/RO -> eqiad, LU/IM/IT -> ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174648 (owner: 10BBlack) [09:00:57] (03CR) 10BBlack: [C: 032] Revert "esams drain: 13x->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174649 (owner: 10BBlack) [09:01:12] (03CR) 10BBlack: [C: 032] Revert "esams drain: IE/IS/PT->ulsfo, FR/ES eqiad->ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174650 (owner: 10BBlack) [09:05:56] (03PS1) 10BBlack: Revert "esams drain: GB -> ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174652 [09:05:58] (03PS1) 10BBlack: Revert "esams drain: AF + 6x countries esams->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174653 [09:06:00] (03PS1) 10BBlack: Revert "esams drain: rest of AS esams->ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174654 [09:07:52] (03CR) 10BBlack: [C: 032] Revert "esams drain: GB -> ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174652 (owner: 10BBlack) [09:08:05] (03CR) 10BBlack: [C: 032] Revert "esams drain: AF + 6x countries esams->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174653 (owner: 10BBlack) [09:08:16] (03CR) 10BBlack: [C: 032] Revert "esams drain: rest of AS esams->ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174654 (owner: 10BBlack) [09:11:43] (03PS3) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [09:22:18] (03PS1) 10BBlack: Revert "esams drain: 7x->ulsfo + 7x->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174655 [09:22:20] (03PS1) 10BBlack: Revert "esams drain: CH/CZ/DE->eqiad, TR/UA->ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174656 [09:22:22] (03PS1) 10BBlack: Revert "esams drain: RU->ulsfo, 8x->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174657 [09:23:01] (03CR) 10BBlack: [C: 032] Revert "esams drain: 7x->ulsfo + 7x->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174655 (owner: 10BBlack) [09:23:12] (03CR) 10BBlack: [C: 032] Revert "esams drain: CH/CZ/DE->eqiad, TR/UA->ulsfo" [dns] - 10https://gerrit.wikimedia.org/r/174656 (owner: 10BBlack) [09:23:23] (03CR) 10BBlack: [C: 032] Revert "esams drain: RU->ulsfo, 8x->eqiad" [dns] - 10https://gerrit.wikimedia.org/r/174657 (owner: 10BBlack) [09:27:57] (03PS4) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [09:44:35] Sigh people relying on Commons IPs https://www.mediawiki.org/w/index.php?title=Talk:InstantCommons&diff=0&oldid=1198749 [09:46:52] <_joe_> Nemo_bis: the only answer I have is "do regular dns queries and update your firewall accordingly" [09:47:17] <_joe_> and yes, that's sad [09:47:51] Nemo_bis: or don't whitelist IPs at all [09:47:55] Yes, I thought of that but I gave a hammer suggestion instead https://www.mediawiki.org/wiki/Talk:InstantCommons#Behind_a_firewall_or_load_balancer [09:48:20] If they're so keen on specifying IPs [09:49:20] <_joe_> maybe we can explain him why those IP changes happen [09:49:25] <_joe_> him/her [10:14:09] (03PS1) 10Giuseppe Lavagetto: deployment: make scap proxies configured in one place [puppet] - 10https://gerrit.wikimedia.org/r/174664 [10:14:28] <_joe_> ok this ^^ is a "real hiera benefit" [11:15:53] (03PS1) 10Filippo Giunchedi: syslog: deprecate /home/wikipedia/syslog [puppet] - 10https://gerrit.wikimedia.org/r/174673 [11:23:32] (03CR) 10Matanya: [C: 031] ganglia: remove pmtpa varnish stanza [puppet] - 10https://gerrit.wikimedia.org/r/174205 (owner: 10Dzahn) [11:28:23] (03CR) 10Matanya: realm.pp - remove pmtpa (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/173476 (owner: 10Dzahn) [11:29:48] (03CR) 10Matanya: [C: 031] Change ru.wikinews.org to HTTPS only. [puppet] - 10https://gerrit.wikimedia.org/r/173078 (owner: 10JanZerebecki) [11:42:26] _joe_: well not really, we already had a way to assign global variables like $::site etc. [11:42:50] <_joe_> paravoid: yes of course [11:43:16] <_joe_> but this is neater, don't you think? :) [11:45:05] honestly? I think I'll start liking hiera a lot more once we start to do hierarchical stuff :) [11:45:24] <_joe_> well, we already do a (little) bit [11:45:32] _joe_: merged? Sorry, I was in meetings etc [11:45:53] <_joe_> kart_: nope, I was planning to do it after lunch [11:46:03] _joe_: sure. Thanks. [11:46:10] like for this change specifically [11:46:12] <_joe_> right now I'm trying to get what's wrong with labs/private :) [11:46:29] why eqiad.yaml? :) [11:46:42] why can't it be the 'default' scap proxies, that we can override in a DC if we want to [11:46:53] <_joe_> right [11:47:51] tbh, I don't really like how we do hiera so far :) [11:48:05] but I need to write something more detailed than that, which is why I haven't spoken up yet [11:48:06] <_joe_> paravoid: what part you don't like? [11:48:13] <_joe_> yes pleas [11:48:20] <_joe_> it's all up for discussion btw [11:48:41] I don't like having to e.g. define "nagios_group = swift" on a hostname basis, where that's something that belongs to the role class [11:48:44] (03PS1) 10Filippo Giunchedi: txstatsd: gather runtime self metrics under statsd [puppet] - 10https://gerrit.wikimedia.org/r/174675 [11:49:09] or the whole swift configs that are in there [11:49:31] and at the same time, we're not really benefitting from hierarchy [11:49:37] <_joe_> paravoid: oh that (nagios_group) is there because I didn't want to restructure everything everywhere [11:49:43] springle: meanwhile, we need to add tables in Beta. Who can do that? [11:49:50] <_joe_> but I agree completely [11:49:54] springle: ping me if you're around. [11:50:01] I was hoping that we'd be finally able to say "these are our ntp servers" somewhere centrally, but also say "but for esams, just use the esams one instead" [11:50:06] <_joe_> I don't see why swift configs are a problem [11:50:17] (03PS5) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [11:50:22] <_joe_> paravoid: that's the idea, and you're able to do that [11:50:29] there's no "default.yml" :) [11:50:37] <_joe_> paravoid: "common" [11:50:47] <_joe_> and yes got your point about that [11:51:12] <_joe_> "common" is also already hierachically structured in distinct files, so I may just move things there [11:52:08] the problem with e.g. the swift config is that it's stuff that we used to have in its own place, under a swift role class that you wouldn't even look at if you weren't working on swift [11:52:23] and now it's on the "eqiad" file, or the "codfw" file [11:52:33] that's just flattening out all of our abstractions [11:52:36] <_joe_> and in common, it will be under common/swift/whatever [11:52:50] what about swift eqiad's config? [11:53:13] <_joe_> oh ask godog :) I think he needs to backport swift_new there [11:53:23] ok, what about swift codfw's config [11:53:28] that's not going to be under common/, will it? [11:53:51] <_joe_> that should be under codfw/swift/whatever.yaml [11:54:00] <_joe_> I do agree that flat files are horrible [11:54:06] there is no codfw/ yet? [11:54:11] as far as I can see? [11:54:16] <_joe_> no, I need to make a change for that [11:54:19] right :) [11:54:41] <_joe_> I'm gathering your inputs right now [11:54:45] I didn't say I didn't like hiera in general, just that we need some finetuning there :) [11:54:54] <_joe_> I completely agree [11:55:05] I really like that we're doing it and I'm really grateful that someone's doing it [11:55:16] so I don't mean to be all negative here, that's why I haven't spoken up so far :) [11:55:18] <_joe_> and I also was sure that we would change things around when we start actually using those [11:55:26] <_joe_> yeah don't worry [11:55:35] <_joe_> I'm not reading this negatively, I'm old [11:55:37] <_joe_> :P [11:55:52] I see a lot of abuses over there already [11:56:00] $ cat hosts/analytics1027.yaml [11:56:00] monitoring::configuration::group: analytics_eqiad [11:56:00] cluster: analytics [11:56:07] (ewww) [11:56:19] <_joe_> well, we had that abuse in site.pp [11:56:29] sure :) [11:56:39] it wasn't pretty at all, at least it was in *one* place :) [11:56:42] now it's in two [11:57:19] <_joe_> because we're in transition [11:57:27] how would it work in the end? [11:57:45] <_joe_> hopefully, each server will be assigned a "main role" [11:58:00] <_joe_> and all those global variables would be configured in the role [11:58:05] <_joe_> in its hiera file [11:58:16] 0099 [11:58:24] so *roles* would have a hiera file, not hosts? [11:58:35] how would you do that? [11:58:38] <_joe_> unluckily, both [11:59:31] <_joe_> I had a couple of ideas on how to simplify that, but I need to wrap my head around that, what I did now doesn't satisfy me [11:59:48] <_joe_> but I need some time to find something I think is better than what we have [12:00:18] <_joe_> I'll write a page on wikitech with options and write to ops@ [12:00:34] <_joe_> I just need a break of ~ 2 days to really work on that [12:12:30] !log Restarted EventLogging mysql-m2 consumer to empty its caches [12:12:36] Logged the message, Master [12:18:07] (03PS2) 10Giuseppe Lavagetto: deployment: make scap proxies configured in one place [puppet] - 10https://gerrit.wikimedia.org/r/174664 [12:18:34] (03PS6) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [12:21:26] <_joe_> paravoid: do you like this ^^ more? [12:22:01] <_joe_> (I do think it's better, tbh) [12:30:22] !log upload carbon-c-relay to trusty-wikimedia [12:30:26] Logged the message, Master [12:35:58] PROBLEM - puppet last run on mw1043 is CRITICAL: CRITICAL: Puppet has 1 failures [12:40:32] (03PS7) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [12:49:02] (03PS8) 10Yuvipanda: ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 [12:51:01] (03CR) 10Yuvipanda: [C: 032] ircyall: Introduce module for web2irc relay [puppet] - 10https://gerrit.wikimedia.org/r/174647 (owner: 10Yuvipanda) [12:51:17] <_joe_> bbl [12:52:57] (03PS5) 10Yuvipanda: shinken: Setup IRC notification for shinken [puppet] - 10https://gerrit.wikimedia.org/r/173080 [12:53:09] RECOVERY - puppet last run on mw1043 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [13:04:48] !log Jenkins: restarting to remove a deadlock and unload the Statsd plugin [13:04:50] Logged the message, Master [13:13:22] godog: _joe_: hey :) I disabled the statsd plugin in Jenkins so you can reclaim the disk space eat by the jenkins.ci hierarchy. Task is https://phabricator.wikimedia.org/T1075 [13:13:36] apparently should be about dropping /var/lib/carbon/whisper/jenkins/ [13:15:10] and restarting txstatsd [13:24:49] PROBLEM - very high load average likely xfs on ms-be1007 is CRITICAL: CRITICAL - load average: 239.40, 123.25, 60.26 [13:25:19] PROBLEM - CI: Low disk space on /var on labmon1001 is CRITICAL: CRITICAL: integration.integration-puppetmaster.diskspace._var.byte_avail.value (11.11%) [13:34:29] RECOVERY - CI: Low disk space on /var on labmon1001 is OK: OK: All targets OK [13:55:46] (03CR) 10Hashar: "This change breaks the puppet run on deployment-bastion.eqiad.wmflabs:" [puppet] - 10https://gerrit.wikimedia.org/r/173353 (owner: 10Ori.livneh) [13:59:48] hashar: nice, thanks! yeah yuvipanda I'll restart txstatsd too [14:00:58] !log restart txstatsd on tungsten to stop receiving jenkins metrics [14:01:01] Logged the message, Master [14:01:27] godog: for Zuul we would need to tweak Zuul reporting capabilityies [14:01:40] Zuul statsd is an All / Nothing switch [14:02:06] hasharCall: yep, do we have a local copy of the code already? [14:02:54] godog: yes that is upstream + some huge hack to add python modules + a couple patches pending merge/approval per upstream [14:14:53] (03PS3) 10Giuseppe Lavagetto: deployment: make scap proxies configured in one place [puppet] - 10https://gerrit.wikimedia.org/r/174664 [14:22:40] (03PS6) 10Alexandros Kosiaris: WIP: Modularize torrus [puppet] - 10https://gerrit.wikimedia.org/r/174389 [14:33:31] hashar: sth like https://gerrit.wikimedia.org/r/#/c/174691/ ? [14:34:00] in call for now [14:40:50] heya apergos [14:40:56] this change is good to go: [14:40:56] https://gerrit.wikimedia.org/r/#/c/168104/ [14:40:59] but, it depends on this: [14:41:01] https://gerrit.wikimedia.org/r/#/c/144640/ [14:49:22] (03PS1) 10Giuseppe Lavagetto: hiera: a few tweaks [puppet] - 10https://gerrit.wikimedia.org/r/174694 [14:49:48] (03PS2) 10Giuseppe Lavagetto: Make qualitywiki HTTPS only [puppet] - 10https://gerrit.wikimedia.org/r/173493 (owner: 10Reedy) [14:50:16] (03CR) 10Giuseppe Lavagetto: [C: 032] Make qualitywiki HTTPS only [puppet] - 10https://gerrit.wikimedia.org/r/173493 (owner: 10Reedy) [14:53:03] godog: looking :) [14:54:20] (03PS4) 10Giuseppe Lavagetto: Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 (owner: 10KartikMistry) [14:54:24] (03PS1) 10Cmjohnson: adding new mw servers to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/174695 [14:54:37] (03CR) 10Giuseppe Lavagetto: [C: 032] Add support for woff2 files [puppet] - 10https://gerrit.wikimedia.org/r/173763 (owner: 10KartikMistry) [14:54:46] <_joe_> cmjohnson: \o/ [14:55:05] godog: yeah something like that [14:55:36] godog: though I propose patches to upstream , will probably make the keys configurable via zuul.conf instead [14:57:05] hashar: sure that's even better [14:57:53] godog: I will create a subtask in our phabricator [14:57:58] then fill a bug/feature request upstream [14:58:01] and work on a patch [14:58:22] at least dropping jenkins.ci hierarchy should have bought us some time / disk space [14:58:50] hashar: yep, sounds good to me [15:01:21] !log Restarting Jenkins AND Zuul. Beta cluster jobs are still deadlocked. [15:01:25] Logged the message, Master [15:11:24] <_joe_> kart_: your change is merged [15:11:39] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:11:49] PROBLEM - check configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:11:58] PROBLEM - check if dhclient is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:11:59] PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:11:59] PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:12:09] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:12:39] PROBLEM - check if salt-minion is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:13:08] RECOVERY - Disk space on rhenium is OK: DISK OK [15:13:29] RECOVERY - check if salt-minion is running on rhenium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:13:38] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [15:13:39] RECOVERY - check configured eth on rhenium is OK: NRPE: Unable to read output [15:13:48] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [15:13:49] RECOVERY - DPKG on rhenium is OK: All packages OK [15:13:49] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:15:28] _joe_: thanks. [15:21:41] (03CR) 10Cmjohnson: [C: 032] adding new mw servers to dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/174695 (owner: 10Cmjohnson) [15:24:29] <_joe_> cmjohnson: whenever you need help/the servers are ready to image, just let me know [15:24:41] <_joe_> should I prepare the node defs as well in the meantime? [15:26:08] _joe_ they are ready to image except 2 ....apparently i typo'd the idrac cfg yesterday mw1229 and mw1239 [15:26:16] will get them later today [15:26:46] <_joe_> cmjohnson: ok so out of curiosity, why starting with mw1227 and not mw1221? [15:27:11] because there are 6 already in the rack that need to be relocated [15:27:18] <_joe_> oh ok [15:27:19] <_joe_> :) [15:27:25] <_joe_> thanks a lot! [15:27:32] so those 6 will be added once you are comfortable [15:27:54] <_joe_> which servers need moving? [15:28:00] oh the last 18 arrived ...will have those 12 for you tomorrow [15:28:37] <_joe_> so it's 38 total, I'd use 15 for api at least [15:28:50] mw1201-3 and mw1208-1210 [15:29:02] need to be relocated ^ [15:29:21] yep 38 total [15:29:21] <_joe_> cmjohnson: just to confirm, in the end we'll have mw1221-mw1258 right? [15:29:26] yes [15:29:35] <_joe_> ok, good! thanks! [15:29:45] * _joe_ on it [15:29:56] do you wanna img? [15:30:09] _joe_ ^ [15:30:13] <_joe_> well, first let me do the node defs [15:30:22] <_joe_> then we can just split the work I guess [15:30:46] ok, lmk [15:30:50] <_joe_> or, if the other servers arrived [15:31:04] <_joe_> you can focus on those and let me deal with imaging for now [15:31:15] <_joe_> I think that's even better [15:31:47] that would be great...cuz there is a lot of the bs that needs to be done...and it's time consuming (racktables mostly) [15:32:10] <_joe_> I also have a varnish change coming up, but I'll continue tomorrow morning [15:33:01] <_joe_> Coren, ottomata you could help as well if you're up to it [15:33:55] _joe_: I'm working on a scary problem with one of the virt boxen, but as soon as I squash it I'll see about doing a couple for you. [15:34:27] <_joe_> Coren: ok thanks, I'll put up a list somewhere [15:34:40] <_joe_> Coren: these are new servers, so it's going to be faster [15:34:54] Ooo. Fresh metal. :-) [15:34:58] <_joe_> I hope faster than gerrit is now [15:35:06] <_joe_> Coren: with kickass processors as well [15:37:17] <_joe_> is it me or gerrit is very slow? [15:37:58] (03CR) 10Yuvipanda: [C: 04-2] "More I think about it, more this seems like a bad idea." [puppet] - 10https://gerrit.wikimedia.org/r/170398 (owner: 10John F. Lewis) [15:39:29] (03CR) 10Mark Bergsma: [C: 031] varnish: remove cache separation for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/174390 (owner: 10Giuseppe Lavagetto) [15:43:46] <_joe_> mark: thanks [15:43:50] (03CR) 10Giuseppe Lavagetto: "@Yuvi: we do keep backups of list archives, so I don't see why we should restrict this. List admins are, moreover, the sole people respons" [puppet] - 10https://gerrit.wikimedia.org/r/170398 (owner: 10John F. Lewis) [15:50:12] <^d> _joe_: No problems for me. [15:50:44] * anomie assumes manybubbles will SWAT today [15:50:47] <_joe_> ^d: it was running git pull --rebase that was painstakingly slow [15:51:26] <^d> real 0m7.130s [15:51:26] <^d> user 0m1.681s [15:51:26] <^d> sys 0m1.530s [15:51:35] <^d> That was for mediawiki/core, hadn't pulled since last night [15:52:18] <_joe_> !log disabling puppet on all caches, before a pretty large change, will be reeanbled after a few tests [15:52:22] Logged the message, Master [15:52:30] (03PS3) 10Giuseppe Lavagetto: varnish: remove cache separation for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/174390 [15:52:56] (03PS7) 10Alexandros Kosiaris: Modularize torrus [puppet] - 10https://gerrit.wikimedia.org/r/174389 [15:53:04] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: remove cache separation for HHVM [puppet] - 10https://gerrit.wikimedia.org/r/174390 (owner: 10Giuseppe Lavagetto) [15:53:34] <_joe_> come ooon jenkins [15:58:01] (03CR) 10Nemo bis: "But we have hundreds of lists, so the questions arise:" [puppet] - 10https://gerrit.wikimedia.org/r/170398 (owner: 10John F. Lewis) [16:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141120T1600). Please do the needful. [16:00:53] anomie: I'm actually going to do a meeting now :( I can validate but shouldn't SWAT while I'm in there [16:01:00] ^d: can you SWAT for me? [16:01:07] I'm sorry! [16:02:21] <^d> manybubbles: Yeah no worries [16:02:25] thanks! [16:05:07] ^d: I don't really have a test case for anything but enwiki. so long as searches don't blow up on wmf9 I think it is good [16:05:15] <^d> Yeah [16:05:58] <^d> Surprise surprise, qunit has failed on me again [16:08:18] (03CR) 10BryanDavis: "Did that modules/varnishkafka change sneak in by accident?" [puppet] - 10https://gerrit.wikimedia.org/r/174664 (owner: 10Giuseppe Lavagetto) [16:13:30] <_joe_> bd808: thanks man [16:13:38] <_joe_> it did, damn submodules [16:15:00] !log demon Synchronized php-1.25wmf8/extensions/CirrusSearch: (no message) (duration: 00m 05s) [16:15:03] Logged the message, Master [16:15:36] !log demon Synchronized php-1.25wmf9/extensions/CirrusSearch: (no message) (duration: 00m 04s) [16:15:40] Logged the message, Master [16:26:35] <_joe_> !log puppet reenabled everywhere, change tested and live on all varnishes within the next 20 minutes [16:26:39] Logged the message, Master [16:29:52] _joe_: \o/ [16:30:08] <_joe_> ori: and we have more good news, new appservers arrived [16:30:16] \o/ \o/ [16:30:23] <_joe_> so we'll add quite a few hhvm appservers to api [16:31:28] ori, yyoo [16:32:02] (03PS8) 10Ottomata: add varnish::kafka::statsv [puppet] - 10https://gerrit.wikimedia.org/r/174195 (owner: 10Ori.livneh) [16:32:04] ready for statsv? [16:32:40] ottomata: yep, but maybe we should wait, seeing as the cache separation change is rolling out to the varnishes atm [16:32:58] <_joe_> ori: btw, http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=HHVM%2520Appservers%2520eqiad&tab=m&vn=&hide-hf=false [16:33:04] oh. ok [16:33:09] <_joe_> the hhvm cluster is going to be _really_ bored soon [16:33:18] (03PS8) 10Alexandros Kosiaris: Modularize torrus [puppet] - 10https://gerrit.wikimedia.org/r/174389 [16:33:53] _joe_: do you think ottomata and i could push out a change to logging on bits? [16:34:16] _joe_: interesting, why did load decrease? [16:34:48] ori, lemme know, my car needs a jump and i need to call a cab to jump it. i will do it now if we are going to delay this [16:34:51] <_joe_> ori: the zend pool has more visitors => wider cache [16:34:57] <_joe_> that's my hypothesis [16:35:20] <_joe_> also, whatever is already cached by zend, will not be requested to hhvm again [16:36:04] makes sense. ottomata, i think we can do it now; it only affects bits, and not varnish itself [16:36:18] sorry to be slightly schizophrenic [16:36:47] ok [16:36:49] let's dooo it [16:36:55] ja? [16:37:03] ori, say merge! :) [16:37:08] merge! [16:37:35] (03CR) 10Ottomata: [C: 032] add varnish::kafka::statsv [puppet] - 10https://gerrit.wikimedia.org/r/174195 (owner: 10Ori.livneh) [16:37:52] actualy, wait! ha, we need ot make the kafka topic :p [16:37:53] on it [16:37:58] RECOVERY - Varnish HTCP daemon on cp1008 is OK: PROCS OK: 1 process with UID = 112 (vhtcpd), args vhtcpd [16:38:05] !log demon Synchronized php-1.25wmf9/extensions/Math: (no message) (duration: 00m 06s) [16:38:05] i'm going to just give it the same settings as the webrequest ones, [16:38:06] <^d> James_F: ^^^ [16:38:08] Logged the message, Master [16:38:11] rep =3, partitions = 12 [16:38:13] ^d: Testing. [16:38:16] hey who killed my downtime? :p [16:38:38] oh nobody did [16:38:47] I guess it reported it down so it's reporting it up [16:39:11] ^d: Confirmed fixed. Thanks! [16:39:15] <^d> yw [16:39:26] <^d> Ok, swat done :D [16:39:41] ok, done, running puppet on cp1056 [16:42:38] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: puppet fail [16:43:06] yeah yeah [16:43:34] ? [16:43:40] (03PS1) 10Ottomata: Fix for varnish::kafka::statsv include [puppet] - 10https://gerrit.wikimedia.org/r/174713 [16:43:54] (03CR) 10Ottomata: [C: 032 V: 032] Fix for varnish::kafka::statsv include [puppet] - 10https://gerrit.wikimedia.org/r/174713 (owner: 10Ottomata) [16:43:58] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [16:43:59] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: puppet fail [16:44:52] <_joe_> wat? [16:44:54] lol [16:44:58] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: puppet fail [16:45:10] it's varnishkafka patch oclock! [16:45:21] there it goes [16:45:55] <_joe_> nah [16:46:28] hmm, worked on cp1056, something else wrong on cp3109.... [16:46:31] 3019 [16:48:24] <_joe_> it's bogus there [16:48:28] <_joe_> nevermind it [16:48:49] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [16:48:49] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [16:48:58] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [16:49:05] <_joe_> is someone running puppet by hand there? [16:49:06] ori, the ganglia stuff is beineg weird [16:49:18] _joe_, i'm testing this on 1056 and 3019 [16:49:22] on those, yes [16:50:11] ottomata: it'll fix itself now [16:50:19] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:50:38] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:51:09] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [16:51:13] hm, it isn't, the mv command is failing for no good reason [16:51:18] hm, or did it [16:51:24] hmm, i guess it did [16:52:15] (03CR) 10Alexandros Kosiaris: [C: 032] "Puppet compiler says almost noop (cosmetic changes)" [puppet] - 10https://gerrit.wikimedia.org/r/174389 (owner: 10Alexandros Kosiaris) [16:52:49] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [16:52:49] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:54:43] ok, ori, i think this is running [16:54:47] how can we test? i'm trying to [16:54:59] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [16:55:08] curl -H 'Host: bits.wikimedia.org' http://localhost:80/statsv?hi=there [16:55:10] on cp1056 [16:55:17] and am consuming this topic from kafka [16:55:19] ottomata: you'll get a 404, not a 204, but that's fine [16:55:59] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [16:56:04] it's going to fail again [16:56:48] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: puppet fail [16:57:59] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:57:59] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [16:58:58] (03PS1) 10Ori.livneh: handdle missing .pyconf.new gracefully [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174717 [16:59:08] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [16:59:16] (03PS2) 10Ori.livneh: handdle missing .pyconf.new gracefully [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174717 [16:59:18] PROBLEM - puppet last run on analytics1021 is CRITICAL: CRITICAL: puppet fail [16:59:27] (03CR) 10Ori.livneh: [C: 032 V: 032] handdle missing .pyconf.new gracefully [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174717 (owner: 10Ori.livneh) [16:59:53] <_joe_> !log restart apache on mw1218, stuck in a apc futex [16:59:57] Logged the message, Master [17:00:19] (03PS1) 10Ottomata: Add logrotate file that will properly rotate all varnishkafka instance *.stats.json files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174720 [17:00:25] (03PS1) 10Ori.livneh: Update Varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/174721 [17:00:30] ori ^ [17:00:32] let's put mine in too [17:00:33] before you merge that [17:00:34] (03CR) 10Ori.livneh: [C: 032 V: 032] Update Varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/174721 (owner: 10Ori.livneh) [17:00:39] PROBLEM - puppet last run on analytics1019 is CRITICAL: CRITICAL: puppet fail [17:00:48] RECOVERY - Apache HTTP on mw1218 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [17:00:53] ottomata: d'oh [17:01:02] heh [17:01:03] !log reboot ms-be1007, xfs-induced high load [17:01:05] Logged the message, Master [17:01:06] s'ok i gotcha then [17:01:29] PROBLEM - puppet last run on db2033 is CRITICAL: CRITICAL: puppet fail [17:01:38] PROBLEM - puppet last run on db1024 is CRITICAL: CRITICAL: puppet fail [17:01:43] were you able to consume it from the topic? [17:01:56] db2033 / db1024 unrelated to what me and otto are doing btw ^ [17:02:00] (03PS1) 10Ottomata: Update varnishkafka module with logrotate fix [puppet] - 10https://gerrit.wikimedia.org/r/174722 [17:02:20] PROBLEM - puppet last run on elastic1026 is CRITICAL: CRITICAL: puppet fail [17:02:26] (03CR) 10Ottomata: [C: 032 V: 032] Update varnishkafka module with logrotate fix [puppet] - 10https://gerrit.wikimedia.org/r/174722 (owner: 10Ottomata) [17:02:38] ori, i am consuming, haven't seen anything come through yet [17:03:04] (03CR) 10Ottomata: [C: 032 V: 032] Add logrotate file that will properly rotate all varnishkafka instance *.stats.json files [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174720 (owner: 10Ottomata) [17:03:07] ottomata: because there's a buffer [17:03:08] PROBLEM - puppet last run on mw1178 is CRITICAL: CRITICAL: puppet fail [17:03:09] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:03:15] of 1000+ messages or something [17:03:18] PROBLEM - swift-account-replicator on ms-be1007 is CRITICAL: Connection refused by host [17:03:28] PROBLEM - check if salt-minion is running on ms-be1007 is CRITICAL: Connection refused by host [17:03:29] PROBLEM - swift-account-auditor on ms-be1007 is CRITICAL: Connection refused by host [17:03:39] PROBLEM - puppet last run on mw1038 is CRITICAL: CRITICAL: puppet fail [17:03:39] ehh, its a count and time buffer [17:03:41] PROBLEM - swift-object-replicator on ms-be1007 is CRITICAL: Connection refused by host [17:03:45] it'll send every timeout too [17:03:59] sigh didn't silence ms-be1007 in time, apologies [17:04:27] around a second i think [17:04:59] PROBLEM - puppet last run on es2008 is CRITICAL: CRITICAL: puppet fail [17:05:10] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet last ran 16 hours ago [17:05:19] PROBLEM - puppet last run on platinum is CRITICAL: CRITICAL: puppet fail [17:06:13] <_joe_> on virt1006 puppet is disabled [17:07:10] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:07:15] hmm ori [17:07:17] my curl [17:07:20] is being redirecte? [17:07:22] //bits.wikimedia.org/wiki/statsv/ [17:07:26] Refresh: 5; url=http://bits.wikimedia.org/wiki/statsv/?hi=there&ts=1416503231 [17:07:49] heh, i curled hi=there [17:08:01] it must be working [17:08:04] ? [17:08:19] <_joe_> akosiaris: on es2008 Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Unknown function t at /etc/puppet/modules/torrus/manifests/xml_generation/cdn.pp:1 on node es2008.codfw.wmnet [17:08:23] while true; do curl --head -H 'Host: bits.wikimedia.org' "http://localhost:80/statsv/?hi=cpupt^Cere&ts=$(date +%s)"; sleep 1; done [17:08:23] <_joe_> Warning: Not using cache on failed catalog [17:08:37] oops [17:08:37] while true; do curl --head -H 'Host: bits.wikimedia.org' "http://localhost:80/statsv/?hi=there&ts=$(date +%s)"; sleep 1; done [17:08:48] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [17:08:50] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [17:09:09] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [17:09:18] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: Puppet has 1 failures [17:09:39] puppetmaster is going nuts [17:09:42] puppet is not failing on those hosts [17:10:14] ori, is varnish redirecting my request? [17:10:16] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures [17:10:17] to /wiki/statsv? [17:10:36] mediawiki did that because we're not intercepting the pattern yet [17:10:41] but that doesn't matter [17:10:49] varnish shoudl still log the original request, aye [17:10:50] hm [17:10:56] ottomata: regardless of whether or not the reqs are going through, i think we should leave it be for now, too many things going on atm infrastructure-wise [17:10:57] its not coming thorugh kafka [17:11:07] ok [17:11:14] sounds fine with me, i am going to jump my car then :) [17:11:18] PROBLEM - puppet last run on amssq51 is CRITICAL: CRITICAL: Puppet has 1 failures [17:11:18] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Puppet has 1 failures [17:11:28] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures [17:11:37] ottomata: nod, doesn't make sense to debug this through a storm of puppet alerts [17:11:57] ottomata: thanks very much , good luck with the car! [17:12:01] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [17:12:09] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [17:12:30] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:08] RECOVERY - very high load average likely xfs on ms-be1007 is OK: OK - load average: 25.35, 5.72, 1.88 [17:13:12] ottomata: your logrotate patch broke [17:13:29] hm [17:13:29] PROBLEM - puppet last run on amssq41 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:29] RECOVERY - swift-account-replicator on ms-be1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:13:38] RECOVERY - check if salt-minion is running on ms-be1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:13:48] RECOVERY - swift-account-auditor on ms-be1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:13:50] PROBLEM - puppet last run on cp1038 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:51] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [17:13:55] Error: /Stage[main]/Varnishkafka/File[/etc/logrotate.d/varnishkafka]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///modules/varnishkafka/varnishkafka_logrotate [17:13:58] RECOVERY - swift-object-replicator on ms-be1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:14:29] PROBLEM - puppet last run on amssq38 is CRITICAL: CRITICAL: Puppet has 1 failures [17:14:44] ha, i don't see that at all on cp1052, but it isn't replacing the file either [17:15:59] ottomata: revert? [17:16:08] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [17:16:17] eh i want to fix, but i can't see what's wrong with it! [17:17:34] PROBLEM - puppet last run on amssq33 is CRITICAL: CRITICAL: Puppet has 1 failures [17:17:38] RECOVERY - puppet last run on analytics1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:17:41] ok reverting. [17:17:59] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [17:18:09] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [17:18:18] (03PS1) 10Ottomata: Something was wrong with that logrotate varnishkafka module change, reverting to previous revision [puppet] - 10https://gerrit.wikimedia.org/r/174726 [17:18:33] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: Puppet has 1 failures [17:18:44] (03CR) 10Ottomata: [C: 032 V: 032] Something was wrong with that logrotate varnishkafka module change, reverting to previous revision [puppet] - 10https://gerrit.wikimedia.org/r/174726 (owner: 10Ottomata) [17:18:59] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:08] RECOVERY - puppet last run on analytics1019 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:19:18] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:19:19] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:29] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:29] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:30] PROBLEM - puppet last run on amssq50 is CRITICAL: CRITICAL: Puppet has 1 failures [17:19:59] RECOVERY - puppet last run on db2033 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:19:59] RECOVERY - puppet last run on db1024 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:20:21] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:28] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:29] PROBLEM - puppet last run on cp4015 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:29] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [17:20:49] RECOVERY - puppet last run on elastic1026 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [17:20:49] PROBLEM - puppet last run on amssq52 is CRITICAL: CRITICAL: Puppet has 1 failures [17:21:26] _joe_: must have been transient... cause I can not reproduce. But it is worrying... like the manifests were at some weird state when es2008's manifest was being compiled [17:21:29] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [17:21:39] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [17:21:59] RECOVERY - puppet last run on mw1038 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [17:22:18] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [17:22:19] RECOVERY - puppet last run on es2008 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:22:30] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: Puppet has 1 failures [17:22:39] PROBLEM - puppet last run on amssq58 is CRITICAL: CRITICAL: Puppet has 1 failures [17:23:18] ori, it looks like it is refreshing gmond on every puppet run again [17:23:29] RECOVERY - puppet last run on mw1178 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [17:24:08] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:24:09] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:24:09] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [17:24:29] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:24:29] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:24:31] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [17:24:31] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:24:31] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [17:24:31] RECOVERY - puppet last run on cp4015 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:24:39] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:24:46] (03CR) 10Hashar: [C: 031] "The beta cluster properly override the base path ( /data/project/syslog ) see inline comment for the exact place of the definition." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/174673 (owner: 10Filippo Giunchedi) [17:24:49] RECOVERY - puppet last run on platinum is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [17:25:08] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [17:25:19] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [17:25:38] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [17:25:38] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:25:39] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [17:25:58] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:26:19] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [17:26:39] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:26:48] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:27:09] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:27:38] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [17:27:49] RECOVERY - puppet last run on amssq50 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:27:49] RECOVERY - puppet last run on amssq58 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [17:27:49] RECOVERY - puppet last run on amssq51 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:27:49] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:27:58] RECOVERY - puppet last run on cp1038 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:27:59] RECOVERY - puppet last run on amssq52 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:29:07] (03PS1) 10Giuseppe Lavagetto: mediawiki: add node-level definitions for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/174730 [17:29:49] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:29:49] RECOVERY - puppet last run on amssq33 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:29:49] RECOVERY - puppet last run on amssq38 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [17:29:49] RECOVERY - puppet last run on amssq41 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [17:30:28] RECOVERY - puppet last run on amssq45 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [17:31:14] (03PS2) 10Giuseppe Lavagetto: mediawiki: add node-level definitions for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/174730 [17:31:28] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add node-level definitions for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/174730 (owner: 10Giuseppe Lavagetto) [17:31:49] (03CR) 10Giuseppe Lavagetto: [V: 032] mediawiki: add node-level definitions for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/174730 (owner: 10Giuseppe Lavagetto) [17:33:38] (03PS1) 10Alexandros Kosiaris: Allocate codfw Labs networks [dns] - 10https://gerrit.wikimedia.org/r/174732 [17:33:58] (03CR) 10Cscott: [C: 031] Give parsoid-roots access to ruthenium; split cassandra test hosts [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [17:34:27] _joe_: wow, that's awesome -- 14? [17:35:07] <_joe_> ori: 15 to api, 23 to the appserver pool [17:35:15] woooo [17:35:19] racked and everything? [17:35:30] <_joe_> chris is racking them now [17:35:30] (03PS5) 10GWicke: Give parsoid-admins access to ruthenium; split cassandra test hosts [puppet] - 10https://gerrit.wikimedia.org/r/172780 (owner: 10Cscott) [17:35:40] <_joe_> I'm going to reimage a few now [17:35:58] <_joe_> and tomorrow morning I count on putting 8 in production [17:36:00] <_joe_> in the api cluster [17:36:22] <_joe_> and one in the main appserver pool [17:36:34] <_joe_> so we have a full weekend under "normal" load [17:36:48] <_joe_> before we go for dissolving the pools on tuesday [17:36:54] (03PS1) 10Ori.livneh: Don't notify Service[gmond] on new Pyconf [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174733 [17:37:20] _joe_, mark: yay for more api servers! [17:37:31] (03CR) 10Ori.livneh: [C: 032 V: 032] Don't notify Service[gmond] on new Pyconf [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174733 (owner: 10Ori.livneh) [17:37:33] <_joe_> gwicke: I'm giving them priority [17:39:10] 15 additional ones should help quite a bit [17:41:17] <_joe_> gwicke: they're also quite powerful [17:41:46] (03PS1) 10Ori.livneh: Update Varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/174734 [17:41:58] (03CR) 10Ori.livneh: [C: 032 V: 032] "per otto" [puppet] - 10https://gerrit.wikimedia.org/r/174734 (owner: 10Ori.livneh) [17:43:59] PROBLEM - puppet last run on amslvs2 is CRITICAL: CRITICAL: puppet fail [17:44:04] (03PS1) 10Ori.livneh: Revert "Add logrotate file that will properly rotate all varnishkafka instance *.stats.json files" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174735 [17:44:09] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: puppet fail [17:44:15] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Add logrotate file that will properly rotate all varnishkafka instance *.stats.json files" [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/174735 (owner: 10Ori.livneh) [17:44:29] PROBLEM - puppet last run on db1029 is CRITICAL: CRITICAL: puppet fail [17:44:46] <_joe_> .win 32 [17:44:48] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: puppet fail [17:44:58] PROBLEM - puppet last run on curium is CRITICAL: CRITICAL: puppet fail [17:45:09] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [17:45:54] (03PS1) 10Ori.livneh: Update Varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/174737 [17:46:05] (03CR) 10Ori.livneh: [C: 032 V: 032] Update Varnishkafka submodule [puppet] - 10https://gerrit.wikimedia.org/r/174737 (owner: 10Ori.livneh) [17:47:47] varnishkafka stuff done [17:52:00] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [17:53:54] <_joe_> can someone take a look at all those puppet failures? [17:54:27] sure [17:57:35] <_joe_> thanks jgage :) [17:58:08] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:00:04] ^d, legoktm: Respected human, time to deploy Extension Distributor (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141120T1800). Please do the needful. [18:00:22] hm well hooft was ok but it took forever, checking out another puppet-fail host.. [18:02:08] RECOVERY - puppet last run on curium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [18:02:49] RECOVERY - puppet last run on db1029 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [18:03:09] RECOVERY - puppet last run on amslvs2 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [18:03:19] PROBLEM - DPKG on mw1227 is CRITICAL: Connection refused by host [18:03:19] PROBLEM - mediawiki-installation DSH group on mw1227 is CRITICAL: Host mw1227 is not in mediawiki-installation dsh group [18:03:19] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [18:03:20] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:03:20] whatever was wrong with puppet is no longer wrong, runs by hand are not producing errors [18:03:29] PROBLEM - Disk space on mw1227 is CRITICAL: Connection refused by host [18:03:29] PROBLEM - nutcracker port on mw1227 is CRITICAL: Connection refused by host [18:03:48] PROBLEM - nutcracker process on mw1227 is CRITICAL: Connection refused by host [18:03:50] <_joe_> jgage: then it's the puppet server [18:03:51] hmm, mw1277 is being re-imaged, I think [18:03:58] <_joe_> it is being imaged [18:03:58] PROBLEM - puppet last run on mw1227 is CRITICAL: Connection refused by host [18:03:58] PROBLEM - HHVM processes on mw1227 is CRITICAL: Connection refused by host [18:03:59] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:04:02] ah, imaged. [18:04:12] ^demon|brb: ready? [18:04:18] PROBLEM - HHVM rendering on mw1227 is CRITICAL: Connection refused [18:04:28] <_joe_> sorry guys [18:04:39] PROBLEM - puppet last run on db2019 is CRITICAL: CRITICAL: puppet fail [18:04:42] hmm, I was going to check if it was the report cleanup script, but the time is wrong [18:04:43] PROBLEM - RAID on mw1227 is CRITICAL: Connection refused by host [18:04:59] PROBLEM - puppet last run on db1031 is CRITICAL: CRITICAL: puppet fail [18:04:59] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: puppet fail [18:05:08] PROBLEM - check configured eth on mw1227 is CRITICAL: Connection refused by host [18:05:09] PROBLEM - puppet last run on search1016 is CRITICAL: CRITICAL: puppet fail [18:05:09] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [18:05:09] PROBLEM - check if dhclient is running on mw1227 is CRITICAL: Connection refused by host [18:05:19] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: puppet fail [18:05:39] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [18:07:09] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: puppet fail [18:11:19] PROBLEM - puppet last run on db1020 is CRITICAL: CRITICAL: puppet fail [18:11:29] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: puppet fail [18:12:09] PROBLEM - puppet last run on db1062 is CRITICAL: CRITICAL: puppet fail [18:15:03] <^demon|brb> legoktm: Yeah, I got sucked into a Solr/ES discussion. Lez go [18:15:07] <^demon|brb> (sorry) [18:15:11] ok :P [18:16:04] I copied all the extensions into http://extdist.wmflabs.org/dist/extensions/ so there shouldn't be any downtime, just possibly outdated tarballs. [18:16:25] <^d> I'll merge to master & start porting to wmf9 [18:16:51] (03PS1) 10Andrew Bogott: Switch back to using the wikistatus package. [puppet] - 10https://gerrit.wikimedia.org/r/174746 [18:16:59] legoktm: is it time yet? :) [18:17:08] yuvipanda: once we deploy the mw change :) [18:17:22] legoktm: I still don't understand why that would make that much of a difference [18:17:32] this just sets up additional cron jobs and stuff, doesn't seem to modify the original bits in any form [18:17:42] yuvipanda: it moves directories around [18:17:47] (03CR) 10Andrew Bogott: [C: 032] Switch back to using the wikistatus package. [puppet] - 10https://gerrit.wikimedia.org/r/174746 (owner: 10Andrew Bogott) [18:17:47] wait, does it? [18:17:50] yes [18:17:54] dist --> dist/extensions [18:17:56] oh [18:17:57] yea [18:17:57] h [18:17:58] it does [18:18:00] true, true [18:19:31] RECOVERY - DPKG on mw1227 is OK: All packages OK [18:19:48] RECOVERY - nutcracker port on mw1227 is OK: TCP OK - 0.000 second response time on port 11212 [18:19:49] RECOVERY - Disk space on mw1227 is OK: DISK OK [18:19:59] RECOVERY - RAID on mw1227 is OK: OK: no RAID installed [18:20:11] RECOVERY - nutcracker process on mw1227 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [18:20:12] RECOVERY - HHVM processes on mw1227 is OK: PROCS OK: 1 process with command name hhvm [18:20:19] RECOVERY - check configured eth on mw1227 is OK: NRPE: Unable to read output [18:20:29] RECOVERY - check if dhclient is running on mw1227 is OK: PROCS OK: 0 processes with command name dhclient [18:21:31] <^d> yuvipanda, legoktm: Ok, I'm all staged on tin and ready to hit enter on sync-dir [18:21:52] ^d: lets push the mw change first, make sure it's working, and then do the puppet one [18:21:58] <^d> kk [18:22:08] !log demon Synchronized php-1.25wmf9/extensions/ExtensionDistributor/: (no message) (duration: 00m 07s) [18:22:10] Logged the message, Master [18:22:14] (03PS1) 10Ori.livneh: mediawiki::packages: require ::apt [puppet] - 10https://gerrit.wikimedia.org/r/174748 [18:22:34] https://www.mediawiki.org/w/api.php?action=query&list=extdistrepos has skins :D [18:22:39] yuvipanda: ok, ready for puppet now [18:22:39] <^d> Blahhh, I have to rebuild i18n [18:22:48] hehok [18:22:49] (03PS4) 10Yuvipanda: extdist: Support distributing skins [puppet] - 10https://gerrit.wikimedia.org/r/174471 (owner: 10Legoktm) [18:23:45] ^d: oh yeah, this'll need a scap :/ [18:23:45] legoktm: waiting for jenkins [18:23:58] <^d> legoktm: I already sync-dir'd. Why not just l10nupdate? :) [18:24:04] <^d> Faster than a full scap. [18:24:07] ok :P [18:24:08] RECOVERY - puppet last run on db2019 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [18:24:09] lol [18:24:19] RECOVERY - puppet last run on db1031 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:24:19] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:24:23] (03CR) 10Yuvipanda: [C: 032] extdist: Support distributing skins [puppet] - 10https://gerrit.wikimedia.org/r/174471 (owner: 10Legoktm) [18:24:29] RECOVERY - puppet last run on search1016 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [18:24:30] legoktm: I should set up monitoring for this too [18:24:39] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: puppet fail [18:24:43] legoktm: merged [18:24:48] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: puppet fail [18:24:49] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [18:24:51] ok, running puppet manually [18:25:10] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:25:10] PROBLEM - puppet last run on rbf1002 is CRITICAL: CRITICAL: puppet fail [18:25:39] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [18:25:59] Notice: /Stage[main]/Extdist/File[/etc/skindist.conf]/ensure: created [18:25:59] Notice: /Stage[main]/Extdist/Cron[skindist-generate-tarballs]/ensure: created [18:26:08] legoktm: I'm setting up alerts for extdist now. which email would you like to use? :) [18:26:18] yuvipanda: legoktm@wikimedia.org plz [18:26:24] also ty :D [18:26:28] ops@ ? :P [18:26:33] hehe [18:26:39] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: puppet fail [18:27:07] !log updated the python-openstack-wikistatus on carbon to 2014.11 [18:27:11] Logged the message, Master [18:27:29] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: puppet fail [18:27:29] (03PS1) 10Yuvipanda: shinken: Add monitoring for extdist project too [puppet] - 10https://gerrit.wikimedia.org/r/174751 [18:27:33] legoktm: ^ check your email. [18:28:44] will in a few minutes, found a bug in nightly.py [18:28:48] legoktm: heh ok [18:28:49] PROBLEM - puppet last run on virt1004 is CRITICAL: CRITICAL: Puppet has 1 failures [18:29:13] <^d> legoktm: I'm going to be naughty and walk away while l10nupdate is still running...meeting. [18:29:24] ok [18:29:27] <^d> I guess I could take my laptop lol. [18:29:28] PROBLEM - puppet last run on virt1003 is CRITICAL: CRITICAL: Puppet has 1 failures [18:29:37] <^d> Probably will drop and I didn't use screen ugh [18:29:49] RECOVERY - puppet last run on db1020 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:30:38] RECOVERY - puppet last run on db1062 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [18:30:45] http://extdist.wmflabs.org/dist/skins/ woot [18:30:49] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:31:11] hmm [18:31:29] ^d: Pretty soon we will be able to use mosh and screen to do deploys because Ori has almost got rid of the need for the agent. Pretty damn cool! [18:31:30] PROBLEM - puppet last run on mw1125 is CRITICAL: CRITICAL: Puppet has 1 failures [18:32:05] ok, all looks good. [18:36:32] ori: heh, was just going to investigate that labs/private thing. thanks! [18:36:33] (03PS1) 10Legoktm: Un-disenable Special:SkinDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174759 [18:37:15] ^d: once l10nupdate finishes, we can turn on Special:SkinDistributor ^ [18:39:00] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-20 18:39:00+00:00 [18:39:03] Logged the message, Master [18:39:18] legoktm: did you get an alert? [18:39:23] legoktm: well, an empty alert at least? :) [18:39:28] * legoktm opens up email [18:39:57] oh, hmm [18:40:01] shinken says it can't even reach them [18:40:14] Subject: ** PROBLEM alert - extdist2/ is ** [18:40:17] yeah [18:40:21] very nicely done, I know [18:40:24] it just means host is down [18:40:34] I can't ping extdist1, 2 or 3 [18:40:37] also why do we have 3? [18:40:47] 3 has a bigger /var/log [18:41:04] after this skin stuff settles, I'm going to make 3 the real one and get rid of 2 [18:41:11] but for now extdist.wmflabs.org points to 2 [18:42:02] legoktm: yeah [18:42:04] legoktm: ok [18:42:41] and I think 1 has the self puppetmaster thing set up for testing things [18:43:01] ah,hmm [18:43:02] that's good [18:43:06] just get rid of 2 later ;) [18:43:29] RECOVERY - puppet last run on rbf1002 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:43:52] ** PROBLEM alert - extdist1/ is ** [18:44:08] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [18:44:09] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [18:44:09] PROBLEM - puppet last run on amslvs2 is CRITICAL: CRITICAL: puppet fail [18:44:09] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [18:44:18] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: puppet fail [18:44:58] PROBLEM - puppet last run on mc1006 is CRITICAL: CRITICAL: puppet fail [18:44:58] PROBLEM - puppet last run on mw1007 is CRITICAL: CRITICAL: puppet fail [18:44:59] PROBLEM - puppet last run on analytics1017 is CRITICAL: CRITICAL: puppet fail [18:44:59] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: puppet fail [18:45:58] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: puppet fail [18:45:58] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:46:09] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: puppet fail [18:46:09] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [18:46:18] PROBLEM - puppet last run on mc1002 is CRITICAL: CRITICAL: puppet fail [18:46:32] legoktm: hmm, unsure why I can't ping extdist tho [18:46:42] from where? [18:46:54] it might have some strict firewall things? [18:47:49] on https://wikitech.wikimedia.org/wiki/Special:NovaSecurityGroup it only has 22 and 80 listed [18:47:51] (03PS1) 10Yuvipanda: shinken: Add monitoring for analytics project too [puppet] - 10https://gerrit.wikimedia.org/r/174763 [18:47:53] milimetric: ^ [18:48:02] legoktm: hmm, ping should be open nonetheless, I think. [18:48:07] legoktm: since that's icmp [18:48:10] thx yuvipanda [18:48:21] milimetric: yw. can you verify your email address in that patch and +1? [18:48:53] (03CR) 10Milimetric: [C: 031] shinken: Add monitoring for analytics project too [puppet] - 10https://gerrit.wikimedia.org/r/174763 (owner: 10Yuvipanda) [18:49:00] RECOVERY - puppet last run on virt1003 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [18:49:05] legoktm: can you +1 your patch as well, check email is right? [18:49:18] RECOVERY - puppet last run on virt1004 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [18:49:58] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:50:18] (03CR) 10Legoktm: [C: 031] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/174751 (owner: 10Yuvipanda) [18:50:34] yuvipanda: o.O I can't ping it from bastion2.wmflabs.org [18:51:40] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-20 18:51:39+00:00 [18:51:43] Logged the message, Master [18:53:10] (03CR) 10Yuvipanda: [C: 032] shinken: Add monitoring for extdist project too [puppet] - 10https://gerrit.wikimedia.org/r/174751 (owner: 10Yuvipanda) [18:53:23] (03CR) 10Yuvipanda: [C: 032] shinken: Add monitoring for analytics project too [puppet] - 10https://gerrit.wikimedia.org/r/174763 (owner: 10Yuvipanda) [18:58:58] <^d> Gah, this failed miserably. [18:59:21] <^d> mw1218: 18:51:37 Updated 0 CDB files(s) in /srv/mediawiki/php-1.25wmf9/cache/l10n [18:59:27] <^d> (times every single mw*) [19:01:18] <^d> legoktm: scap it is, I suppose... [19:01:27] :| [19:02:08] <^d> bd808: l10nupdate busted for me really really bad running it alone. [19:02:12] <^d> Pastebinning... [19:02:51] ^d: I'll look but l10nupdate is spooky magic as far as I'm concerned [19:03:04] <^d> P100! [19:03:05] <^d> https://phabricator.wikimedia.org/P100 [19:03:09] RECOVERY - puppet last run on mw1007 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:03:19] RECOVERY - puppet last run on analytics1017 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:03:29] RECOVERY - puppet last run on amslvs2 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [19:03:29] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:03:39] PROBLEM - puppet last run on dbproxy1002 is CRITICAL: CRITICAL: puppet fail [19:03:49] <^d> bd808: I'm mostly concerned about the "unhandled error" around line 940. [19:04:09] RECOVERY - puppet last run on mc1006 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [19:04:19] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: puppet fail [19:04:29] RECOVERY - puppet last run on mc1002 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [19:04:39] PROBLEM - puppet last run on mw1012 is CRITICAL: CRITICAL: puppet fail [19:05:18] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [19:05:19] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:05:29] PROBLEM - puppet last run on mc1003 is CRITICAL: CRITICAL: puppet fail [19:05:53] ^d: I'm not sure why it kept going after that... but I think problem #1 was that you didn't sudo -u l10nupdate to run it [19:06:12] <^d> ugh. [19:06:25] !log demon Started scap: (no message) [19:06:27] Logged the message, Master [19:06:29] sudo: /usr/local/bin/scap-rebuild-cdbs: command not found [19:06:29] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [19:06:29] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: puppet fail [19:06:32] <^d> Screw it, I'm just doing the whole thing [19:06:34] That looks not right too [19:07:09] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: puppet fail [19:07:18] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: puppet fail [19:07:19] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: puppet fail [19:07:19] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: puppet fail [19:07:29] <^d> bd808: It's there, permissions probably why I couldn't find it? [19:07:51] <^d> No. [19:08:09] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [19:08:29] PROBLEM - puppet last run on amssq34 is CRITICAL: CRITICAL: puppet fail [19:08:38] PROBLEM - puppet last run on amssq55 is CRITICAL: CRITICAL: puppet fail [19:09:26] My browser really hates that paste. Scroll sooooo slow [19:09:39] PROBLEM - puppet last run on mc1014 is CRITICAL: CRITICAL: puppet fail [19:10:08] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: puppet fail [19:10:19] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:11:35] ^d: It's broken. [19:11:42] Like all the time [19:11:48] <^d> Yeahhhh [19:12:00] Broken in l10nupdate.log-20141116.gz logs too [19:12:29] Reedy did some work to convert it to run using scap. That must not be quite right [19:14:09] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:17:39] ^d: https://phabricator.wikimedia.org/T1383 [19:19:57] <^d> thx for filing that [19:20:14] <^d> sync-common is at 65% for my full scap [19:22:10] robh: sorry for delay, all clear to merge key update and resolve ticket :) [19:22:21] mw1070 is getting overloaded by the scap again :( -- https://ganglia.wikimedia.org/latest/?c=Application%20servers%20eqiad&h=mw1070.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [19:22:39] RECOVERY - puppet last run on carbon is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:22:50] rmoen: cool, i'll do so now [19:22:53] load average: 103.23, 95.48, 57.43 [19:22:58] RECOVERY - puppet last run on mw1012 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [19:22:58] RECOVERY - puppet last run on dbproxy1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:23:38] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:23:56] (03PS3) 10RobH: stat1003 access for rmoen (rt #8870) and update ssh key [puppet] - 10https://gerrit.wikimedia.org/r/173811 (owner: 10ArielGlenn) [19:24:10] PROBLEM - puppet last run on amssq49 is CRITICAL: CRITICAL: puppet fail [19:24:10] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: puppet fail [19:24:28] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: puppet fail [19:24:28] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: puppet fail [19:24:48] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [19:24:48] PROBLEM - puppet last run on amssq61 is CRITICAL: CRITICAL: puppet fail [19:24:48] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: puppet fail [19:24:49] PROBLEM - puppet last run on amslvs2 is CRITICAL: CRITICAL: puppet fail [19:24:49] RECOVERY - puppet last run on mc1003 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [19:24:59] PROBLEM - puppet last run on elastic1004 is CRITICAL: CRITICAL: puppet fail [19:25:09] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: puppet fail [19:25:09] PROBLEM - puppet last run on rbf1002 is CRITICAL: CRITICAL: puppet fail [19:25:09] PROBLEM - puppet last run on es1008 is CRITICAL: CRITICAL: puppet fail [19:25:18] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: puppet fail [19:25:28] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: puppet fail [19:25:29] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: puppet fail [19:25:39] PROBLEM - puppet last run on db2034 is CRITICAL: CRITICAL: puppet fail [19:25:48] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:25:48] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:25:49] PROBLEM - puppet last run on search1010 is CRITICAL: CRITICAL: puppet fail [19:26:19] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: puppet fail [19:26:20] PROBLEM - puppet last run on elastic1008 is CRITICAL: CRITICAL: puppet fail [19:26:59] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: puppet fail [19:26:59] RECOVERY - puppet last run on amssq55 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:27:11] (03CR) 10RobH: [C: 032] "cleared up with rob via irc, ok to merge" [puppet] - 10https://gerrit.wikimedia.org/r/173811 (owner: 10ArielGlenn) [19:27:19] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:27:28] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:27:38] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:27:48] rmoen: ok, it is now live, it'll take an hour or so to hit all the various servers [19:27:49] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: puppet fail [19:27:59] RECOVERY - puppet last run on mc1014 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:28:07] if there are specific servers you need to access before then, just let me know and i'll manually fire a puppet run on them =] [19:28:50] RECOVERY - puppet last run on amssq34 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:30:02] !log demon Finished scap: (no message) (duration: 23m 37s) [19:30:05] Logged the message, Master [19:30:08] PROBLEM - puppet last run on amssq36 is CRITICAL: CRITICAL: puppet fail [19:30:38] <_joe_> hey is someone looking at this shower of puppet failures? [19:32:19] just got out of meeting, i'll take another look [19:32:59] ^d: deploy https://gerrit.wikimedia.org/r/#/c/174759/ now? [19:33:18] <_joe_> jgage: I don't think you're the only opsen around :) [19:33:47] well i seem to be the only one who responded :) [19:33:49] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [19:34:21] <_joe_> it seems like transient failures [19:34:42] !log restarted puppetmasters [19:34:47] Logged the message, Master [19:35:03] yeah, but some of them are across the lan which suggests the master rather than net [19:35:03] <_joe_> jgage: the error is most of the times [19:35:05] <_joe_> Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Unknown function t at /etc/puppet/modules/torrus/manifests/xml_generation/cdn.pp:1 on node mw1126.eqiad.wmnet [19:35:06] let's see if that helps [19:35:18] oh interesting, 400 [19:35:22] <_joe_> yes, hopefully yes [19:35:23] (03PS1) 10Nuria: Adding user to monitoring of analytics projects [puppet] - 10https://gerrit.wikimedia.org/r/174776 [19:35:37] <_joe_> 400 is the lame way puppet has to say it failed compilation [19:35:39] <_joe_> MEH [19:37:29] notify => Exec['torrus compile --tree=CDN'], [19:37:40] bleh torrus [19:40:21] (03PS1) 10Ottomata: Increase heapsize for hive-server2 and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/174778 [19:40:35] (03CR) 10Ottomata: [C: 032 V: 032] Increase heapsize for hive-server2 and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/174778 (owner: 10Ottomata) [19:40:41] <^d> _joe_: Running puppet on a host that was failing worked fine for me [19:40:52] <^d> (was able to get search1018 and elastic1008 to recover) [19:41:36] (03CR) 10Chad: [C: 032] Un-disenable Special:SkinDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174759 (owner: 10Legoktm) [19:41:41] ty robh [19:41:53] welcome :] [19:41:57] (03Merged) 10jenkins-bot: Un-disenable Special:SkinDistributor [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174759 (owner: 10Legoktm) [19:42:26] !log demon Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 04s) [19:42:29] Logged the message, Master [19:42:30] <^d> legoktm: ^^^ [19:42:39] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:42:54] ^d: yaaay https://www.mediawiki.org/wiki/Special:SkinDistributor :D [19:43:00] thanks! [19:43:08] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [19:43:09] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [19:43:18] RECOVERY - puppet last run on amslvs2 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [19:43:28] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:43:28] RECOVERY - puppet last run on es1008 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [19:43:29] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [19:43:29] RECOVERY - puppet last run on amssq49 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:43:38] <^d> legoktm: woot! [19:43:39] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:43:39] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:44:08] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [19:44:18] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: puppet fail [19:44:19] RECOVERY - puppet last run on elastic1004 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:44:19] PROBLEM - puppet last run on analytics1025 is CRITICAL: CRITICAL: puppet fail [19:44:28] RECOVERY - puppet last run on rbf1002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [19:44:29] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [19:44:38] RECOVERY - puppet last run on elastic1008 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [19:44:49] PROBLEM - puppet last run on ms-be1006 is CRITICAL: CRITICAL: puppet fail [19:44:59] PROBLEM - puppet last run on ms-fe2004 is CRITICAL: CRITICAL: puppet fail [19:45:08] RECOVERY - puppet last run on amssq61 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [19:45:18] RECOVERY - puppet last run on search1010 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:45:29] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:45:29] PROBLEM - puppet last run on analytics1020 is CRITICAL: CRITICAL: puppet fail [19:45:29] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: puppet fail [19:45:58] PROBLEM - puppet last run on es2001 is CRITICAL: CRITICAL: puppet fail [19:45:59] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:46:18] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:46:40] PROBLEM - puppet last run on lvs1005 is CRITICAL: CRITICAL: puppet fail [19:48:08] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: puppet fail [19:48:08] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [19:48:09] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: puppet fail [19:48:28] (03PS1) 10Yuvipanda: nagios_common: Split out notification_commands for shinken/icinga [puppet] - 10https://gerrit.wikimedia.org/r/174780 [19:48:29] PROBLEM - puppet last run on db1016 is CRITICAL: CRITICAL: puppet fail [19:48:39] PROBLEM - puppet last run on labcontrol2001 is CRITICAL: CRITICAL: puppet fail [19:48:49] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: puppet fail [19:48:58] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: puppet fail [19:49:19] PROBLEM - puppet last run on pc1002 is CRITICAL: CRITICAL: puppet fail [19:49:20] RECOVERY - puppet last run on amssq36 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [19:49:28] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: puppet fail [19:49:58] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: puppet fail [19:50:03] hmm [19:50:05] * yuvipanda takes a look too [19:50:54] hmm [19:50:57] seem fine when run manually [19:50:58] * yuvipanda looks at logs [19:50:59] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:51:13] <^d> yuvipanda: Yeah that's what I said a bit ago. I ran it manually on 2 nodes and they recovered. [19:51:18] hmmm [19:51:20] torrus error? [19:51:59] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Unknown function t at /etc/puppet/modules/torrus/manifests/xml_generation/cdn.pp:1 on node labnet1001.eqiad.wmnet [19:52:19] * yuvipanda tries on another machine [19:52:36] yup [19:52:37] torrus [19:52:43] why is that intermittent?! [19:53:57] i'm guessing that the compilation is expensive? [19:54:05] leading to timeouts [19:55:13] jgage: actually... [19:55:17] (03PS1) 10Yuvipanda: torrus: Remove stray characters in file causing failures [puppet] - 10https://gerrit.wikimedia.org/r/174783 [19:55:18] jgage: ^ [19:55:32] (03PS1) 10BryanDavis: l10nupdate: Fix scap command paths [puppet] - 10https://gerrit.wikimedia.org/r/174784 [19:55:38] * jgage looks [19:55:49] ha! [19:55:51] nice catch [19:55:53] indeed [19:56:06] (03CR) 10Gage: [C: 032] torrus: Remove stray characters in file causing failures [puppet] - 10https://gerrit.wikimedia.org/r/174783 (owner: 10Yuvipanda) [19:56:26] jgage: ty! [19:56:30] merged [19:56:36] let's see if that fixes it [19:56:47] no idea *why* it was intermittent, though... [19:56:56] yeah that is especially weird [19:57:12] this kind of thing, you would expect to just... fail [19:57:12] hard [19:59:12] * jgage -> lunch [19:59:17] (03CR) 10BryanDavis: "Resolving this bug also requires that a root change the ownership of /var/lock/scap to l10nupdate:wikidev. The l10nupdate use is not a mem" [puppet] - 10https://gerrit.wikimedia.org/r/174784 (owner: 10BryanDavis) [20:00:09] I's appreciate a review and merge on that puppet patch by any opsen who has time ^ [20:00:26] l10nupdate is broked in prod (and has been for 2 weeks apaprently) [20:00:42] I'm pretty sure that patch will fix it [20:01:00] There was a bug [20:01:03] along with the file permission change I noted in the comments [20:01:21] Nemo_bis: https://phabricator.wikimedia.org/T1383 [20:01:48] should we move l10nupdate out of the puppet repo alongside scap? [20:02:16] greg-g: eh. maybe? [20:02:25] (03PS2) 10Yuvipanda: nagios_common: Split out notification_commands for shinken/icinga [puppet] - 10https://gerrit.wikimedia.org/r/174780 [20:02:40] greg-g: Faidon wasn't happy about scap moving out. [20:02:49] I mean, what reasoning is there to have scap outside but not l10nupdate? /me shrugs just a passing thought [20:03:02] well, then give my team +2 on puppet and I'll be happy with whatever :) [20:03:10] RECOVERY - puppet last run on ms-be1006 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:03:12] ;) [20:03:19] RECOVERY - puppet last run on db2034 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:03:22] I think I asked for that once... [20:03:29] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:03:29] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:03:38] RECOVERY - puppet last run on analytics1025 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:03:41] bd808: I'm looking at the patch now [20:03:49] RECOVERY - puppet last run on analytics1020 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:03:56] yuvipanda: woot! I forgot about your new super powers [20:03:59] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:04:02] bd808: :) [20:04:19] RECOVERY - puppet last run on ms-fe2004 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:04:20] oh man, I knew this was in bash, but ugh [20:04:32] it's not pretty [20:04:42] I mean even for bash it's scabby [20:04:49] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:04:58] RECOVERY - puppet last run on lvs1005 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:05:28] RECOVERY - puppet last run on es2001 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:06:17] bd808: /var/lock/scap on where? tin? [20:06:27] yuvipanda: Yeah tin [20:06:28] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:06:59] (03PS1) 10Jforrester: Enable VisualEditor Beta Feature on other wikis too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174793 [20:07:08] RECOVERY - puppet last run on labcontrol2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:07:28] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:07:32] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:07:49] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:08:10] bd808: heh, it's ori:wikidev now [20:08:18] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:08:20] bd808: I wonder if it gets reset everytime someone writes to it [20:08:20] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:08:23] He ran it first I assume [20:08:35] ah, right [20:08:37] It shouldn't [20:08:38] RECOVERY - puppet last run on pc1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:08:44] If it did Chad would own it [20:08:47] anyway, merging. [20:08:49] RECOVERY - puppet last run on db1016 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:09:05] bd808: I meant https://bugzilla.wikimedia.org/show_bug.cgi?id=73586 [20:09:30] !log run chown l10nupdate:wikidev /var/lock/scap on tin, for https://gerrit.wikimedia.org/r/#/c/174784/1 [20:09:33] Logged the message, Master [20:09:42] (03PS2) 10Yuvipanda: l10nupdate: Fix scap command paths [puppet] - 10https://gerrit.wikimedia.org/r/174784 (owner: 10BryanDavis) [20:10:34] (03CR) 10Yuvipanda: [C: 032] l10nupdate: Fix scap command paths [puppet] - 10https://gerrit.wikimedia.org/r/174784 (owner: 10BryanDavis) [20:10:40] Nemo_bis: Chad ran a full scap today so I bet that bug is fixed now, but yeah it was probably because of this problem that the l10n didn't get updated. [20:11:01] bd808: am forcing a puppet run on tin now [20:11:17] gogo panda powers! [20:11:28] * bd808 needs food [20:11:48] yuvipanda: I'll try running it after I eat something to see if I really fixed it or not [20:12:02] bd808: yeah, ok. I might not be around, though. but might also be. no idea. [20:12:39] yuvipanda: No worries. I can beg for help from others if needed [20:12:46] * bd808 is good at begging [20:12:52] bd808: heh. feel free to add me to similar patches in the future, etc. [20:12:55] * hashar gives a quarter to bd808 [20:13:22] * bd808 buys a sick of gum with hashar's quarter [20:13:26] bd808: puppet ran successfully [20:13:36] cool [20:18:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [20:19:08] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [500.0] [20:25:29] hmmm [20:25:39] shinken is still not sending emails for service state changes [20:25:43] Y U NO LIKE ME SHINKEN [20:28:15] I'm getting many transcoding timeouts recently, where should l look for finding the root cause? [20:32:19] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:32:39] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:41:39] PROBLEM - Host mw1227 is DOWN: PING CRITICAL - Packet loss = 100% [20:42:38] (03PS2) 10Ori.livneh: mediawiki::packages: require ::apt [puppet] - 10https://gerrit.wikimedia.org/r/174748 [20:43:27] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::packages: require ::apt [puppet] - 10https://gerrit.wikimedia.org/r/174748 (owner: 10Ori.livneh) [20:54:23] hmm, so host notification emails just work. [20:54:26] service ones, however... [21:00:09] !log Updated EventLogging to 39de1d3faacc8463db7532405e8fc003b80ecb79 [21:00:14] Logged the message, Master [21:04:53] bam, *that* is why [21:05:32] (03PS3) 10Yuvipanda: nagios_common: Split out notification_commands for shinken/icinga [puppet] - 10https://gerrit.wikimedia.org/r/174780 [21:06:57] is greg-g around today? [21:07:43] * greg-g nods [21:07:47] ah :) [21:07:51] :) [21:07:55] what's up? [21:07:56] * aude wants to deploy https://gerrit.wikimedia.org/r/#/c/174800/ (update to the property suggester) [21:08:13] we would have done it tuesday except i was travelling and we need to update the table also [21:08:21] (maintenance script) [21:08:31] is there time open today? [21:09:16] otherwise can do it on monday [21:09:18] yep, between now and 2 hours from now is open [21:09:22] ok, great [21:09:27] now is fine if you're good to go [21:09:30] ok [21:09:47] (it's early -- afternoon here ;) [21:10:04] thanks [21:11:15] Anybody worried if I run l10nupdate to see if it's fixed now? [21:11:20] greg-g: ? [21:11:55] bd808: if i can quickly deploy an update for wikidata first? [21:12:03] not sure they should be at same time? [21:12:11] aude: Yeah no problem. [21:12:18] waiting on jenkins [21:12:33] bd808: I'll be around for the next 20mins too, if you need anything [21:12:35] bd808: oh right, you were waiting on that, sorry 'bout the delay [21:12:39] aaah, have to do wmf9 too [21:12:43] first [21:12:59] greg-g: No problem. I've got lots of other "real work" to do :) [21:13:48] won't take long [21:17:54] jenkins is taking it's time... [21:19:24] * hashar blames the test suite [21:20:02] heh [21:26:12] !log aude Synchronized php-1.25wmf9/extensions/Wikidata: Update test.wikidata - property suggester (duration: 00m 10s) [21:26:16] Logged the message, Master [21:26:24] * aude verifying [21:28:07] !log aude Synchronized php-1.25wmf8/extensions/Wikidata: Update Wikidata - property suggester (duration: 00m 10s) [21:28:09] Logged the message, Master [21:28:25] done, except need to run a script now [21:37:53] bd808: done (also with the script) [21:38:14] aude: thx. I'll see if I unbroke l10nupdate or not [21:38:36] ok [21:40:02] !log Testing l10nupdate changes [21:40:05] Logged the message, Master [21:40:25] (03PS4) 10Yuvipanda: shinken: Fix notification commands to make email work [puppet] - 10https://gerrit.wikimedia.org/r/174780 [21:42:09] PROBLEM - Host amssq39 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:09] hmm 502 for wikipedia? [21:42:09] PROBLEM - Host amssq42 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:09] PROBLEM - Host cp3015 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:09] PROBLEM - Host amssq36 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:09] PROBLEM - Host cp3018 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:09] PROBLEM - Host amssq40 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:09] PROBLEM - Host cp3017 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:10] PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:10] PROBLEM - Host amssq44 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:11] PROBLEM - Host amssq47 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:11] PROBLEM - Host cp3016 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:12] PROBLEM - Host amssq35 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:12] PROBLEM - Host amssq34 is DOWN: PING CRITICAL - Packet loss = 100% [21:42:19] what a timing [21:42:29] RECOVERY - Host amssq34 is UP: PING OK - Packet loss = 0%, RTA = 96.07 ms [21:42:32] RECOVERY - Host amssq47 is UP: PING OK - Packet loss = 0%, RTA = 95.76 ms [21:42:32] RECOVERY - Host amssq44 is UP: PING OK - Packet loss = 0%, RTA = 95.13 ms [21:42:32] RECOVERY - Host cp3015 is UP: PING OK - Packet loss = 0%, RTA = 95.29 ms [21:42:32] RECOVERY - Host cp3017 is UP: PING OK - Packet loss = 0%, RTA = 95.98 ms [21:42:32] RECOVERY - Host amssq35 is UP: PING OK - Packet loss = 0%, RTA = 96.04 ms [21:42:33] RECOVERY - Host cp3016 is UP: PING OK - Packet loss = 0%, RTA = 95.61 ms [21:42:33] RECOVERY - Host cp3018 is UP: PING OK - Packet loss = 0%, RTA = 95.47 ms [21:42:34] RECOVERY - Host amssq42 is UP: PING OK - Packet loss = 0%, RTA = 95.32 ms [21:42:34] RECOVERY - Host amssq43 is UP: PING OK - Packet loss = 0%, RTA = 95.52 ms [21:42:35] RECOVERY - Host amssq36 is UP: PING OK - Packet loss = 0%, RTA = 94.84 ms [21:42:38] RECOVERY - Host amssq39 is UP: PING OK - Packet loss = 0%, RTA = 95.10 ms [21:42:48] RECOVERY - Host amssq40 is UP: PING OK - Packet loss = 0%, RTA = 95.86 ms [21:43:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [21:43:48] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: puppet fail [21:43:49] PROBLEM - puppet last run on ssl3003 is CRITICAL: CRITICAL: Puppet has 5 failures [21:43:49] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: puppet fail [21:44:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [21:44:49] PROBLEM - puppet last run on amssq57 is CRITICAL: CRITICAL: Puppet has 1 failures [21:44:49] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [21:44:59] PROBLEM - puppet last run on cp3022 is CRITICAL: CRITICAL: Puppet has 1 failures [21:45:08] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 5 failures [21:45:19] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [21:48:39] hm what's up esams [21:58:18] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:58:38] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:58:52] jgage: second time today [21:59:44] !log bd808 Synchronized php-1.25wmf8/cache/l10n: (no message) (duration: 05m 05s) [21:59:48] Logged the message, Master [22:00:15] RECOVERY - puppet last run on amssq57 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:01:00] RECOVERY - puppet last run on ssl3003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [22:01:19] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [22:01:59] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:02:29] RECOVERY - puppet last run on cp3022 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [22:03:06] !log LocalisationUpdate completed (1.25wmf8) at 2014-11-20 22:03:06+00:00 [22:03:09] Logged the message, Master [22:03:10] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [22:03:59] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [22:17:08] (03PS1) 10Jalexander: Revert "Enable SecurePoll error detail for debugging" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174829 [22:18:48] <^d> Jamesofur: ^ is that for swat or you need it sooner? [22:19:00] nah, SWAT is fine, I'm adding it to the list there [22:19:06] <^d> okie dokie [22:19:12] it's just backing out a debugging setting [22:19:39] <^d> *nod* yeah [22:21:41] <^d> Jamesofur: Get it on the list now, we're basically at the 8 patch max :) [22:23:03] * Jamesofur shakes fist at edit conflict [22:23:17] apparently with myself [22:23:18] it's on there [22:23:40] !log bd808 Synchronized php-1.25wmf9/cache/l10n: (no message) (duration: 08m 05s) [22:23:44] Logged the message, Master [22:23:57] 8 minutes, not too bad [22:24:11] That's just the data sync, but yeah [22:24:19] <^d> Jamesofur: It was so much harder to edit conflict yourself in the days before tabbed browsing :p [22:24:28] so true [22:25:38] (03PS1) 10Legoktm: Deploy GlobalUserPage extension to betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174831 [22:26:14] (03PS1) 10Gage: Hadoop: Logstash: log INFO level messages [puppet] - 10https://gerrit.wikimedia.org/r/174832 [22:26:33] :o swat is full already? [22:27:00] <^d> yeah pretty much. [22:27:08] !log LocalisationUpdate completed (1.25wmf9) at 2014-11-20 22:27:08+00:00 [22:27:11] Logged the message, Master [22:27:16] hmm, I have a beta-only change. [22:27:46] <^d> We could do a beta-only change now? I don't think that needs to wait for swat. [22:27:54] (03CR) 10BryanDavis: [C: 031] "If they are noise to someone, it is easy enough to add filter to hide all hadoop messages in a given Kibana dashboard." [puppet] - 10https://gerrit.wikimedia.org/r/174832 (owner: 10Gage) [22:27:55] legoktm: yeah, jfdi [22:27:55] legoktm: Aren't you deployer yourself now? [22:28:04] * hoo scratches his head [22:28:21] hoo: yeah, but I still haven't learned how to deploy yet. or.i is going to teach me on monday [22:28:32] :P [22:28:46] <^d> legoktm: link to patch? [22:28:48] git fetch; git rebase; sync-file [22:28:51] ^d: https://gerrit.wikimedia.org/r/#/c/174831/ [22:29:20] <^d> bd808: `git pull -r` because I'm a rebel and live on the edge :D [22:30:04] (03CR) 10Chad: [C: 032] Deploy GlobalUserPage extension to betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174831 (owner: 10Legoktm) [22:30:12] (03Merged) 10jenkins-bot: Deploy GlobalUserPage extension to betalabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174831 (owner: 10Legoktm) [22:31:15] !log demon Synchronized wmf-config/: globaluserpage on beta, no-op sync (duration: 00m 07s) [22:31:17] Logged the message, Master [22:31:42] thanks [22:32:05] <^d> yw [22:32:54] thanks bd808, glad we agree about logging :) [22:33:51] (03CR) 10Gage: [C: 032] "I agree with bd808!" [puppet] - 10https://gerrit.wikimedia.org/r/174832 (owner: 10Gage) [22:34:10] jgage: There are some log events that aren't worth recording in prod, but by and large if we can hold the data and keep up with the event stream I don't see any reason to drop things that might be useful [22:34:58] The worst thing in debugging an intermittant prod problem is to find out that the log level was one notch too low. :( [22:34:59] yeah. at minimum i'd rather get the whole feed and then selectively drop specific event types identified as useless/noise [22:35:33] (03CR) 10Andrew Bogott: [C: 032] Allow sshd to pull ssh keys from ldap on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/173066 (owner: 10Andrew Bogott) [22:35:36] You'll have a hard time out spamming OCG. ;) [22:35:43] :D [22:35:52] * jgage sets level DEBUG [22:36:14] Why not trace? [22:36:26] for real java crazy log dumpage [22:36:36] <^d> We could log elastic logs to logstash. [22:36:41] <^d> s/could/should/ [22:36:45] should [22:36:47] yes [22:36:55] especially the slow logs [22:37:04] and the mysql slow logs would be awesome too [22:37:27] ironically, logstash doesn't log its own messages to logstash either [22:37:45] chicken eating egg problem [22:37:58] also the logstash logs are crap [22:38:13] yeah, agreed on both [22:38:17] That ruby logging library it uses is painful to look at [22:38:17] the latter is annoying [22:38:18] <^d> I read a blog post awhile ago about somebody who'd setup logstash consuming its own logs. [22:38:46] so it would log that it had logged about processing a log while it logged? [22:38:46] i'd be interested to read that if you have the link handy, ^d [22:39:09] <^d> I'm trying to dig it up :) [22:39:12] danke [22:39:25] feeding logstash debug log stream into logstash would be a fun way to self DOS [22:39:42] each event larger than the last [22:40:28] <^d> nvm, it was just the silly about logging elasticsearch events in logstash. [22:40:33] ^d, greg-g: l10nupdate is fixed [22:40:58] hm ok [22:41:00] <^d> wheee [22:41:26] bd808: sweet [22:41:27] <^d> Heh, third-to-final paragraph [22:41:29] <^d> "The usefulness of this entire setup is a bit suspicious." [22:42:18] heh! [22:43:14] soooo how long does it take for code to get deployed to beta? http://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version still doesn't show it yet [22:43:35] but eval.php shows it's being loaded... [22:44:11] legoktm: Just a config change? They are almost instant [22:44:44] hmmmm [22:44:50] it's not showing up on Special:Version :| [22:44:53] legoktm: scap is running there now -- https://integration.wikimedia.org/ci/view/Beta/ [22:45:55] legoktm: and the scap that is running is for your config change -- https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/30556/ [22:46:05] ah [22:46:06] ok :D [22:46:13] Huh, neat. [22:50:07] legoktm: Looks like you caught (or caused) an l10n change that took a while [22:50:28] 22:43:09 Finished mw-update-l10n (duration: 12m 41s) [22:50:37] o.O [22:50:44] it's on http://meta.wikimedia.beta.wmflabs.org/wiki/Special:Version :D [22:50:56] l10nupdate is teh slow [22:53:28] (03PS6) 10Andrew Bogott: Move the openstack_version setting to hiera. [puppet] - 10https://gerrit.wikimedia.org/r/173904 [22:55:13] (03CR) 10Andrew Bogott: [C: 032] Move the openstack_version setting to hiera. [puppet] - 10https://gerrit.wikimedia.org/r/173904 (owner: 10Andrew Bogott) [23:00:04] maxsem, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141120T2300). [23:02:27] (03PS15) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [23:13:29] !log maxsem Synchronized php-1.25wmf9/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/174749/ (duration: 00m 04s) [23:13:33] Logged the message, Master [23:19:47] <^d> !log running sync-common on mw1135. out of sync? [23:19:52] Logged the message, Master [23:20:11] greg-g: Around? [23:24:56] I'm getting an error message when trying to log-in to any beta.wmflabs wiki: [23:24:58] Unable to connect [23:24:58] Firefox can't establish a connection to the server at login.wikimedia.beta.wmflabs.org. [23:25:01] " [23:25:20] with URL: https://login.wikimedia.beta.wmflabs.org/wiki/Special:CentralLogin/start?token=8722... [23:25:31] hoo: yeah, what's up? [23:25:48] legoktm: ^ see quiddity [23:25:57] greg-g: Did you intentionally change the time for next weeks Tuesday train? [23:26:08] 19:00–21:00 UTC [23:26:09] uh [23:26:09] hmmm [23:26:15] <^d> !log graceful'd mw1135, apc stale? [23:26:21] Logged the message, Master [23:26:22] hoo: no :) [23:26:34] hoo: copy paste from this week, then forgot it was changed this week :) [23:26:37] greg-g: But... can we keep it? [23:26:38] quiddity: it loads for me... [23:26:41] hoo: hah [23:26:50] But then when I reload the mainpage at the actual wiki (meta or en) I'm successfully logged in. (However i'm not logged in to the other betaSUL wikis..., e.g. wikidata) [23:26:55] Reedy: thoughts on next week's Tuesday deploy time? [23:27:12] I'm in university during the normal Tuesday deploy time... so having it after 6pm would be better for me [23:27:24] that ok with aude ? [23:27:25] 6pm CET [23:27:41] greg-g: Not sure... let's wait for her to come back from having food [23:27:44] kk [23:27:59] hoo: can you two talk and just send me an email? whatever works for you [23:28:11] Well, if Reedy is ok with it [23:28:36] I'll just talk to aude and then write an email to you two to see what we can come up with [23:28:53] * greg-g nods [23:29:09] <^d> Ah, it was stale apc. [23:29:11] ok with me [23:29:13] <^d> yay, fixed. [23:29:15] * aude won't be around [23:29:29] quiddity: did the whole login flow and it wfm [23:29:44] aude: what's the eta for implementing a solution like the one aaron suggested for bug 56602? [23:30:18] ori: we have an immediate fix to constrain use of memcached for this [23:30:21] aude: Ok, I can then do all Tuesday deploys (until traveling or so) [23:30:28] Will write the mail [23:30:38] the solution aaron suggests, need to talk to lydia and tobi when we can schedule [23:30:41] aude: constrain by how much? [23:30:43] * aude has holiday next week [23:30:51] ori: to only users opting into the beta feature [23:30:58] aude: On Wednesday, that is, right? [23:31:23] hoo: not around on wednesday also [23:31:39] Yeah, we will need to find someone for Wednesday, probably [23:31:43] :S [23:31:52] ori: my patch is in gerrit, if merged, maybe we can backport it [23:32:05] Maybe I can do it, but not very likely [23:32:28] hoo: deployments shouldn't depend on just us two [23:32:42] * aude thinks tobi + jan can handle it for example [23:32:43] aude: how much longer will you be around? I can review it now [23:33:00] ori: it's somewhat non-trivial [23:33:06] * aude around for a while [23:33:14] * ori reviews [23:33:14] aude: Yeah, guess they can... just need to ask them [23:33:36] https://gerrit.wikimedia.org/r/#/c/174113/ [23:33:52] In the end, they can still revert or call me up or so [23:33:53] jan knows where allt he config is and how to do stuff :) [23:33:58] tobi knows how to do the build [23:34:03] and yes, can revert [23:34:27] or maybe i won't be at the airport yet :) [23:34:52] aude: Don't miss your flight :D [23:34:56] hoo: :) [23:35:47] aren't there some wikis where it's not a BF? [23:35:49] legoktm, hmm, i notice that the http://meta.wikimedia.beta.wmflabs.org/ URL is http, but URL that trying to log-in redirected me to is http*s* - could that be it? [23:36:05] legoktm: a few [23:36:19] quiddity: secure login is enabled, so you should only be logged in on https I think. [23:36:59] I can't load anything at https://meta.wikimedia.beta.wmflabs.org/ it instantly errors. [23:38:11] also don't know that my patch is a clean cherrypick [23:39:15] quiddity: Firefox can't connect to blah or a different error? [23:39:28] Firefox can't establish a connection to the server at meta.wikimedia.beta.wmflabs.org. [23:39:28] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [23:39:43] quiddity: network issue on your end? [23:40:00] quiddity: can you ping it? [23:40:21] <^d> legoktm: I'm getting instant fail too [23:40:23] ori: for https://gerrit.wikimedia.org/r/#/c/174113/, i suggest if it gets merged tomorrow or so, then we can put it on test2 etc [23:40:26] as a wmf9 build [23:40:36] o.O [23:40:40] <^d> (yes, I can ping) [23:40:47] wfm... [23:41:27] how would I ping a https? I only know $ ping google.com [23:41:31] aude: I'll be working tomorrow... guess I can have a look [23:41:35] hoo: thanks [23:41:35] aude: wfm [23:41:53] ori: ok :) [23:41:54] quiddity: ping https://foobar.com [23:42:15] $ ping https://google.com/ [23:42:16] ping: unknown host https://google.com/ [23:42:23] o.O [23:42:41] but without the https:// it works fine. [23:42:45] km-mpb:skins km$ ping https://meta.wikimedia.beta.wmflabs.org [23:42:46] PING https://meta.wikimedia.beta.wmflabs.org (208.80.155.135): 56 data bytes [23:42:46] ori: and for the cachign solution aaron suggests, we might want to make some of the code used for i18n cache generalized [23:42:55] that could be used also for the sites data [23:42:59] re*used [23:43:19] what does the sites data consist of, again, and how often does it change? [23:43:26] ori: like the interwiki data [23:43:48] <^d> chad@notsexy /a/vag (master)$ curl --dump-header - "https://meta.wikimedia.beta.wmflabs.org" [23:43:48] <^d> curl: (7) Failed to connect to meta.wikimedia.beta.wmflabs.org port 443: Connection refused [23:43:51] <^d> legoktm: ^ [23:44:04] like global id (enwiki), local ids, "language", etc [23:44:09] aude: how often does it change? [23:44:14] only when we have a new wiki [23:44:27] ^d: ok, I can reproduce that... [23:44:28] can you just store it in a .php file? [23:44:31] this is weird. [23:44:32] Can we cdb it? [23:44:33] $siteData = (...)? [23:44:38] hoo: cdb is a bit evil [23:44:40] afaik [23:44:52] <^d> It's not evil [23:44:52] Well, it's static and stuff... but it's fast [23:44:56] php would be ok or json [23:45:03] * aude doesn't care too much [23:45:03] this changes so infrequently that I think just having a PHP file in mediawiki-config would be best. [23:45:04] static as in: Need to recreate to change [23:45:20] <^d> cdb is only painful for things that change often. like msgs. [23:45:21] <^d> :) [23:45:26] ^d: i see [23:45:29] ^d, quiddity: I have no idea what's going on..... [23:45:32] that might work then [23:45:36] <^d> legoktm: nor do I. [23:45:39] why not a PHP file? [23:45:45] you'll get better caching and better performance [23:45:51] ori: it would be dynamically generated [23:45:57] you can dynamically generate a php file [23:45:58] i think cdb or json would be better for that [23:46:00] ori: You mean just a file with 'return array( /* stuffs */ );' [23:46:01] i know :) [23:46:06] ? [23:46:09] hoo: yes [23:46:28] ori: if you think that is best, then ok with me [23:46:32] <^d> The problem with PHP is it's completely non-portable to other things. [23:46:33] I'd also prefer to have json or cdb... dynamically creating PHP is evil [23:46:37] I want us to fix the performance issue that has been throwing off the whole cluster for months, and then you guys can iterate [23:46:39] PROBLEM - Disk space on graphite1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 19756 MB (3% inode=99%): [23:46:39] !log maxsem Synchronized php-1.25wmf9/extensions/WikiGrok/: https://gerrit.wikimedia.org/r/174847 (duration: 00m 04s) [23:46:44] Logged the message, Master [23:46:50] ori: my patch will do that mostly [23:47:14] yes, but your patch is nontrivial, and it doesn't conflict with doing what i'm suggesting [23:47:20] yeah [23:47:48] !log maxsem Synchronized php-1.25wmf8/extensions/WikiGrok/: https://gerrit.wikimedia.org/r/174847 (duration: 00m 04s) [23:47:50] Logged the message, Master [23:48:32] aude, hoo: can one of you pastebin a JSON-serialized version of the current data? [23:49:31] should be trivial... can you do it, aude? [23:49:39] I actually wanted to do other stuffs today :P [23:49:42] Always get lost... [23:50:48] ori: give me a minute [23:50:53] thanks [23:52:53] (03CR) 10Kaldari: [C: 032] Add 'types of albums' WikiGrok campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174586 (owner: 10Bmansurov) [23:53:05] (03Merged) 10jenkins-bot: Add 'types of albums' WikiGrok campaign [mediawiki-config] - 10https://gerrit.wikimedia.org/r/174586 (owner: 10Bmansurov) [23:57:15] I'm claiming SWAT today [23:57:22] Because I have a late patch again [23:57:47] * ^d gives RoanKattouw the deployment conch [23:58:49] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures