[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151202T0000). Please do the needful. [00:00:56] (03CR) 10Rush: [C: 032] add labtest realm for ldap/manifests/role/config.pp [puppet] - 10https://gerrit.wikimedia.org/r/256359 (owner: 10Rush) [00:01:45] (03CR) 10Aaron Schulz: [C: 031] Lower redis connection timeout from 2s to 0.5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255560 (owner: 10Ori.livneh) [00:11:12] 6operations, 6Phabricator: migrate RT maint-announce into phabricator - https://phabricator.wikimedia.org/T118176#1843329 (10RobH) IRC Update: @Dzahn proposes we evaulate using phabricator's calendar tracking for these in lieu of using google. Benefits: Open source \o/, single interface Drawbacks: We don't h... [00:13:13] 7Puppet, 6Phabricator, 6Release-Engineering-Team: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1843333 (10Negative24) @mmodell Thanks for your details (and icons to go with it :)). Shouldn't `sudo service apache2 restart` be used instead of the init V script? [00:14:17] 7Puppet, 6Phabricator, 6Release-Engineering-Team: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1843335 (10Negative24) p:5High>3Low [00:22:14] (03PS1) 10BBlack: tlsproxy: settable upstream IP, defaulting to 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/256366 [00:22:16] (03PS1) 10BBlack: varnish: only believe XRIP from local nginx [puppet] - 10https://gerrit.wikimedia.org/r/256367 [00:22:18] (03PS1) 10BBlack: varnish: handle XFF whitespace better [puppet] - 10https://gerrit.wikimedia.org/r/256368 [00:23:47] (03CR) 10BBlack: [C: 032] tlsproxy: settable upstream IP, defaulting to 127.0.0.1 [puppet] - 10https://gerrit.wikimedia.org/r/256366 (owner: 10BBlack) [00:23:49] Reedy: 16:26 < shinken-wm> PROBLEM - Parsoid on deployment-parsoid05 is CRITICAL: Connection refused [00:24:03] Yeah, it'll come back :P [00:24:14] just cause i know you were trying to ssh to those :p [00:24:16] ok [00:24:29] yeah, it already is back :D [00:24:31] :) [00:28:20] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [00:29:24] ^ and that might happen again [00:29:55] !log puppetstoredconfigclean.rb labcontrol2001.wikimedia.org fixes icinga config [00:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:13] (03PS2) 10Dzahn: icinga: add virtual host for ores (test) [puppet] - 10https://gerrit.wikimedia.org/r/256352 [00:40:34] Hello there friends! I have a question about the password reset for the WM mailing lists. I didn't receive an email about the new password. How can I get a new password for that list? [00:41:19] KatyLove: which list are you asking about? [00:42:15] It's a list for which I am a moderator [00:42:37] I am a moderator for 3 lists, and 2 of them I received a password reset mutante [00:42:48] But the third I did not [00:43:13] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [00:43:18] (03CR) 10Dzahn: "@Yuvipanda so it works when this gets included in the icinga module. still a mystery why in the previous location it did not but did creat" [puppet] - 10https://gerrit.wikimedia.org/r/256352 (owner: 10Dzahn) [00:43:18] Where might I go to reset my password or to gain access? [00:44:23] bblack: Yo. [00:44:44] bblack: There should hopefully be an incoming Phabricator ticket about https://en.wikipedia.org/wiki/Special:Contributions/127.0.0.1 [00:44:59] It looks like https://gerrit.wikimedia.org/r/256366 might be related? [00:45:00] 6operations: Recent edits / vandalism from 127.0.0.1 on enwiki - https://phabricator.wikimedia.org/T120043#1843441 (10Slakr) 3NEW [00:45:25] The wikis have become self aware and are vandalising themselves [00:45:31] KatyLove: which is the list name you need it for? moderator passwords would ideally be set by the admins for that list [00:45:55] so the thing is we mailed this "owner" address for each list [00:45:59] 6operations: Recent edits / vandalism from 127.0.0.1 on enwiki - https://phabricator.wikimedia.org/T120043#1843460 (10ori) p:5Triage>3Unbreak! [00:46:10] Ah, ok mutante. I thought I was an owner. I guess I am not. [00:46:15] and initially assumed that means only admins, but actually it meant "admins and moderators" [00:46:24] The list is wmfkids@lists.wikimedia.org [00:46:24] 6operations: Recent edits / vandalism from 127.0.0.1 on enwiki - https://phabricator.wikimedia.org/T120043#1843441 (10ori) Likely {2f3d2e92d231568}. [00:46:39] 6operations: Recent edits / vandalism from 127.0.0.1 on enwiki - https://phabricator.wikimedia.org/T120043#1843465 (10MZMcBride) Given the timing, this is probably related to . [00:46:41] KatyLove: yea, you should have received a mail for all of them either way. hold on [00:46:56] I swear I didn't... :) Thanks for checking mutante! [00:47:12] 6operations: Recent edits / vandalism from 127.0.0.1 - https://phabricator.wikimedia.org/T120043#1843466 (10Krenair) [00:47:35] KatyLove: so for this one, you are more than moderator, you are administrator [00:47:38] 6operations: Recent edits / vandalism from 127.0.0.1 - https://phabricator.wikimedia.org/T120043#1843441 (10Krenair) The timing makes this seem related to (CR) BBlack: [C: 2] tlsproxy: settable upstream IP, defaulting to 127.0.0.1 [puppet] - https://gerrit.wikimedia.org/r/256366 (owner: BBlack) [00:47:49] That is what I thought mutante [00:47:51] Oh, but I see everyone else already made that comment. [00:47:53] Thanks fo rchecking [00:47:53] Oh well. [00:47:59] Heh. [00:48:04] KatyLove: i can reset it, it just means it will also change for Dario [00:48:06] We're adorable. [00:48:09] Poor Dario [00:48:18] I asked him about this too I think [00:48:21] So he is aware of my problem [00:48:23] you did? good [00:48:46] https://www.wikidata.org/w/index.php?title=Q3876511&action=history wtf [00:48:47] because this is easy to go in circles when one of the admins resets the password [00:48:48] Is anyone messing? [00:48:54] Yeah I imagine mutante [00:48:56] change will be reverted in a sec [00:49:04] (03PS1) 10BBlack: Revert "tlsproxy: settable upstream IP, defaulting to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/256370 [00:49:12] (03CR) 10BBlack: [C: 032 V: 032] Revert "tlsproxy: settable upstream IP, defaulting to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/256370 (owner: 10BBlack) [00:49:22] ah, I see [00:49:26] ok then [00:50:03] heh [00:50:08] Please tell me we don't have the same password though mutante ... [00:50:25] yeah, so.... we should track down that MW bug :P [00:50:25] 123456? [00:50:29] KatyLove: you do. that's the thing about these password resets :) [00:50:42] why on earth is it treating 127.0.0.1 as the client IP if it sees that in the XFF list? :P [00:50:42] Ahhh....well if that's the case I can just go ask him to give it to me. [00:50:43] 6operations: Recent edits / vandalism from 127.0.0.1 - https://phabricator.wikimedia.org/T120043#1843476 (10He7d3r) Also seen on ptwiki: https://pt.wikipedia.org/wiki/Special:Contribs/127.0.0.1 [00:50:58] KatyLove: if he has it, that would be easiest.. except.. i just changed it :) [00:51:10] 6operations: Recent edits / vandalism from 127.0.0.1 - https://phabricator.wikimedia.org/T120043#1843478 (10BBlack) Yes, it's related and revert is in progress, we don't need further confirmation reports [00:51:12] ha!! that is kinda funny mutante [00:51:17] 6operations: Recent edits / vandalism from 127.0.0.1 - https://phabricator.wikimedia.org/T120043#1843479 (10ori) 5Open>3Resolved a:3ori Reverted in 5f6512ac9850d1. [00:51:18] Thanks so much [00:51:50] why is 127.0.0.1 not globally blocked, just in case, anyway? [00:52:10] well the software would/should never believe that 127.0.0.1 is the client IP anyways [00:52:16] KatyLove: check mail now [00:52:17] there's a bug in mediawiki somewhere [00:52:31] Done! And accessed. Thanks so much mutante - I have been getting daily requests that I can't act on! [00:52:39] And I see my co-owner hasn't either ;) [00:52:41] this probably all goes back to TrustedXFF and related code [00:53:33] it probably "trusts" our local networks and rewinds through them in the XFF list to reach the "real" client IP, but doesn't consider 127.0.0.1 to be part of our local network for those purposes, so it stops there and calls that the client IP [00:53:34] KatyLove: ok, cool, so since i sent to "wmfkinds-owner" Dario has also received it [00:53:40] kids [00:53:46] I like kinds better [00:53:53] Maybe I'll start that one ;) [00:54:04] (my change was to switch one of our internal forwarding proxies from using a local server's own IP to using 127.0.0.1 to reach the same, which replaced a WMF IP with 127.0.0.1 in the midst of the XFF list) [00:54:05] and it can be a place where people just send kindness all around [00:54:14] Really appreciate it. thanks mutante [00:54:26] heh, that sounds nice, just sooo many lists :) you're welcome [00:54:59] * ori looks [00:55:40] FWIW, when we do similar processing in varnish, we've handled this by including both definitions of localhost in the list of "our network IPs" [00:55:55] (127.0.0.1 and ::1/128) [00:57:12] all revert should be fully runtime-applied now [00:57:28] 7Puppet, 6Phabricator, 5Gerrit-Migration: Configure backula to backup the /srv/phab/repos directory - https://phabricator.wikimedia.org/T120045#1843488 (10greg) 3NEW [00:57:51] Someone might want to reply at https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#User:127.0.0.1 [00:58:03] well it affects all wikis, but sure [00:58:08] * hoo does another attempt at going to bed [01:08:53] ACKNOWLEDGEMENT - Unmerged changes on repository puppet on labtestcontrol2001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). daniel_zahn scheduled downtime. test host that should not be in icinga anyways? [01:08:53] ACKNOWLEDGEMENT - keystone process on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/keystone-all daniel_zahn scheduled downtime. test host that should not be in icinga anyways? [01:08:59] ACKNOWLEDGEMENT - nova-conductor process on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-conductor daniel_zahn scheduled downtime. test host that should not be in icinga anyways? [01:08:59] ACKNOWLEDGEMENT - nova-scheduler process on labtestcontrol2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-scheduler daniel_zahn scheduled downtime. test host that should not be in icinga anyways? [01:08:59] ACKNOWLEDGEMENT - puppetmaster https on labtestcontrol2001 is CRITICAL: Connection refused daniel_zahn scheduled downtime. test host that should not be in icinga anyways? [01:09:23] what, not just in icinga, it even sends SMS [01:09:33] as a host called "test" [01:09:45] come on, that cant be right [01:09:50] (03PS1) 10Chad: Phab: back up code repositories in backula [puppet] - 10https://gerrit.wikimedia.org/r/256373 (https://phabricator.wikimedia.org/T120045) [01:12:05] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1843529 (10Dzahn) 3NEW [01:12:54] (03PS2) 10BBlack: varnish: only believe XRIP from local nginx [puppet] - 10https://gerrit.wikimedia.org/r/256367 [01:12:56] (03PS2) 10BBlack: varnish: handle XFF whitespace better [puppet] - 10https://gerrit.wikimedia.org/r/256368 [01:13:00] (03CR) 10Greg Grossmeier: "ur qwik" [puppet] - 10https://gerrit.wikimedia.org/r/256373 (https://phabricator.wikimedia.org/T120045) (owner: 10Chad) [01:13:42] (03CR) 10BBlack: [C: 032 V: 032] varnish: only believe XRIP from local nginx [puppet] - 10https://gerrit.wikimedia.org/r/256367 (owner: 10BBlack) [01:14:10] (03CR) 10BBlack: [C: 032 V: 032] varnish: handle XFF whitespace better [puppet] - 10https://gerrit.wikimedia.org/r/256368 (owner: 10BBlack) [01:14:18] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: labtestcontrol2001 should not make Icinga page us - https://phabricator.wikimedia.org/T120047#1843545 (10Dzahn) [01:14:42] (03CR) 10Chad: "ez patch is ez" [puppet] - 10https://gerrit.wikimedia.org/r/256373 (https://phabricator.wikimedia.org/T120045) (owner: 10Chad) [01:14:49] ostriches: :P [01:15:34] (03PS2) 10Dzahn: Phab: back up code repositories in backula [puppet] - 10https://gerrit.wikimedia.org/r/256373 (https://phabricator.wikimedia.org/T120045) (owner: 10Chad) [01:15:51] (03CR) 10Dzahn: [C: 032] "wfm" [puppet] - 10https://gerrit.wikimedia.org/r/256373 (https://phabricator.wikimedia.org/T120045) (owner: 10Chad) [01:16:25] (03CR) 10Chad: "Actually, do I need to include ::role::backup::host?" [puppet] - 10https://gerrit.wikimedia.org/r/256373 (https://phabricator.wikimedia.org/T120045) (owner: 10Chad) [01:16:41] mutante: I was looking some moar [01:16:50] Ah, yerp, I do [01:16:51] (03CR) 10Dzahn: "yes :)" [puppet] - 10https://gerrit.wikimedia.org/r/256373 (https://phabricator.wikimedia.org/T120045) (owner: 10Chad) [01:16:59] Amending [01:17:41] shit, too 2late. [01:19:15] (03PS1) 10Dzahn: phabricator: make iridium a backup host [puppet] - 10https://gerrit.wikimedia.org/r/256374 (https://phabricator.wikimedia.org/T120045) [01:19:35] Heh, you beat me [01:19:53] (03CR) 10Chad: [C: 031] phabricator: make iridium a backup host [puppet] - 10https://gerrit.wikimedia.org/r/256374 (https://phabricator.wikimedia.org/T120045) (owner: 10Dzahn) [01:20:06] except the dependency that doesnt belong there :p [01:20:33] (03PS2) 10Dzahn: phabricator: make iridium a backup host [puppet] - 10https://gerrit.wikimedia.org/r/256374 (https://phabricator.wikimedia.org/T120045) [01:20:59] (03PS3) 10Dzahn: phabricator: make iridium a backup host [puppet] - 10https://gerrit.wikimedia.org/r/256374 (https://phabricator.wikimedia.org/T120045) [01:21:06] (03CR) 10Dzahn: [C: 032] phabricator: make iridium a backup host [puppet] - 10https://gerrit.wikimedia.org/r/256374 (https://phabricator.wikimedia.org/T120045) (owner: 10Dzahn) [01:24:38] mutante: thx [01:25:57] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: icinga config broken due to duplicate labstestcontrol - https://phabricator.wikimedia.org/T120050#1843587 (10Dzahn) 3NEW [01:26:34] 6operations, 6Labs, 10Labs-Infrastructure, 7Icinga: icinga config broken due to duplicate labs-ns1 / labcontrol2001 - https://phabricator.wikimedia.org/T120050#1843595 (10Dzahn) [01:27:08] ostriches: welcome! [01:27:44] 7Puppet, 6Phabricator, 5Gerrit-Migration, 5Patch-For-Review: Configure backula to backup the /srv/phab/repos directory - https://phabricator.wikimedia.org/T120045#1843598 (10demon) 5Open>3Resolved a:3demon This should be all setup now. [01:28:02] 7Puppet, 6Phabricator, 5Gerrit-Migration, 5Patch-For-Review: Configure backula to backup the /srv/phab/repos directory - https://phabricator.wikimedia.org/T120045#1843602 (10Dzahn) I can confirm the bacula stuff has been installed on iridium, the phab server. If you want to really have this confirmed let's... [01:28:45] PROBLEM - puppet last run on rdb2003 is CRITICAL: CRITICAL: puppet fail [01:34:20] PROBLEM - MariaDB Slave Lag: m3 on db1048 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 394 [01:36:10] ^ got paged again [01:36:17] it's running just phabricator tickets [01:36:20] RECOVERY - MariaDB Slave Lag: m3 on db1048 is OK: OK slave_sql_lag Seconds_Behind_Master: 0 [01:36:22] eh, queries [01:36:38] and there is a pattern like that every 24 hours exactly [01:37:54] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [01:43:17] (03PS1) 10Dzahn: ores: move monitoring to icinga [puppet] - 10https://gerrit.wikimedia.org/r/256376 (https://phabricator.wikimedia.org/T119340) [01:45:25] RECOVERY - puppet last run on rdb2003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [01:45:33] (03PS2) 10Dzahn: ores: move monitoring to icinga [puppet] - 10https://gerrit.wikimedia.org/r/256376 (https://phabricator.wikimedia.org/T119340) [01:49:03] (03PS3) 10Dzahn: ores: move monitoring to icinga [puppet] - 10https://gerrit.wikimedia.org/r/256376 (https://phabricator.wikimedia.org/T119340) [01:49:22] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1405/neon.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/256376 (https://phabricator.wikimedia.org/T119340) (owner: 10Dzahn) [02:03:46] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [02:15:00] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [02:26:26] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 10m 47s) [02:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:59] !log labcontrol2001 - disable puppet, kill from puppet stored configs [02:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:56] (03Abandoned) 10Dzahn: icinga: add virtual host for ores (test) [puppet] - 10https://gerrit.wikimedia.org/r/256352 (owner: 10Dzahn) [02:34:52] bblack: You're awesome, thank you. [03:18:12] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [5000000.0] [03:36:00] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 1.00% above the threshold [1000000.0] [03:42:51] PROBLEM - puppet last run on mw2029 is CRITICAL: CRITICAL: puppet fail [03:43:43] 7Puppet, 6Phabricator, 6Release-Engineering-Team: phabricator at labs is not up to date - https://phabricator.wikimedia.org/T117441#1843808 (10mmodell) >>! In T117441#1843333, @Negative24 wrote: > @mmodell Thanks for your details (and icons to go with it :)). > > Shouldn't `sudo service apache2 restart` be... [04:10:21] RECOVERY - puppet last run on mw2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:53:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Dec 2 05:53:21 UTC 2015 (duration 53m 20s) [05:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:09:26] (03PS2) 10Ori.livneh: Add wikidata mainpage to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256187 (https://phabricator.wikimedia.org/T117555) (owner: 10Addshore) [06:09:34] (03CR) 10Ori.livneh: [C: 032 V: 032] Add wikidata mainpage to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256187 (https://phabricator.wikimedia.org/T117555) (owner: 10Addshore) [06:09:52] (03PS3) 10Ori.livneh: Add WD Q64 static version to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256188 (https://phabricator.wikimedia.org/T117555) (owner: 10Addshore) [06:10:00] (03CR) 10Ori.livneh: [C: 032 V: 032] Add WD Q64 static version to webpref asset-check [puppet] - 10https://gerrit.wikimedia.org/r/256188 (https://phabricator.wikimedia.org/T117555) (owner: 10Addshore) [06:10:58] why is 127.0.0.1 not globally blocked, just in case, anyway? [06:11:24] It probably is blocked on some wikis. But I think it would probably make issues like this a lot more difficult to diagnose. [06:14:08] 6operations: Recent edits / vandalism from 127.0.0.1 - https://phabricator.wikimedia.org/T120043#1843880 (10MZMcBride) Has a task been filed about the bug in MediaWiki? ``` there's a bug in mediawiki somewhere this probably all goes back to TrustedXFF and related code it probably "tru... [06:25:31] !log ori@tin Synchronized php-1.27.0-wmf.7/extensions/NavigationTiming: Idb675cdce: Add isHiDPI and isHttp2 properties; drop isHttps (T119014) (duration: 00m 50s) [06:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:31] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] PROBLEM - puppet last run on restbase2006 is CRITICAL: CRITICAL: puppet fail [06:32:12] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:31] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:40] PROBLEM - puppet last run on db1056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:40] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:50] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:11] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:21] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:41:00] PROBLEM - Ubuntu mirror in sync with upstream on carbon is CRITICAL: /srv/mirrors/ubuntu is over 12 hours old. [06:44:51] RECOVERY - Ubuntu mirror in sync with upstream on carbon is OK: /srv/mirrors/ubuntu is over 0 hours old. [06:56:11] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [06:56:51] RECOVERY - puppet last run on restbase2006 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:57:31] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:02:39] (03CR) 10Giuseppe Lavagetto: [C: 032] base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/256267 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [07:02:56] (03PS2) 10Giuseppe Lavagetto: base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/256267 (https://phabricator.wikimedia.org/T114638) [07:03:56] <_joe_> oh FFS jenkins [07:04:13] (03CR) 10Giuseppe Lavagetto: [V: 032] base::certificates: add puppet's CA to the trusted store [puppet] - 10https://gerrit.wikimedia.org/r/256267 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [07:08:00] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 1.00% above the threshold [1000000.0] [07:24:01] PROBLEM - dhclient process on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:11] PROBLEM - Check size of conntrack table on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:42] PROBLEM - configured eth on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:42] PROBLEM - puppet last run on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:11] PROBLEM - DPKG on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:11] PROBLEM - Disk space on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:40] PROBLEM - RAID on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:42] PROBLEM - salt-minion processes on planet1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:26:00] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [07:26:40] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:26:41] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [07:26:51] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:26:51] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [07:27:00] RECOVERY - puppet last run on db1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:11] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:21] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:28:42] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:30:00] (03CR) 10TheDJ: "For posterity: caused edits to be attributed to 127.0.0.1" [puppet] - 10https://gerrit.wikimedia.org/r/256366 (owner: 10BBlack) [07:33:05] (03PS3) 10Giuseppe Lavagetto: k8s: switch to using systems' CA [puppet] - 10https://gerrit.wikimedia.org/r/243662 (https://phabricator.wikimedia.org/T114638) [07:53:03] (03PS3) 10Giuseppe Lavagetto: etcd: switch to using the system-wide puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/243663 (https://phabricator.wikimedia.org/T114638) [07:54:14] (03PS4) 10Giuseppe Lavagetto: etcd: switch to using the system-wide puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/243663 (https://phabricator.wikimedia.org/T114638) [07:55:16] (03PS3) 10Muehlenhoff: openldap: Make slapd.conf 0440 [puppet] - 10https://gerrit.wikimedia.org/r/256009 [07:57:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] openldap: Make slapd.conf 0440 [puppet] - 10https://gerrit.wikimedia.org/r/256009 (owner: 10Muehlenhoff) [07:57:52] (03PS4) 10Muehlenhoff: openldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 [07:58:42] (03PS5) 10Giuseppe Lavagetto: etcd: switch to using the system-wide puppet ca [puppet] - 10https://gerrit.wikimedia.org/r/243663 (https://phabricator.wikimedia.org/T114638) [07:59:09] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/243663 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [08:00:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] openldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 (owner: 10Muehlenhoff) [08:00:26] (03PS5) 10Muehlenhoff: openldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 [08:00:37] (03CR) 10Muehlenhoff: [V: 032] openldap: Allow passing a higher size limit for LDAP queries [puppet] - 10https://gerrit.wikimedia.org/r/256213 (owner: 10Muehlenhoff) [08:10:09] (03PS3) 10Giuseppe Lavagetto: conftool: switch to using system-wide certs [puppet] - 10https://gerrit.wikimedia.org/r/243664 (https://phabricator.wikimedia.org/T114638) [08:10:31] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/243664 (https://phabricator.wikimedia.org/T114638) (owner: 10Giuseppe Lavagetto) [08:23:32] PROBLEM - Labs LDAP on seaborgium is CRITICAL: Could not bind to the LDAP server [08:28:09] ^ seaborgium is me, was briefly debugging some schema change, slapd is back up [08:29:32] RECOVERY - Labs LDAP on seaborgium is OK: LDAP OK - 0.017 seconds response time [08:33:00] PROBLEM - NTP on planet1001 is CRITICAL: NTP CRITICAL: No response from NTP server [08:36:50] <_joe_> looking into planet [08:37:01] RECOVERY - RAID on planet1001 is OK: OK: no RAID installed [08:37:11] RECOVERY - salt-minion processes on planet1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:37:21] RECOVERY - dhclient process on planet1001 is OK: PROCS OK: 0 processes with command name dhclient [08:37:21] RECOVERY - Check size of conntrack table on planet1001 is OK: OK: nf_conntrack is 0 % full [08:38:02] RECOVERY - configured eth on planet1001 is OK: OK - interfaces up [08:38:02] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [08:38:22] RECOVERY - Disk space on planet1001 is OK: DISK OK [08:38:26] <_joe_> uhm interesting [08:38:30] RECOVERY - DPKG on planet1001 is OK: All packages OK [08:38:50] RECOVERY - NTP on planet1001 is OK: NTP OK: Offset 0.002647638321 secs [08:38:51] <_joe_> !log planet1001 recovered as soon as I did get into console via gnt-instance [08:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:42:37] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1844080 (10jcrespo) Of all ciphers, only a few work: ``` for cipher in ECDHE-RSA-AES256-GCM-SHA384 ECDHE-ECDSA-AES256-GCM-SHA384 ECDHE-RSA-AES256-SHA384 ECDHE-ECDSA-AES256-SHA384 DHE-DS... [08:49:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [5000000.0] [08:55:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [09:02:16] 6operations, 6Labs, 10Labs-Infrastructure, 7LDAP, 7discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1844087 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [09:02:57] 6operations, 6Labs, 10Labs-Infrastructure, 7LDAP, 7discovery-system: Allow creation of SRV records in labs. - https://phabricator.wikimedia.org/T98009#1257018 (10MoritzMuehlenhoff) The new openldap::labs servers based on OpenLDAP (seaborgium, serpens) provide that schema, a quick test was fine. [09:05:50] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [09:13:28] (03CR) 10Addshore: "Hmm, does something else have to happen to get this running?" [puppet] - 10https://gerrit.wikimedia.org/r/256187 (https://phabricator.wikimedia.org/T117555) (owner: 10Addshore) [09:22:31] <_joe_> !log stopped ocg service on ocg1003 [09:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:31:41] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 29 failures [09:31:44] (03PS1) 10Muehlenhoff: Set size_limit for openldap::labs to 32768 [puppet] - 10https://gerrit.wikimedia.org/r/256386 [09:32:49] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844125 (10fgiunchedi) @maxsem or @milimetric I can't access either of those instances, could you add my wikitech user 'Filippo Giunchedi' to the project(admin) ? thanks! [09:33:30] godog: you should be able to ssh as root@ [09:35:45] yuvipanda: heh that's what I thought too, doesn't seem to be working while presenting my labs key to root@puppet-test02.maps-team.eqiad.wmflabs [09:36:20] ditto root@limn1.eqiad.wmflabs [09:36:40] godog: https://github.com/wikimedia/labs-private/blob/master/files/ssh/root-authorized-keys [09:36:43] godog: is that right key? [09:37:21] PROBLEM - RAID on db2019 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [09:38:54] yuvipanda: err, PEBCAK, prod vs labs key, thanks! [09:39:01] godog: :) ok! [09:40:25] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844127 (10fgiunchedi) >>! In T119541#1844125, @fgiunchedi wrote: > @maxsem or @milimetric I can't access either of those instances, could you add my wikitech user 'Filippo Giunchedi' to the project(ad... [09:45:51] PROBLEM - HHVM rendering on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:31] PROBLEM - Apache HTTP on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:46:51] (03PS2) 10Muehlenhoff: Set size_limit for openldap::labs to 32768 [puppet] - 10https://gerrit.wikimedia.org/r/256386 [09:47:50] PROBLEM - RAID on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:50] PROBLEM - dhclient process on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:51] PROBLEM - configured eth on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:01] PROBLEM - nutcracker port on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:11] PROBLEM - Check size of conntrack table on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:22] PROBLEM - nutcracker process on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:30] PROBLEM - DPKG on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:41] PROBLEM - Disk space on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:50] PROBLEM - salt-minion processes on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:00] PROBLEM - SSH on mw1130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:49:01] PROBLEM - puppet last run on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:01] PROBLEM - HHVM processes on mw1130 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:52:17] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844142 (10fgiunchedi) debugged a bit further, e.g. on `puppet-test02` I can get past the error by explicitly `include base` on the node definition in `site.pp` (since `compile puppet.conf` is defined... [09:52:59] yuvipanda: ^ _why_ is that I don't know heh [09:53:15] godog: lololol [09:53:25] godog: I've a theory [09:54:13] godog: about two weeks ago, I was removing and cleaning up roles from ldap [09:54:24] godog: role::labs::instance (which includes base) used to be applied via LDAP [09:54:32] so when you do role::puppet::self [09:54:35] they're both applied via LDAP [09:54:38] but *now* [09:54:43] role::labs::instance is included via site.pp [09:54:51] and role::puppet::self is included via base [09:54:53] err [09:54:55] via LDAP [09:54:58] so I don't know if this is the cause [09:55:04] but it might explain the flapping about of dependencies [09:55:47] yup, if role::labs::instance doesn't cause any problems it is worth trying a rollback [09:56:20] godog: that's kinda complicated, since we'll have to add it via LDAP to every single role and then fix some Openstack related shims... [09:56:46] godog: I wonder if a 'require base' (or require role::labs::instance? Lol?) might [09:56:52] also fuck you too, puppet, I guess [09:57:06] 6operations, 6Services: Discussion: Use XFS for Cassandra data partition? - https://phabricator.wikimedia.org/T120004#1844146 (10fgiunchedi) as I said, let's first determine if we have a problem with ext4 before committing resources [09:57:41] yuvipanda: heheh it is late for you too, it can wait I suppose [09:58:27] godog: heh :) but I don't want to bring them back into LDAP at all... [09:58:52] if that's what's causing the problem then maybe it's time for me to roll up sleeves and get that ENC done [10:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151202T1000). [10:01:24] 6operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#1844149 (10jcrespo) 3NEW [10:03:12] (03CR) 10Muehlenhoff: "Hmm, puppet compiler fails, but I don't see an obvious reason why: http://puppet-compiler.wmflabs.org/1408/" [puppet] - 10https://gerrit.wikimedia.org/r/256386 (owner: 10Muehlenhoff) [10:06:03] RECOVERY - nutcracker port on mw1130 is OK: TCP OK - 0.000 second response time on port 11212 [10:06:10] RECOVERY - dhclient process on mw1130 is OK: PROCS OK: 0 processes with command name dhclient [10:06:10] RECOVERY - configured eth on mw1130 is OK: OK - interfaces up [10:06:22] RECOVERY - Check size of conntrack table on mw1130 is OK: OK: nf_conntrack is 0 % full [10:06:31] RECOVERY - nutcracker process on mw1130 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:06:40] RECOVERY - DPKG on mw1130 is OK: All packages OK [10:06:40] RECOVERY - Apache HTTP on mw1130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.740 second response time [10:06:42] RECOVERY - Disk space on mw1130 is OK: DISK OK [10:06:52] RECOVERY - salt-minion processes on mw1130 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:07:02] RECOVERY - SSH on mw1130 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [10:07:10] RECOVERY - HHVM processes on mw1130 is OK: PROCS OK: 6 processes with command name hhvm [10:07:55] (03PS3) 10Muehlenhoff: Switch everything to the new openldap ldap servers. [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) (owner: 10Andrew Bogott) [10:08:02] RECOVERY - RAID on mw1130 is OK: OK: no RAID installed [10:08:02] RECOVERY - HHVM rendering on mw1130 is OK: HTTP OK: HTTP/1.1 200 OK - 65282 bytes in 1.657 second response time [10:08:51] 6operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#1844184 (10jcrespo) Maybe it is worth checking too slots 2 and 7: ``` megacli -PDList -aALL | grep "\(Media Error Count\|S.M.A.R.T\|Slot\)" Slot Number: 0 Media Error Count: 0 Drive has flagged a S.M.A.R.T alert :... [10:11:42] 6operations, 7Monitoring: ganglia graphs should not have "N" as units - https://phabricator.wikimedia.org/T81659#1844192 (10fgiunchedi) p:5Normal>3Lowest [10:12:52] godog: many thanks for those cleanup of some of my mad graphite metrics!! [10:13:02] RECOVERY - puppet last run on mw1130 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [10:13:33] 6operations: check puppet freshness monitoring - https://phabricator.wikimedia.org/T84037#1844205 (10fgiunchedi) [10:13:56] 6operations: check puppet freshness monitoring - https://phabricator.wikimedia.org/T84037#1844209 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi looks like we're now checking puppet freshness via proper means [10:14:02] addshore: haha no worries [10:14:23] <_joe_> !log clearing the job cache for ocg1003 [10:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:14:35] I was looking at making a ldap group for being able to move & delete metrics this morning actually (so I dont have to take up valuable ops time) ;) [10:14:52] I don't really think you can do that with LDAP [10:15:02] you need specific rights on the graphite host I think [10:16:26] * aude deploying https://gerrit.wikimedia.org/r/#/c/255063/ (ok from greg + fundraising team, as long as we don't touch metawiki now) [10:18:34] (03CR) 10Aude: [C: 032] "ok from greg + fundraising team to do this, as long as we exclude meta-wiki for now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255063 (https://phabricator.wikimedia.org/T109780) (owner: 10Aude) [10:19:16] (03Merged) 10jenkins-bot: Enable data access for wikinews, mediawiki.org and wikispecies [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255063 (https://phabricator.wikimedia.org/T109780) (owner: 10Aude) [10:20:07] aude: your right, the second time running that query takes 3 seconds rather than 3 minutes ;) [10:20:46] !log aude@tin Synchronized dblists/arbitraryaccess.dblist: Enabling data access for wikinews, wikispecies and mediawiki.org (duration: 00m 30s) [10:21:42] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enabling data access for wikinews, wikispecies and mediawiki.org (duration: 00m 27s) [10:22:40] 6operations, 10Deployment-Systems: /srv/deployment wrong permissions on new installs - https://phabricator.wikimedia.org/T90588#1844221 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi looks like this was fixed, from `mw2208` freshly reimaged yesterday `drwxr-xr-x 3 root root 4096 Dec 1 13:55 /srv/deployme... [10:23:34] done, looks good [10:23:42] addshore: must be cached etc now [10:23:47] yup [10:27:51] (03CR) 10Alexandros Kosiaris: "seems like pep8 is still complaining.." [puppet] - 10https://gerrit.wikimedia.org/r/256311 (owner: 10Chad) [10:28:58] (03CR) 10Alexandros Kosiaris: [C: 032] check-raid.py: minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/256292 (owner: 10Chad) [10:29:00] PROBLEM - Apache HTTP on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:02] (03PS2) 10Alexandros Kosiaris: check-raid.py: minor pep8 fix [puppet] - 10https://gerrit.wikimedia.org/r/256292 (owner: 10Chad) [10:29:21] PROBLEM - HHVM rendering on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:29:43] 6operations, 7Monitoring: Multiple entries exists for each matrix with minor change in matrix naming - https://phabricator.wikimedia.org/T86034#1844237 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi I believe that's by design, queued messages overall vs bounces queued ``` sub read_queue { open Q, "/usr/s... [10:30:06] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "LGTM, thanks that not syntax was counter intuitive!" [puppet] - 10https://gerrit.wikimedia.org/r/256291 (owner: 10Chad) [10:30:14] (03PS2) 10Alexandros Kosiaris: pep8: fix list-last-n-good-dumps style, mostly in/not in stuff [puppet] - 10https://gerrit.wikimedia.org/r/256291 (owner: 10Chad) [10:31:21] PROBLEM - nutcracker process on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:31:31] PROBLEM - Disk space on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:31:39] 6operations, 10ops-eqiad: cp1037-1040 reclaim as spares - https://phabricator.wikimedia.org/T83553#1844245 (10fgiunchedi) @cmjohnson anything left to do here? [10:31:40] PROBLEM - Check size of conntrack table on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:31:41] 6operations, 10ops-eqiad: cp1037-1040 reclaim as spares - https://phabricator.wikimedia.org/T83553#1844247 (10fgiunchedi) [10:31:42] PROBLEM - SSH on mw1120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:32:07] (03CR) 10Alexandros Kosiaris: [C: 032] ps_mem.py: fix ori's code to be pep8 compliant :) [puppet] - 10https://gerrit.wikimedia.org/r/256289 (owner: 10Chad) [10:32:11] (03PS2) 10Alexandros Kosiaris: ps_mem.py: fix ori's code to be pep8 compliant :) [puppet] - 10https://gerrit.wikimedia.org/r/256289 (owner: 10Chad) [10:32:11] PROBLEM - salt-minion processes on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:32:11] PROBLEM - DPKG on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:32:11] PROBLEM - HHVM processes on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:32:31] PROBLEM - configured eth on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:32:39] (03PS3) 10Alexandros Kosiaris: pep8: fix list-last-n-good-dumps style, mostly in/not in stuff [puppet] - 10https://gerrit.wikimedia.org/r/256291 (owner: 10Chad) [10:32:47] (03PS3) 10Alexandros Kosiaris: ps_mem.py: fix ori's code to be pep8 compliant :) [puppet] - 10https://gerrit.wikimedia.org/r/256289 (owner: 10Chad) [10:32:51] PROBLEM - RAID on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:33:00] PROBLEM - nutcracker port on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:33:00] PROBLEM - dhclient process on mw1120 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:33:20] (03CR) 10Alexandros Kosiaris: [C: 032] pep8: minor whitespace fix in deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/256288 (owner: 10Chad) [10:35:44] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/256346 (https://phabricator.wikimedia.org/T101299) (owner: 10Andrew Bogott) [10:36:59] 6operations, 10ops-codfw: update ILO firmware on fdb2001 - https://phabricator.wikimedia.org/T84806#1844263 (10fgiunchedi) [10:38:22] (03PS4) 10Alexandros Kosiaris: ps_mem.py: fix ori's code to be pep8 compliant :) [puppet] - 10https://gerrit.wikimedia.org/r/256289 (owner: 10Chad) [10:38:26] (03CR) 10Alexandros Kosiaris: [V: 032] ps_mem.py: fix ori's code to be pep8 compliant :) [puppet] - 10https://gerrit.wikimedia.org/r/256289 (owner: 10Chad) [10:38:43] (03PS2) 10Alexandros Kosiaris: pep8: minor whitespace fix in deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/256288 (owner: 10Chad) [10:38:51] (03CR) 10Alexandros Kosiaris: [V: 032] pep8: minor whitespace fix in deploy.py [puppet] - 10https://gerrit.wikimedia.org/r/256288 (owner: 10Chad) [10:43:20] 6operations, 10ops-codfw: update ILO firmware on fdb2001 - https://phabricator.wikimedia.org/T84806#1844278 (10fgiunchedi) @jgreen it'd likely require fixing other hosts too, still worth fixing piecemeal or the workaround is good enough? [10:44:58] (03CR) 10Bmansurov: "> What's the rationale for this being wikipedia-only?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [10:45:29] 6operations, 7Monitoring: Job queue ganglia monitoring @terbium stopped working - https://phabricator.wikimedia.org/T84705#1844280 (10fgiunchedi) 5Open>3Invalid a:3fgiunchedi job queue metrics got moved to graphite, https://grafana.wikimedia.org/dashboard/db/job-queue-health [10:47:31] (03CR) 10Alexandros Kosiaris: [C: 032] varnish: clean up a bunch of pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256278 (owner: 10Chad) [10:47:37] (03PS2) 10Alexandros Kosiaris: varnish: clean up a bunch of pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256278 (owner: 10Chad) [10:47:42] (03CR) 10Alexandros Kosiaris: [V: 032] varnish: clean up a bunch of pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256278 (owner: 10Chad) [10:50:41] 6operations, 10ops-codfw: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#1844296 (10fgiunchedi) p:5Triage>3Normal [10:51:04] (03CR) 10Alexandros Kosiaris: [C: 031] "Happy to see this happening." [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [10:52:58] (03PS4) 10Merlijn van Deen: package_builder: add option to use built packages during build [puppet] - 10https://gerrit.wikimedia.org/r/256176 [10:53:36] (03CR) 10Alexandros Kosiaris: [C: 031] Set size_limit for openldap::labs to 32768 [puppet] - 10https://gerrit.wikimedia.org/r/256386 (owner: 10Muehlenhoff) [10:53:40] PROBLEM - puppet last run on cp2022 is CRITICAL: CRITICAL: Puppet has 1 failures [10:53:51] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Puppet has 1 failures [10:53:57] (03PS5) 10Merlijn van Deen: package_builder: add option to use built packages during build [puppet] - 10https://gerrit.wikimedia.org/r/256176 [10:54:41] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:02] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:02] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:08] 6operations, 7Availability, 7Performance, 7Wikimedia-log-errors: Memcached error for key "WANCache:v:enwiki:image_redirect:254363f3d14af58bbe12c644ee69ccf7" on server "/var/run/nutcracker/nutcracker.sock:0": A TIMEOUT OCCURRED - https://phabricator.wikimedia.org/T102916#1844315 (10fgiunchedi) 5Open>3Res... [10:55:12] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:12] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:12] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:20] PROBLEM - puppet last run on cp1071 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:30] RECOVERY - nutcracker process on mw1120 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [10:55:30] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:32] RECOVERY - Disk space on mw1120 is OK: DISK OK [10:55:32] PROBLEM - puppet last run on cp2015 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:41] RECOVERY - SSH on mw1120 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [10:55:41] RECOVERY - Check size of conntrack table on mw1120 is OK: OK: nf_conntrack is 0 % full [10:55:50] PROBLEM - puppet last run on cp1043 is CRITICAL: CRITICAL: Puppet has 1 failures [10:55:53] (03CR) 10Merlijn van Deen: package_builder: add option to use built packages during build (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/256176 (owner: 10Merlijn van Deen) [10:56:00] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [10:56:10] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: Puppet has 1 failures [10:56:10] RECOVERY - salt-minion processes on mw1120 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [10:56:10] RECOVERY - DPKG on mw1120 is OK: All packages OK [10:56:10] RECOVERY - HHVM processes on mw1120 is OK: PROCS OK: 6 processes with command name hhvm [10:56:21] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: Puppet has 1 failures [10:56:31] RECOVERY - configured eth on mw1120 is OK: OK - interfaces up [10:56:34] mhh, looking on cp1071, varnish.py failed [10:56:43] akosiaris: ^ [10:56:51] RECOVERY - RAID on mw1120 is OK: OK: no RAID installed [10:57:00] RECOVERY - dhclient process on mw1120 is OK: PROCS OK: 0 processes with command name dhclient [10:57:00] RECOVERY - nutcracker port on mw1120 is OK: TCP OK - 0.000 second response time on port 11212 [10:57:21] RECOVERY - Apache HTTP on mw1120 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.219 second response time [10:57:21] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Puppet has 1 failures [10:57:21] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures [10:57:42] RECOVERY - HHVM rendering on mw1120 is OK: HTTP OK: HTTP/1.1 200 OK - 65286 bytes in 1.808 second response time [10:58:01] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures [10:58:10] PROBLEM - puppet last run on cp3039 is CRITICAL: CRITICAL: Puppet has 1 failures [10:58:11] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: Puppet has 1 failures [10:59:21] PROBLEM - puppet last run on cp3019 is CRITICAL: CRITICAL: Puppet has 1 failures [11:00:11] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: Puppet has 1 failures [11:00:11] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [11:00:23] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:00:31] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 1 failures [11:00:31] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: Puppet has 1 failures [11:00:31] PROBLEM - puppet last run on cp2002 is CRITICAL: CRITICAL: Puppet has 1 failures [11:00:38] fixing the syntax error [11:01:01] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [11:01:03] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: Puppet has 1 failures [11:01:12] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [11:01:21] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures [11:01:21] PROBLEM - puppet last run on cp4010 is CRITICAL: CRITICAL: Puppet has 1 failures [11:01:31] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [11:01:41] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:21] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:21] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [11:02:21] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: Puppet has 1 failures [11:03:01] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet has 1 failures [11:03:20] PROBLEM - puppet last run on cp3032 is CRITICAL: CRITICAL: Puppet has 1 failures [11:03:31] PROBLEM - puppet last run on cp3003 is CRITICAL: CRITICAL: Puppet has 1 failures [11:03:37] (03PS1) 10Filippo Giunchedi: varnish: fix ganglia-varnish.py continuation lines [puppet] - 10https://gerrit.wikimedia.org/r/256394 [11:04:04] !log restart cassandra on restbase1001 (to effect openjdk security updates plus related libs) [11:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnish: fix ganglia-varnish.py continuation lines [puppet] - 10https://gerrit.wikimedia.org/r/256394 (owner: 10Filippo Giunchedi) [11:04:21] PROBLEM - puppet last run on cp4018 is CRITICAL: CRITICAL: Puppet has 1 failures [11:04:21] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Puppet has 1 failures [11:04:40] PROBLEM - puppet last run on cp1057 is CRITICAL: CRITICAL: Puppet has 1 failures [11:05:02] PROBLEM - puppet last run on cp1048 is CRITICAL: CRITICAL: Puppet has 1 failures [11:05:09] godog: sigh, my mistake sorry [11:05:38] akosiaris: np, it'll converge eventually :) [11:05:45] I wonder how we could catch those tho [11:05:50] PROBLEM - puppet last run on cp2004 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:10] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:22] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:23] PROBLEM - puppet last run on cp3015 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:30] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:40] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 1 failures [11:06:50] !log restarting cassandra on restbase100[2-4] (subsequently) (to effect openjdk security updates plus related libs) [11:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:07:01] PROBLEM - puppet last run on cp2006 is CRITICAL: CRITICAL: Puppet has 1 failures [11:07:18] zero fetch is also gonna complain [11:07:22] lemme fix that [11:07:31] RECOVERY - puppet last run on cp1071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:07:32] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 1 failures [11:07:51] RECOVERY - puppet last run on cp2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:08:10] PROBLEM - puppet last run on cp1070 is CRITICAL: CRITICAL: Puppet has 1 failures [11:08:23] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Puppet has 1 failures [11:08:23] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Puppet has 1 failures [11:08:31] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:12] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:20] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: Puppet has 1 failures [11:09:51] PROBLEM - puppet last run on cp2005 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:01] PROBLEM - puppet last run on cp1044 is CRITICAL: CRITICAL: Puppet has 1 failures [11:10:03] yeah pylint 1.3.1 catched that syntax error [11:12:14] but the CI pep8 did not :( [11:12:17] (03PS1) 10Jcrespo: Allow ssl key usage [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256395 [11:12:35] (03CR) 10jenkins-bot: [V: 04-1] Allow ssl key usage [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256395 (owner: 10Jcrespo) [11:13:46] (03PS2) 10Jcrespo: Allow ssl key usage [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256395 [11:14:23] (03CR) 10jenkins-bot: [V: 04-1] Allow ssl key usage [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256395 (owner: 10Jcrespo) [11:17:29] hashar: I take it we're not ci-running pylint on puppet (?) [11:17:32] (03PS1) 10Bmansurov: Enable RelatedArticles and Cards on beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 [11:18:58] godog: we run pep8 via a wrapper https://integration.wikimedia.org/ci/job/operations-puppet-pep8/5274/console [11:19:00] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [11:19:02] RECOVERY - puppet last run on cp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:25] ah here it is https://github.com/wikimedia/integration-jenkins/blob/master/tools/puppet_pep8.py [11:19:31] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:31] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:19:31] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [11:19:41] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:19:52] RECOVERY - puppet last run on cp2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:20:10] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:20:21] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:20:22] godog: basically it crawl the directories and runs pep8 1.4.6 against the .py files in each directory [11:20:34] so pep8 can take in account a .pep8 file in the dir if it exists [11:20:40] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [11:20:43] actually seems like zerofetcher is just ifne [11:20:45] fine* [11:20:51] we should really phase that out in favor of running pep8/pyflakes/pylint from the root of the repo [11:21:20] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:21:21] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:21:31] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [11:21:47] (03PS3) 10Bmansurov: Enable RelatedArticles on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) [11:21:52] RECOVERY - puppet last run on cp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:21:58] hashar: we should also make ops/puppet run flake8 instead of pep8 [11:22:01] https://gerrit.wikimedia.org/r/#/c/244148/ <-- attempts to run pep8 within a venv, which would let us add more linters to it [11:22:01] RECOVERY - puppet last run on cp1043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:22:03] since flake8 is what's used elsewhere [11:22:11] also flake8 respects tox.ini... [11:22:12] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:22:16] (03CR) 10Bmansurov: [C: 04-1] Enable RelatedArticles on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255553 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [11:22:16] flake8 actually caught that [11:22:46] 2.2.2-1 that is [11:23:20] !log restarting cassandra on restbase100[78] (subsequently) (to effect openjdk security updates plus related libs) [11:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:30] well [11:23:30] (03PS3) 10Hashar: tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) [11:23:36] we can complete that gerrit change [11:23:41] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:23:41] RECOVERY - puppet last run on cp3019 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [11:23:44] and enable tox on operations/puppet [11:23:46] would let one add more linters as needed [11:24:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [150.0] [11:24:21] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:24:30] RECOVERY - puppet last run on cp3039 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [11:24:30] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:24:30] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:25:16] (03CR) 10jenkins-bot: [V: 04-1] tox entry point to run pep8==1.4.6 [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:25:20] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [11:26:00] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [11:26:31] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:50] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:51] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:26:51] RECOVERY - puppet last run on cp2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:27:20] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:27:21] PROBLEM - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [11:27:31] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:27:41] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:27:41] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:27:42] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:27:51] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [150.0] [11:28:32] RECOVERY - puppet last run on cp4018 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:28:32] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:28:32] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:28:32] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:28:50] RECOVERY - puppet last run on cp1057 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:29:20] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:29:51] RECOVERY - puppet last run on cp3003 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:30:50] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:30:52] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:31:30] RECOVERY - puppet last run on cp2006 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:31:50] RECOVERY - puppet last run on cp3032 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:32:03] (03CR) 10Hashar: "Would run pep8 1.4.6 just like the current Jenkins job. The difference is this change runs it from the root of the repository instead of r" [puppet] - 10https://gerrit.wikimedia.org/r/244148 (https://phabricator.wikimedia.org/T114887) (owner: 10Hashar) [11:32:41] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [11:32:51] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:32:51] RECOVERY - puppet last run on cp3015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:33:02] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:33:29] 6operations, 10OCG-General-or-Unknown, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#1844369 (10Joe) 3NEW [11:34:01] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:34:20] RECOVERY - puppet last run on cp2005 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [11:34:30] RECOVERY - puppet last run on cp1044 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [11:34:31] RECOVERY - puppet last run on cp1070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:34:40] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:34:51] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:34:52] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:35:00] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:35:41] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:35:42] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:38:11] (03PS3) 10Jcrespo: Allow ssl key usage [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256395 [11:39:07] 7Puppet, 6operations, 6Services, 7Monitoring: OCG checks should be CRITICAL when reading from the server times out - https://phabricator.wikimedia.org/T120078#1844383 (10Joe) 3NEW [11:40:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [11:41:51] RECOVERY - HTTP 5xx reqs/min -https://grafana.wikimedia.org/dashboard/db/varnish-http-errors- on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:45:56] 6operations, 10OCG-General-or-Unknown, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1844399 (10Joe) 3NEW [11:46:09] 6operations, 10OCG-General-or-Unknown, 6Services: The OCG cleanup cache script doesn't work properly - https://phabricator.wikimedia.org/T120079#1844399 (10Joe) [11:48:50] PROBLEM - puppet last run on db1066 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:49:57] (03PS3) 10Muehlenhoff: Set size_limit for openldap::labs to 32768 [puppet] - 10https://gerrit.wikimedia.org/r/256386 [11:50:41] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:51:32] (03CR) 10Muehlenhoff: [C: 032 V: 032] Set size_limit for openldap::labs to 32768 [puppet] - 10https://gerrit.wikimedia.org/r/256386 (owner: 10Muehlenhoff) [11:52:22] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Puppet last ran 19 hours ago [11:54:21] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:56:42] 7Puppet, 6operations, 6Services, 7Monitoring: OCG checks should be CRITICAL when reading from the server times out - https://phabricator.wikimedia.org/T120078#1844420 (10Joe) a:3Joe [12:03:10] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [12:10:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 16.67% of data above the critical threshold [5000000.0] [12:14:30] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 83.33% of data above the critical threshold [5000000.0] [12:14:38] (03PS1) 10Filippo Giunchedi: deployment: fix socket_connect_timeout argument [puppet] - 10https://gerrit.wikimedia.org/r/256403 (https://phabricator.wikimedia.org/T118380) [12:14:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [5000000.0] [12:15:19] (03PS2) 10Filippo Giunchedi: deployment: fix socket_connect_timeout argument [puppet] - 10https://gerrit.wikimedia.org/r/256403 (https://phabricator.wikimedia.org/T118380) [12:15:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] deployment: fix socket_connect_timeout argument [puppet] - 10https://gerrit.wikimedia.org/r/256403 (https://phabricator.wikimedia.org/T118380) (owner: 10Filippo Giunchedi) [12:18:31] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [12:21:38] (03PS1) 10Jcrespo: Update mariadb module to the latest commit [puppet] - 10https://gerrit.wikimedia.org/r/256405 [12:22:08] (03CR) 10Jcrespo: [C: 04-2] "Depends on 256395" [puppet] - 10https://gerrit.wikimedia.org/r/256405 (owner: 10Jcrespo) [12:22:30] (03CR) 10jenkins-bot: [V: 04-1] Update mariadb module to the latest commit [puppet] - 10https://gerrit.wikimedia.org/r/256405 (owner: 10Jcrespo) [12:22:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 85.71% of data above the critical threshold [5000000.0] [12:24:40] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [5000000.0] [12:30:11] yuvipanda: well, deleting, moving & merging graphite things would be a good thing for me ;) [12:30:22] there is a script for the later [12:30:51] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 1.00% above the threshold [1000000.0] [12:31:01] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 1.00% above the threshold [1000000.0] [12:41:15] !log restart cassandra on maps-test200{1,2,3,4}.codfw.wmnet [12:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:48:41] 6operations, 7LDAP: Fix LDAP replication OIT hostname - https://phabricator.wikimedia.org/T82675#1844506 (10fgiunchedi) p:5Normal>3Low [12:50:23] 6operations, 10netops: Setup family inet6 ACLs for analytics vlans - https://phabricator.wikimedia.org/T83669#1844512 (10fgiunchedi) p:5Normal>3Low [12:52:00] 6operations: SSL address space separation - https://phabricator.wikimedia.org/T83736#1844523 (10fgiunchedi) [12:53:55] 6operations: reserve blocks for root user by default - https://phabricator.wikimedia.org/T84634#1844526 (10fgiunchedi) [12:54:36] 6operations: reserve blocks for root user by default - https://phabricator.wikimedia.org/T84634#1844529 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi blocks are reserved for root as expected on a new install ``` mw2208:~$ sudo tune2fs -l /dev/sda1 | grep -i reserved Reserved block count: 5860672 Reserv... [12:55:37] 6operations, 5Patch-For-Review: Create roles for test systems and spares - https://phabricator.wikimedia.org/T115489#1844532 (10MoritzMuehlenhoff) 5Open>3Resolved These are available and in use for a while now. [13:00:32] (03CR) 10Phuedx: [C: 04-1] Enable RelatedArticles and Cards on beta wikipedias (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (owner: 10Bmansurov) [13:12:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [13:12:10] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 7.14% of data above the critical threshold [150.0] [13:16:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [13:16:12] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [75.0] [13:26:21] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 4 failures [13:36:18] (03CR) 10Krinkle: [C: 04-1] "Image is not well-compressed and uses inconsistent text rendering." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) (owner: 10Dereckson) [13:37:42] (03CR) 10Krinkle: "Also, the update included a change of the Wikipedia logo (from v1 to v2). However it is no longer centered. It was left-aligned with the o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254471 (https://phabricator.wikimedia.org/T118491) (owner: 10Dereckson) [13:51:22] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:51:31] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:53:20] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:53:22] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:53:32] PROBLEM - RAID on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:53:52] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:00] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:01] PROBLEM - Check size of conntrack table on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:21] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:30] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:41] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:54:42] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:54:50] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:40] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [14:01:40] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [14:02:02] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [14:02:11] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [14:02:12] RECOVERY - Check size of conntrack table on mw1136 is OK: OK: nf_conntrack is 0 % full [14:02:31] RECOVERY - Disk space on mw1136 is OK: DISK OK [14:02:31] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:02:51] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [14:02:51] RECOVERY - DPKG on mw1136 is OK: All packages OK [14:03:00] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [14:03:11] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [14:03:51] RECOVERY - RAID on mw1136 is OK: OK: no RAID installed [14:17:53] 6operations, 6Project-Creators: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#1844627 (10jcrespo) [14:24:07] (03PS7) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [14:25:26] akosiaris, I've fixed some stuff in ^^ but have problems with testing it as in labs it explodes in various places [14:33:48] <_joe_> otto is still not around? [14:33:49] <_joe_> mh [14:35:26] (03PS1) 10ArielGlenn: dumps: one more check to see if db is skipped before we dump it [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256410 (https://phabricator.wikimedia.org/T116564) [14:35:39] paravoid: https://gerrit.wikimedia.org/r/#/c/255555/6 has your changes implemented. Can you give it a quick once-over so that I can start on closing this up? :-) [14:36:20] (03CR) 10ArielGlenn: [C: 032 V: 032] dumps: one more check to see if db is skipped before we dump it [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/256410 (https://phabricator.wikimedia.org/T116564) (owner: 10ArielGlenn) [14:37:12] MaxSem: like ? [14:37:33] Error: Failed to apply catalog: Could not find dependency File[/etc/ldap/ldap.conf] for Class[Puppet::Self::Config] at /etc/puppet/modules/puppet/manifests/self/master.pp:62 [14:38:38] (03CR) 10Faidon Liambotis: [C: 031] Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [14:39:03] paravoid: ευχαριστώ :-) [14:39:08] yvw [14:42:33] 6operations: 503 errors on datasets.wikimedia.org - https://phabricator.wikimedia.org/T120091#1844688 (10Halfak) 3NEW [14:42:41] (03PS7) 10coren: Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) [14:43:02] MaxSem: you 've defined a puppet node in site.pp over there ? why ? [14:43:23] akosiaris, what's a better option? [14:43:45] wikitech roles ? [14:44:01] mmm, and why is site.pp failing? :P [14:44:23] cause labs have LDAP puppet enc integration ? [14:44:55] you shouldn't even have to mess with site.pp [14:45:05] I am not surprised it is causing problem [14:45:07] <_joe_> rotfl [14:45:07] problems* [14:45:19] !log deploying cleanup of labs PAM configuration - this should be a functional noop but may cause some puppet noise [14:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:30] (03CR) 10coren: [C: 032] Labs: switch PAM handling to use pam-auth-update [puppet] - 10https://gerrit.wikimedia.org/r/255555 (https://phabricator.wikimedia.org/T85910) (owner: 10coren) [14:47:08] so, MaxSem: my advice. use wikitech to create a role you want to test, and apply it to the VM via the wikitech interface [14:47:26] no need to mess with site.pp... in fact it is only going to cause problems [14:48:14] wee, works! [14:48:42] MaxSem: :-) [14:52:22] PSA: I've just renamed #Database to #DBA [14:53:42] jynus: I suppose I understand, but to me this feels like those are now tasks for fixing the DBAs. :-) [14:54:08] ah [14:54:26] I wonder that, I think without s it reflects reality better [14:54:32] *wondered [14:54:44] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1844724 (10akosiaris) I just realized the `site.pp` comment above is the reason there are problems on the `puppet-test02` is having problems. labs have an LDAP enc, reusing site.pp to override/overload... [14:55:12] considering I have more tasks than all of #operations and most of mine are #operations, too... [14:55:21] jynus: That tag is mostly for your benefit (and your soon-to-be-colleague's). Don't let my bikeshedding discourage you. :-) [14:57:00] o no, with procurement now there are 170 on #operations, 130 on #DBA [14:57:32] jynus: Clearly, you are underworked! :-P [15:00:24] (03PS1) 10Giuseppe Lavagetto: ocg: send out an alarm when ocg doesn't respond to health checks [puppet] - 10https://gerrit.wikimedia.org/r/256412 (https://phabricator.wikimedia.org/T120078) [15:05:16] 6operations, 10OCG-General-or-Unknown, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#1844757 (10Joe) p:5Unbreak!>3High [15:06:46] 6operations: 503 errors on datasets.wikimedia.org - https://phabricator.wikimedia.org/T120091#1844777 (10Ottomata) [15:08:33] (03PS2) 10Giuseppe Lavagetto: ocg: send out an alarm when ocg doesn't respond to health checks [puppet] - 10https://gerrit.wikimedia.org/r/256412 (https://phabricator.wikimedia.org/T120078) [15:10:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/256412 (https://phabricator.wikimedia.org/T120078) (owner: 10Giuseppe Lavagetto) [15:14:01] 6operations, 10OCG-General-or-Unknown, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#1844808 (10Joe) [15:14:02] 7Puppet, 6operations, 6Services, 7Monitoring, 5Patch-For-Review: OCG checks should be CRITICAL when reading from the server times out - https://phabricator.wikimedia.org/T120078#1844807 (10Joe) 5Open>3Resolved [15:14:30] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:16:38] hasharMeeting: https://gerrit.wikimedia.org/r/#/c/244148/ looks good, though if we move right away to flake8 we get syntax error for free (?) that's the main thing I was interested in [15:19:06] 6operations: update exim::listserve::private::mailing_lists value in puppet - https://phabricator.wikimedia.org/T82350#1844820 (10fgiunchedi) p:5Normal>3Low [15:19:29] 6operations, 10netops, 7Monitoring: Setup flow monitoring of *internal* network traffic - https://phabricator.wikimedia.org/T79755#1844830 (10fgiunchedi) [15:20:19] 6operations, 7Monitoring: Setup BGP monitoring for PyBal, including amount of prefixes - https://phabricator.wikimedia.org/T79124#1844832 (10fgiunchedi) p:5Normal>3Low [15:25:18] (03PS3) 10Filippo Giunchedi: diamond: send log to stdout at level INFO [puppet] - 10https://gerrit.wikimedia.org/r/255528 [15:25:26] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] diamond: send log to stdout at level INFO [puppet] - 10https://gerrit.wikimedia.org/r/255528 (owner: 10Filippo Giunchedi) [15:25:54] paravoid: It appears to be working flawlessly; no effect before the script is run (because no --force), and the switch with the script creates pristine config. I'm going to salt soon; do you think it's worthwhile to cleanup the .orig files this leaves around once I'm done and everything works? [16:43:22] (03PS1) 10Rush: labtestcontrol install changes [puppet] - 10https://gerrit.wikimedia.org/r/256435 [16:50:01] (03PS2) 10Rush: labtestcontrol install changes [puppet] - 10https://gerrit.wikimedia.org/r/256435 [16:50:01] (03CR) 10Rush: [C: 032] labtestcontrol install changes [puppet] - 10https://gerrit.wikimedia.org/r/256435 (owner: 10Rush) [16:50:02] (03PS1) 10Reedy: Fix apple-touch-icon.png on wikipedias [puppet] - 10https://gerrit.wikimedia.org/r/256437 (https://phabricator.wikimedia.org/T115965) [16:50:02] 6operations, 5Patch-For-Review, 7Regression: [Regression] 404 Not Found: https://en.wikipedia.org/apple-touch-icon.png - https://phabricator.wikimedia.org/T115965#1845096 (10Reedy) >>! In T115965#1845062, @fgiunchedi wrote: > the one `public-wiki-rewrites.incl` above seems to cover wikimedia-related virtualh... [16:50:22] godog: I think it's mostly _joe_ being busy and not having time to carry on refactoring stuff out [16:51:34] * Reedy creates a dependent patch [16:52:22] Reedy: likely, only wikipedia and not the rest on purpose? [16:52:40] It doesn't want adding to every wiki group [16:52:47] As not every has wgAppleTouchIcon defined [16:52:52] I'm just gonna fill in the other gaps now [16:53:23] (03PS1) 10Chad: gmond_memcached.py: fix all kinds of pep8 warnings [puppet] - 10https://gerrit.wikimedia.org/r/256438 [16:55:51] Reedy: sweet, thanks! [16:55:57] 1 file changed, 3 insertions(+), 23 deletions(-) [16:56:00] I love doing stuff like this [16:56:04] Remove crap, make more stuff work [16:56:48] (03PS1) 10Reedy: Add apple-touch-icon.png to Wikidata, Wikinews and Wiktionary [puppet] - 10https://gerrit.wikimedia.org/r/256440 [16:57:23] (03CR) 10Chad: "Weird, since pep8 was completely silent on the issue locally. Will amend." [puppet] - 10https://gerrit.wikimedia.org/r/256311 (owner: 10Chad) [16:59:07] Reedy: /r/frisson [16:59:27] (03PS1) 10Reedy: Remove www.de rewrites [puppet] - 10https://gerrit.wikimedia.org/r/256441 [16:59:31] mutante|away: ^^ :D [16:59:46] I wonder if this might be a good thing to keep me busy [17:00:04] andrewbogott akosiaris: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151202T1700). [17:04:09] 6operations, 10OCG-General-or-Unknown, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#1845136 (10bd808) > I seem to understand this is not easy to fix [[https://phabricator.wikimedia.org/diffusion/OMWC/browse/master/wmf-config... [17:04:41] 6operations, 5Patch-For-Review, 7Regression: [Regression] 404 Not Found: https://en.wikipedia.org/apple-touch-icon.png - https://phabricator.wikimedia.org/T115965#1845139 (10Reedy) https://gerrit.wikimedia.org/r/256440 is for other Wiki projects that have the $wgAppleTouchIcon defined. Might aswell fix them... [17:05:45] 6operations, 10Analytics, 6Analytics-Backlog, 10Deployment-Systems, and 2 others: Deploy AQS with scap3 - https://phabricator.wikimedia.org/T114999#1845141 (10JAllemandou) >>! In T114999#1845008, @mobrovac wrote: >>>! In T114999#1844957, @greg wrote: >> (Off-topic-ish: There's no AQS project in Phab? Do al... [17:06:19] (03PS1) 10Rush: openstack: fix glance /srv permissions [puppet] - 10https://gerrit.wikimedia.org/r/256444 [17:06:28] (03PS2) 10Rush: openstack: fix glance /srv permissions [puppet] - 10https://gerrit.wikimedia.org/r/256444 [17:07:42] (03CR) 10Rush: [C: 032] openstack: fix glance /srv permissions [puppet] - 10https://gerrit.wikimedia.org/r/256444 (owner: 10Rush) [17:08:20] (03PS3) 10Isart: Adding diamond collector to send P_S metrics to graphite. Fixing user/pass on template. Making python code PEP-8 compliant. [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256007 [17:09:59] (03CR) 10Hashar: "operations-puppet-pep8 traverses all directories and in each runs something like:" [puppet] - 10https://gerrit.wikimedia.org/r/256311 (owner: 10Chad) [17:21:59] (03CR) 10Dzahn: [C: 031] "per "plain unencrypted HTTP over port 443"" [puppet] - 10https://gerrit.wikimedia.org/r/253917 (https://phabricator.wikimedia.org/T118956) (owner: 10BBlack) [17:22:41] (03CR) 10Ottomata: Eventloggging alarm that triggers when sql insertion decreases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255724 (https://phabricator.wikimedia.org/T119771) (owner: 10Nuria) [17:24:24] (03CR) 10DCausse: [C: 031] Elastic: move merge_threads to hiera [puppet] - 10https://gerrit.wikimedia.org/r/207377 (owner: 10Chad) [17:28:02] (03CR) 10Dzahn: [C: 031] "dzahn@sphinx:~/wmf/dns$ for wtfde in quote pedia books source news versity; do host www.de.wiki${wtfde}; done" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [17:28:52] (03CR) 10Dzahn: "for wtfde in quote pedia books source news versity; do host www.de.wiki${wtfde}.org; done" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [17:30:32] (03CR) 10Jcrespo: [C: 032] Allow ssl key usage [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256395 (owner: 10Jcrespo) [17:30:58] (03PS2) 10Jcrespo: Update mariadb module to the latest commit [puppet] - 10https://gerrit.wikimedia.org/r/256405 [17:33:37] (03CR) 10Dzahn: "the redirect is ^www\.([a-z-]+)\, so not just "de" besides the comment, i also don't see other languages, but there is "www.m" which actua" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [17:34:24] (03CR) 10Jcrespo: [C: 032] Update mariadb module to the latest commit [puppet] - 10https://gerrit.wikimedia.org/r/256405 (owner: 10Jcrespo) [17:35:23] (03CR) 10Dzahn: ""www.m" exists in DNS but only for Wikipedia, not the other projects. this might break that one redirect..shrug" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [17:36:07] (03CR) 10Dzahn: "by shrug i mean no idea why the other projects don't have it and if it should be consistent" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [17:36:24] * jynus waits for the puppet storm [17:47:37] 6operations, 10Traffic, 10Wikimedia-Stream, 5Patch-For-Review: rcstream service on port 443 is broken, spamming logs - https://phabricator.wikimedia.org/T118956#1845226 (10Dzahn) +1 i also think we should merge it per "plain http on 443" [17:48:11] (03CR) 1020after4: "gitblit doesn't even work most of the time so redirecting to a dead IP address would be an improvement. The phabricator side of this is t" [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [17:48:29] (03CR) 1020after4: "tldr; let's merge this!" [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [17:48:36] (03PS3) 1020after4: Gerrit: use Diffusion for repo browsing [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [17:50:39] (03CR) 10Giuseppe Lavagetto: [C: 031] "yeah this is just a relic, I guess" [puppet] - 10https://gerrit.wikimedia.org/r/253917 (https://phabricator.wikimedia.org/T118956) (owner: 10BBlack) [17:54:45] (03PS1) 10Jcrespo: Fixing import issue for ssl_key (minor syntax change) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256451 [17:55:04] (03CR) 10jenkins-bot: [V: 04-1] Fixing import issue for ssl_key (minor syntax change) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256451 (owner: 10Jcrespo) [17:57:05] (03PS3) 10Nuria: Eventloggging alarm that triggers when sql insertion decreases [puppet] - 10https://gerrit.wikimedia.org/r/255724 (https://phabricator.wikimedia.org/T119771) [17:57:24] (03PS2) 10Jcrespo: Fixing import issue for ssl_key (minor syntax change) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256451 [17:57:34] (03CR) 10Nuria: Eventloggging alarm that triggers when sql insertion decreases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255724 (https://phabricator.wikimedia.org/T119771) (owner: 10Nuria) [17:57:44] (03CR) 10jenkins-bot: [V: 04-1] Fixing import issue for ssl_key (minor syntax change) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256451 (owner: 10Jcrespo) [17:59:07] (03PS3) 10Jcrespo: Fixing import issue for ssl_key (minor syntax change) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256451 [17:59:40] icinga says gallium has no git daemon running [17:59:42] (03PS4) 10Ori.livneh: Clean up l10nupdate settings (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [17:59:49] (03CR) 10Ori.livneh: [C: 032 V: 032] Clean up l10nupdate settings (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/256026 (https://phabricator.wikimedia.org/T119746) (owner: 10BryanDavis) [17:59:50] hmm [18:00:42] (03CR) 10Jcrespo: [C: 032] Fixing import issue for ssl_key (minor syntax change) [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256451 (owner: 10Jcrespo) [18:01:15] @seen hashar [18:01:15] mutante: Last time I saw hashar they were quitting the network with reason: Quit: Textual IRC Client: www.textualapp.com N/A at 12/2/2015 5:10:33 PM (50m41s ago) [18:01:29] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845269 (10MaxSem) [18:01:40] (03PS1) 10Jcrespo: Updating mariadb to the latest codebase [puppet] - 10https://gerrit.wikimedia.org/r/256454 [18:01:50] root@gallium:~# /etc/init.d/git-daemon status * git-daemon is running [18:01:57] so is it running or not :p [18:02:05] (03PS2) 10Jcrespo: Updating mariadb to the latest codebase [puppet] - 10https://gerrit.wikimedia.org/r/256454 [18:02:12] already fixed itself.. [18:02:16] (03CR) 10Jcrespo: [C: 032] Updating mariadb to the latest codebase [puppet] - 10https://gerrit.wikimedia.org/r/256454 (owner: 10Jcrespo) [18:05:09] (03PS1) 10Ori.livneh: Fix I772920: mediawiki::users::web is a variable, not a class [puppet] - 10https://gerrit.wikimedia.org/r/256457 [18:05:30] bd808: fyi ^ [18:05:41] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix I772920: mediawiki::users::web is a variable, not a class [puppet] - 10https://gerrit.wikimedia.org/r/256457 (owner: 10Ori.livneh) [18:05:54] ori: oh thanks [18:06:01] dumb error from me [18:06:19] np, i missed it too [18:07:31] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [18:07:42] !log mw1136 - hhvm restart [18:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:27] bd808: applied successfully on tin 2nd time around [18:08:59] (03PS1) 10Jcrespo: Fixing resource type error [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256458 [18:09:09] 6operations, 5Patch-For-Review: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1845287 (10Dzahn) asked Mark if he has concerns. we can move both torrus and smokeping [18:09:12] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 65618 bytes in 1.821 second response time [18:09:20] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.100 second response time [18:09:32] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:09:40] (03CR) 10Jcrespo: [C: 032] Fixing resource type error [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256458 (owner: 10Jcrespo) [18:10:39] (03PS1) 10Jcrespo: Update mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/256460 [18:11:08] (03PS2) 10Jcrespo: Update mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/256460 [18:11:27] 6operations, 10Deployment-Systems, 10Wikimedia-General-or-Unknown, 5Patch-For-Review, 15User-bd808: localisationupdate broken on wmf wikis by scap master-master sync changes - https://phabricator.wikimedia.org/T119746#1845293 (10bd808) >>! In T119746#1837536, @gerritbot wrote: > Change 255952 merged by O... [18:12:32] (03CR) 10Dzahn: "for some reason a compiler run said "Error: Duplicate declaration: Class[Exim4]" and since that i think it was just rebased." [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [18:12:43] (03CR) 10Dzahn: [C: 04-1] Phab: clean up role, remove ::config and ::main abstraction [puppet] - 10https://gerrit.wikimedia.org/r/235778 (owner: 10Chad) [18:13:19] (03CR) 10Jcrespo: [C: 032] Update mariadb submodule [puppet] - 10https://gerrit.wikimedia.org/r/256460 (owner: 10Jcrespo) [18:14:47] (03PS1) 10Muehlenhoff: Assign per-data Salt grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/256462 (https://phabricator.wikimedia.org/T111006) [18:18:27] (03PS1) 10Jcrespo: Fixing namespace [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256463 [18:18:51] (03PS2) 10Muehlenhoff: Assign per-datacentre Salt grains for debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/256462 (https://phabricator.wikimedia.org/T111006) [18:19:15] (03CR) 10Jcrespo: [C: 032] Fixing namespace [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256463 (owner: 10Jcrespo) [18:19:58] (03PS1) 10Jcrespo: Update mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/256465 [18:20:26] (03PS2) 10Jcrespo: Update mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/256465 [18:20:37] twentyafterfour: feel like merging this one? https://gerrit.wikimedia.org/r/#/c/247794/ i dont have +2 on that repo but keep seeing it since i commented and we once talked about it on IRC where you said it won't get auto deployed but fine to merge [18:21:39] (03CR) 10Jcrespo: [C: 032] Update mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/256465 (owner: 10Jcrespo) [18:22:39] 6operations, 6Security-Team, 10Wikimedia-General-or-Unknown: Non-NDA users cannot access graphite.wikimedia.org - https://phabricator.wikimedia.org/T56713#1845353 (10Krinkle) [18:23:59] mutante: ok [18:24:25] twentyafterfour: thx, just kept popping up in gerrit once you ever leave a comment or vote [18:25:22] (03PS4) 10Dzahn: add wikilovesmonument.org [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [18:26:02] (03CR) 10Dzahn: "added that it will point to schippers.wikimedia.nl. and more reviewers" [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [18:29:36] Hello operators! I have a db query on analytics-store that won't complete no matter what. Any ideas what's going on? [18:29:39] Details: https://www.irccloud.com/pastebin/0CBpADwm/ [18:30:36] #analytics said I should ask here. [18:34:15] (03PS1) 10Dzahn: icinga cleanup: move gsb monitoring to ./monitor/ [puppet] - 10https://gerrit.wikimedia.org/r/256467 [18:34:30] (03PS4) 10Ottomata: Eventloggging alarm that triggers when sql insertion decreases [puppet] - 10https://gerrit.wikimedia.org/r/255724 (https://phabricator.wikimedia.org/T119771) (owner: 10Nuria) [18:34:38] (03CR) 10Ottomata: [C: 032 V: 032] Eventloggging alarm that triggers when sql insertion decreases [puppet] - 10https://gerrit.wikimedia.org/r/255724 (https://phabricator.wikimedia.org/T119771) (owner: 10Nuria) [18:34:49] jynus, ^ [18:36:05] neilpquinn: it would be great if you could paste that into phab and we'll tag it for DBA attention [18:36:10] Krenair, what? [18:36:18] neilpquinn had a database question [18:39:41] neilpquinn, how much times does it take now the other query? [18:40:19] (03PS2) 10Dzahn: icinga cleanup: move gsb monitoring to ./monitor/ [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) [18:42:36] (03CR) 10Dzahn: [C: 04-1] "eh.. "Failed to compile catalog for node neon.wikimedia.org: Must pass client_id to Class[Icinga::Monitor::Gsb]" ?" [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [18:45:02] (03PS1) 10Jcrespo: Getting rid of separate class [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256468 [18:45:36] (03CR) 10Jcrespo: [C: 032] Getting rid of separate class [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256468 (owner: 10Jcrespo) [18:45:50] (03CR) 10Dzahn: "why does this work before the change ?:P" [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [18:48:00] 6operations, 10Gitblit-Deprecate, 6Phabricator, 10Phabricator-Upstream: PHD ensuring umask goodness - https://phabricator.wikimedia.org/T91648#1845440 (10greg) So, just to summarize, what needs to happen here now for us (WMF) going forward? is this blocking anything (this task is in the #gitblit-deprecate... [18:48:53] (03PS1) 10Jcrespo: Update mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/256470 [18:49:44] let me check [18:50:29] (03CR) 10Jcrespo: [C: 032] Update mariadb module [puppet] - 10https://gerrit.wikimedia.org/r/256470 (owner: 10Jcrespo) [18:51:17] 6operations, 10Gitblit-Deprecate, 6Phabricator, 10Phabricator-Upstream: PHD ensuring umask goodness - https://phabricator.wikimedia.org/T91648#1845458 (10chasemp) AFA what can be done here, we could wrap phd in our own init script or something. But for #Gitblit-Deprecate I don't think this blocks? [18:51:52] 6operations, 6Analytics-Kanban, 10CirrusSearch, 6Discovery, and 2 others: Delete logs on stat1002 in /a/mw-log/archive that are more than 90 days old {hawk} - https://phabricator.wikimedia.org/T118527#1845459 (10Nuria) 5Open>3Resolved [18:52:01] jynus: it still runs a lot faster. I just started it, and it's already queried 2.5 M rows in 21 s (according to SHOW FULL PROCESSLIST). The other one has only done 420 K even though it's been running for nearly an hour. [18:53:01] so it looks like you are trying to execute a slow query [18:53:29] I recommend you writing a faster query [18:53:42] 6operations, 5Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#1845473 (10Dzahn) puppet-lint: re-enable unquoted resource check https://gerrit.wikimedia.org/r/#/c/253652/ [18:53:42] Apparently, but I can't figure out why it would be slow. As I read the EXPLAIN output, it's less complicated. [18:53:43] if you need help with that, file a bug on phabricator [18:54:04] Hmm. Thanks. [18:54:05] as doctor house says, EXPLAIN allways lies [18:54:13] (03CR) 10Chad: "That's not it. That .pep8 file only ignores E501 line-too-long. I was running pep8 . in the top level which exposed a ton of errors but di" [puppet] - 10https://gerrit.wikimedia.org/r/256311 (owner: 10Chad) [18:54:19] SELECT, on the other side, always tells the truth [18:54:20] Haha, okay. [18:58:01] PROBLEM - puppet last run on mw2118 is CRITICAL: CRITICAL: puppet fail [18:58:18] 6operations, 10OCG-General-or-Unknown, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#1845500 (10GWicke) I don't know much too much about OCG's Redis queues, but one possibility to potentially look into is stored jobs in the qu... [18:58:55] 6operations, 6Phabricator, 10Phabricator-Upstream: PHD ensuring umask goodness - https://phabricator.wikimedia.org/T91648#1845502 (10demon) [19:00:25] (03CR) 10coren: [C: 031] "Sane, with a (very minor) quibble that does not otherwise prevent merging." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256170 (owner: 10Andrew Bogott) [19:01:32] (03PS1) 10Dzahn: eventlogging, varnish: fix last 2 quoting warnings [puppet] - 10https://gerrit.wikimedia.org/r/256472 (https://phabricator.wikimedia.org/T93645) [19:02:11] PROBLEM - puppet last run on db2055 is CRITICAL: CRITICAL: puppet fail [19:04:31] 6operations, 10OCG-General-or-Unknown, 6Services: OCG should not be contacted directly from the appservers but only via LVS - https://phabricator.wikimedia.org/T120077#1845549 (10Joe) @bd808 apart from finding the code itself, what is problematic is that this is an essential part of the design of OCG, as I u... [19:06:18] ^puppet failing just for the laughs on db2055 [19:06:54] mutante: Filed as https://phabricator.wikimedia.org/T120119 [19:08:00] (03PS2) 10Andrew Bogott: Wikitech: Explicitly rebuild smw data four times/day [puppet] - 10https://gerrit.wikimedia.org/r/256170 [19:08:11] RECOVERY - puppet last run on db2055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:08:35] (03PS2) 10Dzahn: eventlogging, varnish: fix last 2 quoting warnings [puppet] - 10https://gerrit.wikimedia.org/r/256472 (https://phabricator.wikimedia.org/T93645) [19:09:58] neilpquinn: cool, thanks. that was for jynus actually, so that it can be continued non-realtime [19:11:27] (03PS3) 10Dzahn: eventlogging, varnish: fix last 2 quoting warnings [puppet] - 10https://gerrit.wikimedia.org/r/256472 (https://phabricator.wikimedia.org/T93645) [19:12:33] (03CR) 10Dzahn: [C: 032] "the varnish change is only inside a comment block" [puppet] - 10https://gerrit.wikimedia.org/r/256472 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [19:13:43] !log restarting mysql on db2067 to test a configuration change [19:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:29] (03PS2) 10Dzahn: varnish: move file to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [19:16:50] (03PS3) 10Dzahn: varnish: move varnish-test-geoip to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [19:17:41] (03PS4) 10Dzahn: varnish: move file to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [19:18:30] (03PS5) 10Dzahn: varnish: move file to module [puppet] - 10https://gerrit.wikimedia.org/r/253457 [19:20:25] (03CR) 10Dzahn: "besides this i just see one other file in global ./files/varnish/ while everything else is in the module" [puppet] - 10https://gerrit.wikimedia.org/r/253457 (owner: 10Dzahn) [19:26:11] RECOVERY - puppet last run on mw2118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:51] PROBLEM - puppet last run on mw2141 is CRITICAL: CRITICAL: puppet fail [19:28:10] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: puppet fail [19:29:05] (03PS1) 10Giuseppe Lavagetto: Imported Upstream version 0.4.3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256475 [19:29:07] (03PS1) 10Giuseppe Lavagetto: New version [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256476 [19:32:57] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Imported Upstream version 0.4.3 [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256475 (owner: 10Giuseppe Lavagetto) [19:33:22] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] New version [debs/python-etcd] - 10https://gerrit.wikimedia.org/r/256476 (owner: 10Giuseppe Lavagetto) [19:35:10] 6operations, 10DBA, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1845611 (10jcrespo) Replication is ready, we only need to: * Set SSL => 'on' puppet and restart all servers to apply the config changes (we will set it on the role once it has been applied e... [19:35:30] 6operations, 10DBA: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1845613 (10jcrespo) [19:38:32] 6operations, 10Traffic: Improve Varnish XFF processing for trusted proxies - https://phabricator.wikimedia.org/T120121#1845623 (10BBlack) 3NEW [19:40:21] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:45:13] (03PS3) 10Dzahn: lint: re-enable double quoted strings check [puppet] - 10https://gerrit.wikimedia.org/r/243859 (https://phabricator.wikimedia.org/T93645) [19:45:50] (03CR) 10jenkins-bot: [V: 04-1] lint: re-enable double quoted strings check [puppet] - 10https://gerrit.wikimedia.org/r/243859 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [19:46:49] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1845648 (10jcrespo) 3NEW a:3jcrespo [19:47:11] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1845648 (10jcrespo) [19:47:12] 6operations, 10DBA: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1845655 (10jcrespo) [19:47:39] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1845648 (10jcrespo) [19:47:40] 6operations, 10DBA, 5Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#1845657 (10jcrespo) [19:48:12] 6operations, 5Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#1845659 (10jcrespo) [19:48:13] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1845648 (10jcrespo) [19:48:50] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1845662 (10jcrespo) [19:48:51] 6operations, 10DBA, 7Tracking: Migrate MySQLs to use ROW-based replication (tracking) - https://phabricator.wikimedia.org/T109179#1845661 (10jcrespo) [19:49:34] (03PS1) 10Rush: openstack: refactor designate role/class for labtest [puppet] - 10https://gerrit.wikimedia.org/r/256477 [19:49:39] 6operations, 10DBA: Perform a rolling restart of all MySQL slaves (masters too for those services with low traffic) - https://phabricator.wikimedia.org/T120122#1845648 (10jcrespo) [19:49:55] (03PS2) 10Rush: openstack: refactor designate role/class for labtest [puppet] - 10https://gerrit.wikimedia.org/r/256477 [19:53:21] (03PS1) 10Dzahn: minimal lint fixes [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/256478 [19:54:16] (03PS2) 10Dzahn: fix the last quoted boolean [puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/256478 [19:55:01] RECOVERY - puppet last run on mw2141 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [19:56:28] (03PS3) 10Rush: openstack: refactor designate role/class for labtest [puppet] - 10https://gerrit.wikimedia.org/r/256477 [19:56:41] (03PS4) 10Rush: openstack: refactor designate role/class for labtest [puppet] - 10https://gerrit.wikimedia.org/r/256477 [19:58:19] (03PS1) 10Giuseppe Lavagetto: Use system-wide etcd configurations for the etcd driver [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 [19:58:20] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: puppet fail [19:58:21] (03PS1) 10Giuseppe Lavagetto: Clarified the error message since we're in a multi-host setup now. [software/conftool] - 10https://gerrit.wikimedia.org/r/256481 [19:58:23] (03PS1) 10Giuseppe Lavagetto: Fix tests [software/conftool] - 10https://gerrit.wikimedia.org/r/256482 [19:58:30] (03PS1) 10Dzahn: fix double quoted string warnings [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256483 [19:58:52] how can puppet flop? [19:59:12] (03CR) 10jenkins-bot: [V: 04-1] Clarified the error message since we're in a multi-host setup now. [software/conftool] - 10https://gerrit.wikimedia.org/r/256481 (owner: 10Giuseppe Lavagetto) [19:59:14] (03CR) 10jenkins-bot: [V: 04-1] Use system-wide etcd configurations for the etcd driver [software/conftool] - 10https://gerrit.wikimedia.org/r/256480 (owner: 10Giuseppe Lavagetto) [19:59:23] (03CR) 10jenkins-bot: [V: 04-1] Fix tests [software/conftool] - 10https://gerrit.wikimedia.org/r/256482 (owner: 10Giuseppe Lavagetto) [20:02:01] (03PS1) 10Dzahn: fix lint warnings [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/256484 [20:05:45] (03CR) 10Jcrespo: [C: 031] fix double quoted string warnings [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/256483 (owner: 10Dzahn) [20:06:22] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:07:58] oh: ! [remote rejected] HEAD -> refs/publish/master (project is read only) [20:08:02] for cassandra [20:11:06] (03CR) 10Rush: [C: 032] "hard to see as a noop w/ the compiler issues with hiera private but it's somewhat clear as a noop :)" [puppet] - 10https://gerrit.wikimedia.org/r/256477 (owner: 10Rush) [20:12:43] (03PS1) 10Dzahn: minimal lint fix, indentation warning [puppet/cdh] - 10https://gerrit.wikimedia.org/r/256487 [20:13:33] (03PS2) 10Jdlrobson: Enable RelatedArticles and Cards on beta wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/256396 (https://phabricator.wikimedia.org/T116676) (owner: 10Bmansurov) [20:14:09] (03PS1) 10Dzahn: ferm: fix last lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/256489 [20:16:24] (03PS1) 10Dzahn: wikilabels: fix lint warning [puppet] - 10https://gerrit.wikimedia.org/r/256491 [20:18:22] (03PS3) 10Rush: openstack: convert missing ldappassword param hiera [puppet] - 10https://gerrit.wikimedia.org/r/256492 [20:18:29] (03PS4) 10Rush: openstack: convert missing ldappassword param hiera [puppet] - 10https://gerrit.wikimedia.org/r/256492 [20:19:05] (03CR) 10Rush: [C: 032] openstack: convert missing ldappassword param hiera [puppet] - 10https://gerrit.wikimedia.org/r/256492 (owner: 10Rush) [20:19:30] (03CR) 10Rush: [V: 032] openstack: convert missing ldappassword param hiera [puppet] - 10https://gerrit.wikimedia.org/r/256492 (owner: 10Rush) [20:19:57] (03PS1) 10Dzahn: role: fix "ensure found on line but not the first" [puppet] - 10https://gerrit.wikimedia.org/r/256493 [20:23:12] (03PS1) 10Dzahn: dataset,ores: fix "ensure not the first" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256494 [20:24:40] (03PS1) 10Dzahn: graphite: fix "ensure not the first" warnings [puppet] - 10https://gerrit.wikimedia.org/r/256495 [20:24:42] (03PS3) 10Andrew Bogott: Wikitech: Explicitly rebuild smw data four times/day [puppet] - 10https://gerrit.wikimedia.org/r/256170 [20:26:17] (03CR) 10Andrew Bogott: [C: 032] Wikitech: Explicitly rebuild smw data four times/day [puppet] - 10https://gerrit.wikimedia.org/r/256170 (owner: 10Andrew Bogott) [20:27:29] (03PS1) 10Dzahn: fix lint warnings [puppet/nginx] - 10https://gerrit.wikimedia.org/r/256496 [20:29:26] 6operations, 10Dumps-Generation, 10hardware-requests: determine hardware needs for dumps in eqiad (boxes out of warranty, capacity planning) - https://phabricator.wikimedia.org/T118154#1845741 (10RobH) Sorry, this fell off my radar, and it shouldn't have. I'm picking this back up now, and I'll be generating... [20:29:54] (03PS1) 10Rush: labtest: labtestservices2001 as designate [puppet] - 10https://gerrit.wikimedia.org/r/256497 [20:30:03] (03PS1) 10Dzahn: varnish: fix last lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/256498 [20:31:25] (03CR) 10jenkins-bot: [V: 04-1] varnish: fix last lint warnings [puppet] - 10https://gerrit.wikimedia.org/r/256498 (owner: 10Dzahn) [20:31:41] (03CR) 10Rush: [C: 032] labtest: labtestservices2001 as designate [puppet] - 10https://gerrit.wikimedia.org/r/256497 (owner: 10Rush) [20:37:17] !log enable mail queue monitoring for fundraising [20:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:27] 6operations, 10Dumps-Generation, 10hardware-requests: eqiad: (2) snapshot hosts (similar+ to snapshot1001) - https://phabricator.wikimedia.org/T120126#1845751 (10RobH) 3NEW a:3RobH [20:39:29] (03PS1) 10Dzahn: icinga: remove user from dialout group [puppet] - 10https://gerrit.wikimedia.org/r/256508 (https://phabricator.wikimedia.org/T110893) [20:39:34] bah [20:39:43] that was meant to be in another space, good thing no sensitive data yet ;] [20:39:55] ^ i'm pretty sure that 'dialout' group for icinga.. is from the days when we sent out SMS with a USB dongle [20:40:01] (03PS22) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:40:02] so for that it would have to dial out [20:40:37] tries to help cleanup that entire module [20:52:10] (03PS1) 10Dzahn: icinga/labsnfs: move monitoring groups to labsnfs [puppet] - 10https://gerrit.wikimedia.org/r/256509 (https://phabricator.wikimedia.org/T110893) [20:55:31] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845820 (10akosiaris) `puppet-test02` can be considered fixed btw [20:56:26] (03PS23) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:57:05] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845826 (10yuvipanda) So was the problem just including things in site.pp vs including it via LDAP? [20:58:36] (03PS24) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [20:59:40] (03PS25) 10Ottomata: [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) [21:00:26] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Puppetize eventlogging-service with systemd [puppet] - 10https://gerrit.wikimedia.org/r/253465 (https://phabricator.wikimedia.org/T118780) (owner: 10Ottomata) [21:01:55] 7Puppet, 6operations, 6Labs: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1845854 (10akosiaris) More precisely it was the fact that the //node// was defined in both site.pp and LDAP. [21:04:47] (03CR) 10Alexandros Kosiaris: "hiera. Look at the labs/private repo" [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [21:04:51] (03PS1) 10Dzahn: mediawiki: move roles into separate files [puppet] - 10https://gerrit.wikimedia.org/r/256574 [21:06:44] (03CR) 10Dzahn: "aah, thanks! so for review purposes, assume i will change this in the private repo(s) too" [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [21:07:57] (03PS2) 10Dzahn: mediawiki: move roles into separate files [puppet] - 10https://gerrit.wikimedia.org/r/256574 [21:14:07] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware (mwoffliner) - https://phabricator.wikimedia.org/T117095#1845870 (10RobH) a:5RobH>3mark I only seem to have a single spare that meets these requirements closely: Dell PowerEdge R420, Dual Intel Xeon E5-2440, 32GB... [21:14:30] hello [21:24:28] PROBLEM - cassandra CQL 10.64.48.110:9042 on restbase1009 is CRITICAL: Connection refused [21:24:57] this is the decommission finishing ^^ [21:27:59] ACKNOWLEDGEMENT - cassandra CQL 10.64.48.110:9042 on restbase1009 is CRITICAL: Connection refused gwicke Decommission finished. [21:30:47] !log rebooting lvs1007 for interface config test (not active, no BGP) [21:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:53] 6operations, 10Traffic, 5Patch-For-Review: Fix ethernet startup race on HP LVS w/ jessie - https://phabricator.wikimedia.org/T110530#1845977 (10BBlack) So, @faidon pointed out that this would probably fix it self with `s/^auto eth/allow-hotplug eth/` on `/etc/network/interfaces`. The eth0 entry there is alr... [21:41:53] are you touching the network? [21:42:19] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1846016 (10Tgr) Tested with beta Commons file uploads after updating librsvg2-2 / librsvg2-bin / librsvg2-common on `deployment-mediawiki02` (per T84950 tha... [21:42:40] I have 30 alerts CHECK_NRPE: Socket timeout after 10 seconds. [21:44:03] !log disabling all alert notifications for dbstore1002 [21:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:46:41] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1846029 (10Tgr) So, * the beta appserver does not have the right fonts installed. This is expected, beta cluster has no real image scalers. * beta cluster's... [21:56:34] (03PS1) 10Brian Wolff: Fix upload rewrite rules for beta [puppet] - 10https://gerrit.wikimedia.org/r/256589 [21:58:51] (03CR) 1020after4: "Anyone willing to merge this? Alexandros or Chase?" [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [22:00:11] 7Blocked-on-Operations, 6operations, 6Commons, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1846073 (10Bawolff) > * beta cluster's thumb.php hack does not handle extra parameters I submitted https://gerrit.wikimedia.org/r/#/c/256589/ for this. [22:05:31] (03CR) 10Paladox: "What about the problem with un merged patchs. You carn't view them in diffusion." [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [22:06:11] I think we will start by killing some queries here [22:09:25] !log unscheduled restart of dbstore1002 (analytics-slave) [22:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:51] "unscheduled restart" [22:10:22] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1846098 (10RobH) a:3Smalyshev Ok some working notes: Both wdqs1001/wdqs1002 are Dell Poweredge R420s that have space for 8 total SFF... [22:10:45] SMalyshev: I think my updates on the above ^ make sense but let me know if not [22:11:01] and i think we can totally do it. [22:11:26] these systems use a model SSD that we have plenty of spares and no longer use in new systems. [22:11:44] 6operations: Grant tomasz access to Google Web Master Tools for top 10 languages across desktop and mobile plus wikipedia.org portal - https://phabricator.wikimedia.org/T120136#1846102 (10Tfinc) 3NEW [22:13:45] (03CR) 10Reedy: [C: 04-1] "So, not to break anything, I need to limit the wikipedia rewrite to be only www.m.wikipedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [22:14:13] (03CR) 10Gergő Tisza: [C: 031] "IIRC beta image requests go something like nginx -> custom thumb.php -> upload varnish -> text varnish -> apache -> proper thumb.php so I " [puppet] - 10https://gerrit.wikimedia.org/r/256589 (owner: 10Brian Wolff) [22:16:13] (03PS2) 10Gergő Tisza: Fix upload rewrite rules for beta [puppet] - 10https://gerrit.wikimedia.org/r/256589 (https://phabricator.wikimedia.org/T71757) (owner: 10Brian Wolff) [22:16:34] (03CR) 10Dzahn: "or let's first suggest to remove that from DNS for consistency and get a +1 from mobile for it. and once that is gone this patch here can " [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [22:17:40] (03CR) 10Brian Wolff: "ugh, really? :(" [puppet] - 10https://gerrit.wikimedia.org/r/256589 (https://phabricator.wikimedia.org/T71757) (owner: 10Brian Wolff) [22:19:11] (03PS3) 10Ori.livneh: Fix upload rewrite rules for beta [puppet] - 10https://gerrit.wikimedia.org/r/256589 (https://phabricator.wikimedia.org/T71757) (owner: 10Brian Wolff) [22:19:17] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix upload rewrite rules for beta [puppet] - 10https://gerrit.wikimedia.org/r/256589 (https://phabricator.wikimedia.org/T71757) (owner: 10Brian Wolff) [22:19:58] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1846127 (10RobH) 5Open>3stalled Presently EQIAD has 7 of Seagate Barracuda ST1000MD003 in 2TB and CODFW has 0. So this would be an order for 3 @ EQIAD and 10 & CODFW. I'll create a sub-task for the order... [22:20:38] PROBLEM - git_daemon_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/lib/git-core/git-daemon [22:21:43] ^^^ looking at it [22:21:46] not sure why [22:23:23] 6operations: Grant tomasz access to Google Web Master Tools for top 10 languages across desktop and mobile plus wikipedia.org portal - https://phabricator.wikimedia.org/T120136#1846137 (10Deskana) Related tasks: {T101157} and {T116822}. @chasemp and I worked on those and we're working to create a dedicated accou... [22:24:46] RECOVERY - git_daemon_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon [22:25:09] (03CR) 10Faidon Liambotis: [C: 04-1] "Why are we not owning this domain? I don't think we should be hosting domains that we do not own. Pointing records elsewhere is OK, but on" [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [22:26:14] (03CR) 10Dzahn: "< mutante> hello mobile team, when cleaning up Apache config in a different matter (old redirects for stuff like www.de.wikipedia.org that" [puppet] - 10https://gerrit.wikimedia.org/r/256441 (owner: 10Reedy) [22:27:36] 6operations, 10hardware-requests: spare swift disks order - https://phabricator.wikimedia.org/T119698#1846179 (10RobH) a:3RobH I'll keep this stalled and assigned to me, as the purchase task is pending approvals. [22:33:47] 6operations, 10hardware-requests: Site: 2 hardware access request for ORES - https://phabricator.wikimedia.org/T119598#1846188 (10RobH) a:3mark @akosiaris: So using the old out of warranty systems still needs @Mark to approve. Right now we are kind of in the process of killing off older systems that are out... [22:34:14] 6operations, 10hardware-requests: eqiad: (2) spare servers request for ORES - https://phabricator.wikimedia.org/T119598#1846191 (10RobH) [22:35:24] 6operations, 10hardware-requests: Hardware access request for yubico auth servers - https://phabricator.wikimedia.org/T118983#1846194 (10RobH) @Mark: We're still also pending your approval of this spare allocation in EQIAD for Yubiauth system. (It may be unclear since this also has a sub-task for the ordering... [22:40:48] (03PS1) 10Hashar: zuul: tweak git-daemon monitoring [puppet] - 10https://gerrit.wikimedia.org/r/256593 [22:42:38] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: puppet fail [22:44:09] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1846226 (10Smalyshev) @RobH would it require full reimage or can be done incrementally while preserving existing data? If the latter,... [22:46:13] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, 10hardware-requests: Additional diskspace of wdqs1001/wdqs1002 - https://phabricator.wikimedia.org/T119579#1846235 (10RobH) It would require a reimage to replace the raid1 with a raid10. However, it would be the best option in the long run,... [22:50:56] (03PS1) 10Andrew Bogott: labtestcontrol2001: set is_labs_puppet_master and is_puppet_master [puppet] - 10https://gerrit.wikimedia.org/r/256595 [22:53:03] (03CR) 10Andrew Bogott: [C: 032] labtestcontrol2001: set is_labs_puppet_master and is_puppet_master [puppet] - 10https://gerrit.wikimedia.org/r/256595 (owner: 10Andrew Bogott) [22:56:54] (03CR) 10Chad: "I'm not convinced viewing unmerged patches is actually a useful feature of gerrit vs. just adding confusion and extra links to the UI." [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [22:58:33] (03PS4) 10Yuvipanda: Gerrit: use Diffusion for repo browsing [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [22:58:47] (03CR) 10Yuvipanda: [C: 032 V: 032] "KILLKILLKILLKILL" [puppet] - 10https://gerrit.wikimedia.org/r/256262 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [22:58:55] ostriches: does this need a gerrit restart [23:01:15] 6operations, 7Mobile: Investiage if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1846295 (10Reedy) 3NEW [23:01:32] ah it autorestarts [23:01:38] That. [23:02:04] tbh, all those links /will break/ lmao [23:02:12] Hence my inline comment I hadn't answered yet. [23:03:03] :) I don't know if people used it [23:03:06] we'll know now [23:05:25] Heh, gerrit hella slow. Caches cold [23:05:31] 6operations, 7Mobile, 5Patch-For-Review: Investiage if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1846321 (10Dzahn) If i got it right then ~ 0.000272% of all hits in /srv/log/webrequest/sampled-1000.json on oxygen have been to www.m.wikipedia.org . a random example s... [23:06:46] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures [23:07:07] 6operations, 7Mobile, 5Patch-For-Review: Investiage if www.m.wikipedia.org needs to stay around - https://phabricator.wikimedia.org/T120143#1846328 (10Dzahn) 14:25 < mutante> hello mobile team, when cleaning up Apache config in a different matter (old redirects for stuff like www.de.wikipedia.org that dont e... [23:10:19] yuvipanda: Bleh. [23:10:23] https://gerrit.wikimedia.org/r/#/c/256599/ [23:10:57] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:12:28] ostriches: ok. running puppet again [23:12:36] grrrit-wm didn't take to the restarts very well, eh [23:12:49] Clearly [23:12:59] 6operations, 6Analytics-Backlog, 7HTTPS: EventLogging sees too few distinct client IPs - https://phabricator.wikimedia.org/T119144#1846335 (10atgo) [23:14:13] I restarted it [23:16:50] (03CR) 10Dzahn: "like this? https://gerrit.wikimedia.org/r/#/c/256601/" [puppet] - 10https://gerrit.wikimedia.org/r/256467 (https://phabricator.wikimedia.org/T110893) (owner: 10Dzahn) [23:23:12] (03CR) 10Alex Monk: [C: 031] delete www.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/256597 (https://phabricator.wikimedia.org/T120143) (owner: 10Dzahn) [23:24:52] (03PS2) 10Dzahn: delete www.m.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/256597 (https://phabricator.wikimedia.org/T120143) [23:30:37] PROBLEM - pybal on lvs1007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal [23:32:36] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:34:36] (03PS1) 10Chad: Revert "Revert "Gerrit: use Diffusion for repo browsing"" [puppet] - 10https://gerrit.wikimedia.org/r/256605 [23:40:30] (03PS2) 10Chad: Revert "Revert "Gerrit: use Diffusion for repo browsing"" [puppet] - 10https://gerrit.wikimedia.org/r/256605 [23:41:40] (03CR) 1020after4: [C: 031] Revert "Revert "Gerrit: use Diffusion for repo browsing"" [puppet] - 10https://gerrit.wikimedia.org/r/256605 (owner: 10Chad) [23:42:14] revert revert revert revert [23:42:27] all day long [23:42:33] (03CR) 10Rush: [C: 031] "if it comes down to the double negative revert though you win a prize" [puppet] - 10https://gerrit.wikimedia.org/r/256605 (owner: 10Chad) [23:43:18] * twentyafterfour likes type-o-tripple-negative reverts [23:44:57] twentyafterfour: now you got cinnamon girl in my head but the super creepy version [23:45:54] it's not halloween, it's not time for me to break out those albums [23:45:56] hahah ... my roommate has been playing "lovin you is like lovin the dead..." for a while now. [23:49:05] wait chasemp: is there a version of cinnamon girl that isn't creepy? [23:49:44] well...I mean that depends on your pov but https://www.youtube.com/watch?v=aAdtUDaBfRA :) [23:49:54] (that's the original) [23:57:48] 7Blocked-on-Operations, 7Varnish: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1846476 (10ori) 3NEW