[00:01:22] PROBLEM - DPKG on snapshot1007 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:02:11] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [00:03:12] RECOVERY - DPKG on snapshot1007 is OK: All packages OK [00:03:20] (03PS2) 10ArielGlenn: fix one more stray directory reference in snapshot misc cron job [puppet] - 10https://gerrit.wikimedia.org/r/281560 [00:04:45] (03CR) 10ArielGlenn: [C: 032] fix one more stray directory reference in snapshot misc cron job [puppet] - 10https://gerrit.wikimedia.org/r/281560 (owner: 10ArielGlenn) [00:06:49] (03PS1) 10ArielGlenn: delay the monthly dump cron run one more lousy day [puppet] - 10https://gerrit.wikimedia.org/r/281575 [00:07:13] (03PS2) 10ArielGlenn: delay the monthly dump cron run one more lousy day [puppet] - 10https://gerrit.wikimedia.org/r/281575 [00:09:22] RECOVERY - RAID on stat1002 is OK: OK: optimal, 1 logical, 12 physical [00:09:48] (03CR) 10ArielGlenn: [C: 032] delay the monthly dump cron run one more lousy day [puppet] - 10https://gerrit.wikimedia.org/r/281575 (owner: 10ArielGlenn) [00:12:16] !log renamed frontend.navtiming.loading -> frontend.navtiming.loadEventStart and frontend.navtiming.sending -> frontend.navtiming.fetchStart on graphite2001 and graphite1001 ahead of merging https://gerrit.wikimedia.org/r/#/c/281082 [00:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:28] (03PS4) 10Ori.livneh: webperf: Rename navtiming 'loading' and 'sending' to standard equivalent [puppet] - 10https://gerrit.wikimedia.org/r/281082 (owner: 10Krinkle) [00:13:35] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: Rename navtiming 'loading' and 'sending' to standard equivalent [puppet] - 10https://gerrit.wikimedia.org/r/281082 (owner: 10Krinkle) [00:13:51] (03PS2) 10Ori.livneh: webperf: Convert navtiming metric mapping into list [puppet] - 10https://gerrit.wikimedia.org/r/281497 (owner: 10Krinkle) [00:14:36] !log krinkle@terbium Running deleteEqualMessages.php over previously cleaned wikis with --lang-code (T45917) [00:14:37] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: Convert navtiming metric mapping into list [puppet] - 10https://gerrit.wikimedia.org/r/281497 (owner: 10Krinkle) [00:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:20] (03CR) 10Ori.livneh: [C: 04-1] "I think you forgot domInteractive" [puppet] - 10https://gerrit.wikimedia.org/r/281498 (owner: 10Krinkle) [00:19:08] (03PS2) 10Ori.livneh: coal-web: Show domInteractive instead of domComplete [puppet] - 10https://gerrit.wikimedia.org/r/281501 (owner: 10Krinkle) [00:19:17] (03CR) 10Ori.livneh: [C: 032 V: 032] coal-web: Show domInteractive instead of domComplete [puppet] - 10https://gerrit.wikimedia.org/r/281501 (owner: 10Krinkle) [00:21:21] (03PS2) 10Krinkle: webperf: Collect metrics for 'domInteractive' and 'domComplete' [puppet] - 10https://gerrit.wikimedia.org/r/281498 [00:21:32] (03CR) 10Krinkle: "Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/281498 (owner: 10Krinkle) [00:21:38] (03PS3) 10Krinkle: webperf: Collect metrics for 'domInteractive' and 'domComplete' [puppet] - 10https://gerrit.wikimedia.org/r/281498 [00:21:56] (03CR) 10jenkins-bot: [V: 04-1] webperf: Collect metrics for 'domInteractive' and 'domComplete' [puppet] - 10https://gerrit.wikimedia.org/r/281498 (owner: 10Krinkle) [00:22:34] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf: Collect metrics for 'domInteractive' and 'domComplete' [puppet] - 10https://gerrit.wikimedia.org/r/281498 (owner: 10Krinkle) [01:15:11] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=544 [critical =500] [01:45:10] RECOVERY - check_missing_thank_yous on db1025 is OK: OK missing_thank_yous=1 [02:07:52] PROBLEM - puppet last run on labvirt1002 is CRITICAL: CRITICAL: puppet fail [02:27:02] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.19) (duration: 11m 55s) [02:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:44] (03PS1) 10Yuvipanda: k8s: Stop using packages for master components [puppet] - 10https://gerrit.wikimedia.org/r/281586 (https://phabricator.wikimedia.org/T130972) [02:30:46] (03PS1) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [02:31:33] RECOVERY - RAID on db1052 is OK: OK: optimal, 1 logical, 2 physical [02:34:41] RECOVERY - puppet last run on labvirt1002 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [02:36:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Apr 5 02:36:21 UTC 2016 (duration 9m 19s) [02:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:50:11] PROBLEM - puppet last run on mw2114 is CRITICAL: CRITICAL: Puppet has 1 failures [03:17:01] RECOVERY - puppet last run on mw2114 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [03:58:01] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [5000000.0] [03:58:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [5000000.0] [04:06:09] (03PS2) 10Yuvipanda: k8s: Stop using packages for master components [puppet] - 10https://gerrit.wikimedia.org/r/281586 (https://phabricator.wikimedia.org/T130972) [04:06:11] (03PS2) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [04:06:13] (03PS1) 10Yuvipanda: k8s: Add simple script for deploying master [puppet] - 10https://gerrit.wikimedia.org/r/281589 (https://phabricator.wikimedia.org/T130972) [04:06:15] (03PS1) 10Yuvipanda: k8s: Simple script to deploy worker & proxy [puppet] - 10https://gerrit.wikimedia.org/r/281590 (https://phabricator.wikimedia.org/T130972) [04:06:17] (03PS1) 10Yuvipanda: k8s: Switch to new format for ABAC [puppet] - 10https://gerrit.wikimedia.org/r/281591 (https://phabricator.wikimedia.org/T130972) [04:14:12] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000000.0] [04:23:10] PROBLEM - puppet last run on cp4005 is CRITICAL: CRITICAL: puppet fail [04:25:20] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [04:26:01] RECOVERY - Host mw2031 is UP: PING OK - Packet loss = 0%, RTA = 37.07 ms [04:50:00] RECOVERY - puppet last run on cp4005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:51:41] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [04:51:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [04:58:18] 6Operations, 6Performance-Team, 13Patch-For-Review, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#2179751 (10Ricordisamoa) Out of curiosity, have you got any stats about the foreseen performance improvements? [06:03:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 62.07% of data above the critical threshold [5000000.0] [06:31:00] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:21] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:01] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:21] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:01] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:11] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:51:35] 6Operations, 10ops-eqiad: db1052 degraded RAID - https://phabricator.wikimedia.org/T131701#2179827 (10Volans) 5Open>3Resolved Rebuild completed, virtual drive back to optimal. [06:56:21] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:56:30] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:02] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:57:10] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:57:11] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:58:00] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:04:03] 6Operations, 6Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2179841 (10Joe) I don't think measuring latencies for memcached (where they are usually around 1 ms) is that significant; improving the cache hit ratio would be very significant on the... [07:12:50] PROBLEM - Varnishkafka log producer on cp1044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [07:14:50] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: puppet fail [07:14:56] <_joe_> !log uploading the hhvm 3.12.1 backport package for jessie to reprepro [07:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:36:01] RECOVERY - Varnishkafka log producer on cp1044 is OK: PROCS OK: 1 process with command name varnishkafka [07:38:40] 6Operations, 10Continuous-Integration-Infrastructure, 6Services, 13Patch-For-Review: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2179870 (10Ricordisamoa) >>! In T124474#2164007, @Krinkle wrote: > It would additionally be interesting to explore actual multi-version node (nvm's primary featu... [07:41:41] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:45:38] (03PS1) 10ArielGlenn: add hhvm include to snapshots experimentally on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281600 [07:46:57] ah _joe_ that was you, huh :-) [07:48:37] (03CR) 10ArielGlenn: [C: 032] add hhvm include to snapshots experimentally on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281600 (owner: 10ArielGlenn) [07:50:24] meh and I can't actually include the stanza because it's got ubuntu >= trusty hardcoded in there [07:52:02] requires_os('ubuntu >= trusty') [07:54:04] (03PS1) 10ArielGlenn: allow hhvm installs on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281601 [08:00:53] <_joe_> apergos: don't :) [08:01:04] no? bummer [08:01:11] <_joe_> no [08:01:19] I was going to ask before doing anything but there's one more change I'd have to make in a manifest [08:01:25] <_joe_> 1) there is a complete patch from me from yesterday [08:01:43] <_joe_> https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+branch:production+topic:no_more_zend,n,z [08:02:11] ah this is a lot more than I need but if it covers the case, great [08:02:13] <_joe_> 2) hhvm packages on jessie are being backported right now and I have an issue with the wikidiff2 extension [08:02:21] oh, still not ready [08:02:24] <_joe_> nope [08:02:29] <_joe_> there is a ticket [08:02:36] drat [08:02:45] <_joe_> https://phabricator.wikimedia.org/T131755 [08:03:11] subscribed [08:04:17] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2179897 (10Joe) Tidy compiled fine with minimal tweaks to debian/control, while wikidiff2 failed to compile with a rather mysteri... [08:04:24] well this likely means that I'll revisit after this month's run then. too much delay on my part already. thanks for the heads up [08:04:43] (03PS3) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) [08:05:40] (03PS1) 10ArielGlenn: Revert "add hhvm include to snapshots experimentally on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/281602 [08:05:53] I'll drop back to trusty on these boxes then [08:07:20] (03CR) 10ArielGlenn: [C: 032] Revert "add hhvm include to snapshots experimentally on jessie" [puppet] - 10https://gerrit.wikimedia.org/r/281602 (owner: 10ArielGlenn) [08:09:23] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2179900 (10Joe) Apparently I forgot to add libtbb-dev to the builddeps of wikidiff2, so the error makes much more sense now :) I'... [08:20:04] <_joe_> apergos: we can merge my patch once I'm done with the packages, but I'd first test those a bit [08:20:06] (03Abandoned) 10ArielGlenn: allow hhvm installs on jessie [puppet] - 10https://gerrit.wikimedia.org/r/281601 (owner: 10ArielGlenn) [08:20:12] yep [08:20:26] It's better for me to wait [08:20:26] <_joe_> also, you're just using the cli? [08:20:33] yes but. [08:20:48] in order to get mediawiki on there it wants all the rest of te crud [08:21:01] <_joe_> because I am working on adapting the hhvm module to jessie as well [08:21:06] <_joe_> it's a fair bit of work [08:21:11] I imagine [08:22:27] well I would not mind being a guineau pig for testing at some point on one box [08:22:38] just that it won't be a full test of everything [08:25:34] (03CR) 10Muehlenhoff: "Looks good to me, but let's avoud the transitional packages." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/281409 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:27:49] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/281408 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:28:23] <_joe_> moritzm: thanks! :) [08:33:04] (03CR) 10Muehlenhoff: [C: 031] "Minor nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281407 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:34:30] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/281410 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:38:37] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/281412 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:44:11] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281419 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [08:46:55] 6Operations, 10Traffic: Varnish 4 panic log registered on cp1004 - https://phabricator.wikimedia.org/T131830#2179913 (10elukey) [08:47:20] 6Operations, 10Traffic: Varnish 4 panic log registered on cp1004 - https://phabricator.wikimedia.org/T131830#2179925 (10elukey) [08:48:34] 6Operations, 10Traffic: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2179913 (10elukey) [09:09:10] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/281418 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [09:10:10] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 667 [09:17:45] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/281411 (https://phabricator.wikimedia.org/T126310) (owner: 10Giuseppe Lavagetto) [09:20:10] RECOVERY - check_mysql on lutetium is OK: Uptime: 1623830 Threads: 1 Questions: 13508299 Slow queries: 9904 Opens: 102078 Flush tables: 2 Open tables: 64 Queries per second avg: 8.318 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:31:03] 6Operations, 10Traffic: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2179963 (10ema) Relevant logs as displayed by journalctl: https://phabricator.wikimedia.org/P2855 [09:31:21] 6Operations, 10Traffic: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2179964 (10ema) p:5Triage>3High [09:33:11] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail [09:34:55] 6Operations, 10ops-eqiad, 6DC-Ops: Broken disk on aqs1001.eqiad.wmnet - https://phabricator.wikimedia.org/T130816#2147470 (10faidon) Ping? [09:51:12] 6Operations, 10Attribution-Generator, 6Commons, 6TCB-Team: Unable to restore file - https://phabricator.wikimedia.org/T131832#2179992 (10Steinsplitter) [09:51:36] 6Operations, 10Attribution-Generator, 6Commons, 6TCB-Team: Unable to restore file - https://phabricator.wikimedia.org/T131832#2180006 (10Steinsplitter) p:5Triage>3High [09:53:48] 6Operations, 10Attribution-Generator, 6Commons, 6TCB-Team: Unable to restore file - https://phabricator.wikimedia.org/T131832#2180008 (10Steinsplitter) The same for: Errors: ``` 503 Service Temporarily Unavailable ``` ``` [1e984968e61432e4f88d9f01] 2016-04-05 09:51:55: Fatal exception of type MWExce... [09:55:09] 6Operations, 10Attribution-Generator, 6Commons, 6TCB-Team: Unable to restore file - https://phabricator.wikimedia.org/T131832#2180011 (10Steinsplitter) p:5High>3Triage [09:57:11] (03CR) 10Nemo bis: "The commit message is accurate and the rationale is explained in the phabricator task. Please revoke your inaccurate CR." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [10:00:30] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:02:38] (03PS1) 10Filippo Giunchedi: diamond: send production traffic via graphite line protocol [puppet] - 10https://gerrit.wikimedia.org/r/281622 (https://phabricator.wikimedia.org/T121861) [10:11:07] 6Operations, 10Attribution-Generator, 6Commons, 6TCB-Team: Unable to restore file - https://phabricator.wikimedia.org/T131832#2179992 (10Pokefan95) Looks like the file is successfully restored now. Maybe it failed because it has a very large file size (2.76 GB). [10:11:58] 6Operations, 10Attribution-Generator, 6Commons, 6TCB-Team: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2180021 (10Pokefan95) [10:15:14] (03CR) 10Filippo Giunchedi: [C: 04-1] "shouldn't pose a problem with graphite disk space as some tcp metrics will change, though to be deployed on wed/thurs after services codfw" [puppet] - 10https://gerrit.wikimedia.org/r/281622 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [10:15:55] 6Operations, 7Puppet, 7Jenkins: Run puppet unit tests as part of continuous integration - https://phabricator.wikimedia.org/T131833#2180022 (10Gehel) [10:16:12] <_joe_> !log all hhvm extensions uploaded for debian jessie as well [10:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:19:34] 6Operations, 10Attribution-Generator, 6Commons, 6TCB-Team: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2180034 (10Steinsplitter) yes, it has taken a while. I pressed the deletion button accidentally yesterday, then noticed and closed the browser. Th... [10:22:57] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2180038 (10Joe) I add to fix the debian/control files of a couple of packages, but now they're built and uploaded to reprepro. B... [10:25:02] 6Operations, 10Traffic: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2180044 (10ema) Backtrace including symbols missing from the logs: 0x433ea5: varnishd() [0x433ea5] <- pan_ic + 357 0x4311e4: varnishd() [0x4311e4] <- obj_getmethods + 84 0x43283e: varnishd(ObjGet... [10:30:59] 6Operations, 7Puppet, 7Jenkins: Run puppet unit tests as part of continuous integration - https://phabricator.wikimedia.org/T131833#2180051 (10Gehel) [10:31:02] 6Operations, 10Continuous-Integration-Infrastructure: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2180052 (10Gehel) [10:33:31] PROBLEM - DPKG on gallium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:33:38] (03PS3) 10Filippo Giunchedi: graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) [10:33:50] 6Operations, 10Continuous-Integration-Infrastructure: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2180065 (10Gehel) The job used to run those rspec was disabled recently in https://gerrit.wikimedia.org/r/#/c/179244/. The issue was that different modules used... [10:34:49] (03CR) 10jenkins-bot: [V: 04-1] graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [10:39:00] RECOVERY - DPKG on gallium is OK: All packages OK [10:39:25] gallium errors are from me [10:44:21] PROBLEM - DPKG on gallium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:46:11] RECOVERY - DPKG on gallium is OK: All packages OK [10:46:28] (03PS1) 10Giuseppe Lavagetto: cache::text: route traffic for restbase, citoid, cxserver to codfw [puppet] - 10https://gerrit.wikimedia.org/r/281626 [10:50:40] (03PS1) 10Giuseppe Lavagetto: Make MediaWiki call the codfw restbase from all datacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281628 [10:51:16] 6Operations, 10Continuous-Integration-Infrastructure: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2180095 (10hashar) We gave it a try a while ago. @Gehel proposed to have some kind of hacking sprint to revisit running the Puppet module tests easily. At first... [10:51:18] <_joe_> godog: ^^ [10:52:23] 6Operations, 10Continuous-Integration-Infrastructure: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2180096 (10hashar) A change that got abandoned which might be of interest https://gerrit.wikimedia.org/r/#/c/178810/ . That one is to run unit tests for the vari... [10:56:13] _joe_: ack, thanks I'll take a look [10:56:44] (03Restored) 10Hashar: Basic rspec setup [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [11:00:41] (03PS1) 10Muehlenhoff: Enable ChallengeResponseAuthentication [puppet] - 10https://gerrit.wikimedia.org/r/281629 [11:00:43] (03PS1) 10Muehlenhoff: WIP: Use Yubico OTPs as a second authentication factor for members of the yubiauth group [puppet] - 10https://gerrit.wikimedia.org/r/281630 [11:02:09] (03PS9) 10Hashar: Basic rspec setup [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) [11:03:16] (03CR) 10jenkins-bot: [V: 04-1] Basic rspec setup [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [11:03:20] (03CR) 10Hashar: "Rebased. rspec-puppet 2+ is now published on ruby gems: https://rubygems.org/gems/rspec-puppet so the Gemfile no more installs from githu" [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [11:03:41] (03PS4) 10Filippo Giunchedi: graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) [11:03:43] (03PS1) 10Filippo Giunchedi: graphite: add cluster_servers graphite-web setting [puppet] - 10https://gerrit.wikimedia.org/r/281631 (https://phabricator.wikimedia.org/T85451) [11:05:00] (03CR) 10jenkins-bot: [V: 04-1] graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [11:05:14] (03CR) 10jenkins-bot: [V: 04-1] graphite: add cluster_servers graphite-web setting [puppet] - 10https://gerrit.wikimedia.org/r/281631 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [11:13:06] (03PS10) 10Hashar: Basic rspec setup [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) [11:13:34] (03CR) 10Hashar: "Replaced rspec 'should' with expect(subject).to." [puppet] - 10https://gerrit.wikimedia.org/r/178810 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [11:15:16] 6Operations, 10Continuous-Integration-Infrastructure: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#2180130 (10hashar) I have reopened/rebased/fixed https://gerrit.wikimedia.org/r/#/c/178810/ which introduce a very basic spec for wmflib.os_version(). Can be in... [11:37:56] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2180135 (10Joe) 5Open>3Resolved [11:37:58] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2180136 (10Joe) [11:38:45] 6Operations, 10MediaWiki-General-or-Unknown, 7HHVM: Backport HHVM from sid => jessie and build all of our extensions for jessie as well - https://phabricator.wikimedia.org/T131755#2177353 (10Joe) I performed some smoke tests to verify our extensions worked correctly. Please note our puppet module for hhvm is... [11:40:00] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:07] (03PS1) 10BBlack: Bugfix for wmf nukelru patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/281634 (https://phabricator.wikimedia.org/T131830) [11:40:10] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:11] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:40:11] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:41:07] 6Operations, 10Traffic, 13Patch-For-Review: Varnish 4 panic log registered on cp1044 - https://phabricator.wikimedia.org/T131830#2180142 (10BBlack) ^ This should fix the panic, needs review -> package -> deploy [11:41:50] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:42:57] <_joe_> seems like an outage from one of our remote services [11:43:06] <_joe_> I'm checking [11:44:02] <_joe_> yup everything is fine now [11:45:20] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:45:21] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:50:50] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:51] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:50:51] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:52:31] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:52:31] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:52:40] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:57:51] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:58:10] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:58:10] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:58:11] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:59:51] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:59:51] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:00:00] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:05:31] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:32] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:12:40] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:30] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:16:01] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:16:01] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:25:30] (03PS1) 10Muehlenhoff: Clarify role description [puppet] - 10https://gerrit.wikimedia.org/r/281636 [12:26:51] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:27:30] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, and 5 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2180218 (10Gehel) [12:27:33] 6Operations, 10CirrusSearch, 6Discovery, 6Discovery-Search-Backlog, and 4 others: Activate SSL + connection pooling for CirrusSearch on PROD - https://phabricator.wikimedia.org/T131839#2180204 (10Gehel) [12:32:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:32:41] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:35:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Clarify role description [puppet] - 10https://gerrit.wikimedia.org/r/281636 (owner: 10Muehlenhoff) [12:36:01] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:01] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:41:30] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:43:21] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:45:47] (03PS1) 10Merlijn van Deen: toollabs: install virtualenvwrapper [puppet] - 10https://gerrit.wikimedia.org/r/281639 (https://phabricator.wikimedia.org/T131840) [12:45:49] (03PS1) 10Matthias Mullie: Add Flow dumps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 [12:45:59] (03PS2) 10Merlijn van Deen: toollabs: install virtualenvwrapper [puppet] - 10https://gerrit.wikimedia.org/r/281639 (https://phabricator.wikimedia.org/T131840) [12:46:09] (03PS3) 10Merlijn van Deen: toollabs: install virtualenvwrapper [puppet] - 10https://gerrit.wikimedia.org/r/281639 (https://phabricator.wikimedia.org/T131840) [12:46:51] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:39] * urandom waves at _joe_ and godog [12:50:32] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2180261 (10BBlack) So I've spent a few days pondering all of this. There are definitely some improvements we could make to cache_upload's way of doing... [12:51:31] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#2007434 (10BBlack) [12:51:44] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Solve large-object/stream/pass/chunked in our shared VCL - https://phabricator.wikimedia.org/T131761#2180266 (10BBlack) [12:51:46] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: Convert misc cluster to Varnish 4 - https://phabricator.wikimedia.org/T131501#2180267 (10BBlack) [12:51:48] 6Operations, 10Traffic, 13Patch-For-Review, 7Varnish: cache_misc's misc_fetch_large_objects has issues - https://phabricator.wikimedia.org/T128813#2180265 (10BBlack) [12:52:10] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [12:52:20] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [12:52:21] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [12:53:40] I would really appreciate if anyone review these patches of mine in puppet: https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+owner:%22Ladsgroup+%253Cladsgroup%2540gmail.com%253E%22,n,z [12:54:27] all of ores patches except the one for staging role are already cherry-picked in beta cluster and they work very well [12:55:41] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [12:57:32] (03PS2) 10Matthias Mullie: Add Flow dumps schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281640 (https://phabricator.wikimedia.org/T112799) [12:58:34] (03PS1) 10ArielGlenn: change installer for snapshot 1005-1007 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/281642 [12:58:56] 6Operations, 6Commons: Unable to restore file that has a very large file size - https://phabricator.wikimedia.org/T131832#2180279 (10Tobi_WMDE_SW) Unrelated [12:59:24] hey urandom [13:00:15] (03CR) 10ArielGlenn: [C: 032] change installer for snapshot 1005-1007 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/281642 (owner: 10ArielGlenn) [13:00:44] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2178301 (10mark) A redundant setup was actually suggested during the initial setup of Phabricator, but decided against because: # Phabricator didn't actually have good s... [13:00:47] trusty? [13:00:56] because of hhvm? [13:10:11] (03CR) 10Ema: [C: 032 V: 032] Bugfix for wmf nukelru patch [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/281634 (https://phabricator.wikimedia.org/T131830) (owner: 10BBlack) [13:12:11] PROBLEM - Apache HTTP on mw1248 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [13:13:09] [1657097.483269] init: hhvm main process (2735) killed by SEGV signal ---^ [13:13:21] (03PS1) 10Ema: New WMF version: 4.1.2-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/281645 (https://phabricator.wikimedia.org/T131830) [13:13:25] anybody working on it? [13:13:37] otherwise I'll restart [13:13:52] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.050 second response time [13:14:16] or for example I could use 'last' and check [13:14:41] !log restarted hhvm on mw1248 (hhvm was segfaulting) [13:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:35] (03CR) 10BBlack: [C: 031] New WMF version: 4.1.2-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/281645 (https://phabricator.wikimedia.org/T131830) (owner: 10Ema) [13:18:38] (03CR) 10Ema: [C: 032 V: 032] New WMF version: 4.1.2-1wm2 [debs/varnish4] (debian-wmf) - 10https://gerrit.wikimedia.org/r/281645 (https://phabricator.wikimedia.org/T131830) (owner: 10Ema) [13:21:38] Phabricator has a problem: the Search querys are throwing 504 [13:21:43] (03PS1) 10Rillke: Add UploadsLink to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281646 (https://phabricator.wikimedia.org/T130018) [13:22:37] some here who can take a look? [13:22:44] Seems to work again [13:23:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [13:24:30] But really slow [13:26:46] 6Operations, 10Phabricator: Phabricator search querys are actually very slow - https://phabricator.wikimedia.org/T131843#2180340 (10Luke081515) [13:29:35] !log re-installing snapshot1007 with trusy [13:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:33:47] 6Operations, 10Analytics: Upgrade aqs* to nodejs 4.3 - https://phabricator.wikimedia.org/T123629#2180383 (10elukey) The above procedure caused (sigh) a small outage yesterday when I tried to update aqs1001. The main problem happened right after stopping cassandra, triggering the following failure: ``` com.goo... [13:33:54] (03PS4) 10Rush: toollabs: install virtualenvwrapper [puppet] - 10https://gerrit.wikimedia.org/r/281639 (https://phabricator.wikimedia.org/T131840) (owner: 10Merlijn van Deen) [13:35:16] (03CR) 10Rush: [C: 032] toollabs: install virtualenvwrapper [puppet] - 10https://gerrit.wikimedia.org/r/281639 (https://phabricator.wikimedia.org/T131840) (owner: 10Merlijn van Deen) [13:35:21] PROBLEM - puppet last run on mw2037 is CRITICAL: CRITICAL: Puppet has 1 failures [13:40:09] !log depooling cp1044 (maps) for varnish upgrade to 4.1.2-1wm2. Bug: T131830 [13:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:41] (03CR) 10BBlack: [C: 031] "Works in the compiler as intended, commitmsg update inbound though" [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) (owner: 10BBlack) [13:45:40] godog: should we preemptively throttle the 2004-b bootstrap? [13:46:53] (03PS1) 10Rillke: Set up UploadsLink on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281651 (https://phabricator.wikimedia.org/T131844) [13:47:00] urandom: yeah, like halve the speed, 2MB/s [13:47:07] ok [13:49:24] !log Throttling outbound stream throughput to 15Mbps on restbase2004-a.codfw.wmnet and restbase2003.codfw.wmnet : T95253 [13:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:23] 6Operations, 6Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2180416 (10mark) @RobH: please request quotes for 2x 4 servers as outlined above. Thanks! [13:52:03] !log repooling cp1044 (maps) after varnish upgrade to 4.1.2-1wm2. Bug: T131830 [13:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:14] (03CR) 10Filippo Giunchedi: [C: 031] cache::text: route traffic for restbase, citoid, cxserver to codfw [puppet] - 10https://gerrit.wikimedia.org/r/281626 (owner: 10Giuseppe Lavagetto) [13:52:20] (03CR) 10Filippo Giunchedi: [C: 031] Make MediaWiki call the codfw restbase from all datacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281628 (owner: 10Giuseppe Lavagetto) [13:52:21] <_joe_> so, should we go? [13:52:55] in 8 minutes! [13:54:36] (03PS2) 10Giuseppe Lavagetto: cache::text: route traffic for restbase, citoid, cxserver to codfw [puppet] - 10https://gerrit.wikimedia.org/r/281626 [13:58:16] !log depooling cp1043 (maps) for varnish upgrade to 4.1.2-1wm2. Bug: T131830 [13:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:21] (03PS3) 10BBlack: VCL: remove all non-default between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) [13:58:57] <_joe_> bblack, ema heads-up we're switching restbase, citoid and cxserver to work through codfw [13:59:21] <_joe_> godog, urandom green light? [13:59:28] (03CR) 10Giuseppe Lavagetto: [C: 032] cache::text: route traffic for restbase, citoid, cxserver to codfw [puppet] - 10https://gerrit.wikimedia.org/r/281626 (owner: 10Giuseppe Lavagetto) [13:59:30] +1 [13:59:40] <_joe_> ok, submitting then [13:59:56] * urandom crosses fingers [14:00:02] <_joe_> oh come on [14:00:05] :) [14:00:08] <_joe_> last time this part was a breeze [14:00:27] ok ok [14:00:55] let's do it! [14:00:58] yeah that part's easy, just need to puppet the text caches in eqiad to take effect [14:01:12] <_joe_> I'm starting the puppet run on the eqiad caches [14:01:36] TIL "to puppet" [14:01:37] you could mod cache::route_table too, but IMHO that's already been tested at the varnish layer with our swift testing, and it's more-complicated and not necessary here... [14:02:21] <_joe_> bblack: we already tested it, it works pretty well :) [14:02:28] ok :) [14:03:09] yeah it's just scarier, and IMHO not really necessary to test that the applayer switching works ok [14:03:36] (scarier in that it's possible to create loops with serial changes there which aren't fully in effect before the next one hits) [14:04:05] <_joe_> uhm puppet is being slow, I remembered it being faster on the caches [14:05:08] godog: fyi, it's going to take a bit for the outbound throttling of 2003 to take effect [14:05:17] it's still @ 4MB/s [14:05:25] ok! [14:05:36] (if you remember, it has to finish the file it's on before it takes effect) [14:05:50] !log repooling cp1043 (maps) after varnish upgrade to 4.1.2-1wm2. Bug: T131830 [14:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:11] <_joe_> I see traffic flowing through codfw now [14:06:35] yup: https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-client-requests?from=1459821983442&to=1459865183443&panelId=32&fullscreen&var-datacenter=2&var-node=All&var-quantile=99percentile [14:07:46] <_joe_> so, let's wait for the things to be settled and we can think of switching over the mediawiki traffic as well [14:08:22] RECOVERY - puppet last run on mw2037 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:08:45] _joe_: SGTM [14:09:01] (03CR) 10Ottomata: "Cool, looks good. Missing doc about the args[0] as full cluster name though. Add that and +1 from me, and then I think we can merge this" [puppet] - 10https://gerrit.wikimedia.org/r/279280 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [14:10:55] <_joe_> !log external traffic for restbase, citoid, cxserver fully switched to codfw [14:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:15] <_joe_> external traffic reaching parsoid is almost below detectability [14:14:30] <_joe_> godog, urandom should we make The Big Switch now? [14:14:47] (03CR) 10Ottomata: Hieraize keyholder::agent configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [14:14:50] <_joe_> which would be, let's point the mediawiki cannon at restbase/parsoid in codfw [14:15:12] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:16:03] _joe_: yup, another 10/15 min perhaps so it'll be clear from the graphs [14:16:42] (03PS1) 10ArielGlenn: add hiera config settings for old snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/281659 [14:17:11] <_joe_> godog: cool [14:17:32] yeah, +1 to waiting for a few metrics collection cycles [14:17:49] 6Operations, 6Performance-Team, 10Traffic, 13Patch-For-Review: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2180445 (10BBlack) re: nginx upstream+debian: debian's "master" branch is still at 1.9.10-1, but their "dyn" branch has work beyond that up through 1.9.13 and not yet released, which als... [14:26:47] <_joe_> so, latencies don't seem that bad in absolute terms [14:27:02] <_joe_> given we have a full round-trip for every mediawiki request [14:27:30] _joe_: what are you using as a measure? [14:28:15] <_joe_> https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-cf-performance [14:28:21] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: Puppet has 2 failures [14:29:16] <_joe_> interestingly, the 3-instances machine is underperforming, apparently [14:29:29] _joe_: that's not incurring any cross-dc latency in that graph [14:29:46] <_joe_> urandom: how can that be? [14:29:57] (03PS2) 10ArielGlenn: add hiera config settings for old snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/281659 [14:30:01] <_joe_> if parsoid gets a request, it calls mediawiki in eqiad [14:30:08] <_joe_> and rb does the same on its side [14:30:15] <_joe_> whenever it needs to call the mw api [14:30:16] apergos: hey -- you're doing tons of refactoring work on all those snapshot stuff [14:30:22] but that graph is of Cassandra [14:30:32] apergos: perhaps you shouldn't be self-merging these right away and get a few reviews [14:30:32] <_joe_> just cassandra? [14:30:34] it's CF latency according to Cassandra [14:30:36] 6Operations, 10Traffic, 7Varnish: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2180463 (10BBlack) Noted while investigating other things: https://www.varnish-cache.org/trac/ticket/1643 - the ticket is still open, but it's not clear to me whether it's only an open issue in 4... [14:30:44] apergos: at least wait a day or two maybe? :) [14:30:58] paravoid: I'm just trying to get the old snaps to where I can enable puppet on them [14:31:00] again [14:31:03] i.e. shortly [14:31:31] <_joe_> urandom: meh I got confused [14:31:44] and yes I do absolutely want comments as soon as I have things in a stable (not broken) state [14:31:58] _joe_: i'm looking at https://grafana.wikimedia.org/dashboard/db/restbase, which is from the perspective of restbase [14:32:17] i.e. if Cassandra were slow, i'd expect to see it in the latencies there [14:32:52] but the latencies from codfw will be coming from codfw, so i don't expect them to be higher, unless it's due to something going on there [14:33:04] i.e. that's not incurring inter-datacenter latency either [14:33:09] <_joe_> ok [14:33:17] $ git log --oneline --since="1 month ago" modules/snapshot/ |wc -l [14:33:17] 48 [14:33:38] that's quite substantial work to be self-merged [14:33:52] _joe_: but those all look good to me so far [14:35:04] _joe_: this one will give you Cassandra rate and latency from a (more or less) client perspective: https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-client-requests?from=1459780463520&to=1459866863520&var-datacenter=2&var-node=All&var-quantile=99percentile [14:35:35] apergos: aye, i'm noticing these too because of this rebase conflict [14:35:36] i; [14:35:37] need to fix that date range [14:35:39] i'd be happy to help review [14:35:47] this new stuff [14:35:51] _joe_: FWIW anytime is good to switch the rest (pun intented) [14:35:54] but its harder to make changes after the stuff is applied sometimes [14:35:57] https://grafana.wikimedia.org/dashboard/db/restbase-cassandra-client-requests?from=now-24h&var-datacenter=2&var-node=All&var-quantile=99percentile [14:36:01] _joe_: ^^ [14:36:04] so even though it takes a little longer, its better to get review up front [14:36:50] <_joe_> cool, it doesn't look bad [14:37:05] <_joe_> urandom: let's see how it handles the big guns :P [14:37:26] _joe_: are you going to point the mediawiki canon, or cannon, at it? [14:37:32] (03CR) 10Giuseppe Lavagetto: [C: 032] Make MediaWiki call the codfw restbase from all datacenters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281628 (owner: 10Giuseppe Lavagetto) [14:37:37] <_joe_> yeah [14:38:00] my lame joke was wasted [14:38:17] <_joe_> :) [14:38:18] apergos: so, could you start asking/waiting for some reviews for your refactoring work (at least once you're done with bringing those old snaphost hosts back to running puppet?) [14:38:27] yes that's fine [14:38:34] thanks :) [14:38:41] feel free to add me as a reviewer [14:38:51] and poke me :) [14:38:53] sorry, I've been trying to beat a moving deadline and failing [14:38:54] sure [14:39:29] <_joe_> deploy started [14:40:04] !log oblivian@tin Synchronized wmf-config/ProductionServices.php: make mediawiki talk to codfw restbase only (duration: 00m 47s) [14:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:40:10] (03CR) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [14:40:10] PROBLEM - HHVM rendering on mw1250 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.010 second response time [14:40:44] ja apergos add me as reviewer too and we can make them move faster [14:42:01] RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 64559 bytes in 0.092 second response time [14:42:56] (03PS21) 10Ottomata: Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [14:43:00] <_joe_> urandom: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=Restbase+codfw&m=cpu_report&s=by+name&mc=2&g=network_report [14:43:13] thar she blows [14:43:28] <_joe_> that's where I got my "90%" figure last time :P [14:43:32] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [14:45:50] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 73.33% of data above the critical threshold [5000000.0] [14:46:18] 6Operations, 10Traffic, 7Varnish: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2180493 (10ema) >>! In T131502#2180463, @BBlack wrote: > Noted while investigating other things: https://www.varnish-cache.org/trac/ticket/1643 - the ticket is still open, but it's not clear to m... [14:52:54] _joe_: Seeing some 500s: https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase [14:54:42] (03PS22) 10Ottomata: Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [14:56:31] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:59:55] CI weekly chat https://plus.google.com/hangouts/_/wikimedia.org/btest-ci-weekly [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160405T1500). [15:00:04] Urbanecm: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:02:32] (03CR) 10ArielGlenn: [C: 032] add hiera config settings for old snapshot hosts [puppet] - 10https://gerrit.wikimedia.org/r/281659 (owner: 10ArielGlenn) [15:03:51] hmm, no Urbanecm. I'm around to SWAT if needed. [15:04:51] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet last ran 19 hours ago [15:06:50] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:07:51] (03PS2) 10Filippo Giunchedi: graphite: add cluster_servers graphite-web setting [puppet] - 10https://gerrit.wikimedia.org/r/281631 (https://phabricator.wikimedia.org/T85451) [15:07:53] (03PS5) 10Filippo Giunchedi: graphite: add 'big_users' route and cluster [puppet] - 10https://gerrit.wikimedia.org/r/277490 (https://phabricator.wikimedia.org/T85451) [15:11:10] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: puppet fail [15:13:58] 6Operations, 10Traffic, 7Varnish: Convert upload cluster to Varnish 4 - https://phabricator.wikimedia.org/T131502#2180630 (10ema) I've tried to reproduce the corrupt range response issue on 4.0.3 without success. Instead, I've encountered a different problem: $ curl -v -H 'Range:bytes=0-' http://localhost... [15:15:12] (03CR) 10Dereckson: [C: 04-1] Add new namespaces and new aliases for newikibooks (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:21:45] (03PS1) 10ArielGlenn: swap classes on snapshot1002 for cron dump run job [puppet] - 10https://gerrit.wikimedia.org/r/281668 [15:22:46] (03PS1) 10Muehlenhoff: Ignore packages in deinstalled status [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/281669 [15:35:51] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:38:09] o/ _joe_ [15:38:23] <_joe_> halfak: uh, what's up? :) [15:38:27] In revscoring, we're blocked on a few puppet patches that akosiaris hasn't been able to get to yet [15:38:40] Would it be OK if we flagged you as a reviewer on a few of them? [15:39:18] <_joe_> halfak: of course, but I cannot make promises on when I can look; I'd also still like alex to have a final say [15:39:39] 6Operations, 6Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2180679 (10Gehel) To consolidate infos. >>! In T125126#1991742, @Yurik wrote: > Ok, seems some confusion has been clarified. The maps team immediate need to launch maps for all wiki pr... [15:39:42] OK. We'll add you to reviewers and see how it goes. [15:39:51] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [15:40:03] In the future, should we ask first like this or just add you as a reviewer? [15:42:58] (03PS23) 10Ottomata: Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [15:43:44] !log ms-be1019 to weight 3500 - T116842 [15:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:28] (03PS19) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [15:45:31] _joe_, ^ [15:46:44] _joe_ or whoever knows about ganglia reporting ... looks like network graphs in codfw needs tweaking ... not sure why it has been showing a baseline 5-10m "in" traffic over the last month ... see https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Parsoid%20codfw&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1459871119&g=network_report&z=large [15:49:54] (03PS2) 10ArielGlenn: swap classes on snapshot1002 for cron dump run job [puppet] - 10https://gerrit.wikimedia.org/r/281668 [15:51:38] <_joe_> subbu: what's wrong with that? [15:51:47] <_joe_> it's puppet and monitoring [15:52:47] <_joe_> the actual number of requests is pretty low nowadays [15:53:01] <_joe_> but yeah it seems a bit excessive [15:53:08] _joe_, 10mb a sec! :) [15:53:17] <_joe_> over 20 machines [15:53:18] (03PS4) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) [15:53:37] <_joe_> so it's like 500kb/sec [15:53:38] i don't think codfw has been active at all except during switchovers like few weeks back and today. [15:53:42] <_joe_> which is still too much [15:53:43] (03CR) 10jenkins-bot: [V: 04-1] Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [15:53:44] right. [15:56:46] (03PS3) 10ArielGlenn: swap classes on snapshot1002 for cron dump run job [puppet] - 10https://gerrit.wikimedia.org/r/281668 [15:58:19] (03PS5) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) [16:00:04] godog moritzm: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160405T1600). [16:01:22] (03CR) 10ArielGlenn: [C: 032] swap classes on snapshot1002 for cron dump run job [puppet] - 10https://gerrit.wikimedia.org/r/281668 (owner: 10ArielGlenn) [16:01:35] bblack, would there be any objections to enable wikipedia for the maps, if we only allow the , not ? This is identical to the geohack usage, so no extra load [16:01:53] a lot of users have been asking about this [16:02:00] esp during hackathon [16:02:35] yo, no puppetswat patches? ping me if needed [16:02:50] yurik: is there some additional testing this enables? [16:03:03] bblack, ?? what do you mean [16:03:23] I mean any driver for pushing it broader would be to test additional things we're not testing now, IMHO [16:03:41] we don't want to spread it further on user demand for user use until we're ready to put it in production? [16:04:28] bblack, it will basically be allowing for replacement of the wmflabs' geohack. We won't be taking down geohack, so in the worst case, we can simply switch back to it [16:04:32] example - https://www.mediawiki.org/wiki/Help:Extension:Kartographer#.3Cmaplink.3E [16:05:01] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:05:26] bblack, doing this will allow us to meet user's asks for better functionality - so they can start testing this feature and tell us what needs improvement [16:06:01] bblack, thanks to this tactics, we are already getting tons of good feedback - https://www.mediawiki.org/wiki/Talk:Maps [16:06:49] great [16:06:57] (03PS1) 10ArielGlenn: use new cron job class for dumps on snapshot 1001 and 1004 also [puppet] - 10https://gerrit.wikimedia.org/r/281677 [16:07:06] but still, I don't see why we're trying to spread this to a broader audience if we're not production-ready [16:09:14] afaik we're still basically stopped up on https://phabricator.wikimedia.org/T125126 unresolved [16:10:57] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#1970805 (10Nuria) We are having problems with AQS nodes related to lack of SSDs: our iowait is going real high, to the point that we think might be affecting loadi... [16:11:35] bblack, if you are referring to the extra servers - having won't require any new hardware. On the other hand, in my experience with maps & community, it takes a very long time for the community to agree on the new feature and figure out the blockers. We have had kartographer on wikivoyage for a month now, and only now we are getting some active feedback and feature requests. It will be even longer for WP. But on the other hand, [16:11:35] enabling it allows advanced users to experiment with it in the context of their own wiki, and will get us much better feedback. [16:12:27] (03CR) 10BBlack: [C: 031] Remove puppet/recursor0/recursor1.esams CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/281117 (owner: 10Faidon Liambotis) [16:12:55] (03CR) 10ArielGlenn: [C: 032] use new cron job class for dumps on snapshot 1001 and 1004 also [puppet] - 10https://gerrit.wikimedia.org/r/281677 (owner: 10ArielGlenn) [16:13:51] yurik: I don't know how many more times or ways I can say it: it's not about cpu load or i/o, it's about having a reliable production system in place for the long term, that we can work with operationally. what you're on now is crappy old loaner to-be-decommed hardware, not a real production config. [16:14:22] turning on more maps.wm.o features for a much broader audience seems like something that should wait on that getting resolved and having real production hardware in place [16:15:11] afaik it's not resolved. I have work pending on that as well, and I haven't gotten any positive updates lately on unblocking it. [16:16:22] bblack, i heard you. My concern is that it will be months after we resolve the hardware until we start getting good feedback from the community. So having it on does not preclude us from disabling it, but it does give us valuable feedback [16:16:45] you generally can't take back what you give to users :) [16:17:06] if I had a better picture of what the state of the hardware thing was, I might know more and make a better decision [16:17:15] tfinc, ^ [16:17:50] all I know is, it's not resolved, and there doesn't seem to be any functional hard reason it can't be resolved, it's all issues up at the soft layers of management/planning/budgeting [16:18:20] bblack, judging by all the feedback i get from tfinc and toby, the hardware question is very near to being decided from all the soft layers :) [16:19:12] well, when it is, then we can work on putting this thing in real production config, and then I don't really care how much of it you turn on where for how many users :) [16:19:37] yurik: what's the question here? is it if the budget is approved to move forward on maps hardware ? [16:20:09] also please pull in gehel to support these conversations [16:20:37] i'm about to commute in so let me know what you need [16:21:03] (03PS1) 10ArielGlenn: fix wrong owner for cron dumps script, should be root [puppet] - 10https://gerrit.wikimedia.org/r/281679 [16:21:45] tfinc, i will email the conv to get you up to speed. Basically trying to work with community on to replace geohack while hardware question is being solved [16:23:18] got it. budget wise were approved to move forward for 2015/2016 and no red flags for 2016/2017 so there isn't an existing issue when it comes to CAPEX [16:23:22] ok. heading in [16:23:40] yurik: and don't forget your expense report :) [16:23:43] bblack, ^ [16:23:56] (not the expense report portion :)))) [16:24:44] (03CR) 10ArielGlenn: [C: 032] fix wrong owner for cron dumps script, should be root [puppet] - 10https://gerrit.wikimedia.org/r/281679 (owner: 10ArielGlenn) [16:25:02] yurik: I'd still like to hear from mark on that first, there's been a lot of back and forth on this over a period of time now, and I'm not directly involved. [16:25:17] so would i :) [16:25:42] I don't know if the language above about budgets actually meets whatever he's waiting on to release hardware or whatever [16:26:29] bblack, i would love to organize a facetoface about this with everyone involved. My understanding is that tfinc & erik are both waiting for mark [16:26:46] no need for a facetoface [16:26:53] well, he's a busy guy! [16:26:54] yei, mark is here [16:26:57] oh and here he is [16:26:59] mark is waiting on others [16:27:12] and I expect it to be resolved within the next few days [16:27:17] awesome!!!!!!!!!!!!! [16:27:52] i've also asked rob to start getting quotes for maps backends [16:28:00] bblack, i will add wmftest.net:8080 to the refs - apparently people are using that for vagrant testing now [16:28:19] (unrelated to the above) [16:28:41] uh [16:28:53] no idea why, but i think vagrant does something magical for it [16:28:56] I thought wmftest was about perf testing from AWS? [16:29:00] yurik / tfink: sorry I'm late here ... [16:29:19] gehel, no worries, just wanted to keep you in the loop [16:29:32] I just saw the update from mark for the maps backend H/W [16:30:49] yurik: would you have time for some face time to make sure we are aligned? [16:30:57] gehel, sure thing [16:31:05] bblack, our vagrant has that URL everywhere now [16:31:12] now? lemme call you... [16:31:23] gehel, ok [16:31:28] yurik: actually, no give me 10' to sort my webcam issue... [16:31:32] sure [16:33:11] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [16:36:04] bblack: wmftest just points back to localhost, was setup for cross-domain simulation with vagrant [16:36:19] (03PS1) 10Yurik: Allow maps usage from vagrant [puppet] - 10https://gerrit.wikimedia.org/r/281681 [16:36:24] bblack, ^ [16:36:56] hmmm ok, yeah I see the older *.local there [16:37:06] wpt.wmftest.net is the one that was AWS-based [16:37:42] (03CR) 10BBlack: [C: 032] Allow maps usage from vagrant [puppet] - 10https://gerrit.wikimedia.org/r/281681 (owner: 10Yurik) [16:38:22] thx! [16:39:27] (03PS1) 10Andrew Bogott: In our libvirt hack, rename libvirt_images_type to images_type [puppet] - 10https://gerrit.wikimedia.org/r/281683 (https://phabricator.wikimedia.org/T131322) [16:40:25] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2180890 (10mmodell) p:5Triage>3Normal @mark: yes, both have changed since, I believe. Some specific points: 1. Phabricator has improved support for scaling and failove... [16:42:26] yurik: I should be good to go... [16:43:06] (03PS2) 10Andrew Bogott: In our libvirt hack, rename libvirt_images_type to images_type [puppet] - 10https://gerrit.wikimedia.org/r/281683 (https://phabricator.wikimedia.org/T131322) [16:44:12] all those free, spare servers that are coming out of nowhere :) [16:44:45] (03PS1) 10Cmjohnson: Changing installer to jesse for snapshot1005 for troubleshooting. [puppet] - 10https://gerrit.wikimedia.org/r/281686 [16:44:59] (03PS2) 10BBlack: text VCL: do_stream when creating hit-for-pass [puppet] - 10https://gerrit.wikimedia.org/r/281643 [16:45:07] (03CR) 10BBlack: [C: 032 V: 032] text VCL: do_stream when creating hit-for-pass [puppet] - 10https://gerrit.wikimedia.org/r/281643 (owner: 10BBlack) [16:46:03] (03PS4) 10BBlack: VCL: remove all non-default between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) [16:46:20] (03CR) 10BBlack: [C: 032 V: 032] VCL: remove all non-default between_bytes_timeout [puppet] - 10https://gerrit.wikimedia.org/r/281499 (https://phabricator.wikimedia.org/T131761) (owner: 10BBlack) [16:48:21] (03PS1) 10ArielGlenn: fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 [16:49:41] (03PS2) 10Cmjohnson: Changing installer to jesse for snapshot1005 for troubleshooting. [puppet] - 10https://gerrit.wikimedia.org/r/281686 [16:49:46] (03CR) 10jenkins-bot: [V: 04-1] fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 (owner: 10ArielGlenn) [16:50:07] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2180904 (10mmodell) New problem: Apparently jenkins can't access phabricator over ssh. [16:50:42] (03CR) 10Cmjohnson: [C: 032] Changing installer to jesse for snapshot1005 for troubleshooting. [puppet] - 10https://gerrit.wikimedia.org/r/281686 (owner: 10Cmjohnson) [16:52:00] (03PS2) 10ArielGlenn: fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 [16:53:11] (03PS3) 10ArielGlenn: fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 [16:54:57] (03CR) 10jenkins-bot: [V: 04-1] fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 (owner: 10ArielGlenn) [16:55:48] 6Operations, 6Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2180934 (10Gehel) Please only look at the backend side for the above comment. I will open another request for the varnish servers. [16:55:51] 6Operations, 10Phabricator: Phabricator search querys are actually very slow - https://phabricator.wikimedia.org/T131843#2180340 (10mmodell) That query matches a lot of documents. I can't think of any way that it would not be slow. [16:59:12] (03PS4) 10ArielGlenn: fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 [17:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160405T1700). Please do the needful. [17:00:36] no parsoid deploy [17:00:51] (03CR) 10jenkins-bot: [V: 04-1] fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 (owner: 10ArielGlenn) [17:02:13] (03PS5) 10ArielGlenn: fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 [17:03:07] 6Operations, 10Phabricator, 10hardware-requests: We need a backup phabricator front-end node - https://phabricator.wikimedia.org/T131775#2180958 (10chasemp) I'm pretty sure you mean LVS :) There have been a few tasks and outlines over time for this but the general threshold for pain I recall is: can we reb... [17:11:39] (03PS6) 10ArielGlenn: fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 [17:13:13] (03CR) 10ArielGlenn: [C: 032] fix repodir reference for monitor script for dumps [puppet] - 10https://gerrit.wikimedia.org/r/281687 (owner: 10ArielGlenn) [17:16:31] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:20:54] no kartographer deploy (i'm messing with maps2004, but not via depl) [17:22:44] anyone around who can look at a trusty install issue on these HP DL360s? It won't recognize the disks, tried hpsa driver which is what runs on jessie), no good, jessie installs fine. in the meantime these systems are canonical certified with 14.04 (trusty) [17:23:56] bd808: mh https://tools.wmflabs.org/sal seems to be broken, known? [17:24:32] 6Operations, 6Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2181011 (10Gehel) Summary of IRC discussion: * the current H/W for maps is not uniform. We have better specs on map-test2003 / 2004 than on 2001 / 2002. The better spec should be used a... [17:40:20] yurik: was there any followup after i logged out ? [17:46:36] hey ops, can someone hold my hand re: deploy versions briefly? [17:46:55] tfinc, all's good, robh is working on the quote of the backend servers, mark should give us an update in a few days about varnish servers [17:47:04] great [17:47:42] a patch which was merged sat apr 2 (https://gerrit.wikimedia.org/r/247914) should start getting deployed today? what version # would that be? [17:48:29] seems like 1.27.0-wmf.20 ? but that branch hasn't been created yet? [17:49:22] 6Operations, 6Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2158431 (10RobH) >>! In T131180#2181011, @Gehel wrote: > Summary of IRC discussion: > > * the current H/W for maps is not uniform. We have better specs on map-test2003 / 2004 than on 20... [17:49:28] 6Operations, 6Discovery, 10Maps, 10hardware-requests: Maps back end hardware - https://phabricator.wikimedia.org/T131180#2181088 (10RobH) [17:54:03] (03PS1) 10MarcoAurelio: Adding museumcommons.wikimedia.nl on $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) [17:56:15] (03PS2) 10Yuvipanda: k8s: Add simple script for deploying master [puppet] - 10https://gerrit.wikimedia.org/r/281589 (https://phabricator.wikimedia.org/T130972) [17:56:17] (03CR) 10Urbanecm: "I fixed it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [17:56:42] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Add simple script for deploying master [puppet] - 10https://gerrit.wikimedia.org/r/281589 (https://phabricator.wikimedia.org/T130972) (owner: 10Yuvipanda) [17:56:53] (03PS2) 10Yuvipanda: k8s: Simple script to deploy worker & proxy [puppet] - 10https://gerrit.wikimedia.org/r/281590 (https://phabricator.wikimedia.org/T130972) [17:56:59] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Simple script to deploy worker & proxy [puppet] - 10https://gerrit.wikimedia.org/r/281590 (https://phabricator.wikimedia.org/T130972) (owner: 10Yuvipanda) [17:58:35] marxarelli: you're doing the train deploy today? has the branch been cut yet? [17:59:03] cscott: yes and almost [17:59:08] about to start cutting now [17:59:32] marxarelli: ok, i just need to confirm that https://gerrit.wikimedia.org/r/247914 is going to be in 1.27.0-wmf.20 [17:59:54] cscott: yep! [17:59:58] marxarelli: Parsoid has a version-flag dependency (https://gerrit.wikimedia.org/r/280792) [18:00:04] ok, cool. [18:06:12] (03PS3) 10Yuvipanda: k8s: Stop using packages for master components [puppet] - 10https://gerrit.wikimedia.org/r/281586 (https://phabricator.wikimedia.org/T130972) [18:06:14] (03PS3) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [18:06:16] (03PS2) 10Yuvipanda: k8s: Switch to new format for ABAC [puppet] - 10https://gerrit.wikimedia.org/r/281591 (https://phabricator.wikimedia.org/T130972) [18:06:18] (03PS1) 10Yuvipanda: k8s: s/file/puppet/ [puppet] - 10https://gerrit.wikimedia.org/r/281704 [18:06:36] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: s/file/puppet/ [puppet] - 10https://gerrit.wikimedia.org/r/281704 (owner: 10Yuvipanda) [18:09:39] !log Creating new wmf/1.27.0-wmf.20 branch on tin [18:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:08] (03PS1) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [18:11:30] (03CR) 10jenkins-bot: [V: 04-1] ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [18:12:00] (03PS4) 10Yuvipanda: k8s: Stop using packages for master components [puppet] - 10https://gerrit.wikimedia.org/r/281586 (https://phabricator.wikimedia.org/T130972) [18:12:02] (03PS4) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [18:12:04] (03PS3) 10Yuvipanda: k8s: Switch to new format for ABAC [puppet] - 10https://gerrit.wikimedia.org/r/281591 (https://phabricator.wikimedia.org/T130972) [18:12:06] (03PS1) 10Yuvipanda: k8s: s/file/puppet/ again [puppet] - 10https://gerrit.wikimedia.org/r/281708 [18:13:51] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: s/file/puppet/ again [puppet] - 10https://gerrit.wikimedia.org/r/281708 (owner: 10Yuvipanda) [18:13:53] (03PS2) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [18:14:42] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#2181206 (10RobH) [18:14:51] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Switch to new format for ABAC [puppet] - 10https://gerrit.wikimedia.org/r/281591 (https://phabricator.wikimedia.org/T130972) (owner: 10Yuvipanda) [18:15:12] 6Operations, 10Analytics, 10hardware-requests, 13Patch-For-Review: eqiad: (3) AQS replacement nodes - https://phabricator.wikimedia.org/T124947#1970805 (10RobH) [18:17:10] (03CR) 10jenkins-bot: [V: 04-1] ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [18:19:37] (03PS3) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [18:20:34] (03PS1) 10ArielGlenn: try hw raid and suitable partman for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281709 [18:21:55] (03Abandoned) 10ArielGlenn: try hw raid and suitable partman for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281709 (owner: 10ArielGlenn) [18:24:26] (03PS1) 10ArielGlenn: try hw raid and suitable partman for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281712 [18:25:47] (03CR) 10ArielGlenn: [C: 032] try hw raid and suitable partman for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281712 (owner: 10ArielGlenn) [18:26:23] (03CR) 10Dereckson: [C: 04-1] "Technically looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281700 (https://phabricator.wikimedia.org/T131841) (owner: 10MarcoAurelio) [18:26:39] merging someone's kube stuff [18:26:45] YuviPanda: ? [18:27:08] ok to merge them on palladium? [18:27:13] apergos: ah, sure. sorry, forgot (labs automerges, so noop on palladium) [18:27:27] done! [18:30:44] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: Relevance forge hardware - https://phabricator.wikimedia.org/T131184#2181262 (10Gehel) Systems are labs, so eqiad DC. [18:31:53] (03PS1) 10ArielGlenn: back to trusty installer for snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/281713 [18:31:57] (03CR) 10Dereckson: [C: 04-1] Add new namespaces and new aliases for newikibooks (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [18:32:17] (03PS4) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [18:36:13] (03CR) 10ArielGlenn: [C: 032] back to trusty installer for snapshot1005 [puppet] - 10https://gerrit.wikimedia.org/r/281713 (owner: 10ArielGlenn) [18:37:30] (03PS5) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [18:37:32] (03PS1) 10Yuvipanda: k8s: Set ownership and permissions of binaries properly [puppet] - 10https://gerrit.wikimedia.org/r/281715 [18:37:40] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: Relevance forge hardware - https://phabricator.wikimedia.org/T131184#2158516 (10RobH) [18:38:00] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2158516 (10RobH) [18:38:20] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Set ownership and permissions of binaries properly [puppet] - 10https://gerrit.wikimedia.org/r/281715 (owner: 10Yuvipanda) [18:41:00] (03PS6) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) [18:44:45] (03PS7) 10Urbanecm: Add new namespaces and new aliases for newikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) [18:46:09] (03CR) 10Urbanecm: "Fixed. Sorry for this typo." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281443 (https://phabricator.wikimedia.org/T131754) (owner: 10Urbanecm) [18:47:50] 6Operations, 6Discovery, 10hardware-requests, 3Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2158516 (10RobH) If labs instances need to route to them, it'll need to be in a labs support vlan. As for the hardware itself, we don't have any spare syst... [18:51:21] (03PS6) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [18:51:23] (03PS1) 10Yuvipanda: k8s: Fix abac template [puppet] - 10https://gerrit.wikimedia.org/r/281716 [18:52:13] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Fix abac template [puppet] - 10https://gerrit.wikimedia.org/r/281716 (owner: 10Yuvipanda) [18:53:18] (03PS7) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [18:53:20] (03PS1) 10Yuvipanda: k8s: Fix abac template again [puppet] - 10https://gerrit.wikimedia.org/r/281717 [18:53:49] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Fix abac template again [puppet] - 10https://gerrit.wikimedia.org/r/281717 (owner: 10Yuvipanda) [19:00:04] marxarelli: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160405T1900). Please do the needful. [19:01:05] (03PS1) 10Giuseppe Lavagetto: hhvm: watch extension packages from the service [puppet] - 10https://gerrit.wikimedia.org/r/281720 [19:01:07] (03PS1) 10Giuseppe Lavagetto: hhvm: parametrize directories [puppet] - 10https://gerrit.wikimedia.org/r/281721 [19:01:09] (03PS1) 10Giuseppe Lavagetto: hhvm: s/fcgi.ini/server.ini/ [puppet] - 10https://gerrit.wikimedia.org/r/281722 [19:07:30] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:11:21] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2181422 (10Ottomata) @Robh this was approved last week. It is hard for me to follow possible subtickets. Is this ordered? [19:12:24] ottomata: i can update you on that task now. its approved, and i just didnt get it setup yet due to all the other pending orders! I'll create the task to allocate and set it up now. [19:12:48] ok awesome! [19:13:10] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [19:13:15] ottomata: preferrred hostname? [19:13:27] or just element? [19:13:51] uh yeahHhhh [19:13:55] oo gotta think about this one [19:13:57] hm. [19:14:06] we might want to name it analytics* somethign...but also stat* something... [19:14:06] not trolling, i ran out of fucks about hostnames awhile ago, as long as its on the infrastructure naming page is all i care about. [19:14:12] heh [19:14:25] robh it will be a stat like box, in that lots of people will log into it [19:14:38] but, it will be for using the Analytics (hadoop) Cluster [19:14:45] ok, so stat1004? [19:14:48] right now stat1002 is used for this, as well as a LOT of other things [19:14:53] heh [19:15:08] yeah, hmmm, probably stat1004 [19:15:14] that will be the least confusing to users [19:15:26] although probably more so to opsen :p [19:15:33] every time someone asks for hadoop access [19:15:34] yeehaw! [19:15:35] it seems like the analytics/stat cluster could do an overhaul on naming standards [19:15:38] indeed [19:15:44] analytics cluster is ok [19:15:53] but that is likely a huge project and not time to do it right now, heh [19:15:55] stat* naming has no reason [19:16:13] thus far we don't user give shell access to analytics boxes [19:16:18] so, lets' not call it that [19:16:24] stat1004 is the most consistent and sane right now i think [19:16:42] (03Abandoned) 10BBlack: dnsrecursor: add localhost data [puppet] - 10https://gerrit.wikimedia.org/r/267208 (https://phabricator.wikimedia.org/T125170) (owner: 10BBlack) [19:17:26] ottomata: so OS? also what vlan? [19:17:33] analytics vlan not normal right? [19:18:10] trusty, analyics vlan [19:18:15] cool [19:18:52] 6Operations: setup stat1003/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2181451 (10RobH) [19:18:56] ^ there we go =] [19:19:02] sorry, 1004, correcting [19:19:10] just wrong in task title. [19:19:19] 6Operations: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2181451 (10RobH) [19:19:27] yeehaw thank you! [19:21:54] _joe_ marxarelli: parsoid has a cherry pick to deploy to match core's 1.27-wmf.20. any problem with doing that deploy as soon as the train deploy is done? [19:23:29] !log Cloning mediawiki for checkout of new 1.27.0-wmf.20 branch on tin [19:23:32] (03PS1) 10ArielGlenn: add dir for test files to be served from datasets host [puppet] - 10https://gerrit.wikimedia.org/r/281723 [19:23:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:45] cscott: can it wait until the evening swat? [19:23:50] ottomata: also the tasks for the purchase of ssds for the analtyics ssytems in eqiad has been escalted [19:23:56] for either ssd order or system with ssd order [19:24:05] the 6 recent allocations in eqiad. [19:24:50] great, thanks robh [19:25:40] marxarelli: well, visualeditor will use "enframed" instead of "framed" as the name of that image option (and some other similar funniness) after 1.27-wmf.20 is deployed and before parsoid's deploy occurs. [19:25:42] (03CR) 10ArielGlenn: [C: 032] add dir for test files to be served from datasets host [puppet] - 10https://gerrit.wikimedia.org/r/281723 (owner: 10ArielGlenn) [19:25:50] marxarelli: so i'd like to do it as soon as possible after the train deploy. [19:26:10] YuviPanda: merging your stuff again [19:26:19] YuviPanda: k8s: Fix abac template again (fbb328e) [19:26:20] ok? [19:26:38] apergos: yeah! sorry again, was just debugging stuff. [19:26:43] cscott: kk. i'll let you know when i'm done [19:26:49] done [19:27:59] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [19:28:00] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [19:28:06] marxarelli: ok. it's not a huge deal (group0 and relatively minor changes to image option localization), but it would be good to get it done. thanks. [19:31:58] !log Applying security patches to wmf/1.27.0-wmf.20 checkout on tin [19:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:39] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181515 (10Gehel) [19:33:29] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181533 (10Gehel) [19:34:48] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181542 (10RobH) a:5BBlack>3RobH I'll create and link in #procurement tasks for pricing shortly. [19:39:02] 6Operations, 10hardware-requests: +1 'stat' type box for hadoop client usage - https://phabricator.wikimedia.org/T128808#2181548 (10RobH) 5Open>3Resolved WMF4721 is approved, and I've just created T131877 for its deployment. [19:43:19] (03CR) 10Legoktm: [C: 031] "Looks good, pending security review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281651 (https://phabricator.wikimedia.org/T131844) (owner: 10Rillke) [19:46:55] (03PS1) 10Yuvipanda: Fix tests to run with v1.2.0 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281725 [19:46:57] (03PS1) 10Yuvipanda: Fix case of "Pods" to make tests pass [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281726 [19:46:59] (03PS1) 10Yuvipanda: Remove uidenforcer plugin [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281727 [19:47:01] (03PS1) 10Yuvipanda: Add wmftoolsenforcer admission controller [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281728 [19:47:11] bah [19:47:15] (03PS1) 10Dduvall: Group0 to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281729 [19:47:19] (03PS1) 10Yuvipanda: Remove uidenforcer plugin [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281730 [19:47:21] (03PS1) 10Yuvipanda: Add wmftoolsenforcer admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281731 [19:47:21] I should delete the debs repo [19:47:48] chasemp: ^ is rebased setup. I'm going to checkout on tools-docker-builder-03 nad run tests now [19:48:23] kk [19:49:29] (03CR) 10Dduvall: [C: 032] Group0 to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281729 (owner: 10Dduvall) [19:49:34] (03CR) 10Aaron Schulz: [C: 031] Use ProductionServices for the jobqueue configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/279350 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [19:49:45] (03PS20) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [19:50:12] (03Merged) 10jenkins-bot: Group0 to 1.27.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281729 (owner: 10Dduvall) [19:53:56] (03CR) 1020after4: [C: 031] Hieraize keyholder::agent configuration [puppet] - 10https://gerrit.wikimedia.org/r/279198 (https://phabricator.wikimedia.org/T130419) (owner: 1020after4) [19:57:26] !log dduvall@tin Started scap: testwiki to php-1.27.0-wmf.20 and rebuild l10n cache [19:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:37] (03CR) 1020after4: [C: 031] Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [20:00:09] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [20:02:35] (03PS2) 10Yuvipanda: Add wmftoolsenforcer admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281731 [20:05:10] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [20:05:25] (03PS3) 10Yuvipanda: Add wmftoolsenforcer admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281731 [20:06:24] (03Abandoned) 10Yuvipanda: Fix tests to run with v1.2.0 [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281725 (owner: 10Yuvipanda) [20:06:35] (03Abandoned) 10Yuvipanda: Fix case of "Pods" to make tests pass [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281726 (owner: 10Yuvipanda) [20:06:47] (03Abandoned) 10Yuvipanda: Remove uidenforcer plugin [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281727 (owner: 10Yuvipanda) [20:06:57] (03Abandoned) 10Yuvipanda: Add wmftoolsenforcer admission controller [debs/kubernetes] - 10https://gerrit.wikimedia.org/r/281728 (owner: 10Yuvipanda) [20:10:09] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [20:11:58] 6Operations, 10Continuous-Integration-Infrastructure, 10Phabricator, 10netops, and 4 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2181736 (10hashar) Jenkins execute the jobs on labs instances, so it is not surprising t... [20:15:09] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 1 failures [20:15:09] PROBLEM - check_redis on payments2001 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [20:15:41] ^ fixing... [20:18:51] (03PS1) 10Ppchelko: Emit resource_change events from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) [20:20:09] PROBLEM - check_puppetrun on bismuth is CRITICAL: CRITICAL: Puppet has 1 failures [20:20:33] (03CR) 10jenkins-bot: [V: 04-1] Emit resource_change events from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) (owner: 10Ppchelko) [20:21:29] (03PS4) 10Yuvipanda: Add uidenforcer admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281731 [20:22:59] PROBLEM - Apache HTTP on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:39] PROBLEM - HHVM rendering on mw1134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:30] (03Abandoned) 10Yuvipanda: Remove uidenforcer plugin [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281730 (owner: 10Yuvipanda) [20:24:35] (03PS2) 10Ppchelko: Emit resource_change events from RESTBase. [puppet] - 10https://gerrit.wikimedia.org/r/281740 (https://phabricator.wikimedia.org/T126571) [20:25:09] RECOVERY - check_puppetrun on bismuth is OK: OK: Puppet is currently enabled, last run 157 seconds ago with 0 failures [20:25:25] 6Operations, 7Puppet, 10Wikimedia-Apache-configuration, 13Patch-For-Review, 7Technical-Debt: Refactor the mediawiki puppet classes to make HHVM default, drop zend compatibility - https://phabricator.wikimedia.org/T126310#2181794 (10Aklapper) [20:27:43] (03CR) 10Yuvipanda: [C: 032 V: 032] Add uidenforcer admission controller [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281731 (owner: 10Yuvipanda) [20:28:07] (03CR) 10Nuria: ">I don't understand why they are "fake requests" or "disguised as user requests"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [20:28:24] (03PS1) 10Yuvipanda: k8s: Build only arm64 binaries [puppet] - 10https://gerrit.wikimedia.org/r/281742 [20:28:26] (03PS1) 10Yuvipanda: k8s: Update tag [puppet] - 10https://gerrit.wikimedia.org/r/281743 [20:29:45] (03PS8) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [20:29:46] !log dduvall@tin Finished scap: testwiki to php-1.27.0-wmf.20 and rebuild l10n cache (duration: 32m 19s) [20:29:47] (03PS2) 10Yuvipanda: k8s: Update tag [puppet] - 10https://gerrit.wikimedia.org/r/281743 [20:29:49] (03PS2) 10Yuvipanda: k8s: Build only arm64 binaries [puppet] - 10https://gerrit.wikimedia.org/r/281742 [20:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:21] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Build only arm64 binaries [puppet] - 10https://gerrit.wikimedia.org/r/281742 (owner: 10Yuvipanda) [20:30:36] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Update tag [puppet] - 10https://gerrit.wikimedia.org/r/281743 (owner: 10Yuvipanda) [20:31:41] jouncebot next [20:31:42] In 2 hour(s) and 28 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160405T2300) [20:31:52] (03CR) 10Hashar: "That might do it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [20:32:49] !log dduvall@tin rebuilt wikiversions.php and synchronized wikiversions files: Group0 to 1.27.0-wmf.20 [20:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:35:00] bd808: where's your cool wikiversions thing live again? [20:35:18] https://tools.wmflabs.org/versions/ [20:35:40] neato! [20:36:00] bd808: welcome back [20:36:12] bd808: I assume you saw that stashbot and sal were having issues? [20:36:18] I'm not really here :) [20:36:19] RECOVERY - puppet last run on snapshot1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:36:30] !log dduvall@tin Purged l10n cache for 1.27.0-wmf.18 [20:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:36:43] I just restarted the stashbot process. It got sad and died when phab was down [20:36:44] bd808: then nevermind! [20:37:25] hmmm... maybe elastic is broken in tool labs :/ [20:37:27] this is bd808 http://arresteddevelopment.wikia.com/wiki/You_can_always_tell_a_Milford_man [20:39:44] bd808: yeah, it actually was breaking before phab went down for the thermal paste yesterday, everyone was blaming it, but it wasn't jiving with the tiemline :) [20:40:00] (03CR) 10Hashar: "The puppet compiler is not helpful there:" [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [20:40:30] greg-g: yeah. looks like elastic won't start. I'm going to guess its a puppet change for cirrus that leaked into my tool labs instances without all the right config [20:40:42] bd808: /me nods [20:41:00] be sure to kill stashbot again before par.avoid bans it again :) [20:42:49] (03PS5) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [20:48:51] (03PS21) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [20:51:28] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2181940 (10Gehel) According to @BBlack on IRC: * Specs is "standard, SSD-based varnish cluster machine configurations". * we should not need to buy any new hardwar... [20:51:29] cscott: deployed. you have an extra 9 minutes :) [20:51:39] sorry, i went very slowly today [20:52:00] whoo [20:52:07] marxarelli: thanks [20:52:13] np [20:52:29] (03PS1) 10BryanDavis: Quote ship_to_logstash host value [puppet] - 10https://gerrit.wikimedia.org/r/281756 [20:53:26] YuviPanda: ^ I need that patch to fix elasticsearch and stashbot in tool labs [20:54:16] 7Blocked-on-Operations, 6Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2181955 (10MBinder_WMF) @greg thanks for waking this thread up. :) My teams are still asking for live updated boards. @chasemp I'm happy to... [20:54:46] someone can help me with autopromoting user groups? [20:54:51] Dereckson: ^ [20:54:55] patch for enwp [20:55:09] (03PS1) 10ArielGlenn: turn off dump cron on old snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281770 [20:55:23] ah, ich found it [20:55:51] (03CR) 10Yuvipanda: [C: 032] Quote ship_to_logstash host value [puppet] - 10https://gerrit.wikimedia.org/r/281756 (owner: 10BryanDavis) [20:56:40] *I [20:58:38] (03PS2) 10ArielGlenn: turn off dump cron on old snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281770 [21:02:08] 7Blocked-on-Operations, 6Operations, 10Phabricator, 10Traffic: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#2182000 (10greg) So, as for next steps: >>! In T112765#2168889, @chasemp wrote: > Someone from #releng could put up changes for both the mi... [21:02:30] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2182002 (10BBlack) Well to be completely clear: should not need to buy any new hardware **this quarter** - all of them need replacing on standard lifetimes, with th... [21:04:20] I'd love some help with https://phabricator.wikimedia.org/T41510 .. [21:05:24] Luke081515: https://www.mediawiki.org/wiki/Manual:$wgAutopromote [21:06:12] Luke081515: if I remember, they want edit count + age? [21:06:47] yeah, I got it [21:06:55] maybe you can review that patch in a few minutes? [21:07:02] I hope I did'nt forget anything [21:07:09] (03PS3) 10ArielGlenn: turn off dump cron on old snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281770 [21:08:34] (03CR) 10Hashar: [C: 031] "Dan told me he will baby sit the deployment once he is done with the MW train :)" [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [21:09:29] (03CR) 10ArielGlenn: [C: 032] turn off dump cron on old snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281770 (owner: 10ArielGlenn) [21:09:52] Krenair: Maybe you can help RileyH? see 23:04 UTC+2 [21:10:02] (21:04 UTC) [21:10:48] Luke081515: ok [21:11:22] (03PS1) 10Yuvipanda: Don't require securitycontext to be nil [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281806 [21:11:28] (03PS22) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [21:12:01] (03PS1) 10Luke081515: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) [21:12:03] Dereckson: ^ [21:12:27] (03CR) 10jenkins-bot: [V: 04-1] Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [21:12:37] (03PS2) 10Luke081515: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) [21:12:49] oh, an error [21:13:00] (03CR) 10jenkins-bot: [V: 04-1] Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [21:13:59] (03PS1) 10ArielGlenn: add back nutcracker to the dump setup for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281808 [21:14:05] (03PS3) 10Luke081515: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) [21:14:36] (03CR) 10jenkins-bot: [V: 04-1] Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [21:14:43] meh [21:14:48] * Luke081515 has no luck today [21:15:18] (03CR) 10ArielGlenn: [C: 032] add back nutcracker to the dump setup for new snapshots [puppet] - 10https://gerrit.wikimedia.org/r/281808 (owner: 10ArielGlenn) [21:15:51] Luke081515: it seems you forgot a ) in the autopromote block [21:16:04] hi RileyH [21:16:26] Luke081515: by the way don't forget to append comma to array element lines you add [21:16:54] thanks for the tip [21:17:23] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2182020 (10Gehel) As we already have the hardware, this just needs @mark's approval. [21:17:52] RileyH, what wiki do you need your watchlist cleared on? [21:18:33] (03PS1) 10ArielGlenn: turn off dumps cron on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/281814 [21:19:03] (03PS4) 10Luke081515: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) [21:19:17] I hope this won't be a try and catch game with jenkins [21:19:54] (03CR) 10Aaron Schulz: [C: 031] upload limit: raise to 4 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (owner: 10Matanya) [21:20:34] Dereckson: Jenkins was nice to me this time :D [21:20:40] should be ready for review now [21:20:50] (03CR) 10ArielGlenn: [C: 032] turn off dumps cron on snapshot1006 and 1007 [puppet] - 10https://gerrit.wikimedia.org/r/281814 (owner: 10ArielGlenn) [21:20:54] Luke081515: you can check locally the syntax with `php -l` to avoid this try and catch game [21:21:20] that's the advantage when we will switch to phab: There is some kind of a basic linter [21:23:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [21:23:30] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [1000.0] [21:24:13] (03PS23) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [21:24:26] (03CR) 10Alex Monk: [C: 031] "Looks about right" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [21:27:31] Luke081515: actually, we could have `arc lint` right now, even to use with Gerrit. [21:28:40] try it, ostriches has prepared a .arclint configuration for JSON and PHP lint [21:29:09] I don't have arcanist at the client for mw config actually [21:29:18] but I'm using it already for my bot php files [21:30:25] Dereckson: Maybe I shouldn't create a new change, I think it would have been easier, if I'd used https://gerrit.wikimedia.org/r/#/c/270660/ -.- [21:30:36] can you abadon that dupe now? [21:32:05] (03CR) 10Yuvipanda: [C: 032 V: 032] Don't require securitycontext to be nil [software/kubernetes] - 10https://gerrit.wikimedia.org/r/281806 (owner: 10Yuvipanda) [21:32:23] (03PS9) 10Yuvipanda: k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) [21:32:42] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Stop using packages for k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/281587 (https://phabricator.wikimedia.org/T130972) (owner: 10Yuvipanda) [21:33:08] (03PS24) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [21:33:28] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:34:15] !log https://tools.wmflabs.org/sal missing entries since 2016-04-04T16:36 [21:34:17] !log updated Parsoid to version a5be1cdc [21:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:34:41] I'll backfill https://tools.wmflabs.org/sal tomorrow [21:34:56] !log restarted parsoid on wtp1001.eqiad.wmnet as a canary [21:35:00] Dereckson: are you planning to write a patch to rasie timeout limits ? [21:35:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:35:17] (03PS1) 10Yuvipanda: k8s: Bump tag again [puppet] - 10https://gerrit.wikimedia.org/r/281819 [21:35:48] (03Abandoned) 10Alex Monk: New group/right/protection level for the English Wikipedia: establishededitor (?) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/270660 (https://phabricator.wikimedia.org/T126607) (owner: 10Alex Monk) [21:36:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:36:09] (03PS25) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [21:36:39] Krenair: Commons please [21:36:43] (03PS2) 10Yuvipanda: k8s: Bump tag again [puppet] - 10https://gerrit.wikimedia.org/r/281819 [21:36:50] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Bump tag again [puppet] - 10https://gerrit.wikimedia.org/r/281819 (owner: 10Yuvipanda) [21:37:04] matanya: if csteipp thinks it's okay, yes [21:37:22] csteipp: can you please approve/deny ? [21:37:45] Dereckson: 300s ? [21:37:52] !log restarted parsoid on wtp2001.codfw.wmnet as a (better) canary [21:37:55] Dereckson: what task? [21:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:38:23] csteipp: https://phabricator.wikimedia.org/T118887 [21:38:53] Dereckson: can you please add my patch to today's swat ? [21:38:58] RileyH, I started it and promptly realised I probably should've done that as a batched query.... :| [21:39:22] meaning? [21:39:24] :P [21:39:45] s4 replag [21:40:00] (03PS1) 10Yuvipanda: k8s: Fix typo in deploy script [puppet] - 10https://gerrit.wikimedia.org/r/281820 [21:40:06] Query OK, 214722 rows affected (1 min 27.90 sec) eek [21:40:14] o.O [21:40:20] gj Krenair [21:41:00] matanya: 280831? you can add it to https://wikitech.wikimedia.org/wiki/Deployments#Week_of_April_4th [21:41:23] yes, adding [21:41:25] (03PS2) 10Yuvipanda: k8s: Fix typo in deploy script [puppet] - 10https://gerrit.wikimedia.org/r/281820 [21:41:33] Krenair: It's nice and empty, thank you! [21:41:34] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Fix typo in deploy script [puppet] - 10https://gerrit.wikimedia.org/r/281820 (owner: 10Yuvipanda) [21:42:47] Looking at https://tendril.wikimedia.org/host/view/db1040.eqiad.wmnet/3306 I may have got away with it [21:42:59] !log restarted parsoid on all nodes to complete deploy of version a5be1cdc [21:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:44:21] restart complete. [21:44:21] subbu: could you do a general sanity check of the new parsoid version on enwiki? [21:44:21] i'm going to test the group0 wikis in particular to check that the image options are correct. [21:44:36] Dereckson: added [21:46:26] (03PS26) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [21:50:36] ok, done with the parsoid deploy. have fun w/ swat [21:50:39] thanks again marxarelli [21:50:45] 6Operations, 10Parsoid, 10RESTBase, 6Services-next, and 3 others: Support following MediaWiki redirects when retrieving HTML revisions - https://phabricator.wikimedia.org/T118548#2182138 (10GWicke) [21:51:37] cscott: no problem [21:58:43] (03PS27) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [22:06:00] matanya: csteipp: okay, https://gerrit.wikimedia.org/r/281823 prepared for that. [22:07:37] thanks Dereckson, adding to train ? [22:07:55] * swat [22:07:59] (03PS1) 10Gehel: Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 [22:09:42] (03CR) 10TTO: "Bit meh about the name..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [22:10:13] matanya: if there is no emergency, perhaps wait a little bit so anyone with a reason to object will get a chance? Until Thursday? [22:10:23] TTO: I can't change the name, it's already defined at Wikimedia Messages [22:10:44] (03CR) 10Ottomata: "Ok status!" [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) (owner: 10Ottomata) [22:10:56] (03CR) 10Luke081515: "The name is defined here: https://gerrit.wikimedia.org/r/#/c/279761/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [22:11:05] Dereckson: it was discuced over and over in the hackathon, but i don't care waiting a bit longer [22:11:54] (03PS1) 10Yuvipanda: k8s: Fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/281825 [22:12:05] (03PS2) 10Yuvipanda: k8s: Fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/281825 [22:12:13] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Fix another typo [puppet] - 10https://gerrit.wikimedia.org/r/281825 (owner: 10Yuvipanda) [22:12:14] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: Hardware for cache cluster for Maps - https://phabricator.wikimedia.org/T131880#2182167 (10RobH) a:5RobH>3mark Excellent, I was too quick to claim for processing! So the request is to allocate the 4 machines in codfw/eqiad/esams/ulsfo each... [22:13:09] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2181515 (10RobH) [22:13:28] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [22:13:38] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [22:13:52] 6Operations, 6Discovery, 10Kartotherian, 10Maps, and 2 others: codfw/eqiad/esams/ulsfo: (4) servers for maps caching cluster - https://phabricator.wikimedia.org/T131880#2181515 (10RobH) [22:14:55] ottomata: exciting! [22:15:01] (I've been keeping an eye on) [22:16:22] (03PS4) 10Matanya: upload limit: raise to 4 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) [22:17:22] (03CR) 10Dereckson: "Okay if this terminology is used consistently." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/250384 (https://phabricator.wikimedia.org/T130442) (owner: 10Nemo bis) [22:32:31] hi! I've been trying to get jenkins to push commits to gerrit - and figured out that the jenkins slaves don't trust the gerrit host. marxarelli is helping me out and suggested adding an sshkey resource to ci labs slave role - which we tried - but running into some alias and permissions issues [22:33:42] chasemp may be ^ if you are around? [22:34:48] just to elaborate on the permissions part, looks to be a known issue with sshkey https://tickets.puppetlabs.com/browse/PUP-2900 [22:35:01] wondering how ops has handled this in the past [22:39:46] marxarelli: madhuvishy In this case, I'd say you can put a big comment explaining the issue and an exec with an onlyif [22:40:14] YuviPanda: would a `file` with just mode also work? [22:40:26] marxarelli: not sure, it might [22:40:29] assuming the sshkey would autorequire the file [22:40:43] but i guess i shouldn't assume anything about sshkey at this point [22:41:01] another alternative would be to use GIT_SSH env variable to point to a script that does ssh with verification itself [22:41:10] this way you wouldn't need sshkey at all [22:41:29] hmm, hashar mentioned that earlier as well [22:41:37] i'm hesitant to do that on integration slaves [22:41:38] we can put a require on sshkey i suppose [22:42:02] 704b4067c92b0bc86c18a3b9551eaaf296a5f18d for how it used to be for labs self hosted puppetmaster [22:42:16] yeah, you can do that too madhuvishy [22:42:37] I think sshkey + exec + comments with link to bug is right thing to do [22:43:03] exec as in chmod? [22:43:06] yeah [22:43:09] hmmm [22:43:13] so the exec is a refreshonly to chmod and sshkey notifies it? [22:43:13] if a file {} doesn't work, that is [22:43:15] ah, ok [22:43:27] yeah, if a file doesn't work then what marxarelli said [22:43:45] okay [22:44:06] marxarelli: i'm not sure where the port goes though - the host_aliases doesn't seem to be for port [22:44:17] cool. madhuvishy let's try the file (maybe w/ a `defined?` guard) first? [22:44:27] yup [22:44:30] madhuvishy: i think it's 'host:port' syntax [22:44:36] or '[host]:port' maybe [22:44:41] ah [22:44:44] hmmm [22:44:51] for the name too? [22:45:02] yes, i think so [22:45:12] trial and error it is [22:45:14] haha, seems like file_line would be soooo much clearer :) [22:45:35] oh puppet [22:45:38] TEDD! [22:45:47] Trial & Error Driven Development [22:45:54] lol [22:46:05] since the dawn of mankind! [22:46:10] * YuviPanda calls the k8s 1.2 upgrade a success, and shall commence eaeting oranges [22:46:31] Dereckson: Still here? [22:50:25] (03PS5) 10Luke081515: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) [22:51:57] (03PS28) 10Ottomata: Add new scap::source define to ease bootstrapping of repositories on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/280730 (https://phabricator.wikimedia.org/T118772) [22:53:27] 6Operations, 6Discovery, 6Discovery-Search-Backlog, 10Wikimedia-Logstash, and 3 others: Upgrade ElasticSearch to 1.7.5 - https://phabricator.wikimedia.org/T122697#2182288 (10Deskana) [22:53:29] 6Operations, 6Discovery, 6Discovery-Search-Backlog, 10MediaWiki-Vendor, and 6 others: Upgrade ruflin/elastica to 2.3.1 - https://phabricator.wikimedia.org/T127831#2182287 (10Deskana) 5Open>3Resolved [22:55:17] PROBLEM - check_puppetrun on payments2003 is CRITICAL: CRITICAL: Puppet has 3 failures [22:55:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [22:56:35] (03PS6) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [22:56:39] 7Puppet, 6Labs, 10Phabricator: Phabricator labs puppet role configures phabricator wrong - https://phabricator.wikimedia.org/T131899#2182293 (10Luke081515) [22:56:45] marxarelli: YuviPanda ^^ may be? [22:57:45] (03CR) 10jenkins-bot: [V: 04-1] ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [22:57:52] (03CR) 10Yuvipanda: [C: 04-1] ci: Add gerrit as a known host for Jenkins slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [22:58:35] madhuvishy: left comments [22:58:42] (03CR) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [22:59:28] madhuvishy: I think it should be a 'subscribe' (from file -> sshkey) or 'notify' (from sshkey -> file) [22:59:35] since you want it to happen every time [22:59:42] right [22:59:58] i wonder if sshkey will change the mode if the file already exists [23:00:05] RoanKattouw ostriches Krenair MaxSem Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160405T2300). Please do the needful. [23:00:05] Luke081515 matanya: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:06] if not, the file -> sshkey would make sense [23:00:16] * Luke081515 is here [23:00:17] RECOVERY - check_puppetrun on payments2003 is OK: OK: Puppet is currently enabled, last run 250 seconds ago with 0 failures [23:00:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:00:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:00:18] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:00:24] if it does change the mode of an existing file, sshkey -> file would be the right order [23:00:31] er, sshkey ~> file [23:00:45] who will SWAT? [23:01:24] marxarelli: right, so I guess it depends on wether sshkey will change a file's mode even if it isn't touching the file itself. but I think a notify from sshkey to file should cover all cases, in the worst case just causing pupet log churn [23:01:27] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:01:30] but only one way to find out [23:02:08] YuviPanda: true true [23:02:29] it'd be nice to avoid churn but nicer to have it work! :) [23:03:54] How will SWAT today? [23:04:23] (03PS7) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [23:04:26] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182311 (10Wrh2) Pinging Jdlrobson as I believe that this issue is the one being referenced in the English Wikivoyage dis... [23:04:48] I will [23:04:51] thanks [23:05:07] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [23:05:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:05:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:05:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:05:29] csteipp, what do you think of https://gerrit.wikimedia.org/r/#/c/280831/ ? [23:05:42] (03CR) 10jenkins-bot: [V: 04-1] ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [23:06:18] Luke081515, https://gerrit.wikimedia.org/r/#/c/281807/5/wmf-config/InitialiseSettings.php shows big red blocks of indentation errors [23:06:29] o.O [23:06:33] give me two minutes [23:06:41] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182312 (10Jdlrobson) @Whr2 have there been any reports of this happening on pages that have been edited since the 4th Ma... [23:06:41] lol why you hate my indentation [23:06:56] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182313 (10Jdlrobson) p:5Normal>3High [23:07:07] madhuvishy: linters hate everything and everyone :) [23:07:26] (03PS6) 10Luke081515: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) [23:07:49] probably not high priority that one :P [23:07:52] MaxSem: From a security perspective, I don't see any issues. Ops may have a different opinion, if they think that's going to encourage using too many resources on our end. I'm assuming our image scalers can handle stuff that big sanely. [23:08:09] (03CR) 10Alex Monk: [C: 04-1] upload limit: raise to 4 GB (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) (owner: 10Matanya) [23:08:27] (03PS8) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [23:08:57] madhuvishy: cherry picking ... [23:09:04] meh, whitespace again [23:09:32] (03PS7) 10Luke081515: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) [23:09:56] (03CR) 10jenkins-bot: [V: 04-1] ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [23:10:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:10:18] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:10:18] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:10:22] marxarelli: cool [23:10:22] (03PS9) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [23:10:31] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182320 (10Wrh2) >>! In T121135#2182312, @Jdlrobson wrote: > @Whr2 have there been any reports of this happening on pages... [23:10:39] Krenair: My patch should be ok now [23:10:43] Luke081515, the upload size patch does not qualify for SWAT so far: it's a slightly icky change that needs to be +1'd by a few stakeholders like ops [23:10:58] MaxSem: This is not my patch :P [23:11:06] it's the one from mantanya [23:11:19] ah, right:P [23:11:33] I only got a protection level like yesterday [23:11:41] but this time at en [23:12:07] (03PS5) 10Alex Monk: upload limit: raise to 4 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) (owner: 10Matanya) [23:12:36] matanya: you've discussed it with someone from operations during the hackaton? [23:12:42] MaxSem: I was looking real quick, and it has a +1 from Aaron at least, we probably want a godo.g or paravoi.d +1 as well [23:12:50] cc matanya ^ [23:13:42] (03CR) 10Alex Monk: [C: 032] Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [23:13:49] madhuvishy, YuviPanda: "/Stage[main]/Role::Ci::Slave::Labs/File[/etc/ssh/ssh_known_hosts]/mode: mode changed '0600' to '0644'" and no churn upon a second run! [23:13:55] marxarelli: \o/ [23:14:11] marxarelli: now if you can figure out why jenkins doesn't like it, I can merge it if you want [23:14:22] (03Merged) 10jenkins-bot: Add 'extendedconfirmed' protection level on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281807 (https://phabricator.wikimedia.org/T126607) (owner: 10Luke081515) [23:14:34] let me check if the host key works first [23:14:48] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182325 (10Jdlrobson) a:3Jdlrobson @Wrh2 this article says: "This travel guide page was last edited at 02:45, on 14 Sep... [23:15:12] :) [23:15:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:15:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:15:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:15:33] YuviPanda, madhuvishy: ugh "RSA host key for IP address '208.80.154.81' not in list of known hosts" [23:15:45] hmmm what did it add this time? [23:15:45] maybe it does need [host]:port syntax [23:15:50] i'll try locally real quick [23:15:54] okay [23:16:58] no luck. [23:17:50] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/281807/ (duration: 00m 35s) [23:17:53] (03CR) 10Greg Grossmeier: "Faidon and/or Filippo: Can we get a +1 from you on this change as well? It has one from Aaron (a previous patchset, so it 'got lost' in ge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280831 (https://phabricator.wikimedia.org/T131895) (owner: 10Matanya) [23:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:55] though it still tried to auth my public key [23:18:23] marxarelli: what is the entry it adds to known_hosts? [23:18:40] !log krenair@tin Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/281807/ (duration: 00m 35s) [23:18:40] "gerrit.wikimedia.org:29148,208.80.154.81:29148 ssh-rsa ..." [23:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:18:46] MaxSem: Krenair ^^ I asked for faidon/filippo to +1 on that patch, we should probably punt it for this swat window [23:18:51] ah that still looks wrong [23:18:58] i locally modified it to use [host]:port syntax and that didn't work either [23:19:07] Krenair: Checked at beta and enwiki, works. Thanks for SWAT, I will close the task now. [23:19:12] "[gerrit.wikimedia.org]:29148,[208.80.154.81]:29148 ssh-rsa ..." [23:19:23] Luke081515, we should backport the wikimediamessages change [23:19:45] or we ask a admin to create the sysmessages localy? [23:19:56] marxarelli: huh [23:20:03] Luke081515, why? [23:20:05] that makes no sense [23:20:16] at frwp why did that that way, and at dewiki too [23:20:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:20:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:20:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:20:45] I can create the messages, but... [23:20:49] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:53] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182349 (10Wrh2) Cache is cleared fairly regularly even if articles aren't edited - I've made minor updates to Template:P... [23:20:58] madhuvishy: what's different this time is that `ssh -p 29418 gerrit.wikimedia.org` doesn't prompt me to accept the host key, and it proceeds to try to authenticate [23:21:08] interesting [23:21:27] ok, which way is better? [23:21:28] PROBLEM - HHVM rendering on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:21:50] madhuvishy: https://phabricator.wikimedia.org/P2863 [23:21:53] Luke081515, a backport [23:21:56] ok [23:22:28] ugh, I just backported the wrong one [23:22:35] luckily didn't merge it [23:22:44] madhuvishy: line 4 is the old record that sshkey created without the port [23:23:04] so i think you should refactor one more time and leave the host_aliases but remove the port [23:23:27] according to the debug output, it'll fallback to the host wo/ a port anyway [23:23:56] right [23:24:36] TIL stuff about ssh known hosts lookup ... [23:25:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:25:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:25:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:25:46] (03PS10) 10Madhuvishy: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 [23:26:48] RECOVERY - HHVM rendering on mw1188 is OK: HTTP OK: HTTP/1.1 200 OK - 65691 bytes in 0.377 second response time [23:27:15] !log restarted hhvm on mw1188, stuck [23:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:28] !log krenair@tin Started scap: https://gerrit.wikimedia.org/r/#/c/281846/ - add messages for the new extendedconfirmed protection [23:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:27:55] marxarelli: updated patch [23:27:59] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.044 second response time [23:28:43] madhuvishy: running it now ... [23:30:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:30:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:30:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:30:33] 6Operations, 10MediaWiki-Parser, 10Traffic, 10Wikidata, 10Wikidata-Page-Banner: Banners fail to show up occassionally on Russian Wikivoyage - https://phabricator.wikimedia.org/T121135#2182391 (10Jdlrobson) Thanks. Then it's definitely not fixed. Looking at source it looks like the exact same problem as w... [23:30:59] madhuvishy: success! [23:31:54] yay [23:32:03] will try my release job [23:32:06] madhuvishy: let's retry that job. maybe we should bring the chat back in #-releng and leave these opsen alone. thanks, YuviPanda! [23:32:12] yes :) [23:32:24] thanks YuviPanda! will bother you again when that patch needs to be merged [23:32:36] :) ok madhuvishy [23:35:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:35:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:35:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:35:28] (03PS1) 10RobH: updating dns entries for stat1004 [dns] - 10https://gerrit.wikimedia.org/r/281856 [23:36:53] 6Operations, 10ops-eqiad: update labels and visible label field for stat1004/WMF4721 - https://phabricator.wikimedia.org/T131902#2182401 (10RobH) [23:37:38] 6Operations: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2181451 (10RobH) [23:37:50] (03CR) 10RobH: [C: 032] updating dns entries for stat1004 [dns] - 10https://gerrit.wikimedia.org/r/281856 (owner: 10RobH) [23:40:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:40:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:40:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:42:35] 6Operations: setup stat1004/WMF4721 for hadoop client usage - https://phabricator.wikimedia.org/T131877#2182429 (10RobH) [23:43:54] (03CR) 10BryanDavis: [C: 031] Moving elasticsearch::https instatiation to elasticsearch role [puppet] - 10https://gerrit.wikimedia.org/r/281824 (owner: 10Gehel) [23:45:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:45:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:45:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:45:22] (03PS1) 10RobH: setting stat1004 install params [puppet] - 10https://gerrit.wikimedia.org/r/281857 [23:45:48] (03CR) 10Dduvall: [C: 031] "Cherry picked on integration-puppetmaster and works as expected on integration slaves (no more host key prompt or failure)." [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [23:46:36] (03CR) 10RobH: [C: 032] setting stat1004 install params [puppet] - 10https://gerrit.wikimedia.org/r/281857 (owner: 10RobH) [23:48:23] YuviPanda: want to merge the patch? [23:50:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:50:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:50:18] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:50:29] madhuvishy: sure [23:50:39] (03PS11) 10Yuvipanda: ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [23:50:47] (03CR) 10Yuvipanda: [C: 032 V: 032] ci: Add gerrit as a known host for Jenkins slaves [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy) [23:50:57] Krenair: Was the backport succesful? I don't know if it's my cachen, but Special:ListGroupRights at en show stills no message for the new group [23:51:16] Luke081515, when you backport message changes like this you have to run a full scap [23:51:43] (03PS1) 10RobH: stat1004 has 4 disks [puppet] - 10https://gerrit.wikimedia.org/r/281858 [23:52:09] Krenair: And this takes a lot of time? I don't know details about scap [23:52:24] Luke081515, yes [23:52:29] scap can take 20-60 minutes [23:52:29] ok [23:52:33] oh [23:52:36] 23:27:27 Started scap: https://gerrit.wikimedia.org/r/#/c/281846/ - add messages for the new extendedconfirmed protection [23:52:41] 23:50:46 Started sync-apaches [23:53:33] (which, 2-3 minutes later, actually started) [23:54:43] YuviPanda: thanks :) [23:55:04] madhuvishy: np [23:55:17] PROBLEM - check_redis on payments2003 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:55:17] PROBLEM - check_puppetrun on payments2002 is CRITICAL: CRITICAL: Puppet has 1 failures [23:55:17] PROBLEM - check_redis on payments2002 is CRITICAL: CRITICAL ERROR - Can not connect to 127.0.0.1 on port 6379 [23:55:47] (03CR) 10Dduvall: "Removed cherry-pick and rebased /var/lib/git/operations/puppet on integration-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/281706 (owner: 10Madhuvishy)