[00:00:48] !log pulled cirrus changes (315440, 315441) to mw1099 [00:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:35] !log ebernhardson@mira Synchronized php-1.28.0-wmf.22/extensions/CirrusSearch/: SWAT CirrusSearch Add completion support to ClusterOverride, Remove position_increment_gap on source_text.trigram (duration: 00m 58s) [00:02:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:19] and that concludes SWAT [00:03:26] 07Blocked-on-Operations, 06Services: Expand SCB cluster - https://phabricator.wikimedia.org/T147903#2707844 (10Pchelolo) [00:07:56] (03CR) 10EBernhardson: [C: 031] [cirrus] Activate BM25 on top 10 wikis: Step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315298 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [00:08:16] (03CR) 10EBernhardson: [C: 031] [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315299 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [00:14:09] 06Operations, 10Gerrit, 10hardware-requests: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596#2697736 (10Dzahn) @RobH Which ticket should i use for the follow-up to investigate lead hardware issues / talk to Dell. This? a new one? [00:16:40] (03CR) 10Dzahn: "oh, really? labtest is related to maintenance hosts? But it's not the case for deployment hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [00:17:57] (03PS3) 10Dzahn: add mapped v6 IPs for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/302649 [00:18:00] jouncebot: next [00:18:00] In 12 hour(s) and 41 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161012T1300) [00:18:04] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:18:12] jouncebot: now [00:18:12] No deployments scheduled for the next 12 hour(s) and 41 minute(s) [00:19:19] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2707885 (10RobH) [00:19:38] mutante: so i just created https://phabricator.wikimedia.org/T147905 for the lead investigation [00:19:44] (03CR) 10Dzahn: [C: 032] add mapped v6 IPs for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/302649 (owner: 10Dzahn) [00:19:48] feel free to append any missing info that you are aware of [00:20:07] robh: ok:) thanks [00:20:12] mutante: is it fully depooled and can be rebooted as needed? [00:20:49] the various software is likely best left intact, just not doing anything. [00:20:54] robh: it doesnt get any traffic and can be rebooted. the only thing is it's still in icinga and i am not sure how to remove it right now [00:21:00] since we dont use puppetstoredconfigclean anymore [00:21:04] and i already ran "node clean" [00:21:13] we dont? [00:21:14] i did disable notifications though.. so.. [00:21:41] afaict we don't since palladium->puppetmaster1001 [00:22:29] hrmm [00:22:31] it doesnt exist on the new master [00:22:53] then how does one decommission a host fully now? =[ [00:22:55] and then there is "puppet node clean" as opposed to "cert clean" [00:23:01] not sure [00:23:13] i _think_ node clean is supposed to do it [00:23:36] it does 2 things [00:23:44] revoking the cert and "storeconfigs removed" [00:23:52] ahh, ok [00:23:53] just that it's still in icinga anyways [00:24:01] in my case [00:24:20] well, how long ago was its cert cleaned? (long enough for neon to run right?) [00:24:46] yes, should have been long enough [00:25:09] checks if it's enabled on neon [00:25:16] yea, ran 22 min ago [00:25:19] odd. [00:25:34] then i have no idea how to clear something out for a full decommission or reclaim to spares anymore =P [00:25:53] will have to ask joe tomorrow [00:26:08] it seems like a bug that it claims it clears storeconfigs but they are still existing in some place [00:26:30] somehow related to moving the masters maybe [00:28:22] * mutante adds proper IPv6 to maintenance servers [00:32:56] (03PS1) 10Dzahn: tcpircbot: adjust IPv6 addresses of maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/315453 (https://phabricator.wikimedia.org/T141619) [00:33:27] (03CR) 10Dzahn: "follow-up https://gerrit.wikimedia.org/r/#/c/315453/" [puppet] - 10https://gerrit.wikimedia.org/r/302649 (owner: 10Dzahn) [00:33:51] (03CR) 10Dzahn: "since https://gerrit.wikimedia.org/r/#/c/302649/3 -> https://gerrit.wikimedia.org/r/#/c/315453/" [puppet] - 10https://gerrit.wikimedia.org/r/302647 (https://phabricator.wikimedia.org/T141619) (owner: 10Dzahn) [00:34:49] (03PS2) 10Dzahn: tcpircbot: adjust IPv6 addresses of maintenance servers [puppet] - 10https://gerrit.wikimedia.org/r/315453 (https://phabricator.wikimedia.org/T141619) [00:35:54] (03CR) 10Dzahn: [C: 032] "[terbium:~] $ ip a s | grep inet6 | grep global" [puppet] - 10https://gerrit.wikimedia.org/r/315453 (https://phabricator.wikimedia.org/T141619) (owner: 10Dzahn) [00:37:11] (03Abandoned) 10Dzahn: add deployment, maintenance servers to hieradata common [puppet] - 10https://gerrit.wikimedia.org/r/302774 (owner: 10Dzahn) [00:39:15] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2707908 (10Dzahn) lead has been removed from puppet site.pp, i revoked the puppet cert and salt key. i used "puppet node clean" which revoked the cert and also claimed it removed storedconfigs, but lead is s... [00:40:26] (03CR) 10Dzahn: [C: 04-1] Gerrit: Also list mediawiki skins [puppet] - 10https://gerrit.wikimedia.org/r/315301 (owner: 10Paladox) [00:42:31] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:48:03] since we have "deployment-tin" and "deployment-mira" in labs, is there also "deployment-terbium" and "deployment-wasat"? [00:57:47] (03PS3) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [00:57:57] (03CR) 10Dzahn: "ah, yea, i see what you mean now. amended to:" [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [00:58:42] (03CR) 10jenkins-bot: [V: 04-1] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [00:59:16] (03CR) 10Dzahn: "bd808/hashar: does it make sense to add deployment-terbium and deployment-wasat? Or which other hosts should i use for that in labs?" [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [00:59:35] (03PS4) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [01:00:29] (03CR) 10Dzahn: "the maintenance hosts now have the mapped IPv6 addresses, that first change is merged. the second one is pending the question what we do i" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [01:00:35] (03CR) 10jenkins-bot: [V: 04-1] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [01:01:24] (03CR) 10Dzahn: [C: 032] contint: install jenkins+CI site on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [01:02:39] (03CR) 10Dzahn: "was it intentional that this has a dependency on the zuul changes? from the message it sounds like it could go before them?" [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [01:04:45] (03CR) 10Dzahn: [C: 04-1] "what Chad said. the "zuul::common" stuff should be in hieradata/role/common or similar, not in hostname.yaml. since it's.. well "common" t" [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [01:07:37] (03PS3) 10Dzahn: Revert "gerrit: workaround a CSS bug with Microsoft Edge" [puppet] - 10https://gerrit.wikimedia.org/r/314835 (owner: 10Paladox) [01:10:17] (03CR) 10Dzahn: [C: 032] Revert "gerrit: workaround a CSS bug with Microsoft Edge" [puppet] - 10https://gerrit.wikimedia.org/r/314835 (owner: 10Paladox) [01:25:11] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=509 [critical =500] [01:30:12] PROBLEM - check_missing_thank_yous on db1025 is CRITICAL: CRITICAL missing_thank_yous=614 [critical =500] [01:40:14] RECOVERY - check_missing_thank_yous on db1025 is OK: OK missing_thank_yous=0 [01:49:45] icinga-wm: thank you ?:p [01:50:38] PROBLEM - puppet last run on elastic1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:59:55] (03PS5) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [02:00:50] (03CR) 10jenkins-bot: [V: 04-1] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [02:01:25] (03PS6) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [02:02:22] (03CR) 10jenkins-bot: [V: 04-1] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [02:14:36] RECOVERY - puppet last run on elastic1046 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [02:15:35] (03PS7) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [02:18:39] 06Operations, 07IPv6: Fix IPv6 autoconf issues once and for all, across the fleet. - https://phabricator.wikimedia.org/T102099#1355719 (10Dzahn) is this a duplicate of T100690 btw? [02:19:58] 06Operations: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1318494 (10Dzahn) enabled on terbium and wasat (maintenance hosts) [02:36:52] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 12m 50s) [02:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:48:02] (03CR) 10BryanDavis: "> bd808/hashar: does it make sense to add deployment-terbium and" [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [03:06:47] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.22) (duration: 12m 50s) [03:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:13:59] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 12 03:13:59 UTC 2016 (duration 7m 12s) [03:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:33:45] (03PS1) 10Awight: WIP Enable MessageCache debugging on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315460 [03:53:23] (03PS2) 10AndyRussG: Enable MessageCache debugging on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315460 (https://phabricator.wikimedia.org/T144952) (owner: 10Awight) [03:54:10] (03PS3) 10AndyRussG: Enable MessageCache debugging on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315460 (https://phabricator.wikimedia.org/T144952) (owner: 10Awight) [04:00:32] (03CR) 10AndyRussG: [C: 032] Enable MessageCache debugging on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315460 (https://phabricator.wikimedia.org/T144952) (owner: 10Awight) [04:01:02] (03Merged) 10jenkins-bot: Enable MessageCache debugging on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315460 (https://phabricator.wikimedia.org/T144952) (owner: 10Awight) [04:46:32] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1012 is OK: OK: Less than 50.00% above the threshold [1.0] [04:48:01] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1013 is OK: OK: Less than 50.00% above the threshold [1.0] [04:51:52] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1020 is OK: OK: Less than 50.00% above the threshold [1.0] [04:55:11] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 28878 seconds ago, expected 28800 [05:00:13] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 29178 seconds ago, expected 28800 [05:05:14] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 29478 seconds ago, expected 28800 [05:10:08] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 29778 seconds ago, expected 28800 [05:12:24] 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#2708080 (10bd808) [05:15:08] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 30078 seconds ago, expected 28800 [05:20:11] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 30378 seconds ago, expected 28800 [05:24:19] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:24:57] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1022 is OK: OK: Less than 50.00% above the threshold [1.0] [05:25:15] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 30678 seconds ago, expected 28800 [05:30:05] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 30978 seconds ago, expected 28800 [05:35:10] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 31279 seconds ago, expected 28800 [05:40:12] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 31578 seconds ago, expected 28800 [05:40:53] RECOVERY - Kafka Broker Under Replicated Partitions on kafka1014 is OK: OK: Less than 50.00% above the threshold [1.0] [05:45:08] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 31878 seconds ago, expected 28800 [05:48:04] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:50:06] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 32179 seconds ago, expected 28800 [05:55:09] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 32478 seconds ago, expected 28800 [06:00:13] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 32778 seconds ago, expected 28800 [06:01:10] (03PS2) 10Legoktm: Enable magic links regardless of MediaWiki core default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314463 (https://phabricator.wikimedia.org/T147536) [06:05:13] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 33079 seconds ago, expected 28800 [06:10:08] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 33378 seconds ago, expected 28800 [06:15:12] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 33678 seconds ago, expected 28800 [06:20:08] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 33978 seconds ago, expected 28800 [06:25:10] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 34278 seconds ago, expected 28800 [06:27:00] ACKNOWLEDGEMENT - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet last ran 34278 seconds ago, expected 28800 Giuseppe Lavagetto stop spamming us (and start using non-changing messages please) [06:27:35] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 3059293 keys - replication_delay is 0 [06:38:38] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tree] [06:41:43] (03CR) 10Giuseppe Lavagetto: "You can see that other virtualhosts have either one or the other directive. Both declared in the same is a repetition. Also, we should of " [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) (owner: 10Alex Monk) [06:42:52] !log reimaging mw1099 (test application server) to jessie [06:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:50:45] (03PS5) 10Giuseppe Lavagetto: Update mwdeploy group sudo rights for jessie [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [06:55:23] find . -name '*repl* [06:55:33] oops, wrong window... [06:57:04] <_joe_> moritzm: eheheh [07:00:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "There is no need for keeping the /sbin/restart around since:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [07:02:58] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:09:08] (03PS3) 10Elukey: Add extra compiler warnings to the Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/314662 (https://phabricator.wikimedia.org/T147436) [07:28:37] 06Operations, 10ops-eqiad: investigate lead hardware issue - https://phabricator.wikimedia.org/T147905#2708139 (10akosiaris) >>! In T147905#2707908, @Dzahn wrote: > lead has been removed from puppet site.pp, i revoked the puppet cert and salt key. > > i used "puppet node clean" which revoked the cert and also... [07:31:04] (03CR) 10Alexandros Kosiaris: [C: 031] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [07:33:32] (03PS3) 10Ema: WIP: Text VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) [07:38:44] !log reimaging mw1163 to Debian (MW Jobrunner) [07:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:42:45] (03PS2) 10Muehlenhoff: xhgui: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304831 [07:44:45] (03CR) 10Muehlenhoff: [C: 032] xhgui: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/304831 (owner: 10Muehlenhoff) [07:45:01] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:47:47] (03PS3) 10Muehlenhoff: udp2log: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/312525 [07:55:14] (03CR) 10Muehlenhoff: [C: 032] udp2log: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/312525 (owner: 10Muehlenhoff) [07:59:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I love the concept, there are several small things that should be fixed." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [08:02:41] (03CR) 10Giuseppe Lavagetto: [C: 031] Refactor memcached role to allow a more flexible hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [08:03:18] (03PS1) 10Muehlenhoff: Configure tin for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/315469 [08:04:09] (03PS1) 10Muehlenhoff: Configure wasat for installation with jessie [puppet] - 10https://gerrit.wikimedia.org/r/315470 [08:09:30] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:59] (03PS6) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [08:12:20] (03CR) 10Ema: varnish: add varnishstat dstat plugin (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/314247 (owner: 10Ema) [08:21:27] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The script is overall what we need, but:" [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [08:23:59] _joe_: thx for the review of zuul/hiera refactor. I completely forgot about that one ( [08:26:33] <_joe_> hashar: I am doing random reviews of three PS at the least every morning [08:26:52] <_joe_> I realized I need to review more changes if I want to be able to properly complain about our code quality [08:30:15] (03CR) 10Hashar: [C: 04-1] "Nowadays equivalent would be:" [puppet] - 10https://gerrit.wikimedia.org/r/315301 (owner: 10Paladox) [08:31:46] _joe_: review is a good feedback loop, hopefully receiver learns about the new standard and will start refactoring/enhancing other code as a result [08:34:19] !log installing c-ares security updates [08:34:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:40:26] 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#2669983 (10akosiaris) As fas as a shared inbox goes my experience with such an approach was that we ended up using it as an archive and not as tracking. Shared Google Inbox/Groups might end up better, at least it looks lik... [08:41:00] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Deploy Youdao MT service [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) (owner: 10KartikMistry) [08:41:06] (03PS4) 10Alexandros Kosiaris: cxserver: Deploy Youdao MT service [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) (owner: 10KartikMistry) [08:42:57] _joe_: I dont get the difference between 'include foo' and class { 'foo': } [08:43:11] related to https://gerrit.wikimedia.org/r/#/c/308778/8/modules/role/manifests/ci/website.pp [08:46:04] (03PS3) 10DCausse: [cirrus] switch cirrus BM25 A/B test config to ja, zh, th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) [08:46:06] (03PS2) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) [08:46:08] (03PS2) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315299 (https://phabricator.wikimedia.org/T147508) [08:46:10] (03PS2) 10DCausse: [cirrus] Activate BM25 on top 10 wikis: Step 2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315298 (https://phabricator.wikimedia.org/T147508) [08:46:40] (03PS1) 10KartikMistry: cxserver: Fix typo in comment [puppet] - 10https://gerrit.wikimedia.org/r/315474 [08:48:25] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Fix typo in comment [puppet] - 10https://gerrit.wikimedia.org/r/315474 (owner: 10KartikMistry) [08:53:00] !log upgrading nodejs on ruthenium [08:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:55:55] (03PS2) 10Filippo Giunchedi: raid: increase check_hpssacli timeout [puppet] - 10https://gerrit.wikimedia.org/r/315103 [08:56:06] !log mw1163 (MW Jobrunner) back in service after the reimage [08:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:30] akosiaris: do you know what's wrong here cxserver deployment? [08:57:32] http://pastebin.com/eWGVdPf0 [08:58:21] er, no... lemme check [08:59:51] Command '/usr/bin/git checkout --force --quiet 060f91d728eae887d04fda48b58062396114f12f' returned non-zero exit status 128" [09:00:09] 06Operations, 10Monitoring: Investigate check_hpssacli number of calls / efficiency - https://phabricator.wikimedia.org/T147916#2708277 (10fgiunchedi) [09:00:30] 06Operations, 10Monitoring: Investigate check_hpssacli number of calls / efficiency - https://phabricator.wikimedia.org/T147916#2708289 (10fgiunchedi) p:05Triage>03Normal [09:00:36] <_joe_> hashar: if you do 'include' you allow other parts of your code to pick the parameters [09:00:39] tin ? [09:00:41] ah yes [09:00:46] kart_: that's what's wrong [09:00:49] tin.eqiad.wmnet [09:00:53] (03CR) 10Filippo Giunchedi: [C: 032] "I've seen only CDBs mentioning inquiry so far on swift machines so I'm assuming those coming check_hpssacli." [puppet] - 10https://gerrit.wikimedia.org/r/315103 (owner: 10Filippo Giunchedi) [09:01:11] akosiaris: I'm using mira. the new deployment server. [09:01:56] why it referenced to tin? Don't know. [09:02:10] kart_: seems like I was not clear, it's the scap config that's referring to tin [09:02:18] let's see where that is [09:02:28] (03Draft1) 10Hashar: (WIP) Zuul hiera refactoring (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/315475 [09:03:00] scap/scap.cfg? [09:03:03] yes [09:03:09] that's needs a change [09:03:11] Let me fix that. [09:03:16] s/tin.eqiad.wmnet/deployment.eqiad.wmnet/ [09:03:18] ok [09:04:21] (03CR) 10DCausse: "prepared the ja, zh and th A/B test instead of removing the whole block." (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [09:06:30] akosiaris: https://gerrit.wikimedia.org/r/#/c/315478/ - looks good? [09:06:45] (03CR) 10Filippo Giunchedi: raid: tweak check_interval for forking checks (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [09:06:48] (03PS3) 10Filippo Giunchedi: raid: tweak check_interval for forking checks [puppet] - 10https://gerrit.wikimedia.org/r/315107 [09:08:22] akosiaris: OK. I'll merge and try. [09:08:58] !log upgrading nodejs on etherpad1001 [09:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:42] I am impressed these days with etherpad... seems to need way less babysitting [09:09:53] and now that I jinxed it [09:09:56] let's see what happens [09:10:07] I restarted etherpad-lite and it still works [09:10:12] it's magic [09:10:24] unbelievable!!!! [09:12:54] akosiaris: that works, thanks. [09:13:04] (atleast as deployment procedure :)) [09:13:10] cool [09:14:36] akosiaris: also Youdao seems working fine. [09:14:40] akosiaris: thanks a lot. [09:15:34] (03PS4) 10Filippo Giunchedi: raid: tweak check_interval for forking checks [puppet] - 10https://gerrit.wikimedia.org/r/315107 [09:15:39] (03CR) 10MarcoAurelio: "> (1 comment)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [09:16:19] kart_: cool! [09:17:26] 06Operations, 06Performance-Team, 10Thumbor: Thumbor times out on large files sometimes - https://phabricator.wikimedia.org/T147412#2708376 (10Gilles) At 120 seconds, still happens for some giant TIFs: ``` gilles@thumbor1001:~$ cat /var/log/thumbor/thumbor.error.log | grep 504 | grep 120 Oct 12 06:28:20 thu... [09:20:00] !log Update cxserver to da7d4f6 (T146731) [09:20:01] T146731: Deploy Youdao MT service - https://phabricator.wikimedia.org/T146731 [09:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:22:47] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/4317/" [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [09:23:15] akosiaris: ^ shouldn't conflict with your icinga work I think but if you could double check too [09:23:15] 06Operations, 06Performance-Team, 10Thumbor: Thumbor times out on large files sometimes - https://phabricator.wikimedia.org/T147412#2708383 (10Gilles) Looking at Mediawiki code, SwiftFileBackend uses the default request timeout for MultiHttpClient, which is 300 seconds. Let's try that value in Thumbor. If t... [09:24:05] (03PS2) 10Hashar: (WIP) Zuul hiera refactoring (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/315475 [09:24:47] jynus: marostegui: I am gonna merge https://gerrit.wikimedia.org/r/#/c/315256/2/manifests/role/mariadb.pp. Should be noop [09:24:50] godog: looking [09:25:56] (03CR) 10Alexandros Kosiaris: [C: 031] raid: tweak check_interval for forking checks [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [09:26:16] godog: I 'll have to amend a minor change of my own, but otherwise LGTM [09:26:37] (03CR) 10Alexandros Kosiaris: [C: 032] role::mariadb: Remove neon ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/315256 (owner: 10Alexandros Kosiaris) [09:26:43] (03PS3) 10Alexandros Kosiaris: role::mariadb: Remove neon ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/315256 [09:26:46] (03CR) 10Alexandros Kosiaris: [V: 032] role::mariadb: Remove neon ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/315256 (owner: 10Alexandros Kosiaris) [09:27:00] (03PS1) 10Gilles: Increase Thumbor HTTP_LOADER_REQUEST_TIMEOUT to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/315479 [09:27:13] akosiaris: ack thanks! I'll merge shortly [09:27:17] (03PS2) 10Gilles: Increase Thumbor HTTP_LOADER_REQUEST_TIMEOUT to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/315479 [09:29:59] (03PS5) 10Filippo Giunchedi: raid: tweak check_interval for forking checks [puppet] - 10https://gerrit.wikimedia.org/r/315107 [09:30:24] (03PS1) 10Elukey: Add config file for the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315480 (https://phabricator.wikimedia.org/T138262) [09:31:12] (03CR) 10Filippo Giunchedi: [C: 032] raid: tweak check_interval for forking checks [puppet] - 10https://gerrit.wikimedia.org/r/315107 (owner: 10Filippo Giunchedi) [09:33:08] (03PS3) 10Filippo Giunchedi: Increase Thumbor HTTP_LOADER_REQUEST_TIMEOUT to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/315479 (owner: 10Gilles) [09:33:29] 06Operations, 06Performance-Team, 10Thumbor: Log thumbnail requests that fail on Thumbor and not on Mediawiki and vice versa - https://phabricator.wikimedia.org/T147918#2708393 (10Gilles) [09:34:00] 06Operations, 06Performance-Team, 10Thumbor: Log thumbnail requests that fail on Thumbor and not on Mediawiki and vice versa - https://phabricator.wikimedia.org/T147918#2708408 (10Gilles) [09:34:15] (03CR) 10Filippo Giunchedi: [C: 032] Increase Thumbor HTTP_LOADER_REQUEST_TIMEOUT to 5 minutes [puppet] - 10https://gerrit.wikimedia.org/r/315479 (owner: 10Gilles) [09:34:34] (03PS2) 10Elukey: Add config file for the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315480 (https://phabricator.wikimedia.org/T138262) [09:36:13] (03CR) 10Elukey: [C: 032] Add config file for the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315480 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [09:37:19] I've upgraded grafana on labmon a few days back for T146354 and haven't heard of anything back, going to upgrade in production too [09:37:19] T146354: upgrade grafana to 3.1.1 - https://phabricator.wikimedia.org/T146354 [09:37:31] s/back/bad/ [09:41:13] (03Abandoned) 10Hashar: (WIP) Zuul hiera refactoring (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/315475 (owner: 10Hashar) [09:42:00] (03PS1) 10Elukey: Fix template render bug for the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315481 (https://phabricator.wikimedia.org/T138262) [09:43:21] (03CR) 10Elukey: [C: 032] Fix template render bug for the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315481 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [09:44:05] (03CR) 10MarcoAurelio: "I wonder if:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [09:47:32] (03PS1) 10Filippo Giunchedi: aptrepo: add A278B781FE4B2BDA key to cassandra repo [puppet] - 10https://gerrit.wikimedia.org/r/315482 [09:49:43] (03PS2) 10Filippo Giunchedi: aptrepo: add A278B781FE4B2BDA key to cassandra repo [puppet] - 10https://gerrit.wikimedia.org/r/315482 [09:49:46] or not, reprepro refuses to update due to ^ [09:52:04] elukey: mind giving a quick look ? [09:53:29] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2708423 (10Gilles) [09:54:50] godog: LGTM, but not a big expert in the config file syntax.. if VerifyRelease can get | I think it is fine! [09:57:38] (03CR) 10Hashar: zuul: refactor to use hiera (039 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [09:57:39] yeah it does according to reprepro manual [10:02:04] (03PS3) 10Filippo Giunchedi: aptrepo: add A278B781FE4B2BDA key to cassandra repo [puppet] - 10https://gerrit.wikimedia.org/r/315482 [10:03:56] (03CR) 10Filippo Giunchedi: [C: 032] aptrepo: add A278B781FE4B2BDA key to cassandra repo [puppet] - 10https://gerrit.wikimedia.org/r/315482 (owner: 10Filippo Giunchedi) [10:05:17] (03PS1) 10Elukey: Fix remaining issues with the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315486 (https://phabricator.wikimedia.org/T138262) [10:09:34] (03PS9) 10Hashar: zuul: refactor to use hiera [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) [10:09:36] (03PS1) 10Hashar: (WIP) zuul role with hiera lookup (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/315487 [10:10:11] !log upgrade grafana on krypton to 3.1.1-1470047149 T146354 [10:10:12] T146354: upgrade grafana to 3.1.1 - https://phabricator.wikimedia.org/T146354 [10:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:17] (03PS2) 10Elukey: Fix remaining issues with the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315486 (https://phabricator.wikimedia.org/T138262) [10:12:00] 06Operations, 10Graphite: upgrade grafana to 3.1.1 - https://phabricator.wikimedia.org/T146354#2708457 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi [10:12:58] (03CR) 10Elukey: [C: 032] Fix remaining issues with the Pivot UI [puppet] - 10https://gerrit.wikimedia.org/r/315486 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [10:20:52] (03CR) 10Zhuyifei1999: "10:19:48 0 ✓ zhuyifei1999@tools-bastion-02: ~$ php -a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [10:22:42] (03CR) 10Hashar: "Puppet compile for gallium (zuul server) and scandium (zuul merger) at https://puppet-compiler.wmflabs.org/4324/" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/308778 (https://phabricator.wikimedia.org/T139527) (owner: 10Hashar) [10:23:16] (03Abandoned) 10Hashar: (WIP) zuul role with hiera lookup (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/315487 (owner: 10Hashar) [10:29:21] (03PS2) 10Hashar: contint: install jenkins+CI site on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315146 [10:32:30] (03CR) 10Hashar: "Rebased and compiled again against contint1001 https://puppet-compiler.wmflabs.org/4325/contint1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [10:33:33] (03CR) 10Alexandros Kosiaris: [C: 032] ganglia: Remove neon as a gmetad allowed host [puppet] - 10https://gerrit.wikimedia.org/r/315254 (owner: 10Alexandros Kosiaris) [10:33:38] (03PS3) 10Alexandros Kosiaris: ganglia: Remove neon as a gmetad allowed host [puppet] - 10https://gerrit.wikimedia.org/r/315254 [10:33:40] (03CR) 10Alexandros Kosiaris: [V: 032] ganglia: Remove neon as a gmetad allowed host [puppet] - 10https://gerrit.wikimedia.org/r/315254 (owner: 10Alexandros Kosiaris) [10:44:19] 06Operations, 13Patch-For-Review: Migrate pool counters to trusty/jessie - https://phabricator.wikimedia.org/T123734#2708520 (10fgiunchedi) @Dzahn I see technetium was provisioned in T118763 and then destroyed, is it going to be used again? If not we could just reuse the name (still in DNS) to create the VM fo... [11:00:08] (03CR) 10Filippo Giunchedi: [C: 04-1] "still LGTM, though it occurred to me that e.g. on service clusters we do /srv/log/ not /srv//log so it should be consist" [puppet] - 10https://gerrit.wikimedia.org/r/315234 (owner: 10Gilles) [11:05:16] (03PS8) 10Filippo Giunchedi: Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 (owner: 10Gilles) [11:06:40] and I am done with reviews/follow ups after "just" 3 hours ... [11:06:44] lunchhhh [11:08:15] (03CR) 10Filippo Giunchedi: [C: 032] Make thumbor use a temp folder controlled by systemd-tmpfiles instead of /tmp [puppet] - 10https://gerrit.wikimedia.org/r/315062 (owner: 10Gilles) [11:12:34] (03PS1) 10Filippo Giunchedi: thumbor: require only instance_service_path [puppet] - 10https://gerrit.wikimedia.org/r/315494 [11:13:42] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:14:31] PROBLEM - puppet last run on thumbor1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:14:33] (03CR) 10Filippo Giunchedi: [C: 032] thumbor: require only instance_service_path [puppet] - 10https://gerrit.wikimedia.org/r/315494 (owner: 10Filippo Giunchedi) [11:15:10] !log reimaing mw1164 to Debian Jessie (MW Jobrunner) [11:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:01] RECOVERY - puppet last run on thumbor1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:25:39] 06Operations, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#2708536 (10fgiunchedi) [11:25:41] 06Operations, 10Monitoring: Investigate check_hpssacli number of calls / efficiency - https://phabricator.wikimedia.org/T147916#2708538 (10fgiunchedi) [11:25:57] (03PS2) 10Gilles: Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 [11:27:43] (03PS3) 10Gilles: Point to a folder firejailed thumbor can actually write to [puppet] - 10https://gerrit.wikimedia.org/r/315234 [11:30:58] 06Operations, 10Icinga, 10Monitoring, 07Need-volunteer: check_puppetrun: print "agent disabled" reason - https://phabricator.wikimedia.org/T98481#2708543 (10fgiunchedi) 05Open>03Invalid Reasons are now reported in icinga, ` WARNING: Puppet is currently disabled, message: testing maintain-replicas revis... [11:32:20] 06Operations, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be machines - https://phabricator.wikimedia.org/T141252#2708547 (10fgiunchedi) [11:32:22] 06Operations, 10Monitoring: investigate speeding up hp raid checks - https://phabricator.wikimedia.org/T138597#2708549 (10fgiunchedi) [11:32:40] 06Operations, 06Performance-Team, 10Thumbor: Extracted ICC profile don't get cleaned up - https://phabricator.wikimedia.org/T147921#2708550 (10Gilles) [11:34:25] 06Operations, 06Performance-Team, 10Thumbor: Extracted ICC profile don't get cleaned up - https://phabricator.wikimedia.org/T147921#2708566 (10Gilles) p:05Normal>03Low They get cleaned up after 10 minutes, but it would be cleaner for the thumbor engines to clean up after themselves. [11:38:59] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2708570 (10Gilles) [11:39:01] 06Operations, 06Performance-Team, 10Thumbor: Temp files not cleaned up on conversion error - https://phabricator.wikimedia.org/T146262#2708568 (10Gilles) 05Open>03Resolved I don't see them stay in /srv/thumbor/tmp. I've caught big ones being processed, but they were gone once processing finished: ``` g... [11:40:11] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:41:39] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2708590 (10Gilles) [11:41:42] 06Operations, 06Performance-Team, 10Thumbor: Figure out a way to live-debug running production thumbor processes - https://phabricator.wikimedia.org/T146143#2708588 (10Gilles) 05Open>03Resolved I now have access to manhole since we moved the content to /srv/thumbor/tmp/ owned by the thumbor user: ``` g... [11:43:39] 06Operations, 06Performance-Team, 10Thumbor, 13Patch-For-Review: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2708595 (10Gilles) [11:43:41] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2708593 (10Gilles) 05Open>03Resolved a:03Gilles [11:46:42] 06Operations, 10Monitoring: Extract metrics from logs - https://phabricator.wikimedia.org/T147923#2708597 (10fgiunchedi) [11:57:18] (03PS2) 10Gilles: Add mtail program to track thumbor OOM kills [puppet] - 10https://gerrit.wikimedia.org/r/315272 [11:57:43] 06Operations, 10hardware-requests: Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2708643 (10elukey) [11:58:24] 06Operations, 10hardware-requests: Reclaim aqs100[123] - https://phabricator.wikimedia.org/T147926#2708655 (10elukey) p:05Triage>03Normal [12:02:56] (03PS1) 10Elukey: Add reference to the hw reclaim task for aqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/315497 [12:03:08] (03CR) 10Gehel: "LGTM. We dont radically change the total number of shards, so risk is minimal and we allow for some room during cluster upgrades. The only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314262 (owner: 10DCausse) [12:04:15] (03CR) 10Elukey: [C: 032] Add reference to the hw reclaim task for aqs100[123] [puppet] - 10https://gerrit.wikimedia.org/r/315497 (owner: 10Elukey) [12:14:39] (03PS2) 10Gilles: Add memory limit to Thumbor subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/315248 [12:14:57] (03PS3) 10Gilles: Add memory limit to Thumbor subprocesses [puppet] - 10https://gerrit.wikimedia.org/r/315248 [12:17:31] (03PS1) 10Alexandros Kosiaris: monitoring: Add an exported parameter to host, service [puppet] - 10https://gerrit.wikimedia.org/r/315500 [12:17:36] let's see what pcc says about this ^ [12:21:15] (03CR) 10jenkins-bot: [V: 04-1] monitoring: Add an exported parameter to host, service [puppet] - 10https://gerrit.wikimedia.org/r/315500 (owner: 10Alexandros Kosiaris) [12:21:51] oh damn puppet syntax [12:22:16] the future parser seems to like this but the old one not... [12:22:17] grrr [12:23:07] !log mw1164 back in service (MW Jobrunner) [12:23:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:26:02] (03CR) 10BBlack: [C: 031] Remove the HHVM version for X-Powered-By (static websites) [puppet] - 10https://gerrit.wikimedia.org/r/314519 (owner: 10Elukey) [12:28:27] (03CR) 10BBlack: [C: 031] varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 (owner: 10Ema) [12:37:53] (03PS2) 10Alexandros Kosiaris: monitoring: Add an exported parameter to host, service [puppet] - 10https://gerrit.wikimedia.org/r/315500 [12:42:12] (03PS4) 10Elukey: Refactor memcached role to allow a more flexible hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) [12:43:11] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2708688 (10Gilles) Found a PD example, which will be handy to add to the Thumbor test suite: https://commons.wikimedia.org/wiki/File:Pacific-Electric-Red-Cars-Awaiting-Destru... [12:44:47] (03PS7) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [12:45:22] 06Operations, 06Performance-Team, 10Thumbor: Extracted ICC profile don't get cleaned up - https://phabricator.wikimedia.org/T147921#2708691 (10Gilles) [12:46:37] (03CR) 10Elukey: [C: 032] Refactor memcached role to allow a more flexible hieradata config [puppet] - 10https://gerrit.wikimedia.org/r/314260 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [12:48:01] puppet is disable on mc1* as precaution for --^ [12:48:19] that should be no-op [12:51:19] (03PS8) 10Ema: varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 [12:51:27] (03CR) 10Ema: [C: 032 V: 032] varnish: add varnishstat dstat plugin [puppet] - 10https://gerrit.wikimedia.org/r/314247 (owner: 10Ema) [12:52:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [12:55:09] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 210 seconds ago with 0 failures [12:56:03] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2708698 (10Gilles) So far I can't reproduce locally. Could be a difference in gifsicle output between ubuntu and debian. [13:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161012T1300). Please do the needful. [13:00:04] mafk: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [13:00:22] ^^ [13:00:40] o/ [13:01:08] I have absolutely not have prepared this window [13:01:09] so going to be slow [13:01:29] pas de probleme [13:02:12] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2708704 (10Gilles) Nope, that's not it: ``` gilles@thumbor1001:~$ wget http://ms-fe.svc.eqiad.wmnet/v1/AUTH_mw/wikipedia-commons-local-public.5a/5/5a/Berliner-pilsner-logo.g... [13:02:17] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2708705 (10Gilles) [13:02:21] mc1* rollout went fine, no op as expected [13:02:43] I still have no idea about how to debug issues like "PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001" [13:03:00] I checked on logstash and flourine but I don't really see anything [13:03:13] (03CR) 10Hashar: [C: 032] Create 'massmessage-sender' group for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314850 (https://phabricator.wikimedia.org/T147740) (owner: 10MarcoAurelio) [13:05:52] (03PS2) 10MarcoAurelio: Create 'massmessage-sender' group for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314850 (https://phabricator.wikimedia.org/T147740) [13:06:12] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [13:06:44] oh my [13:06:47] hashar: you need to re-submit since the patch had to be rebased [13:06:52] yeah [13:07:03] and there is a change that did not get deployed apparently:( [13:07:08] !sal [13:07:08] https://wikitech.wikimedia.org/wiki/Server_Admin_Log https://tools.wmflabs.org/sal/production See it and you will know all you need. [13:07:28] ah that is for beta harmless [13:07:49] (03CR) 10Hashar: Create 'massmessage-sender' group for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314850 (https://phabricator.wikimedia.org/T147740) (owner: 10MarcoAurelio) [13:07:58] (03CR) 10Hashar: [C: 032] Create 'massmessage-sender' group for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314850 (https://phabricator.wikimedia.org/T147740) (owner: 10MarcoAurelio) [13:08:06] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2708707 (10Gilles) Got it, it's my https loader. The GIF engine, since it's built into Thumbor, is the only one that doesn't know to look into wikimedia_original_file to find... [13:08:08] so I guess we will see the new group via https://tr.wikipedia.org/wiki/%C3%96zel:GrupHaklar%C4%B1Listesi [13:08:25] (03Merged) 10jenkins-bot: Create 'massmessage-sender' group for tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314850 (https://phabricator.wikimedia.org/T147740) (owner: 10MarcoAurelio) [13:08:52] That's right. [13:09:05] I'm on x-wikimedia-debug mx1099 [13:09:11] syncing on mw1099 [13:09:13] *mx -> mw [13:09:16] doe [13:09:17] done [13:09:21] checking [13:09:43] looks like sysop can add/remove the group [13:09:47] and have the permission [13:10:11] yep, all lgtm [13:10:14] and mass message senders group has the permission + can remove themselves [13:10:19] removegroupstoself also works [13:10:20] yep [13:10:30] what they lack is the translation at translatewiki.net [13:10:34] apparently [13:11:00] that will come [13:11:18] lgtm [13:14:28] (03CR) 10Hashar: Send abusefilter hit notifications from es.wikibooks to UDP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314852 (https://phabricator.wikimedia.org/T147744) (owner: 10MarcoAurelio) [13:14:33] !log hashar@mira Synchronized wmf-config/InitialiseSettings.php: Create 'massmessage-sender' group for tr.wikipedia T147740 (duration: 03m 42s) [13:14:35] T147740: Create mass message sender user group on Turkish Wikipedia - https://phabricator.wikimedia.org/T147740 [13:14:38] mafk: for the other https://gerrit.wikimedia.org/r/#/c/314852/1/wmf-config/abusefilter.php [13:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:14:44] $wgAbuseFilterNotifications = "udp"; [13:14:45] is the defaul [13:14:47] t [13:15:00] so I guess we should remove $wgAbuseFilterNotifications = false; [13:15:19] hashar: yep, and private filters? [13:15:36] wgAbuseFilterNotificationsPrivate should stay IMHO [13:15:37] looks fine? :} [13:16:11] yeah keep it [13:16:19] so in short: [13:16:23] remove the line: $wgAbuseFilterNotifications = false; [13:16:34] and keep the other one [13:16:38] and keep the line about NotifiactionsPrivate [13:16:39] yeah [13:16:40] right? [13:16:42] okay [13:16:47] doing it right now [13:16:49] rebase while at it and will +2 && push [13:17:00] gerrit patch ammender is <3 <3 [13:17:03] 06Operations, 06Services, 15User-mobrovac: Expand SCB cluster - https://phabricator.wikimedia.org/T147903#2708722 (10mobrovac) p:05Triage>03High > So, we either need to prioritise T96017 or get at least one more box for SCB After {T147409} is done, we'll be able to free the boxes and add them to SCB. [13:17:15] (03PS2) 10MarcoAurelio: Send abusefilter hit notifications from es.wikibooks to UDP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314852 (https://phabricator.wikimedia.org/T147744) [13:17:47] (03PS3) 10MarcoAurelio: Send abusefilter hit notifications from es.wikibooks to UDP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314852 (https://phabricator.wikimedia.org/T147744) [13:18:04] waiting for jenkins-bot [13:18:48] v+2 [13:19:30] * mafk wonders if "mira" will do the job for SWAT, he's used to see @tin [13:19:40] (03CR) 10Hashar: [C: 032] Send abusefilter hit notifications from es.wikibooks to UDP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314852 (https://phabricator.wikimedia.org/T147744) (owner: 10MarcoAurelio) [13:20:09] (03Merged) 10jenkins-bot: Send abusefilter hit notifications from es.wikibooks to UDP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314852 (https://phabricator.wikimedia.org/T147744) (owner: 10MarcoAurelio) [13:20:47] deploying [13:21:34] !log hashar@mira Synchronized wmf-config/abusefilter.php: Send abusefilter hit notifications from es.wikibooks to UDP T147744 (duration: 00m 52s) [13:21:35] T147744: Notify to UDP filter hits from es.wikibooks - https://phabricator.wikimedia.org/T147744 [13:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:23] mafk: both done I guess :} [13:22:32] I am pretty tired, that took long [13:23:24] thanks for SWAT [13:23:33] (03PS3) 10Alexandros Kosiaris: monitoring: Add an exported parameter to host, service [puppet] - 10https://gerrit.wikimedia.org/r/315500 [13:23:36] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] monitoring: Add an exported parameter to host, service [puppet] - 10https://gerrit.wikimedia.org/r/315500 (owner: 10Alexandros Kosiaris) [13:23:40] !log European SWAT completed. [13:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:57] mafk: thanks for the patches :D [13:24:13] 06Operations, 10EventBus, 10hardware-requests: eqiad/codfw: 1+1 Kafka broker in main clusters in eqiad and codfw - https://phabricator.wikimedia.org/T145082#2708740 (10Ottomata) Bump! I see that there are requests for quotes out, did we ever get them back? [13:24:44] I wished I could submit better/more complex patches instead of just this petty stuff. [13:25:12] that will come as you "level up" :D [13:25:38] we all started with easy stuff, learned more in the process and progressed step by step [13:25:56] the cool things with those patches is that they are rather easy to handle [13:26:08] but can also easily break a wiki and cause a huge impact on the community [13:26:15] yeah [13:26:30] + there is all the interaction with the project village pumps / community members that dont know much about mediawiki settings etc [13:26:40] so there is a lot of added value behind the simple patch [13:26:42] and [13:27:26] you should be praised by the community that got some setting adjusted. Cause for much of them, even doing a single line change via git, is something they dont even have knowledge of :D [13:27:32] !log rolling restart of restbase in codfw to pick up new nodejs [13:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:38:27] 06Operations, 06Performance-Team, 10Thumbor: Thumbor fails on gifs where Mediawiki doesn't - https://phabricator.wikimedia.org/T147919#2708765 (10Gilles) [13:41:10] (03PS1) 10Gilles: Upgrade to 0.1.26 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/315507 [13:45:22] (03PS1) 10Gilles: Use wikimedia GIF engine for Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/315508 [13:52:38] !log restart restbase on restbase1007 to pick up new nodejs [13:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:00] (03PS2) 10DCausse: Adjust shard & replica count for enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314262 [13:54:25] (03PS4) 10Ema: Text VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) [13:58:41] !log Upgrading Zuul zuul_2.5.0-8-gcbc7f62 wmf3..wmf4 [13:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:59:08] 06Operations, 10ChangeProp, 06Services, 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2708810 (10mobrovac) Yes, indeed. I ruled out metrics sending. It seems the node-rdkafka and / or librdkafka have something to do with it, since the service doesn't crash w... [14:01:18] !log upgrading openssl + confctl on cp* [14:01:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:53] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057#2708813 (10hashar) 05Resolved>03Open a:05elukey>03None [14:02:30] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057#2618640 (10hashar) Polished up some oddity from yesterday deploy, the shebang was incorrect in zuul-clear-ref. Gotta bump to wmf4 on... [14:02:48] !log installing imagemagick security updates [14:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:09] !log T133395: Restarting Cassandra instances on restbase2009.codfw.wmnet [14:04:11] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [14:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:54] (03CR) 10Ema: "noop on esams/eqiad v3 https://puppet-compiler.wmflabs.org/4334/" [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:09:33] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[lldpd],Package[xfsprogs] [14:11:31] (03PS1) 10Alexandros Kosiaris: facilities: Unexported nagios host and services [puppet] - 10https://gerrit.wikimedia.org/r/315510 [14:14:27] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2708845 (10elukey) Update after a long time: * We have tested version 1.4.25 (that introduced big changes like a max of 64 slab classes) with several extended... [14:19:40] (03PS2) 10Alexandros Kosiaris: facilities: Unexported nagios host and services [puppet] - 10https://gerrit.wikimedia.org/r/315510 [14:21:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "this will make naggen not collect these resources; you have to modify naggen2 too IIRC." [puppet] - 10https://gerrit.wikimedia.org/r/315510 (owner: 10Alexandros Kosiaris) [14:22:06] !log T133395: Restarting Cassandra instances on restbase1007.eqiad.wmnet [14:22:08] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [14:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:24:51] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[imagemagick] [14:24:51] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[moreutils] [14:27:23] !log stopped zuul-merger on scandium pausing CI as a result. Snipe upgrade going on [14:27:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:53] !log install zuul_2.5.0-8-gcbc7f62-wmf4jessie1_amd64.deb on scandium - T145057 [14:28:55] T145057: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057 [14:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:11] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [14:31:36] !log disable puppet on neon. Merging https://gerrit.wikimedia.org/r/315510 [14:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:33:42] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:14] (03PS1) 10PleaseStand: gerrit: Fix CSS selector for diff font size override [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) [14:34:38] !log T133395: Restarting Cassandra instances on restbase1010.eqiad.wmnet [14:34:39] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [14:34:43] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium): Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057#2708902 (10hashar) 05Open>03Resolved a:03elukey scandium:~$ zuul-clear-refs usage: zuul-clear-refs [-h] [--until DAYS_AGO] [-n... [14:34:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:35:12] (03CR) 10Alexandros Kosiaris: "That's exactly what I want. For naggen2 to NOT collect these resources and for the catalog to just have them because they belong to the ho" [puppet] - 10https://gerrit.wikimedia.org/r/315510 (owner: 10Alexandros Kosiaris) [14:35:27] _joe_: ^ [14:35:43] !log zuul-merger on scandium restarted. CI is resumed. [14:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:38:38] !log uploaded zuul 2.5.0-8-gcbc7f62-wmf4jessie1 to jessie-wikimedia/third-party (T145057) [14:38:40] T145057: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057 [14:38:41] <_joe_> akosiaris: but if naggen doesn't collects those, how do they end up in the icinga config? [14:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:19] (03CR) 10BBlack: "+1, can deal with the FIXME about backend conditionals afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [14:39:24] (03CR) 10Giuseppe Lavagetto: ".. that would mean not having those declared in the icinga configuration. Are you ok with that?" [puppet] - 10https://gerrit.wikimedia.org/r/315510 (owner: 10Alexandros Kosiaris) [14:40:41] _joe_: same way a file resource would end up ? [14:41:14] it might need some testing as to which exact file they 'll end up into, but that's why I am carefully merging this change [14:41:23] <_joe_> akosiaris: the nagios_host resources if not collected are written where? [14:41:30] <_joe_> eheh exactly :P [14:41:54] /etc/nagios/nagios_host.cfg [14:42:25] which might end up being just fine [14:42:40] or not, I am still evaluating this [14:44:20] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:44:38] (03Abandoned) 10Paladox: Gerrit: Also list mediawiki skins [puppet] - 10https://gerrit.wikimedia.org/r/315301 (owner: 10Paladox) [14:45:08] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: pay-lvs1003 hardware swap for pay-lvs1001 - https://phabricator.wikimedia.org/T147932#2708940 (10Jgreen) [14:45:45] !log uploaded zuul 2.5.0-8-gcbc7f62-wmf4precise1 to precise-wikimedia/third-party (T145057) [14:45:46] T145057: Upgrade Zuul on scandium.eqiad.wmnet (Jessie zuul-merger) - https://phabricator.wikimedia.org/T145057 [14:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:58] (03CR) 10Alexandros Kosiaris: "actually they would be, in /etc/nagios/nagios_{host,service}.cfg which might be just fine." [puppet] - 10https://gerrit.wikimedia.org/r/315510 (owner: 10Alexandros Kosiaris) [14:47:28] !log traffic cache nginxes: seamless upgrade-restart for new openssl lib [14:47:28] 06Operations, 10ops-eqiad: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2708958 (10Cmjohnson) [14:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:10] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:49:10] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:32] (03CR) 10EBernhardson: [C: 031] "seems sane to me. I'm not worried about the balance of load across this cluster with this change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314262 (owner: 10DCausse) [14:50:58] !log T133395: Restarting Cassandra instances on restbase1011.eqiad.wmnet [14:51:00] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [14:51:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:53:32] 06Operations, 10ops-eqiad, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Rack/Setup Kubernetes Servers - https://phabricator.wikimedia.org/T147933#2708993 (10Joe) [15:01:05] (03PS4) 10Hashar: zuul: migrate server only settings out of merger [puppet] - 10https://gerrit.wikimedia.org/r/309299 [15:08:18] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [15:10:26] !log T133395: Restarting Cassandra instances in RESTBase, eqiad, rack 'b' [15:10:28] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [15:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:31] 06Operations, 06Discovery, 06Discovery-Analysis (Current work), 07Tracking: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2709011 (10Ottomata) I haven't fully groked the description of this problem, but if the solution is what @Dzahn men... [15:14:43] (03CR) 10Hashar: "Puppet compile https://puppet-compiler.wmflabs.org/4337/" [puppet] - 10https://gerrit.wikimedia.org/r/309299 (owner: 10Hashar) [15:16:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [15:19:15] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [15:31:53] 06Operations, 10Security-Reviews, 06Services-next, 06Services (services-next), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2709064 (10GWicke) [15:31:56] 06Operations, 06Reading-Infrastructure-Team, 06Services-next, 07Security-General, 06Services (services-next): Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2709065 (10GWicke) [15:32:05] 06Operations, 10Citoid, 10Graphoid, 10VisualEditor, and 3 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#2709067 (10GWicke) [15:35:20] (03PS5) 10Ema: Text VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) [15:35:28] 06Operations, 06Performance-Team, 10scap, 07HHVM, and 2 others: Make scap able to depool/repool servers via the conftool API - https://phabricator.wikimedia.org/T104352#2709076 (10thcipriani) [15:36:52] (03PS2) 10PleaseStand: gerrit: Fix CSS selector for diff font size override [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) [15:37:15] !log upgrading nodejs to 4.6.0 on maps1* servers [15:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:38:55] 06Operations, 10Citoid, 10VisualEditor, 06Services (blocked): Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#2709100 (10GWicke) [15:38:57] Is gerrit being super slow again? [15:39:31] ostriches: ^ [15:41:36] PROBLEM - mediawiki-installation DSH group on mw1164 is CRITICAL: Host mw1164 is not in mediawiki-installation dsh group [15:46:54] 06Operations, 10Parsoid, 10service-runner, 10service-template-node, 06Services (doing): Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#2709111 (10GWicke) [15:48:35] (03CR) 10BBlack: [C: 031] Text VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [15:48:51] hoo: Not to my knowledge [15:49:51] ostriches: Seems good again [15:51:00] !log T133395: Restarting Cassandra instances in RESTBase, eqiad, rack 'd' [15:51:01] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [15:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:46] hoo happend to me yesturday too [15:53:00] Thought it was my pc, but then it started working again a few secs later [15:55:10] Hmm, wonder what these CPU spikes every 15-20m are: https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&c=Miscellaneous+eqiad&h=cobalt.wikimedia.org&tab=m&vn=&hide-hf=false&mc=2&z=medium&metric_group=NOGROUPS_%7C_network [15:55:29] * ostriches puts on his spelunking hat [15:55:42] my money is on puppet [15:55:58] (03CR) 10Ema: [C: 032] Text VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) (owner: 10Ema) [15:56:07] ostriches that must explain why yesturday for a few secconds gerrit was loading slow for me [15:56:19] 06Operations, 06Services (watching), 15User-mobrovac: Expand SCB cluster - https://phabricator.wikimedia.org/T147903#2709130 (10GWicke) [15:56:43] I spoke with mutante about this since i thought he was rebooting it but he wasen't but then gerrit returned to normal [15:56:48] Must? More like a maybe :D [15:57:04] godog: Puppet, or one of the crons [15:57:16] My bet is the cron actually [15:57:17] ostriches that looks like the reviewer count [15:57:27] Since it is a 20.3mb file [15:57:31] Maybe we should've left it broken! :P [15:57:39] Filesize doesn't matter, it's probably the query it's doing [15:57:44] Yep [15:58:18] /usr/bin/java -jar /var/lib/gerrit2/review_site/bin/gerrit.war gsql -d /var/lib/gerrit2/review_site/ --format JSON_SINGLE -c \"'SELECT changes.change_id AS change_id, COUNT(DISTINCT patch_set_approvals.account_id) AS reviewer_count FROM changes LEFT JOIN patch_set_approvals ON (changes.change_id = patch_set_approvals.change_id) GROUP BY changes.change_id'\" > /var/www/reviewer-counts.json [15:58:19] ? [15:58:43] Wait, reviewer count only goes once per day [15:58:50] Oh [15:58:50] That doesn't make sense. [15:59:08] 06Operations, 06Services-next, 06Services (next), 15User-Joe, and 2 others: Create a service location / discovery system for locating local/master resources easily across all WMF applications - https://phabricator.wikimedia.org/T125069#2709144 (10GWicke) [15:59:23] ostriches no it dosen't it does it every hour [15:59:23] hour => 1, [16:00:02] That doesn't mean that, it means it runs every day at 01:00 [16:00:13] Also, once per hour wouldn't line up with the CPU spikes, which are every 15m [16:00:19] see also: how cron is defined :) [16:00:20] ostriches but doint forget that cobalt isen't using all it's cores [16:00:27] Oh [16:00:44] puppet seeming more likely, per godog. [16:00:56] but i thought puppet is every hour [16:01:03] unless on production it's every 15 mins [16:02:18] But if cobalt isen't using all it's cores then wont that be a corse for the spike? [16:02:54] Forced puppet run didn't spike the cpu. [16:03:22] Oh [16:07:15] (03PS1) 10Chad: Gerrit: Go back to pruning logs every 7 days [puppet] - 10https://gerrit.wikimedia.org/r/315519 [16:08:51] paladox: "not using all it's cores" is not a cause for a cpu spike, something else (software) causes it. The cores issue is not the issue here. [16:09:02] Ok [16:09:12] Could it be gerrit ssh [16:09:17] IE to many connections? [16:11:25] (03PS2) 10Alexandros Kosiaris: Introduce sca1003, sca1004, sca2003, sca2004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/315064 (https://phabricator.wikimedia.org/T147409) [16:12:59] (03CR) 10Paladox: [C: 031] "I've deployed this change on http://gerrit-test.wmflabs.org/gerrit/" [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) (owner: 10PleaseStand) [16:13:57] (03CR) 10Paladox: gerrit: Fix CSS selector for diff font size override (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/315511 (https://phabricator.wikimedia.org/T141286) (owner: 10PleaseStand) [16:14:49] godog: still working? I need some advice about backporting some packages [16:15:06] andrewbogott: sure, in a meeting now tho [16:15:14] godog: ping me when you're free? [16:15:39] yup [16:19:33] (03CR) 10Chad: [C: 04-1] `scap patch` tool for applying patches to a wmf/branch (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [16:24:29] (03CR) 10Alex Monk: "No, the wikiquote, wikinews and wikiversity VHosts have this" [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) (owner: 10Alex Monk) [16:26:38] andrewbogott: free now, shoot! [16:26:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:27:20] godog: ok, I know slightly more than I did 10 minutes ago. But my question remains very simple: How do I backport a package from Jessie to Trusty? [16:27:33] When I've done that in the past I've just rebuilt the packages from source, but I suspect that that is not the easy way :) [16:27:44] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:05] (Specifically I want Trusty backports of the Jessie puppetmaster and puppetmaster-common packages) [16:28:43] godog: This page makes it seem too easy https://wiki.debian.org/BuildingFormalBackports [16:28:59] andrewbogott: does setting DIST=trusty just work when you try to build the package? I'm assuming you are using copper and cowbuilder [16:29:07] copper the machine [16:29:33] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [16:29:37] (03CR) 10Giuseppe Lavagetto: [C: 031] "If that is fine, then ok :) I assumed we changed the path of the hosts/etc files to the /etc/icinga/puppet_* ones" [puppet] - 10https://gerrit.wikimedia.org/r/315510 (owner: 10Alexandros Kosiaris) [16:30:01] godog: I am not nearly that far along [16:30:17] But if it's really as simple as that then I'll log in to copper and have at [16:30:24] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:30:35] (03CR) 10Giuseppe Lavagetto: "which is still redundant and wrong, and should be fixed instead." [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) (owner: 10Alex Monk) [16:31:16] andrewbogott: yeah a quick test would be to try and feed the .dsc and related files to cowbuilder and set DIST=trusty [16:31:20] andrewbogott: well... it usually fails because of dependencies etc but in essence it should work yes :) [16:32:05] Yeah, I assume that this will swiftly land me in dependency hell and then I'll give up :) [16:32:22] I don't think it'll be very problematic, Build-Depends: debhelper (>= 9~), dh-systemd, facter, rake, ruby-hiera [16:32:36] (03CR) 10Giuseppe Lavagetto: "FTR, I posted a diff for scap for this: https://phabricator.wikimedia.org/D411" [puppet] - 10https://gerrit.wikimedia.org/r/312705 (https://phabricator.wikimedia.org/T146656) (owner: 10EBernhardson) [16:32:59] so possibly dh-systemd, and see if there are daemons that have only systemd service files [16:33:33] (03PS2) 10Alex Monk: Follow-up Ifa2cc187: Add ShortUrl support on wikimedia.org docroot sites [puppet] - 10https://gerrit.wikimedia.org/r/311647 (https://phabricator.wikimedia.org/T146014) [16:33:35] (03PS1) 10Alex Monk: apache: Standardise ShortURL config per Giuseppe on If258a076 [puppet] - 10https://gerrit.wikimedia.org/r/315522 [16:33:59] godog: I'm not sure I know where to find the .dsc. I'm looking at https://packages.debian.org/jessie-backports/admin/puppetmaster, I presume instead I should be looking for an equivalent source package? [16:34:53] <_joe_> the source package is "puppet" [16:35:03] oh, nevermind, here it is [16:35:08] <_joe_> Krenair: I'll take a look as soon as I have a moment [16:35:24] _joe_, okay. isn't it like past 6PM there? [16:35:29] <_joe_> yes [16:39:38] 06Operations, 06Services: Discussion: Use XFS for Cassandra data partition? - https://phabricator.wikimedia.org/T120004#2709311 (10GWicke) 05Open>03declined [16:39:39] Blah [16:39:43] f'ing gerrit [16:40:07] … and it's slow again [16:40:38] Nothing is eating cpu on coblat... [16:40:42] cobalt, even [16:41:16] And back [16:41:16] me too [16:41:18] Hmmm [16:41:25] Its still loading [16:41:28] white screen [16:41:34] It's now loaded [16:42:11] Nothing in syslog for the last ~12m or so [16:42:14] So not puppet or a cron [16:42:17] 06Operations: Enable TRIM for SSDs for Cassandra software raid - https://phabricator.wikimedia.org/T89584#2709313 (10GWicke) 05Open>03stalled [16:42:41] GC kicking in? [16:42:53] That was just my thought [16:42:58] gc thrashing [16:43:07] hoo that should not be the course though, since i thought that would have happended on lead too [16:43:23] But lead only experenced a hardware problem last week after many months of use [16:43:25] hoo: jvm gc kicking in, not git repo gc. [16:43:30] (the latter only runs on saturdays) [16:43:44] Yeah, didn't even think about git here [16:43:59] but the number one "why is it suddenly slow" thing with Jvm [16:44:11] (03PS3) 10Dzahn: contint: install jenkins+CI site on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [16:44:25] Well you would think that, except when it's a frequency scaling issue on your cpu pegging the cores at 200mhz :P [16:45:07] But yes, let's see what we can coax out of the jvm in terms of gc stats... [16:45:10] 06Operations, 10ops-eqiad, 13Patch-For-Review: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2709318 (10Cmjohnson) I tried to putting the 500GB disks but still running into issues with the installer. I checked the vlan, switch port, dhcp file. Oct 12 16:37:52... [16:46:12] ostriches i guess the only way to find out is to manually run it and load gerrit website too [16:46:57] No, let's not DOS gerrit on purpose ;-) [16:47:35] Oh sorry [16:50:35] (03CR) 10Paladox: [C: 031] contint: install jenkins+CI site on contint1001 [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [16:51:09] (03PS1) 10BBlack: labs cache routing metadata: remove maps (unused) [puppet] - 10https://gerrit.wikimedia.org/r/315523 (https://phabricator.wikimedia.org/T147848) [16:51:11] (03PS1) 10BBlack: labs cache routing metadata: use hostnames [puppet] - 10https://gerrit.wikimedia.org/r/315524 (https://phabricator.wikimedia.org/T147848) [16:51:13] (03PS1) 10BBlack: labs cache routing metadata: single host [puppet] - 10https://gerrit.wikimedia.org/r/315525 (https://phabricator.wikimedia.org/T147848) [16:52:20] jouncebot: next [16:52:20] In 1 hour(s) and 7 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161012T1800) [16:54:30] (03CR) 10Jforrester: [C: 031] "Product sign-off, FWIW." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315121 (https://phabricator.wikimedia.org/T147799) (owner: 10MarcoAurelio) [16:55:09] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 109, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/3: down - pay-lvs1001BR [16:55:51] Jeff_Green: ^^ ? [16:56:11] paravoid: that's probably cmjohnson [16:56:39] i swapped in new hardware for pay-lvs1001 yesterday, he's detaching the old server [16:56:46] robh: Can we get HT turned on for cobalt? [16:56:49] ostriches: yep [16:57:07] do we need to email anyone or just take it down (we're continuing a discussion from PM in here ;) [16:57:24] I say just !log it. [16:57:37] I can send an e-mail too. [16:57:39] cool, im logging into both the os and mgmt now [16:58:38] so yeah, it has dual 4 core cpus [16:58:48] and only shows 8 cores, hyperthreading is indeed disabled [16:59:03] ostriches: ok, so should i halt any CI/gerrit processes manually or just shut it down normally is fine? [16:59:09] most stuff can take normal shutdown [16:59:21] (not mysql, its messy if we dont spin it down manually iirc) [16:59:39] Lemme take it down manually, paranoid [16:59:55] cool, im sending the mgmt the command to reboot into bios when it reboots [17:00:05] !log gerrit: stopping momentarily for system reboot [17:00:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:00:29] we're down [17:00:38] cool, doing a reboot via os now [17:01:43] its posting. [17:03:01] ht enabled and re-posting. [17:04:21] ostriches: its back up [17:05:00] And so is gerrit, albeit slow because cold caches. [17:05:01] and HT presents twice the core count now [17:05:36] yeah ssh login was slow, it was apparent it went under load nearly immediately [17:05:37] heh [17:05:56] andrewbogott I've a patch for getting rid of role::puppet::self from common in horizon, should come up onece gerrit unbreaks [17:06:02] andrewbogott: I removed it from wikitech puppet groups [17:06:21] cool [17:09:58] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [17:10:18] !log gerrit: system rebooted (cobalt) to enable HT, system back online as of a few minutes ago [17:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:10:37] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [17:10:59] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [17:12:16] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/tendril] [17:16:38] ok, well, puppet failures show, but rerunning on tungsten showed no errors [17:16:48] so transient error perhaps, dunno yet [17:18:24] also fine on graphite1001, im avoiding manual run on neon cuz its slow as hell on the icinga host. [17:18:37] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [17:18:59] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:20:29] robh: seem all git related so I'm assuming transient due to cobalt reboot [17:22:27] yeah, cobalt got slammed [17:22:31] hopefully the HT is helping [17:23:23] andrewbogott: can you +1 https://gerrit.wikimedia.org/r/315526 [17:24:21] 06Operations, 10ChangeProp, 06Services (next), 15User-mobrovac: ChangeProp failing on Node v4.6.0 - https://phabricator.wikimedia.org/T147849#2709466 (10Pchelolo) [17:24:33] 06Operations, 10ORES, 06Services (watching): Limit resources used by ORES - https://phabricator.wikimedia.org/T146664#2709468 (10Pchelolo) [17:27:09] 06Operations, 10Graphoid, 06Services (watching): Graphoid returns a 400 on MW API time-out - https://phabricator.wikimedia.org/T134237#2709485 (10Pchelolo) [17:29:27] 06Operations, 06Services (watching): make ocg role work on labs instances (install deployment-pdf instance with jessie) - https://phabricator.wikimedia.org/T135034#2709500 (10Pchelolo) [17:29:34] 07Blocked-on-Operations, 06Operations, 10Cassandra, 10RESTBase, and 2 others: Finish conversion to multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#2709504 (10GWicke) [17:30:19] grrrit-wm: hi [17:31:01] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2709511 (10BBlack) [17:31:03] 06Operations, 10Traffic, 13Patch-For-Review: Use hostnames (not IPs) in deployment-prep varnish app_directors - https://phabricator.wikimedia.org/T147848#2709509 (10BBlack) 05Open>03Resolved a:03BBlack [17:32:44] robh: Heh, still cpu spiked but half as tall on graphs now :p [17:34:16] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:36:08] !log T133395: RESTBase: Altering keyspace local_group_wiktionary_T_parsoid_html.data to enable time-window compaction [17:36:10] T133395: Evaluate TimeWindowCompactionStrategy - https://phabricator.wikimedia.org/T133395 [17:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:36:24] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:36:45] bblack i have restarted grrrit-wm [17:36:54] Needs restarts after gerrit is restarted [17:37:42] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: removed wdqs probe ref T132457 [puppet] - 10https://gerrit.wikimedia.org/r/315528 (owner: 10BBlack) [17:38:24] 06Operations, 10hardware-requests: EQIAD: (2) hardware access request for PUPPET - https://phabricator.wikimedia.org/T142218#2709547 (10RobH) [17:39:06] 06Operations, 10Traffic: Standardize varnish applayer backend definitions - https://phabricator.wikimedia.org/T147844#2709551 (10BBlack) [17:39:08] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 3 others: Move wdqs to an LVS service - https://phabricator.wikimedia.org/T132457#2709549 (10BBlack) 05Open>03Resolved a:03BBlack [17:40:17] 06Operations, 10ops-eqiad, 13Patch-For-Review: Add new disks to syslog server in eqiad (lithium) - https://phabricator.wikimedia.org/T143307#2709556 (10RobH) [17:40:43] (03CR) 10Dzahn: "@hashar applied on contint1001. i saw it install the various php packages, Apache, contint::website configs for doc/integration, openjdk, " [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [17:41:05] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [17:41:45] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [17:42:14] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[retry-load-new-vcl-file] [17:43:44] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:13] 06Operations, 10Architecture, 10RESTBase, 10ArchCom-RfC (ArchCom-Approved), and 6 others: RFC: Request timeouts and retries - https://phabricator.wikimedia.org/T97204#2709594 (10GWicke) [17:44:23] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:44:45] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [17:45:29] 06Operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1159901 (10GWicke) @Eevans, is there anything actionable left to do here? [17:45:34] 06Operations, 10RESTBase, 10RESTBase-Cassandra: secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#2709604 (10GWicke) p:05High>03Low [17:45:45] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 06Services (doing): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329#1159901 (10GWicke) [17:46:24] 06Operations, 10hardware-requests: Replace/refresh carbon - https://phabricator.wikimedia.org/T137117#2709611 (10RobH) [17:46:53] 06Operations, 10hardware-requests: eqiad: 1 hardware access request for labs graphite - https://phabricator.wikimedia.org/T137724#2709619 (10RobH) [17:47:03] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2709623 (10RobH) [17:47:10] 06Operations, 10media-storage, 05Goal: expand swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2709627 (10RobH) [17:47:25] 06Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785#2709634 (10RobH) [17:47:27] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2709635 (10RobH) [17:47:37] 06Operations, 10ops-eqiad: rack/setup/install/deploy labsdb1009-labsdb1011 - https://phabricator.wikimedia.org/T136860#2709639 (10RobH) [17:47:57] 06Operations, 10media-storage, 05Goal: expand swift hardware in codfw/eqiad - https://phabricator.wikimedia.org/T130012#2122471 (10RobH) [17:48:20] 06Operations, 10hardware-requests: eqiad: (3) nodes for Druid / analytics - https://phabricator.wikimedia.org/T128807#2709653 (10RobH) [17:48:33] 06Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2709662 (10RobH) [17:48:41] 06Operations, 10ops-codfw, 06DC-Ops: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2709666 (10RobH) [17:48:49] 06Operations, 10hardware-requests: additional graphite machines request, 1x per DC - https://phabricator.wikimedia.org/T126253#2008955 (10RobH) [17:48:51] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup new host graphite2002 - https://phabricator.wikimedia.org/T130938#2709671 (10RobH) [17:49:01] 06Operations, 10hardware-requests, 05codfw-rollout: MediaWiki maintenance host for codfw (terbium's equivalent) - https://phabricator.wikimedia.org/T126987#2709676 (10RobH) [17:49:12] 06Operations, 10hardware-requests, 05codfw-rollout: codfw: (2) servers for redis jobrunners - https://phabricator.wikimedia.org/T126453#2709682 (10RobH) [17:49:14] 06Operations, 10ops-codfw, 13Patch-For-Review, 05codfw-rollout: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2709680 (10RobH) [17:49:23] 06Operations, 10hardware-requests, 05codfw-rollout: Log host for codfw (fluorine's equivalent) - https://phabricator.wikimedia.org/T126988#2709687 (10RobH) [17:49:25] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 05codfw-rollout: rack new mw log host - sinistra - https://phabricator.wikimedia.org/T128796#2709686 (10RobH) [17:49:40] 06Operations, 10ops-codfw, 06DC-Ops: db2012 degraded RAID - https://phabricator.wikimedia.org/T124645#2709696 (10RobH) [17:49:42] 06Operations, 10ops-codfw, 06DC-Ops: db2019 has a failed disk - https://phabricator.wikimedia.org/T120073#2709697 (10RobH) [17:49:49] 06Operations, 10RESTBase, 10hardware-requests: 3x additional SSD for restbase hp hardware - https://phabricator.wikimedia.org/T126626#2709702 (10RobH) [17:50:24] 06Operations, 10ops-codfw, 06DC-Ops, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#2709724 (10RobH) [17:50:40] 06Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2709742 (10RobH) [17:52:25] hi wikibugs [17:53:03] robh: hiyaaaa, bumping https://phabricator.wikimedia.org/T145082 [17:53:24] 06Operations, 10Traffic: SSL certificate for policy.wikimedia.org - https://phabricator.wikimedia.org/T110197#2709767 (10RobH) [17:53:52] (03PS1) 10BBlack: cache_misc: pybal_config: use puppetmaster1001.eqiad only [puppet] - 10https://gerrit.wikimedia.org/r/315531 (https://phabricator.wikimedia.org/T147847) [17:54:06] yep, i saw the bump, im working on an audit of all the historical purcahses already this fiscal, and then im planning to pick that back up [17:54:06] as indeed, i have half the quotes in (basically one vendor) for that [17:54:24] right now my procurement board is a mess of pending invoice items, clearing it out, heh [17:54:50] 06Operations, 10Traffic, 13Patch-For-Review: Move pybal_config to an LVS service - https://phabricator.wikimedia.org/T147847#2709790 (10BBlack) See patch above. I think we can skip over the LVS bit here, until some later date when we have more than one backend defined per-datacenter. We also can't/shouldn'... [17:56:06] (03PS3) 1020after4: `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 [17:56:54] (03CR) 1020after4: `scap patch` tool for applying patches to a wmf/branch (035 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [17:57:43] (03PS1) 10EBernhardson: Prefer pages in the user's language in multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315532 (https://phabricator.wikimedia.org/T68829) [18:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161012T1800). [18:00:04] Pchelolo, RoanKattouw, and dcausse: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:17] * RoanKattouw is here [18:00:17] 06Operations, 10Security-Reviews, 06Services (blocked), 15User-mobrovac: Productize the Electron PDF render service & create a REST API end point - https://phabricator.wikimedia.org/T142226#2709819 (10GWicke) p:05Triage>03Normal [18:00:24] * Pchelolo is here [18:00:29] * dcausse too [18:01:39] i suppose if there's room can throw the patch i added above into the mix as well [18:02:32] noone volunteering to deploy though :P i suppose i can ship it all out [18:02:39] \o/ [18:03:14] dcausse: looks safe to ship all 3 of yours together? none have direct impact on requests [18:03:49] ebernhardson: there are some new vars but yes I think we can [18:04:06] (03CR) 10EBernhardson: [C: 032] Adjust shard & replica count for enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314262 (owner: 10DCausse) [18:04:08] (03CR) 10EBernhardson: [C: 032] [cirrus] switch cirrus BM25 A/B test config to ja, zh, th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:04:10] (03CR) 10EBernhardson: [C: 032] [cirrus] Activate BM25 on top 10 wikis: Step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:04:33] (03PS1) 10Rush: labsdb: maintain-views laundry list of changes [puppet] - 10https://gerrit.wikimedia.org/r/315534 [18:04:35] (03Merged) 10jenkins-bot: Adjust shard & replica count for enwiki and dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314262 (owner: 10DCausse) [18:04:44] (03CR) 10jenkins-bot: [V: 04-1] [cirrus] Activate BM25 on top 10 wikis: Step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:04:50] :( [18:04:54] bah, i have to rebase as i go :P [18:05:02] ok :) [18:05:26] (03PS4) 10EBernhardson: [cirrus] switch cirrus BM25 A/B test config to ja, zh, th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:05:31] (03CR) 10jenkins-bot: [V: 04-1] labsdb: maintain-views laundry list of changes [puppet] - 10https://gerrit.wikimedia.org/r/315534 (owner: 10Rush) [18:05:40] (03CR) 10EBernhardson: [C: 032] [cirrus] switch cirrus BM25 A/B test config to ja, zh, th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:06:10] (03Merged) 10jenkins-bot: [cirrus] switch cirrus BM25 A/B test config to ja, zh, th [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315250 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:06:33] (03PS3) 10EBernhardson: [cirrus] Activate BM25 on top 10 wikis: Step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:06:41] (03CR) 10EBernhardson: [C: 032] [cirrus] Activate BM25 on top 10 wikis: Step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:07:10] (03Merged) 10jenkins-bot: [cirrus] Activate BM25 on top 10 wikis: Step 1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315297 (https://phabricator.wikimedia.org/T147508) (owner: 10DCausse) [18:07:11] Pchelolo: can yours be tested on mw1099? [18:07:48] edsanders: lemme try, 5 minutes [18:08:07] Pchelolo: i havn't pulled it yet, but will in a moment [18:08:56] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT cirrussearch config updates (duration: 01m 10s) [18:08:57] (03PS1) 10BBlack: rcstream: internal service IP/hostname [dns] - 10https://gerrit.wikimedia.org/r/315536 (https://phabricator.wikimedia.org/T147845) [18:08:59] (03PS1) 10BBlack: rcstream: internal LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315537 (https://phabricator.wikimedia.org/T147845) [18:09:01] (03PS1) 10BBlack: cache_misc: use stream LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315538 (https://phabricator.wikimedia.org/T147845) [18:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:08] although most likely it can't be tested since we don't have control where the job will execute.. [18:09:38] Pchelolo: right, it will just execute on a job runner [18:10:04] (03PS2) 10BBlack: logstash - DNS entries for LVS service [dns] - 10https://gerrit.wikimedia.org/r/312342 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [18:10:08] (03CR) 10jenkins-bot: [V: 04-1] rcstream: internal LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315537 (https://phabricator.wikimedia.org/T147845) (owner: 10BBlack) [18:10:12] (03CR) 10jenkins-bot: [V: 04-1] cache_misc: use stream LVS service [puppet] - 10https://gerrit.wikimedia.org/r/315538 (https://phabricator.wikimedia.org/T147845) (owner: 10BBlack) [18:10:14] (03PS2) 10Rush: labsdb: maintain-views laundry list of changes [puppet] - 10https://gerrit.wikimedia.org/r/315534 [18:10:21] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-common.php: SWAT T147508 Activate BM25 on top 10 wikis: Step 1 (duration: 00m 50s) [18:10:23] T147508: BM25: initial limited release into production - https://phabricator.wikimedia.org/T147508 [18:10:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:10:30] (03CR) 10BBlack: [C: 032] logstash - DNS entries for LVS service [dns] - 10https://gerrit.wikimedia.org/r/312342 (https://phabricator.wikimedia.org/T132458) (owner: 10Gehel) [18:10:48] (03CR) 10Dzahn: [C: 032] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [18:10:54] (03PS8) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [18:11:02] PROBLEM - jenkins_zmq_publisher on contint1001 is CRITICAL: connect to address 127.0.0.1 and port 8888: Connection refused [18:11:51] Pchelolo: it's pulled to mw1099, but if you can't test i can jst sync it out [18:12:05] ebernhardson: ye, no way to test [18:12:40] RoanKattouw: around for flow re-enable on frwikiquote? [18:12:49] ebernhardson: Yup [18:12:58] (03CR) 10EBernhardson: [C: 032] Re-enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315527 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [18:13:06] (03PS2) 10EBernhardson: Re-enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315527 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [18:13:12] (03CR) 10EBernhardson: [C: 032] Re-enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315527 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [18:13:18] !log ebernhardson@mira Synchronized php-1.28.0-wmf.22/extensions/EventBus/EventBus.hooks.php: SWAT Dont set added/removed properties if they are empty (duration: 00m 52s) [18:13:21] Pchelolo: all synced out [18:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:13:28] checking [18:13:43] (03Merged) 10jenkins-bot: Re-enable Flow beta feature on frwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315527 (https://phabricator.wikimedia.org/T138310) (owner: 10Catrope) [18:13:58] ok thanks robh [18:14:16] (03CR) 10Alexandros Kosiaris: [C: 032] nodepool: lower throttling rate to OpenStack API [puppet] - 10https://gerrit.wikimedia.org/r/315214 (owner: 10Hashar) [18:14:21] (03PS2) 10Alexandros Kosiaris: nodepool: lower throttling rate to OpenStack API [puppet] - 10https://gerrit.wikimedia.org/r/315214 (owner: 10Hashar) [18:14:23] (03CR) 10Alexandros Kosiaris: [V: 032] nodepool: lower throttling rate to OpenStack API [puppet] - 10https://gerrit.wikimedia.org/r/315214 (owner: 10Hashar) [18:14:25] RoanKattouw: pulled to mw1099 [18:14:30] (03PS2) 10BBlack: rcstream: internal service IP/hostname [dns] - 10https://gerrit.wikimedia.org/r/315536 (https://phabricator.wikimedia.org/T147845) [18:14:41] 06Operations, 10ops-eqiad, 10DBA: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396#2709877 (10Cmjohnson) Dell Return Label Tracking USPS 9202 3946 5301 2432 5936 66 [18:16:04] ebernhardson: WFM [18:16:53] (03PS4) 10Ori.livneh: [WIP] Module for Recommendation API [puppet] - 10https://gerrit.wikimedia.org/r/312045 [18:17:51] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT T138310 Re-enable Flow beta feature on frwikiquote (duration: 00m 50s) [18:17:53] T138310: Flow as a Beta feature: enable, disable and reenable doesn't seem to work - https://phabricator.wikimedia.org/T138310 [18:17:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:02] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2709887 (10Gehel) [18:18:04] 06Operations, 06Discovery, 06Maps, 10Maps-data, and 2 others: Ensure Maps servers can be installed easily (automation + documentation) - https://phabricator.wikimedia.org/T138501#2709885 (10Gehel) 05Open>03Resolved Procedure is being tested again on the reimage of maps-test servers. This seems to work... [18:18:06] dcausse: i just noticed your patch has a variable mis-named, so cirrus won't use it [18:18:10] (03CR) 10Chad: `scap patch` tool for applying patches to a wmf/branch (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [18:18:20] dcausse: wmgCirrusSimilarityProfile should have been wmgCirrusSearchSimilarityProfile. will fix [18:18:21] :/ [18:18:34] RoanKattouw: you're all synced out [18:18:54] (03PS3) 10Alexandros Kosiaris: Introduce sca1003, sca1004, sca2003, sca2004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/315064 (https://phabricator.wikimedia.org/T147409) [18:18:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Introduce sca1003, sca1004, sca2003, sca2004 to the cluster [puppet] - 10https://gerrit.wikimedia.org/r/315064 (https://phabricator.wikimedia.org/T147409) (owner: 10Alexandros Kosiaris) [18:19:31] (03CR) 10Chad: [C: 031] `scap patch` tool for applying patches to a wmf/branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312013 (owner: 1020after4) [18:20:11] ebernhardson: doh, sorry :/ [18:20:13] (03PS1) 10EBernhardson: wgCirrusSimilarityProfile -> wgCirrusSearchSimilarityProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315544 [18:20:15] dcausse: ^ [18:20:44] (03CR) 10jenkins-bot: [V: 04-1] wgCirrusSimilarityProfile -> wgCirrusSearchSimilarityProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315544 (owner: 10EBernhardson) [18:21:17] ebernhardson: I've a problem with new A/B test setup as well, not sure what's wrong [18:21:34] https://ja.wikipedia.org/w/index.php?search=%E3%80%9Ctest&title=%E7%89%B9%E5%88%A5:%E6%A4%9C%E7%B4%A2&go=%E8%A1%A8%E7%A4%BA&cirrusUserTesting=bm25:inclinks&cirrusDumpQuery [18:21:44] but this can wait probably [18:22:13] (03PS9) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [18:22:15] hmm, yea that's not right ... but also not visible. it's already late there you can certainly look into it tomorrow [18:22:24] sure [18:22:29] (03PS2) 10EBernhardson: wgCirrusSimilarityProfile -> wgCirrusSearchSimilarityProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315544 [18:23:21] dcausse: the log is: Catchable fatal error: Argument 1 passed to CirrusSearch\Search\RescoreBuilder::getSupportedProfile() must be an instance of array, null given in /srv/mediawiki/php-1.28.0-wmf.21/extensions/CirrusSearch/includes/Search/RescoreBuilders.php on line 165 [18:23:35] (03PS1) 10Cmjohnson: Adding mgmt dns entries for new kubernetes serveres T147933 [dns] - 10https://gerrit.wikimedia.org/r/315546 [18:23:37] it must not be resolving the named profile into the actual profile [18:23:52] (03CR) 10EBernhardson: [C: 032] wgCirrusSimilarityProfile -> wgCirrusSearchSimilarityProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315544 (owner: 10EBernhardson) [18:24:00] ebernhardson: ok will fix tomorrow [18:24:02] yup [18:24:06] thanks! [18:24:23] (03Merged) 10jenkins-bot: wgCirrusSimilarityProfile -> wgCirrusSearchSimilarityProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315544 (owner: 10EBernhardson) [18:25:25] ebernhardson: Thanks! [18:25:26] 06Operations, 10Traffic, 13Patch-For-Review: Move rcstream to an LVS service - https://phabricator.wikimedia.org/T147845#2709924 (10BBlack) Hmmm, now that I'm really looking at the changes, it's pretty clear we have a bigger problem here than with most services. For some reason I can't remember, rcstream wa... [18:26:40] ebernhardson: ah got it, these new profiles were added in wmf22 which is not rolled out there yet [18:27:11] (03CR) 10Dzahn: "the variable containing maintenance hosts has now been added to network/constants.pp" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [18:27:14] (03PS2) 10Cmjohnson: Adding mgmt dns entries for new kubernetes serveres T147933 [dns] - 10https://gerrit.wikimedia.org/r/315546 [18:27:15] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT wgCirrusSimilarityProfile -> wgCirrusSearchSimilarityProfile (duration: 00m 49s) [18:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:27:45] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for new kubernetes serveres T147933 [dns] - 10https://gerrit.wikimedia.org/r/315546 (owner: 10Cmjohnson) [18:28:10] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for new kubernetes serveres T147933 [dns] - 10https://gerrit.wikimedia.org/r/315546 (owner: 10Cmjohnson) [18:28:17] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-common.php: SWAT wgCirrusSimilarityProfile -> wgCirrusSearchSimilarityProfile (duration: 00m 53s) [18:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:29:07] * ebernhardson hates that renaming variables causes a flood of warnings ... [18:29:27] ebernhardson: my bad sorry about that [18:29:46] no worries, it's no end user effect just logging spam :) [18:30:04] 06Operations, 10Cassandra, 10hardware-requests, 06Services (next), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2709927 (10GWicke) [18:30:13] ok, thanks for the quick fix, I'll be able to reindex tomorrow hopefully :) [18:30:20] (03CR) 10EBernhardson: [C: 032] Prefer pages in the user's language in multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315532 (https://phabricator.wikimedia.org/T68829) (owner: 10EBernhardson) [18:30:29] (03CR) 10jenkins-bot: [V: 04-1] Prefer pages in the user's language in multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315532 (https://phabricator.wikimedia.org/T68829) (owner: 10EBernhardson) [18:30:37] * ebernhardson rebases.. [18:30:41] (03PS2) 10EBernhardson: Prefer pages in the user's language in multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315532 (https://phabricator.wikimedia.org/T68829) [18:30:48] (03CR) 10EBernhardson: [C: 032] Prefer pages in the user's language in multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315532 (https://phabricator.wikimedia.org/T68829) (owner: 10EBernhardson) [18:31:17] (03Merged) 10jenkins-bot: Prefer pages in the user's language in multilingual wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315532 (https://phabricator.wikimedia.org/T68829) (owner: 10EBernhardson) [18:34:40] (03PS2) 10Dzahn: logstash: let maintenance hosts connect to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [18:35:10] (03PS5) 10Ori.livneh: [WIP] Module for Recommendation API [puppet] - 10https://gerrit.wikimedia.org/r/312045 [18:35:56] 06Operations, 03Interactive-Sprint: maps-test* hosts running low on space - https://phabricator.wikimedia.org/T146848#2709963 (10Gehel) 05Open>03Resolved a:03Gehel Resovled by reimaging the maps-test servers [18:37:35] !log ebernhardson@mira Synchronized wmf-config/InitialiseSettings.php: SWAT T66829 Prefer articles in a users language on multilingual wikis (duration: 00m 50s) [18:37:37] T66829: Allow hiding OAuth edits in Recent Changes - https://phabricator.wikimedia.org/T66829 [18:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:30] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-common.php: SWAT T66829 Prefer articles in a users language on multilingual wikis (duration: 00m 51s) [18:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:39:48] (03CR) 10Dzahn: "@Eevans amended to use $MAINTENANCE_HOSTS and just add it to the existing rule for port 9200. looks good?" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [18:43:13] SWAT done [18:44:41] (03PS5) 10Giuseppe Lavagetto: Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [18:45:33] (03CR) 10jenkins-bot: [V: 04-1] Conftool: Create script that checks the state after (de)pooling [puppet] - 10https://gerrit.wikimedia.org/r/310454 (https://phabricator.wikimedia.org/T145518) (owner: 10Mobrovac) [18:46:28] <_joe_> fu, rubocop [18:46:39] (03PS1) 10Cmjohnson: Removing dns entries for mw1217 T138925 [dns] - 10https://gerrit.wikimedia.org/r/315547 [18:48:38] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack/Setup pay-lvs1003 and pay-lvs1004 - https://phabricator.wikimedia.org/T143900#2710059 (10Jgreen) 05Open>03Resolved [18:49:28] 06Operations, 06Services (watching): Split the API MediaWiki appserver pool into two external/internal pools - https://phabricator.wikimedia.org/T125085#2710082 (10Pchelolo) [18:49:31] (03PS1) 10Cmjohnson: Removing all entries of decommissioned host mw1217 [puppet] - 10https://gerrit.wikimedia.org/r/315549 [18:49:35] 07Blocked-on-Operations, 06Operations, 10Graphite, 06WMDE-Analytics-Engineering, and 3 others: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#2710084 (10Pchelolo) [18:50:45] robh: ^ "Blocked-on-Operations", i think we only had like 3 left and i removed one [18:51:02] oh.. maybe not [18:51:18] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: pay-lvs1003/pay-lvs1004 hardware swap for pay-lvs1001/pay-lvs1002 - https://phabricator.wikimedia.org/T147932#2710088 (10Jgreen) [18:51:30] well, the dashboard you once made has the "Blocked-on-Ops" section and that was empty for me now [18:51:44] yeah i only saw one last week [18:51:46] and now its gone [18:51:50] we should get that archived or people will keep re-adding it i guess [18:52:02] i can archive it now since we planned to anyhow [18:52:06] :) cool [18:52:32] oh wait [18:52:32] it shows 5 more [18:52:33] the dash is somehow broken [18:52:59] robh: but https://phabricator.wikimedia.org/tag/blocked-on-operations/ yea [18:53:16] on-Operations vs. on-Ops ? [18:53:29] dunno, digging into dash [18:53:34] it has saved queries [18:54:44] (03CR) 10Cmjohnson: [C: 032] Removing all entries of decommissioned host mw1217 [puppet] - 10https://gerrit.wikimedia.org/r/315549 (owner: 10Cmjohnson) [18:54:46] oh, its assigned to noone and blocked on ops [18:54:56] ah [18:55:00] not just blocked on ops [18:55:00] so those are all assigned. [18:55:33] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:55:50] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for mw1217 T138925 [dns] - 10https://gerrit.wikimedia.org/r/315547 (owner: 10Cmjohnson) [18:56:18] panel updated for clarity [18:56:26] (03CR) 10Eevans: [C: 031] "> @Eevans amended to use $MAINTENANCE_HOSTS and just add it to the" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [18:56:47] if we archive a project does it rip it off existing? [18:56:51] bsod 503 Service Temporarily Unavailable [18:56:51] or merely not allow it to be added to more? [18:57:05] trying to reach cs.wiktionary.org [18:58:15] (03PS3) 10Dzahn: Sort by alphabetical order wikimedia-chapter Apache sites [puppet] - 10https://gerrit.wikimedia.org/r/314469 (owner: 10Dereckson) [18:58:41] fatalmonitor has a funny output [18:59:15] (03CR) 10Dzahn: [C: 032] "i double checked before and after with sort and diff, it's the same" [puppet] - 10https://gerrit.wikimedia.org/r/314469 (owner: 10Dereckson) [19:00:04] thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161012T1900). [19:00:16] * thcipriani does [19:00:41] Danny_B: cs.wikt works for me [19:01:07] sorry, was in other screen, indeed cs.wikt works for me and none of our monitors (catchpoint) report anythign yet (not saying its not happening!) [19:01:23] robh: i think it will stay on the existing tickets, just the color of the tag will change to a lighter color [19:01:43] and it will be renamed to -archived or so [19:01:46] mutante: so i just archived blocked-on-ops after a chat in -releng. archive turns it grey and stops it being the top preferred in auto complete [19:01:56] oh? [19:01:58] robh: sounds good [19:01:59] no rename. [19:02:06] unless we do that manually. [19:02:18] i was wrong then, -releng knows best [19:05:06] Danny_B: does the error happen every load or sporadically? [19:05:58] what is happening with wmf-config/event-schemas on mira? Is someone working on that? [19:06:02] (03PS3) 10Dzahn: Add ec.wikimedia.org to Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/314470 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [19:07:22] ebernhar|lunch: waait, event-schemas [19:07:40] ...do those need to be updated? [19:07:48] per: https://gerrit.wikimedia.org/r/#/c/315443/1 [19:08:55] (03CR) 10Dzahn: [C: 032] Add ec.wikimedia.org to Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/314470 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [19:09:42] ^ new chapter in Ecuador [19:09:59] well, not sure if technically "chapter" or user group, but Wikimedias there [19:10:47] user group [19:12:23] https://meta.wikimedia.org/wiki/Wikimedistas_de_Ecuador "Approved as a user group on 18-09-2015" [19:12:39] 06Operations, 10ops-eqiad, 13Patch-For-Review: Broken memory on mw1217 - https://phabricator.wikimedia.org/T138925#2710182 (10Cmjohnson) 05Open>03Resolved Removed from DNS Switch Racktables Disks wiped Rack Added to Decom tracking sheet [19:14:08] thcipriani: doh, event-schemas shouldn't have been updated there [19:14:25] sec i'll make a patch to undo that [19:14:37] !log disconnecting production cable from old pay-lvs1002 (replaced with new) T147932 [19:14:38] T147932: pay-lvs1003/pay-lvs1004 hardware swap for pay-lvs1001/pay-lvs1002 - https://phabricator.wikimedia.org/T147932 [19:14:39] ebernhar|lunch: thank you :) [19:14:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:00] Dereckson: i'll just let puppet deploy that, later we can do the DNS change [19:15:38] (03PS1) 10EBernhardson: Revert event-schemas to the appropriate submodule hash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315551 [19:15:42] Okay, once the DNS change is ready, I'll see for a deployment window scheduling [19:15:44] that way we dont get error pages cached [19:15:51] ok, great [19:15:58] thcipriani: https://gerrit.wikimedia.org/r/315551 [19:16:16] ebernhar|lunch: cool, thanks, I'll get it merged and corrected. [19:16:45] (03CR) 10Thcipriani: [C: 032] Revert event-schemas to the appropriate submodule hash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315551 (owner: 10EBernhardson) [19:18:36] (03PS2) 10Thcipriani: Revert event-schemas to the appropriate submodule hash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315551 (owner: 10EBernhardson) [19:18:53] (03CR) 10Thcipriani: Revert event-schemas to the appropriate submodule hash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315551 (owner: 10EBernhardson) [19:18:59] (03CR) 10Thcipriani: [C: 032] Revert event-schemas to the appropriate submodule hash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315551 (owner: 10EBernhardson) [19:19:25] in which I flail at the rebase button wildly [19:19:26] (03Merged) 10jenkins-bot: Revert event-schemas to the appropriate submodule hash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315551 (owner: 10EBernhardson) [19:20:14] hot dog. back on track :) [19:20:31] (03PS1) 10Cmjohnson: Removing dns entries for pay-lvs1003 and 1004. Re-using names for newer servers. T147932 [dns] - 10https://gerrit.wikimedia.org/r/315553 [19:21:13] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:30] (03Merged) 10jenkins-bot: group1 wikis to 1.28.0-wmf.22 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315554 (owner: 10Thcipriani) [19:23:16] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for pay-lvs1003 and 1004. Re-using names for newer servers. T147932 [dns] - 10https://gerrit.wikimedia.org/r/315553 (owner: 10Cmjohnson) [19:24:36] 06Operations, 06Services (watching): reinstall OCG servers - https://phabricator.wikimedia.org/T84723#2710241 (10Pchelolo) [19:25:20] mutante: robh works now [19:26:15] Danny_B: good :) [19:26:25] if it only happens once it never happened right? ;D [19:27:32] !log thcipriani@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.28.0-wmf.22 [19:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:34] blerg Notice: Undefined property: MobilePage::$revisionTimestamp in /srv/mediawiki/php-1.28.0-wmf.22/extensions/MobileFrontend/includes/models/MobilePage.php on line 66 [19:35:43] hrm, looks like it just happened a bunch on 2 servers and then went away. Maybe a weird edgecase triggered by moving between wmf.21 and .22 [19:44:50] thcipriani: atomic-ness :) [19:51:34] 06Operations, 10Cassandra, 10hardware-requests, 06Services (next), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2710410 (10GWicke) @mark, @faidon, @RobH: Could you comment on the AQS cluster option? The staging cluster expansion is a blocker... [19:52:12] 06Operations, 10Cassandra, 10hardware-requests, 06Services (blocked), 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2710414 (10GWicke) [19:53:00] (03CR) 10Hashar: "Looks good thank you :)" [puppet] - 10https://gerrit.wikimedia.org/r/315146 (owner: 10Hashar) [19:55:59] (03CR) 10Hashar: "Thx! The rate change already looks promising, will look at it again later and comment on T146813" [puppet] - 10https://gerrit.wikimedia.org/r/315214 (owner: 10Hashar) [19:57:48] (03CR) 10Hashar: [C: 031] ";]" [puppet] - 10https://gerrit.wikimedia.org/r/315519 (owner: 10Chad) [20:00:05] gwicke, cscott, arlolra, subbu, bearND, mdholloway, halfak, Amir1, and yurik: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161012T2000). [20:05:10] no mobileapps deploy today [20:20:11] (03PS1) 10Hashar: contint: puppet cleanup for CI master [puppet] - 10https://gerrit.wikimedia.org/r/315563 [20:20:55] (03PS1) 10Yuvipanda: puppet: Add a message announcing deprecation to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/315564 (https://phabricator.wikimedia.org/T120159) [20:21:12] andrewbogott: ^ message in role::puppet::self oto [20:21:57] (03CR) 10jenkins-bot: [V: 04-1] puppet: Add a message announcing deprecation to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/315564 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [20:22:27] (03CR) 10Andrew Bogott: [C: 031] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/315564 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [20:23:13] (03CR) 10Hashar: "Some more legacy cruft on Jenkins. Found them while inspecting the contint1001 recent provisioning via puppet." [puppet] - 10https://gerrit.wikimedia.org/r/315563 (owner: 10Hashar) [20:27:18] 06Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cokkie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967#2710558 (10bd808) [20:27:33] 06Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967#2710572 (10bd808) [20:27:37] 98 [20:27:46] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:38:24] PROBLEM - Disk space on scb1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=87%) [20:40:47] 06Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967#2710588 (10bd808) I've found some blog post from 2012 that discuss IE6, 7 & 8 not supporting `Max-Age` and suggesting s... [20:47:11] 06Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967#2710558 (10BBlack) We use expires in our `CP` cookies as well (which track connection properties for HTTP/2 stats), so... [20:48:34] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher - https://phabricator.wikimedia.org/T147199#2684468 (10BBlack) [20:48:36] 06Operations, 10Analytics, 10Traffic: The WMF-Last-Access Set-Cookie header should follow RFC 2965 syntax rather than the pre-RFC Netscape format - https://phabricator.wikimedia.org/T147967#2710602 (10BBlack) [20:53:05] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:04:19] (03PS1) 10Chad: Gerrit: puppetize log4j.properties [puppet] - 10https://gerrit.wikimedia.org/r/315571 [21:08:44] PROBLEM - Disk space on scb1002 is CRITICAL: DISK CRITICAL - free space: / 326 MB (3% inode=87%) [21:23:22] 06Operations, 10Traffic: Removing support for DES-CBC3-SHA TLS cipher (drops IE8-on-XP support) - https://phabricator.wikimedia.org/T147199#2710681 (10Legoktm) [21:24:44] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:38:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [50.0] [21:43:56] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [21:46:01] (03CR) 10Dzahn: [C: 032] logstash: let maintenance hosts connect to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [21:49:50] urandom: logstash1001 now has the new iptables rules to allow terbium/wasat [21:50:01] mutante: \o/ [21:50:04] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:50:10] mutante: thank you sir! [21:50:25] urandom: you're welcome [21:51:13] (03CR) 10Dzahn: "confirmed working on logstast1001" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [21:51:22] i pasted there how i checked it [21:51:29] cat /etc/ferm/conf.d/10_logstash_canary_checker_reporting [21:51:36] that is the ferm rule [21:51:58] and then with iptables you see it translates to actual host names we expected [21:52:20] will be the same on logstash1002 etc once puppet runs [21:54:12] (03PS2) 10Andrew Bogott: horizon: Show role::puppetmaster::standalone instead of role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/315526 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [21:55:09] (03CR) 10Andrew Bogott: [C: 032] horizon: Show role::puppetmaster::standalone instead of role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/315526 (https://phabricator.wikimedia.org/T120159) (owner: 10Yuvipanda) [22:23:42] (03PS2) 10Dzahn: Activate ec.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/314466 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [22:23:44] (03PS2) 10Yuvipanda: puppet: Add a message announcing deprecation to role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/315564 (https://phabricator.wikimedia.org/T120159) [22:24:00] 06Operations, 10MediaWiki-extensions-CentralNotice, 10Traffic: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2591355 (10demon) >>! In T144194#2606379, @Legoktm wrote: > FWIW, MediaWiki has a built-in sitenotice (https://www.mediawiki.org/wiki/Manual:Interfac... [22:24:16] (03CR) 10Dzahn: [C: 032] Activate ec.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/314466 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [22:25:27] Dereckson: ^ done [22:29:44] (03PS2) 10Dzahn: contint: puppet cleanup for CI master [puppet] - 10https://gerrit.wikimedia.org/r/315563 (owner: 10Hashar) [22:30:46] 06Operations, 10RESTBase, 06Services (later): Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#2711026 (10GWicke) p:05High>03Normal @yuvipanda, IIRC you mentioned that you were working on a WMF Jessie image. Do you have a... [22:31:46] 06Operations, 10Mobile-Content-Service, 10RESTBase, 06Services, 10Traffic: Varnish not purging RESTBase URIs - https://phabricator.wikimedia.org/T127370#2711040 (10GWicke) 05Open>03Resolved a:03GWicke Resolving this in favor of T127387. [22:32:07] 06Operations, 10RESTBase, 06Services (later): Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#2711044 (10yuvipanda) There's a 'docker-registry.tools.wmflabs.org/wikimedia-jessie' you can pull. I think at some point @Joe migh... [22:32:31] 06Operations, 10RESTBase, 06Services (later): Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#2711046 (10yuvipanda) (No guarantees are made about the stability of that docker image, but I'll notify you in case we change it) [22:32:36] 06Operations, 10Mobile-Content-Service, 10RESTBase, 10Traffic, and 2 others: Split slash decoding from general percent normalization in Varnish VCL - https://phabricator.wikimedia.org/T127387#2711047 (10GWicke) [22:33:22] (03CR) 10Dzahn: [C: 032] contint: puppet cleanup for CI master [puppet] - 10https://gerrit.wikimedia.org/r/315563 (owner: 10Hashar) [22:35:05] mutante: thanks [22:35:40] 06Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 03Fundraising Sprint Stirring The Pot, 13Patch-For-Review: Banner not showing up on site - https://phabricator.wikimedia.org/T144952#2615300 (10K4-713) @RobLa-WMF - Thanks for getting involved on this one. Looks like we might need... [22:38:36] (03CR) 10Dzahn: "all done on gallium. except the directory itself will stay "Notice: /Stage[main]/Jenkins/File[/var/lib/jenkins/init.groovy.d]: Not removin" [puppet] - 10https://gerrit.wikimedia.org/r/315563 (owner: 10Hashar) [22:40:18] (03PS1) 10Rush: WIP: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 [22:40:40] (03PS2) 10Rush: WIP: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 [22:41:40] (03CR) 10jenkins-bot: [V: 04-1] WIP: bdsync backup setup for labstore [puppet] - 10https://gerrit.wikimedia.org/r/315595 (owner: 10Rush) [22:49:12] (03PS1) 10Dzahn: gerrit: puppetize reviewer-counts.json [puppet] - 10https://gerrit.wikimedia.org/r/315596 (https://phabricator.wikimedia.org/T147776) [22:52:57] 06Operations, 10MediaWiki-extensions-CentralNotice, 10Traffic: Varnish-triggered CN campaign about browser security - https://phabricator.wikimedia.org/T144194#2711135 (10Legoktm) Well, if varnish doesn't cache those requests couldn't we just UA sniff in MW config and conditionally set the sitenotice based o... [22:54:48] (03PS2) 10Dzahn: gerrit: puppetize reviewer-counts.json [puppet] - 10https://gerrit.wikimedia.org/r/315596 (https://phabricator.wikimedia.org/T147776) [22:55:14] (03CR) 10Dzahn: [C: 032] gerrit: puppetize reviewer-counts.json [puppet] - 10https://gerrit.wikimedia.org/r/315596 (https://phabricator.wikimedia.org/T147776) (owner: 10Dzahn) [22:59:18] 06Operations, 10Citoid, 10Graphoid, 10VisualEditor, and 3 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#2711169 (10GWicke) 05Open>03Resolved Closing due to inactivity. [23:00:05] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20161012T2300). [23:00:46] 06Operations, 10Parsoid, 10service-runner, 10service-template-node, 06Services (done): Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#2711173 (10GWicke) 05Open>03Resolved a:03GWicke @jcook, we considered it when we started work on this... [23:01:11] I'm adding https://gerrit.wikimedia.org/r/#/c/314797/ to SWAT. [23:01:47] (03PS11) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [23:02:49] (03PS2) 10Dereckson: Raise abuse filter emergency threshold for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314797 (https://phabricator.wikimedia.org/T145765) [23:04:24] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314797 (https://phabricator.wikimedia.org/T145765) (owner: 10Dereckson) [23:04:52] 06Operations, 10Monitoring, 10RESTBase, 10service-template-node, 06Services (later): [Discussion] Consider validating JSON schemas when running x-ample tests? - https://phabricator.wikimedia.org/T110240#2711202 (10GWicke) [23:04:52] (03Merged) 10jenkins-bot: Raise abuse filter emergency threshold for es.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314797 (https://phabricator.wikimedia.org/T145765) (owner: 10Dereckson) [23:05:21] live on mw1099 [23:07:01] Logs looks good to me, Especial:FiltroAntiAbusos too. [23:08:44] !log dereckson@mira Synchronized wmf-config/InitialiseSettings.php: Raise abuse filter emergency threshold for es.wikibooks (T145765) (duration: 01m 19s) [23:08:45] T145765: Raise abusefilter autodisable limit on es.wikibooks - https://phabricator.wikimedia.org/T145765 [23:08:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:05] SWAT done. [23:11:35] PROBLEM - puppet last run on db1078 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:34] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: column family cassandra metrics size - https://phabricator.wikimedia.org/T113733#2711284 (10GWicke) [23:19:37] 06Operations, 10RESTBase, 06Services: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#2711285 (10GWicke) [23:20:17] 06Operations, 10RESTBase, 10RESTBase-Cassandra, 06Services, 13Patch-For-Review: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#2711305 (10GWicke) [23:21:15] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services: Cassandra uses default ip address for outbound packets while bootstrapping - https://phabricator.wikimedia.org/T128590#2711330 (10GWicke) [23:21:18] 06Operations, 10RESTBase-Cassandra, 06Services: cassandra slow streaming during (de)commission - https://phabricator.wikimedia.org/T126619#2711331 (10GWicke) [23:21:25] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services: Evaluate efficacy of DateTieredCompactionStrategy - https://phabricator.wikimedia.org/T126221#2711332 (10GWicke) [23:22:59] 06Operations, 10Cassandra, 06Services: Change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2711359 (10GWicke) [23:23:01] 06Operations, 06Services, 07RfC, 15User-Joe, and 2 others: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#2711360 (10GWicke) [23:23:03] 06Operations, 06Services, 10Traffic, 07discovery-system, 05services-tooling: Figure out an etcd deploy strategy that includes multi DC failure scenarios. - https://phabricator.wikimedia.org/T98165#2711361 (10GWicke) [23:27:11] (03PS1) 10Reedy: $wgMWOAuthCentralWiki = false; [mediawiki-config] - 10https://gerrit.wikimedia.org/r/315602 [23:27:33] Dereckson, if swat is done, OuKB and I will do some maps work [23:29:41] OuKB, could you depl https://gerrit.wikimedia.org/r/#/c/315603 [23:30:21] 06Operations, 10RESTBase, 06Services (later): enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#2711404 (10Pchelolo) [23:32:38] I've added https://gerrit.wikimedia.org/r/#/c/315603/ to swat - guess will deploy myself [23:34:42] 06Operations, 10RESTBase, 06Services, 10Traffic, 07Service-Architecture: Proxying new services through RESTBase - https://phabricator.wikimedia.org/T96688#2711407 (10GWicke) 05Open>03Resolved a:03GWicke Yeah, there isn't much useful life in this one left. Closing. [23:37:08] RECOVERY - puppet last run on db1078 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [23:38:40] !log maxsem@mira Synchronized php-1.28.0-wmf.22/extensions/Kartographer/: https://gerrit.wikimedia.org/r/#/c/315603/ (duration: 01m 03s) [23:38:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:14] 06Operations, 10RESTBase, 06Services (later): Provide production jessie image with node 4.2; use this for service-runner build command - https://phabricator.wikimedia.org/T123237#2711411 (10GWicke) @yuvipanda: Awesome, thanks! [23:42:25] (03PS6) 10Ori.livneh: Module and role for Recommendation API [puppet] - 10https://gerrit.wikimedia.org/r/312045 (https://phabricator.wikimedia.org/T116102) [23:43:33] 06Operations, 07RfC, 06Services (watching), 15User-Joe, and 2 others: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#2711430 (10Pchelolo) [23:47:06] FYI: it seems like the job queue is not very happy: https://grafana.wikimedia.org/dashboard/db/job-queue-health [23:48:49] hi [23:49:01] https://commons.wikimedia.org/wiki/File:Map_of_Hindoostan,_1788,_by_Rennell.jpg can't create thumbs? [23:49:13] (big file) [23:50:05] Error creating thumbnail: An unknown error occurred. for all thumbs