[00:00:15] 06Operations, 10Wikimedia-Apache-configuration: catch-all apache vhost on the cluster should return 404 for non-existing sites - https://phabricator.wikimedia.org/T137176#2359804 (10Dzahn) [00:00:38] Amir1: no, it's the scap3/puppet config race issue [00:01:10] I think I 've fixed it for now, but it will resurface on subsequent deploys until we fix it [00:01:43] akosiaris: can you explain more? maybe I could help [00:02:14] Amir1: there is the race condition of scap3 shipping the new directory and changing the symlink and then restarting the service [00:02:27] puppet had not ran yet so 99-config.yaml did not exist [00:02:37] logs showes as a result no python application found, check your startup logs for errors [00:02:41] oh the old issue [00:02:44] showed* [00:02:59] yes. once we get that fixed, we should not have any more problems [00:03:20] akosiaris: I have a patch for it [00:03:22] btw, config deployment is bound to become part of scap. There is quite a bit of work in that regard already [00:03:27] https://gerrit.wikimedia.org/r/292516 [00:03:50] Amir1: does ores support reading from that dir now ? [00:03:58] I mean, the currently deployed version [00:04:15] not, yet. I was actually waiting for this to be merged [00:04:24] ok then [00:04:25] and then fix the ores [00:04:35] because it already made the file [00:06:03] akosiaris: just to be clear, we need the puppet patch merged first or the ores config? [00:06:28] the latter [00:06:34] and then the puppet patch [00:06:57] I think that would be easy [00:07:12] do you want to do it right now, I think it can wait [00:16:25] Amir1: no, it can wait [00:16:53] RECOVERY - ores on scb2002 is OK: HTTP OK: HTTP/1.0 200 OK - 2801 bytes in 0.083 second response time [00:16:54] RECOVERY - ores on scb1001 is OK: HTTP OK: HTTP/1.0 200 OK - 2801 bytes in 0.013 second response time [00:26:57] !log ori@tin Synchronized php-1.28.0-wmf.4/extensions/CentralAuth/includes/CentralAuthHooks.php: I79cbb1dc: Prefetch $wgCentralAuthLoginWiki DNS (T92864) (duration: 00m 29s) [00:26:58] T92864: Investigate the use of DNS prefetching for reducing page load time. - https://phabricator.wikimedia.org/T92864 [00:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:37] !log Created MX and SPF records directly for wmflabs.org. for https://phabricator.wikimedia.org/T137160#2359786 [00:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:46] !log (TXT record for SPF, actually) [00:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:44:54] PROBLEM - MegaRAID on es2004 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [00:54:12] * aude would like to deploy https://gerrit.wikimedia.org/r/#/c/293048/ now [00:56:10] assume it's ok :) [01:10:04] !log aude@tin Synchronized php-1.28.0-wmf.4/extensions/Wikidata: Fix bug (T136093) in display of labels after edit (duration: 02m 03s) [01:10:05] T136093: Undefined index: entities in /srv/mediawiki/php-1.28.0-wmf.2/extensions/Wikidata/extensions/ArticlePlaceholder/includes/SearchHookHandler.php on line 245 - https://phabricator.wikimedia.org/T136093 [01:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:10:44] looks good [01:13:41] (03PS1) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [01:17:04] PROBLEM - puppet last run on mw2179 is CRITICAL: CRITICAL: Puppet has 1 failures [01:21:53] PROBLEM - puppet last run on install2001 is CRITICAL: CRITICAL: puppet fail [01:43:23] RECOVERY - puppet last run on mw2179 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:48:13] RECOVERY - puppet last run on install2001 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [01:49:28] (03CR) 10Ori.livneh: [C: 031] "Looks OK. I would have maybe gone with a more generic group name that could be reused elsewhere where we want to restrict access to truste" [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [01:49:45] (03PS2) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [02:25:26] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.4) (duration: 09m 36s) [02:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:57] (03PS1) 10Yuvipanda: tools: Switch the label selector used in kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/293065 [02:30:57] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun 7 02:30:57 UTC 2016 (duration 5m 32s) [02:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:16] (03PS1) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [02:36:50] (03PS2) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [02:39:46] (03PS3) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [02:45:02] (03PS4) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [02:46:36] (03PS5) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [02:47:39] (03PS2) 10Yuvipanda: tools: Switch the label selector used in kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/293065 [02:47:58] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Switch the label selector used in kube2proxy [puppet] - 10https://gerrit.wikimedia.org/r/293065 (owner: 10Yuvipanda) [02:53:30] (03PS6) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [03:32:49] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:38:39] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [03:50:29] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:54:28] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [03:57:38] 06Operations, 10Monitoring, 06Services: Update restbase catchpoint metric - https://phabricator.wikimedia.org/T137181#2359943 (10GWicke) [04:27:58] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 2.238 second response time [04:42:49] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: puppet fail [04:59:02] (03PS3) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [04:59:35] (03CR) 10jenkins-bot: [V: 04-1] [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 (owner: 10Yuvipanda) [05:01:27] (03PS4) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [05:09:35] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think it's time. See a small comment, great work overall!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) (owner: 10Dzahn) [05:10:58] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:12:37] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 202, down: 0, dormant: 0, excluded: 0, unused: 0 [05:36:10] (03PS7) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [05:37:42] (03CR) 10Dzahn: DHCP: switch default installer to jessie (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) (owner: 10Dzahn) [05:47:37] (03PS8) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [05:49:04] 06Operations, 10Citoid, 10Graphoid, 06Services, and 3 others: SCB services should not use a proxy for our domains - https://phabricator.wikimedia.org/T97530#1245380 (10GWicke) @mobrovac, is there still anything left to be done here? [05:52:47] 06Operations, 06Services, 13Patch-For-Review, 07Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#2359997 (10GWicke) This has been implemented with [checker.py](https://github.com/wikimedia/operations-puppet/blob/production/modules/service/files/chec... [05:52:48] (03PS5) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [05:55:49] (03PS6) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [06:07:18] (03PS1) 10Dzahn: decom einsteinium ? [puppet] - 10https://gerrit.wikimedia.org/r/293068 [06:10:12] (03PS2) 10Dzahn: decom einsteinium ? [puppet] - 10https://gerrit.wikimedia.org/r/293068 [06:30:57] PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: puppet fail [06:31:06] PROBLEM - puppet last run on lvs2002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:45] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:05] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:33:16] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:14] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 3 failures [06:37:29] 06Operations, 10Mail, 10OTRS: otrs email outage tracking task - https://phabricator.wikimedia.org/T137145#2360005 (10MoritzMuehlenhoff) Thanks. I'll write an incident report, there's a few actionables wrt our clamav configuration and the monitoring which could have prevented that from being user-visible. [06:38:52] (03PS1) 10Jcrespo: Depool db1070 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293069 (https://phabricator.wikimedia.org/T133398) [06:39:41] (03CR) 10Muehlenhoff: "IIRC this was only recently added for one of the monitoring quarterly goals. It's already running jessie." [puppet] - 10https://gerrit.wikimedia.org/r/293068 (owner: 10Dzahn) [06:42:34] ACKNOWLEDGEMENT - MegaRAID on es2004 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) Jcrespo ack, no action will be taken until decom (in 1 year) [06:45:44] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting deployment access (for deploying to scb) for Halfak - https://phabricator.wikimedia.org/T136612#2360026 (10jcrespo) 05Open>03Resolved [06:46:40] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2360028 (10Joe) I created a new package for jessie from our own package that is used in trusty, just removing the mediawiki... [06:48:04] 06Operations, 07Puppet, 10MediaWiki-General-or-Unknown: Profile and reduce the puppet execution time on the appservers - https://phabricator.wikimedia.org/T131750#2360029 (10Joe) A noop run on jessie takes 20-25 seconds, that's a way better gain than what I can achieve with profiling (and most of my tests pr... [06:48:16] 06Operations, 07Puppet, 10MediaWiki-General-or-Unknown: Profile and reduce the puppet execution time on the appservers - https://phabricator.wikimedia.org/T131750#2360030 (10Joe) 05Open>03declined a:03Joe [06:48:18] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2360032 (10Joe) [06:49:18] (03CR) 10Jcrespo: "ori: should the need arise, we can change the group's name/create a new one at any time- I decided this for now as the current name is ver" [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [06:54:08] (03PS3) 10Jcrespo: Allow the group of users grafana-admin to edit [puppet] - 10https://gerrit.wikimedia.org/r/292405 [06:55:44] (03PS4) 10Jcrespo: Grant grafana-admin.wm.o access to LDAP group grafana-admin [puppet] - 10https://gerrit.wikimedia.org/r/292405 [06:56:35] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:15] RECOVERY - puppet last run on lvs2002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:57:44] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:45] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:54] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:31] (03CR) 10Jcrespo: [C: 032] Depool db1070 for cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293069 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [07:08:40] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1070 for cloning (duration: 00m 29s) [07:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:19:52] !log stopping and cloning db1070 to new s5 servers [07:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:22:17] 06Operations, 06Mobile-Apps, 10Traffic, 13Patch-For-Review: alias /apple-app-site-association and /.well-known/apple-app-site-association - https://phabricator.wikimedia.org/T130647#2360091 (10jcrespo) 05Open>03Resolved a:03jcrespo Thank you https://grafana.wikimedia.org/dashboard/db/varnish-http-err... [07:34:27] PROBLEM - Disk space on mw1262 is CRITICAL: Connection refused by host [07:35:08] PROBLEM - configured eth on mw1262 is CRITICAL: Connection refused by host [07:35:27] PROBLEM - dhclient process on mw1262 is CRITICAL: Connection refused by host [07:35:38] PROBLEM - nutcracker process on mw1262 is CRITICAL: Connection refused by host [07:35:57] PROBLEM - dhclient process on mw1261 is CRITICAL: Connection refused by host [07:36:18] PROBLEM - mediawiki-installation DSH group on mw1261 is CRITICAL: Host mw1261 is not in mediawiki-installation dsh group [07:36:27] PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection refused [07:36:38] PROBLEM - nutcracker port on mw1261 is CRITICAL: Connection refused by host [07:36:58] PROBLEM - nutcracker process on mw1261 is CRITICAL: Connection refused by host [07:37:07] (03PS2) 10Ema: Don't install apt-show-versions [puppet] - 10https://gerrit.wikimedia.org/r/292936 (https://phabricator.wikimedia.org/T132324) [07:37:17] PROBLEM - puppet last run on mw1261 is CRITICAL: Connection refused by host [07:37:18] (03CR) 10Ema: [C: 032 V: 032] Don't install apt-show-versions [puppet] - 10https://gerrit.wikimedia.org/r/292936 (https://phabricator.wikimedia.org/T132324) (owner: 10Ema) [07:37:27] PROBLEM - salt-minion processes on mw1261 is CRITICAL: Connection refused by host [07:38:07] PROBLEM - Check size of conntrack table on mw1261 is CRITICAL: Timeout while attempting connection [07:38:27] PROBLEM - DPKG on mw1261 is CRITICAL: Timeout while attempting connection [07:38:38] PROBLEM - Disk space on mw1261 is CRITICAL: Timeout while attempting connection [07:38:57] PROBLEM - MD RAID on mw1261 is CRITICAL: Timeout while attempting connection [07:39:28] PROBLEM - nutcracker port on mw1262 is CRITICAL: Connection refused by host [07:39:47] PROBLEM - configured eth on mw1261 is CRITICAL: Timeout while attempting connection [07:39:47] PROBLEM - Check size of conntrack table on mw1262 is CRITICAL: Connection refused by host [07:39:48] PROBLEM - puppet last run on mw1262 is CRITICAL: Connection refused by host [07:39:49] PROBLEM - MD RAID on mw1262 is CRITICAL: Connection refused by host [07:39:49] PROBLEM - salt-minion processes on mw1262 is CRITICAL: Connection refused by host [07:39:49] PROBLEM - Apache HTTP on mw1261 is CRITICAL: Connection timed out [07:39:58] PROBLEM - DPKG on mw1262 is CRITICAL: Connection refused by host [07:41:49] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [07:41:57] PROBLEM - puppet last run on mc1005 is CRITICAL: CRITICAL: Puppet has 1 failures [07:42:18] PROBLEM - puppet last run on db1048 is CRITICAL: CRITICAL: Puppet has 1 failures [07:42:57] PROBLEM - puppet last run on db1036 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:07] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:08] PROBLEM - puppet last run on es1012 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:08] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:18] PROBLEM - puppet last run on elastic1024 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:19] PROBLEM - puppet last run on mw2119 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:27] PROBLEM - puppet last run on lvs4003 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:30] <_joe_> mw1261 is me [07:43:37] PROBLEM - puppet last run on mw2147 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:38] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:38] PROBLEM - puppet last run on db1019 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:48] PROBLEM - puppet last run on cp1050 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:48] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:58] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: Puppet has 1 failures [07:43:58] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 1 failures [07:44:17] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Puppet has 1 failures [07:44:18] PROBLEM - puppet last run on mw2178 is CRITICAL: CRITICAL: Puppet has 1 failures [07:44:47] PROBLEM - puppet last run on mw2086 is CRITICAL: CRITICAL: Puppet has 1 failures [07:44:47] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: Puppet has 1 failures [07:44:58] PROBLEM - puppet last run on mw2135 is CRITICAL: CRITICAL: Puppet has 1 failures [07:51:27] RECOVERY - puppet last run on lvs4003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [07:54:07] nevermind the puppet-fails above, they will get fixed on the next run [07:58:02] 06Operations, 10Mail, 10OTRS: otrs email outage tracking task - https://phabricator.wikimedia.org/T137145#2360146 (10MoritzMuehlenhoff) https://wikitech.wikimedia.org/wiki/Incident_documentation/20160606-otrsmail [08:04:24] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.064 second response time [08:08:26] RECOVERY - puppet last run on cp1050 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:08:45] RECOVERY - puppet last run on db1048 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:09:14] RECOVERY - puppet last run on mw2086 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:09:14] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:09:15] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:09:15] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:09:16] RECOVERY - puppet last run on es1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:09:24] RECOVERY - puppet last run on db1019 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:09:25] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:09:55] RECOVERY - puppet last run on elastic1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:09:56] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:04] RECOVERY - puppet last run on mw2119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:04] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:10:05] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:05] RECOVERY - puppet last run on mw2147 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:10:06] RECOVERY - puppet last run on mw2178 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:15] RECOVERY - puppet last run on mc1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:24] RECOVERY - puppet last run on db1036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:25] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:45] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:55] RECOVERY - puppet last run on mw2135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:11:34] PROBLEM - NTP on mw1261 is CRITICAL: NTP CRITICAL: No response from NTP server [08:18:28] (03CR) 10Elukey: [C: 032] "Tried the code multiple times and it works as expected. I've also added the following gcc parameters to have extra checks:" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [08:18:38] (03CR) 10Elukey: [V: 032] "Tried the code multiple times and it works as expected. I've also added the following gcc parameters to have extra checks:" [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/292172 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [08:22:11] (03PS1) 10Jcrespo: Repool db1070 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293071 [08:24:35] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.008 second response time [08:27:06] RECOVERY - DPKG on mw1261 is OK: All packages OK [08:27:15] RECOVERY - Disk space on mw1261 is OK: DISK OK [08:27:35] (03CR) 10Mobrovac: [C: 031] Change-Prop: Enable file transclusions updates. [puppet] - 10https://gerrit.wikimedia.org/r/292899 (owner: 10Ppchelko) [08:27:44] RECOVERY - MD RAID on mw1261 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [08:28:05] RECOVERY - configured eth on mw1261 is OK: OK - interfaces up [08:28:06] RECOVERY - salt-minion processes on mw1261 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [08:28:25] RECOVERY - dhclient process on mw1261 is OK: PROCS OK: 0 processes with command name dhclient [08:28:35] RECOVERY - nutcracker port on mw1261 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [08:28:36] !log deploying libxml2 security updates on Ubuntu systems (Debian systems already upgraded last week) [08:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:28:56] RECOVERY - Check size of conntrack table on mw1261 is OK: OK: nf_conntrack is 0 % full [08:28:56] RECOVERY - nutcracker process on mw1261 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [08:28:57] 06Operations, 10Traffic: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2360167 (10ema) The issue is reproducible on hosts running vhtcpd (eg: cp1061). Normal CPU usage on hosts *not* running vhtcpd (eg: cp1058). [08:32:20] (03CR) 10Phuedx: [C: 031] huwiki: Enable A/B test for 50% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292964 (https://phabricator.wikimedia.org/T136713) (owner: 10Bmansurov) [08:35:01] PROBLEM - MD RAID on mw1262 is CRITICAL: Timeout while attempting connection [08:35:42] (03PS2) 10Phuedx: huwiki: Enable Popups A/B test for 50% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292964 (https://phabricator.wikimedia.org/T136713) (owner: 10Bmansurov) [08:36:02] PROBLEM - Apache HTTP on mw1262 is CRITICAL: Connection timed out [08:36:11] PROBLEM - configured eth on mw1262 is CRITICAL: Timeout while attempting connection [08:36:12] PROBLEM - mediawiki-installation DSH group on mw1262 is CRITICAL: Host mw1262 is not in mediawiki-installation dsh group [08:36:21] PROBLEM - dhclient process on mw1262 is CRITICAL: Timeout while attempting connection [08:36:51] PROBLEM - nutcracker port on mw1262 is CRITICAL: Timeout while attempting connection [08:37:11] PROBLEM - nutcracker process on mw1262 is CRITICAL: Timeout while attempting connection [08:37:31] PROBLEM - puppet last run on mw1262 is CRITICAL: Timeout while attempting connection [08:37:41] PROBLEM - salt-minion processes on mw1262 is CRITICAL: Timeout while attempting connection [08:37:45] <_joe_> this is me reimaging ^^ [08:38:21] PROBLEM - Check size of conntrack table on mw1262 is CRITICAL: Timeout while attempting connection [08:38:29] (03CR) 10Mobrovac: [C: 04-1] "Needs another update, shall amend." [puppet] - 10https://gerrit.wikimedia.org/r/292899 (owner: 10Ppchelko) [08:38:32] PROBLEM - DPKG on mw1262 is CRITICAL: Timeout while attempting connection [08:38:51] PROBLEM - Disk space on mw1262 is CRITICAL: Timeout while attempting connection [08:40:59] (03PS3) 10Mobrovac: Change-Prop: Enable file transclusions updates. [puppet] - 10https://gerrit.wikimedia.org/r/292899 (owner: 10Ppchelko) [08:41:15] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) (owner: 10Dzahn) [08:41:59] (03PS1) 10Giuseppe Lavagetto: mediawiki: add mw1261/2 to mediawiki-installation, pybal [puppet] - 10https://gerrit.wikimedia.org/r/293074 [08:42:12] RECOVERY - NTP on mw1261 is OK: NTP OK: Offset 0.001598119736 secs [08:46:51] PROBLEM - Apache HTTP on mw1261 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50407 bytes in 0.551 second response time [08:49:03] (03PS1) 10Elukey: Package last release (1.0.10-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/293075 (https://phabricator.wikimedia.org/T136314) [08:50:42] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.063 second response time [08:51:03] PROBLEM - puppet last run on mw1261 is CRITICAL: CRITICAL: Puppet has 3 failures [08:53:24] !log rolling restart of hhvm on appserver canaries to pick up libxml2 update [08:53:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:57:54] (03Abandoned) 10Elukey: Package last release (1.0.10-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/293075 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [08:58:23] (03PS1) 10Elukey: Package last upsteam (1.0.10-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/293076 (https://phabricator.wikimedia.org/T136314) [09:00:54] !log rolling restart of hhvm on codfw appservers to pick up libxml2 update [09:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:05] (03Abandoned) 10Elukey: Package last upsteam (1.0.10-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/293076 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [09:01:42] sorry for the gerrit spam :) [09:02:13] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.001 second response time [09:04:18] !log Upgrading Jenkins IRC plugin 2.25..2.27 and instant messaging plugin 1.34..1.35 . The former should fix a deadlock on shutdowning Jenkins [09:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:05:34] RECOVERY - dhclient process on mw1262 is OK: PROCS OK: 0 processes with command name dhclient [09:05:44] RECOVERY - nutcracker port on mw1262 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [09:06:04] RECOVERY - Check size of conntrack table on mw1262 is OK: OK: nf_conntrack is 0 % full [09:06:04] RECOVERY - nutcracker process on mw1262 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [09:06:13] RECOVERY - salt-minion processes on mw1262 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:06:44] RECOVERY - Disk space on mw1262 is OK: DISK OK [09:07:03] RECOVERY - MD RAID on mw1262 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:07:13] RECOVERY - configured eth on mw1262 is OK: OK - interfaces up [09:07:24] PROBLEM - NTP on mw1262 is CRITICAL: NTP CRITICAL: Offset unknown [09:07:45] (03CR) 10Jcrespo: [C: 032] Repool db1070 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293071 (owner: 10Jcrespo) [09:09:15] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1070 after maintenance (duration: 00m 27s) [09:09:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:10:23] RECOVERY - DPKG on mw1262 is OK: All packages OK [09:15:12] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix hiera regex for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/293077 [09:19:14] RECOVERY - NTP on mw1262 is OK: NTP OK: Offset -0.01481235027 secs [09:20:39] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: fix hiera regex for new appservers [puppet] - 10https://gerrit.wikimedia.org/r/293077 (owner: 10Giuseppe Lavagetto) [09:23:52] 06Operations, 10DBA, 13Patch-For-Review: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398#2360309 (10jcrespo) [09:27:34] PROBLEM - Host mw1262 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:56] <_joe_> that's me ^^ new appservers need to be rebooted when reimaging [09:28:44] RECOVERY - Host mw1262 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [09:29:44] PROBLEM - Apache HTTP on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:31:44] RECOVERY - Apache HTTP on mw1261 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.179 second response time [09:31:44] PROBLEM - Apache HTTP on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:33:43] RECOVERY - Apache HTTP on mw1262 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.058 second response time [09:46:38] (03PS1) 10Elukey: Package last upstream version (1.0.10-1) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/293079 (https://phabricator.wikimedia.org/T136314) [09:49:41] (03PS1) 10Elukey: Package last upstream version (1.0.10-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/293081 (https://phabricator.wikimedia.org/T136314) [09:51:21] (03Abandoned) 10Elukey: Package last upstream version (1.0.10-1) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/293081 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [09:54:31] (03PS1) 10Jcrespo: Add new coredb servers to alias configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293082 (https://phabricator.wikimedia.org/T133398) [09:55:48] (03CR) 10Jcrespo: [C: 04-2] "Delaying until the first server is pooled." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293082 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [09:56:59] (03PS2) 10Giuseppe Lavagetto: mediawiki: add mw1261/2 to mediawiki-installation, pybal [puppet] - 10https://gerrit.wikimedia.org/r/293074 [10:00:24] (03PS1) 10Elukey: Package last upstream 1.0.10-1 [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/293083 (https://phabricator.wikimedia.org/T136314) [10:00:29] (03Abandoned) 10Elukey: Package last upstream version (1.0.10-1) [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/293079 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [10:00:44] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /var/lib/nginx 0 MB (0% inode=99%) [10:01:38] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: add mw1261/2 to mediawiki-installation, pybal [puppet] - 10https://gerrit.wikimedia.org/r/293074 (owner: 10Giuseppe Lavagetto) [10:02:27] (03PS1) 10Jcrespo: Pool new db hosts: db1082, db1087, db1092 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293085 (https://phabricator.wikimedia.org/T133398) [10:02:55] (03PS1) 10Gergő Tisza: Do not set wgAuth to LdapAuth when AuthManager is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) [10:09:24] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures [10:12:44] RECOVERY - Disk space on dataset1001 is OK: DISK OK [10:15:39] 06Operations, 10Traffic: Scripts depending on varnishlog.py maxing out CPU usage on cache_misc - https://phabricator.wikimedia.org/T137114#2360352 (10ema) So, besides the fact that we should check why vhtcpd is running on some misc nodes and not on others, this seems to be a scalability issue in the new varnis... [10:17:02] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2360353 (10Joe) `mw1261` and `mw1262` are both live and working correctly AFAICT. They will be inserted in the server rotat... [10:17:11] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM, 13Patch-For-Review: Make all role::mediawiki::* classes compatible with debian jessie - https://phabricator.wikimedia.org/T131749#2360354 (10Joe) 05Open>03Resolved [10:17:13] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Refresh the appservers puppet code/configs - https://phabricator.wikimedia.org/T131748#2360355 (10Joe) [10:17:19] <_joe_> \o/ [10:18:45] !log rolling restart of hhvm on eqiad appservers to pick up libxml2 update [10:18:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:27:24] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:28:44] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [10:32:57] (03PS3) 10Jforrester: Enable VisualEditor by default on eleven Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292746 (https://phabricator.wikimedia.org/T136990) [10:33:23] (03PS2) 10Jforrester: Enable VisualEditor by default for all users of the English Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292751 (https://phabricator.wikimedia.org/T136992) [10:33:42] (03PS2) 10Jforrester: Enable VisualEditor by default for all users of the French Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292750 (https://phabricator.wikimedia.org/T136993) [10:33:53] (03PS2) 10Jforrester: Enable VisualEditor by default for all users of the Italian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292749 (https://phabricator.wikimedia.org/T136994) [10:34:04] (03PS2) 10Jforrester: Enable VisualEditor by default for all users of the Russian Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292748 (https://phabricator.wikimedia.org/T136995) [10:34:16] (03PS2) 10Jforrester: Enable VisualEditor by default for all users of the Chinese Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292747 (https://phabricator.wikimedia.org/T136996) [10:34:38] (03PS2) 10Jforrester: Enable VisualEditor by default for all users of the German Wikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292752 (https://phabricator.wikimedia.org/T136991) [10:38:13] RECOVERY - mediawiki-installation DSH group on mw1262 is OK: OK [10:39:13] RECOVERY - mediawiki-installation DSH group on mw1261 is OK: OK [10:53:44] !log restarting apache2 on iridium (hosting Phabricator) for libxml2 update [10:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:55:03] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [10:55:43] (03PS1) 10Gergő Tisza: Update audit hooks for AuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293091 (https://phabricator.wikimedia.org/T135504) [10:56:44] PROBLEM - HHVM rendering on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:28] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2360496 (10Joe) A few notes: - All machines have double disks, so we will use the mw-raid1.cfg partman recipe here, I will probably prepare a patch for that. - The servers need to be added to site... [10:58:34] RECOVERY - HHVM rendering on mw1155 is OK: HTTP OK: HTTP/1.1 200 OK - 63506 bytes in 0.124 second response time [11:08:37] !log restarted apache2 on gallium for libxml2 update [11:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:38] !log restarting apache2 on silver (hosting wikitech) for libxml2 update [11:23:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:24] PROBLEM - HHVM rendering on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:27:11] !log restarting apache2 on californium (hosting horizon dashboard) for libxml2 update [11:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:14] RECOVERY - HHVM rendering on mw1160 is OK: HTTP OK: HTTP/1.1 200 OK - 64688 bytes in 0.440 second response time [12:23:18] !log rolling restart of sca cluster for libxml2 security update [12:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:23:33] PROBLEM - puppet last run on cp2019 is CRITICAL: CRITICAL: puppet fail [12:27:48] !log rolling out gdk-pixbuf security updates [12:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:28:35] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Install / configure new maps servers in codfw - https://phabricator.wikimedia.org/T134901#2360690 (10Gehel) Tile generation is completed. Now trying to get initial import and OSM replication to work reliably. Initial import... [12:51:13] (03PS1) 10Faidon Liambotis: smokeping: add more codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/293097 [12:51:37] (03CR) 10Faidon Liambotis: [C: 032 V: 032] smokeping: add more codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/293097 (owner: 10Faidon Liambotis) [12:52:23] RECOVERY - puppet last run on cp2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:05:36] PROBLEM - DPKG on mw1264 is CRITICAL: Connection refused by host [13:05:57] PROBLEM - Disk space on mw1264 is CRITICAL: Connection refused by host [13:06:17] PROBLEM - MD RAID on mw1264 is CRITICAL: Connection refused by host [13:06:45] this is me, new appserver --^ [13:06:57] PROBLEM - Apache HTTP on mw1264 is CRITICAL: Connection refused [13:07:06] PROBLEM - configured eth on mw1264 is CRITICAL: Connection refused by host [13:07:26] PROBLEM - dhclient process on mw1264 is CRITICAL: Connection refused by host [13:07:37] PROBLEM - mediawiki-installation DSH group on mw1264 is CRITICAL: Host mw1264 is not in mediawiki-installation dsh group [13:07:57] PROBLEM - nutcracker port on mw1264 is CRITICAL: Connection refused by host [13:08:07] PROBLEM - nutcracker process on mw1264 is CRITICAL: Connection refused by host [13:08:27] PROBLEM - puppet last run on mw1264 is CRITICAL: Connection refused by host [13:08:37] PROBLEM - salt-minion processes on mw1264 is CRITICAL: Connection refused by host [13:08:57] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.012 second response time [13:09:16] PROBLEM - Check size of conntrack table on mw1264 is CRITICAL: Connection refused by host [13:19:06] PROBLEM - Apache HTTP on mw1264 is CRITICAL: Connection timed out [13:48:09] (03CR) 10Jcrespo: [C: 032] Add new coredb servers to alias configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293082 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [13:48:42] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: puppet fail [13:48:48] (03CR) 10Jcrespo: [C: 032] Pool new db hosts: db1082, db1087, db1092 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293085 (https://phabricator.wikimedia.org/T133398) (owner: 10Jcrespo) [13:49:58] !log about to pool new dewiki/wikidata servers T133398 [13:49:59] T133398: Install, configure and provision recently arrived db core machines - https://phabricator.wikimedia.org/T133398 [13:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:30] (03CR) 10Anomie: [C: 031] Update audit hooks for AuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293091 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [13:52:01] !log jynus@tin Synchronized wmf-config/db-codfw.php: Add new coredb servers to alias configuration (duration: 00m 38s) [13:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:52:41] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:53:57] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Pool new s5 db hosts: db1082, db1087, db1092 with low weight (duration: 00m 23s) [13:54:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:37] some noise on db1042, but I would say it is not related [13:57:51] (03CR) 10Anomie: [C: 04-1] Do not set wgAuth to LdapAuth when AuthManager is enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [13:57:53] refreshlinks are hitting heavier than usual [14:00:05] andrewbogott moritzm: Respected human, time to deploy Labs-wide security updates (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160607T1400). Please do the needful. [14:02:13] RECOVERY - Apache HTTP on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.015 second response time [14:03:20] morebots: I'm going to work on a suspend/resume script for a bit, and then will get this started [14:03:20] I am a logbot running on tools-exec-1212. [14:03:20] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:03:20] To log a message, type !log . [14:05:41] RECOVERY - dhclient process on mw1264 is OK: PROCS OK: 0 processes with command name dhclient [14:05:52] RECOVERY - nutcracker port on mw1264 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:06:02] RECOVERY - salt-minion processes on mw1264 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:06:12] RECOVERY - nutcracker process on mw1264 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:06:33] RECOVERY - Check size of conntrack table on mw1264 is OK: OK: nf_conntrack is 0 % full [14:06:56] um… moritzm: I'm going to work on a suspend/resume script for a bit, and then will get this started [14:07:01] RECOVERY - Disk space on mw1264 is OK: DISK OK [14:07:02] RECOVERY - MD RAID on mw1264 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:07:22] RECOVERY - configured eth on mw1264 is OK: OK - interfaces up [14:08:51] RECOVERY - DPKG on mw1264 is OK: All packages OK [14:10:00] andrewbogott: ok [14:14:52] (03CR) 10Gehel: Adding Icinga checks for Maps (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) (owner: 10Gehel) [14:18:51] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 11.851 second response time [14:20:52] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 11.523 second response time [14:23:11] !log stopping mysql and the OS @ es2017 for hardware maintenance [14:23:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:39] (03PS2) 10Gergő Tisza: Do not set wgAuth to LdapAuth when AuthManager is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) [14:26:44] (03CR) 10Gergő Tisza: Do not set wgAuth to LdapAuth when AuthManager is enabled (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [14:28:34] Could someone please mark https://www.mediawiki.org/wiki/Developers for translation? [14:30:36] Off topic, but it needs moar sections [14:32:26] (03PS1) 10Gergő Tisza: [HOLD] Remove AbuseFilter B/C config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293109 [14:32:46] (03PS1) 10Jdrewniak: T135902 adding readme and license to wikipedia.org portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293110 (https://phabricator.wikimedia.org/T135902) [14:42:41] (03PS3) 10Anomie: Do not set wgAuth to LdapAuth when AuthManager is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [14:43:55] (03CR) 10Anomie: [C: 031] "Looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [14:47:14] !log installing varnishkafka 1.0.10-1 on cp1046 manually to test the new version. [14:47:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:30] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2361052 (10RobH) [14:54:32] 06Operations, 10hardware-requests: eqiad: spare allocation to replace labmon1001 - https://phabricator.wikimedia.org/T136970#2361050 (10RobH) 05Open>03Resolved Excellent, glad it was approved since we made the judgement call to use it.) Thanks @mark! Resolving this task. [14:56:18] 06Operations, 10hardware-requests: Replace/refresh carbon - https://phabricator.wikimedia.org/T137117#2361066 (10RobH) I had Dell just quote to the specification, and I advised HP to reconfigure the quote and chassis as needed since we're reducing the CPU sockets by half and the memory by 75%. I should have a... [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160607T1500). Please do the needful. [15:00:04] gehel bmansurov nikerabbit jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] * gehel o/ [15:00:12] here [15:00:17] o/ [15:00:24] plop [15:00:59] I can SWAT today. [15:01:14] Nikerabbit: do you have privatesettings.php changes to make? [15:01:31] I can get those out of the way first since you're a hold-over from yesterday. [15:02:46] thcipriani: yeah is the perms fixed today? [15:03:05] Nikerabbit: I looked earlier, you *should* be able to edit now [15:03:44] yep, perms seem fine now. [15:03:57] thcipriani: is srv/mediawiki/private the right place to edit it? [15:04:02] (03PS3) 10Gehel: Revert "Send wmf.4 search and ttmserver traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 [15:04:10] Nikerabbit: /srv/mediawiki-staging/private [15:04:55] thcipriani: done [15:05:53] Nikerabbit: kk, so as far as order of operations goes: I'll sync PrivateSettings.php then https://gerrit.wikimedia.org/r/#/c/292898/2 [15:06:08] does that sound right to you? [15:06:31] thcipriani: there is small window between either way when they don't match [15:06:42] yeah, that's kinda what I figured. [15:07:08] ok. Well, I'll try to get them out in quick succession. [15:07:24] (03PS3) 10Thcipriani: Use bot password for TNBot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292898 (https://phabricator.wikimedia.org/T110766) (owner: 10Nikerabbit) [15:08:05] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292898 (https://phabricator.wikimedia.org/T110766) (owner: 10Nikerabbit) [15:08:52] (03Merged) 10jenkins-bot: Use bot password for TNBot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292898 (https://phabricator.wikimedia.org/T110766) (owner: 10Nikerabbit) [15:09:51] Nikerabbit: could you commit your PrivateSettings change? [15:10:06] thcipriani: oh sure [15:10:12] thanks [15:10:27] (03CR) 10BryanDavis: [WIP] Kubernetes backend (035 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 (owner: 10Yuvipanda) [15:11:28] yikes vim [15:11:29] 06Operations, 06Discovery, 06Maps: Ensure that maps server can be automatically installed (fully puppetized) - https://phabricator.wikimedia.org/T135750#2361078 (10Gehel) [15:11:37] :D [15:11:41] (03CR) 10Alexandros Kosiaris: Remove old and redundant AQS specific alarms. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [15:11:44] thcipriani: done if the wrong email doesn't matter [15:12:29] (03CR) 10Ottomata: "Haven't reviewed, but +1 to idea! :)" [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) (owner: 10Dzahn) [15:12:30] thank you. [15:12:55] (03PS4) 10Dzahn: ircyall: Stop using package=>latest [puppet] - 10https://gerrit.wikimedia.org/r/292093 (owner: 10Muehlenhoff) [15:13:26] (03CR) 10Dzahn: [C: 031] "looks good, yuvi right" [puppet] - 10https://gerrit.wikimedia.org/r/292093 (owner: 10Muehlenhoff) [15:14:22] 06Operations, 06Discovery, 06Maps: Ensure that maps server can be automatically installed (fully puppetized) - https://phabricator.wikimedia.org/T135750#2361079 (10Gehel) [15:15:55] !log thcipriani@tin Synchronized private/PrivateSettings.php: SWAT: password update for Translation Notification Bot (duration: 00m 41s) [15:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:38] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:292898|Use bot password for TNBot]] (duration: 00m 34s) [15:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:16:43] ^ Nikerabbit check please [15:16:54] checking [15:18:31] https://meta.wikimedia.org/wiki/User_talk:Nikerabbit#Ilmoitus_k.C3.A4.C3.A4nn.C3.B6ksest.C3.A4:_100wikidays local message works at least still [15:19:19] (03PS5) 10Faidon Liambotis: Grant grafana-admin.wm.o access to LDAP group grafana-admin [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [15:20:47] thcipriani: but remote wiki did not work for some reason, logstash shows "error=Error logging in" [15:20:59] (03CR) 10Faidon Liambotis: [C: 032] Grant grafana-admin.wm.o access to LDAP group grafana-admin [puppet] - 10https://gerrit.wikimedia.org/r/292405 (owner: 10Jcrespo) [15:21:29] that's not good. [15:22:25] tgr: did we miss anything WRT TNBot password update? (backscroll for context) [15:22:30] thcipriani: I don't know offhand what it could be... there are some code changes going in next train with better error messages. [15:23:09] thcipriani: my cargo cult fix for private changes is touch -h wmf-config/PrivateSettings.php and then dync it [15:23:30] *sync* [15:23:32] tgr: I'll also cargo cult that right now :) [15:24:56] !log thcipriani@tin Synchronized wmf-config/PrivateSettings.php: SWAT: [[gerrit:292898|Use bot password for TNBot]] after touch wmf-config/PrivateSettings.php (duration: 00m 25s) [15:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:17] ^ Nikerabbit give testing another shot [15:25:22] thcipriani: ok testing again [15:25:27] thank you [15:25:31] (03PS1) 10Faidon Liambotis: smokeping: various cleanups [puppet] - 10https://gerrit.wikimedia.org/r/293113 [15:25:40] We figured out how that was messed up at one point right? I vaguely remember tracking down issues where the symlink target changed and MW didn't notice [15:26:17] bd808: we did, I don't think we did anything about it though [15:26:18] (03PS1) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) [15:26:43] thcipriani: ok that fixed it: https://ace.wikipedia.org/wiki/Marit_Ureu%C3%ABng_Ngui:Nikerabbit [15:26:47] \o/ [15:27:00] Nikerabbit: thank you for checking/bearing with me, appreciated :) [15:27:04] Probably remembering T126306 [15:27:04] T126306: Scap should touch symlinks when originals are touched - https://phabricator.wikimedia.org/T126306 [15:27:12] (03CR) 10Faidon Liambotis: [C: 032] smokeping: various cleanups [puppet] - 10https://gerrit.wikimedia.org/r/293113 (owner: 10Faidon Liambotis) [15:27:31] (03CR) 10jenkins-bot: [V: 04-1] Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) (owner: 10Elukey) [15:27:43] thcipriani: no problem :) great that tgr had solution at hand [15:27:59] definitely. Thanks tgr ! [15:28:06] (03PS4) 10Thcipriani: Revert "Send wmf.4 search and ttmserver traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 (owner: 10Gehel) [15:28:17] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 (owner: 10Gehel) [15:28:41] The problem is related to T72054 as well [15:28:41] T72054: [scap] Syncing wmf-config/PrivateSettings.php syncs symlink and not file contents - https://phabricator.wikimedia.org/T72054 [15:28:54] (03Merged) 10jenkins-bot: Revert "Send wmf.4 search and ttmserver traffic to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292938 (owner: 10Gehel) [15:29:44] bd808: I am running hhvm where all code is behind symlink on two sites... the other one needs hhvm restart every few deployments because it stops finding the files :/ [15:29:50] (03PS1) 10Faidon Liambotis: admin: replace Adam Wight's SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/293115 (https://phabricator.wikimedia.org/T137162) [15:30:30] thcipriani: thanks! [15:30:36] (03CR) 10Faidon Liambotis: [C: 032] admin: replace Adam Wight's SSH keys [puppet] - 10https://gerrit.wikimedia.org/r/293115 (https://phabricator.wikimedia.org/T137162) (owner: 10Faidon Liambotis) [15:30:54] Nikerabbit: yuck. That sounds a bit like some of the hhvm internal cache exhaustion things we have seen. [15:30:55] gehel: absolutely. Doesn't look like this needs to go in any particular order—is that correct? [15:31:13] thcipriani: there should be no dependency [15:31:32] thcipriani: any order should be good [15:31:35] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: New SSH key for AWight - https://phabricator.wikimedia.org/T137162#2361125 (10faidon) 05Open>03Resolved This is now done, pending a puppet run across the fleet (~30 minutes from now). [15:32:23] gehel: kk, running sync-dir now [15:32:32] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:292938|Revert "Send wmf.4 search and ttmserver traffic to codfw"]] (duration: 00m 26s) [15:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:38] ^ gehel check please [15:32:46] * gehel checking... [15:33:46] gehel: I'm still around in case I can help [15:34:02] thcipriani: traffic seems to be flowing to eqiad... [15:34:31] thcipriani: search working... looks good... [15:34:40] gehel: cool, thanks for checking :) [15:34:56] thcipriani: I'll keep an eye on it just in case... [15:35:06] sounds good, thank you [15:35:15] Nikerabbit: any specific check you could do for translate? [15:35:15] (03PS3) 10Thcipriani: huwiki: Enable Popups A/B test for 50% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292964 (https://phabricator.wikimedia.org/T136713) (owner: 10Bmansurov) [15:35:42] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292964 (https://phabricator.wikimedia.org/T136713) (owner: 10Bmansurov) [15:36:17] (03Merged) 10jenkins-bot: huwiki: Enable Popups A/B test for 50% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292964 (https://phabricator.wikimedia.org/T136713) (owner: 10Bmansurov) [15:37:27] (03PS2) 10Faidon Liambotis: admin: access request for Joe Sutherland [puppet] - 10https://gerrit.wikimedia.org/r/290599 (https://phabricator.wikimedia.org/T136137) (owner: 10RobH) [15:37:43] gehel: both translation search and translation memory works [15:37:54] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [15:38:14] thcipriani: fyi, we did get a spike in response time while from elasticsearch. Already getting back to normal [15:38:24] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA request for @WMDE-leszek - https://phabricator.wikimedia.org/T133145#2361148 (10jcrespo) a:03jcrespo [15:38:24] (03CR) 10Faidon Liambotis: [C: 032] admin: access request for Joe Sutherland [puppet] - 10https://gerrit.wikimedia.org/r/290599 (https://phabricator.wikimedia.org/T136137) (owner: 10RobH) [15:38:37] Nikerabbit: thanks! [15:39:00] gehel: I'll start rebuilds later this week as per the plan [15:39:45] Nikerabbit: rebuild? you mean re-index? (sorry, we might use slightly different terminology) [15:39:46] gehel: ack. I did see a minor and falling change in search timeout: Timeout reached waiting for an available pooled curl connection! in /srv/mediawiki/php-1.28.0-wmf.4/extensions/CirrusSearch/includes/Elastica/PooledHttp.php [15:39:58] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to restricted and analytics-privatedata-users for Joe Sutherland (foks) - https://phabricator.wikimedia.org/T136137#2361152 (10faidon) 05Open>03Resolved This is now done, pending a puppet run across the fleet (~30 minutes from now). [15:40:07] (03CR) 10Hashar: "For CI I indeed include mediawiki::packages:php5 . We have jobs running browser based testing against a local instance of MediaWiki that " [puppet] - 10https://gerrit.wikimedia.org/r/291909 (owner: 10Muehlenhoff) [15:40:10] 07Blocked-on-Operations, 06Labs, 10Labs-Infrastructure, 10Monitoring: Provide a grafana installation for labs - https://phabricator.wikimedia.org/T137216#2361154 (10akosiaris) [15:40:52] 07Blocked-on-Operations, 06Labs, 10Labs-Infrastructure, 10Monitoring: Provide a grafana installation for labs - https://phabricator.wikimedia.org/T137216#2361167 (10akosiaris) p:05Triage>03Normal [15:40:59] gehel: per-wiki wipe-and-refill [15:41:17] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:292964|huwiki: Enable Popups A/B test for 50% of users]] (duration: 00m 24s) [15:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:29] Nikerabbit: ok. Thanks! [15:41:32] ^ bmansuro_ check if possible please [15:41:38] ok [15:42:03] thcipriani: i still see the old values, i'll wait a little bit before checking again [15:44:03] bmansuro_: kk, FYI: just spot-checked, app servers have new value. Likely (hopefully) caching :) [15:44:33] thcipriani: i confirm, I see the change in the front end too. Thanks! [15:44:46] bmansuro_: great! Thanks for checking. [15:45:17] (03PS2) 10Thcipriani: T135902 adding readme and license to wikipedia.org portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293110 (https://phabricator.wikimedia.org/T135902) (owner: 10Jdrewniak) [15:45:37] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293110 (https://phabricator.wikimedia.org/T135902) (owner: 10Jdrewniak) [15:45:54] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [15:46:06] 06Operations, 10ops-eqiad, 10media-storage: rack/setup/deploy ms-be102[2-7] - https://phabricator.wikimedia.org/T136631#2361184 (10RobH) @fgiunchedi is out until June 10th. He would be the ideal person to advise on where these should rack, in terms of their planned configuration/service groups. @faidon may... [15:46:12] (03Merged) 10jenkins-bot: T135902 adding readme and license to wikipedia.org portal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293110 (https://phabricator.wikimedia.org/T135902) (owner: 10Jdrewniak) [15:48:30] !log thcipriani@tin Synchronized portals/prod/wikipedia.org/assets: SWAT: [[gerrit:293110|T135902 adding readme and license to wikipedia.org portal]] (duration: 00m 25s) [15:48:31] T135902: Wikimedia/portals repo should have a README file - https://phabricator.wikimedia.org/T135902 [15:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:56] !log thcipriani@tin Synchronized portals: SWAT: [[gerrit:293110|T135902 adding readme and license to wikipedia.org portal]] (duration: 00m 25s) [15:48:57] T135902: Wikimedia/portals repo should have a README file - https://phabricator.wikimedia.org/T135902 [15:49:02] ^ jan_drewniak check please [15:49:06] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2361207 (10Ottomata) [15:49:21] thcipriani: looks good, thanks! [15:49:32] jan_drewniak: thanks for checking [15:50:11] (03PS4) 10Thcipriani: Do not set wgAuth to LdapAuth when AuthManager is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [15:50:29] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [15:51:07] (03Merged) 10jenkins-bot: Do not set wgAuth to LdapAuth when AuthManager is enabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293086 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [15:51:35] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [15:53:00] !log thcipriani@tin Synchronized wmf-config/wikitech.php: SWAT: [[gerrit:293086|Do not set wgAuth to LdapAuth when AuthManager is enabled]] (duration: 00m 23s) [15:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:06] ^ tgr check please [15:53:55] (03PS2) 10Thcipriani: Update audit hooks for AuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293091 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [15:57:35] thcipriani: well, beta wikitech seems broken, but the config change itself is OK, it just loads the class which has a problem [15:57:53] on prod it's a noop so I don't think anyone will mind if it's left like that for a while [15:58:17] tgr: ack, ok, will continue. [15:58:38] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293091 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [15:59:18] (03Merged) 10jenkins-bot: Update audit hooks for AuthManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293091 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [16:00:04] godog coreyfloyd: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160607T1600). [16:00:04] mobrovac: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:01:20] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:293091|Update audit hooks for AuthManager]] (duration: 00m 24s) [16:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:24] ^ tgr check please [16:01:58] after this patch is checked, SWAT will be complete, sorry for the overflow puppet SWAT folks :( [16:03:28] Hi does anyone know if mediawiki is an oauth2 provider. [16:03:38] please [16:03:42] it is for https://secure.phabricator.com/rPe1a9473eda04bd76da1c96814727ec404a1d284e [16:03:43] paladox: oauth1 [16:04:11] tgr oh ok thanks. Is there a way we can update it to support both oauth 1 and 2 please. [16:04:17] So we can do https://secure.phabricator.com/rPe1a9473eda04bd76da1c96814727ec404a1d284e [16:04:26] enable autologin for phabricator [16:04:45] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:06:03] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA request for @WMDE-leszek - https://phabricator.wikimedia.org/T133145#2361245 (10jcrespo) @WMDE-leszek I have added your LDAP account, `WMDE-leszek`, to the group grafana-admin, please check that you can log in to https://grafana-admin.wikimedia.org [16:06:44] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.594 second response time [16:08:28] thcipriani: works, thanks! [16:08:43] tgr: awesome. Thanks for checking. [16:08:55] SWAT is complete. [16:14:04] tgr https://phabricator.wikimedia.org/T137218 [16:16:05] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:26:34] (03PS2) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) [16:27:58] (03CR) 10jenkins-bot: [V: 04-1] Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) (owner: 10Elukey) [16:28:05] (03PS1) 10Faidon Liambotis: smokeping: properly include module class [puppet] - 10https://gerrit.wikimedia.org/r/293121 [16:28:45] (03CR) 10Faidon Liambotis: [C: 032] smokeping: properly include module class [puppet] - 10https://gerrit.wikimedia.org/r/293121 (owner: 10Faidon Liambotis) [16:29:21] (03CR) 10Faidon Liambotis: [V: 032] smokeping: properly include module class [puppet] - 10https://gerrit.wikimedia.org/r/293121 (owner: 10Faidon Liambotis) [16:33:25] (03PS3) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) [16:34:34] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: NDA request for @thiemowmde - https://phabricator.wikimedia.org/T135994#2361304 (10jcrespo) @thiemowmde I have added you to the group grafana-admin. Please check that you can log in to https://grafana-admin.wikimedia.org [16:34:37] (03PS5) 10Dzahn: ircyall: Stop using package=>latest [puppet] - 10https://gerrit.wikimedia.org/r/292093 (owner: 10Muehlenhoff) [16:35:41] (03CR) 10Dzahn: [C: 032] ircyall: Stop using package=>latest [puppet] - 10https://gerrit.wikimedia.org/r/292093 (owner: 10Muehlenhoff) [16:37:41] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2361318 (10jcrespo) Hi, @Jan_Dittrich can another trusted user (e.g. @JanZerebecki) verify you are a WMDE employee and/or endorse for this access (grafana-admin LDAP group, no NDA req... [16:39:36] (03PS4) 10Elukey: Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) [16:40:07] (03Abandoned) 10Dzahn: decom einsteinium ? [puppet] - 10https://gerrit.wikimedia.org/r/293068 (owner: 10Dzahn) [16:40:37] (03CR) 10Dzahn: "looks like there was an old and a new einsteinium that were unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/293068 (owner: 10Dzahn) [16:41:39] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2361334 (10elukey) @Cmjohnson: Hi! Any idea if we could replace the disk during the next two weeks? Thanks! [16:42:22] (03PS5) 10Dzahn: remove furud from site.pp,dhcp,installserver [puppet] - 10https://gerrit.wikimedia.org/r/292971 (https://phabricator.wikimedia.org/T123718) [16:45:25] (03CR) 10Ottomata: [C: 031] Add the analytics contact group to hadoop/kafka related nrpe monitors. [puppet] - 10https://gerrit.wikimedia.org/r/293114 (https://phabricator.wikimedia.org/T125128) (owner: 10Elukey) [16:45:40] (03CR) 10Elukey: Remove old and redundant AQS specific alarms. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [16:46:32] (03CR) 10Elukey: "Eric/Filippo: any thoughts about how to proceed?" [puppet] - 10https://gerrit.wikimedia.org/r/292568 (https://phabricator.wikimedia.org/T135145) (owner: 10Elukey) [16:48:13] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA request for @WMDE-leszek - https://phabricator.wikimedia.org/T133145#2222784 (10JanZerebecki) Thank you, Jaime. [16:48:22] (03CR) 10Dzahn: [C: 032] "not used, antimony is still the active backend of git.wm" [puppet] - 10https://gerrit.wikimedia.org/r/292971 (https://phabricator.wikimedia.org/T123718) (owner: 10Dzahn) [16:50:25] 06Operations: decom furud - https://phabricator.wikimedia.org/T137221#2361367 (10Dzahn) [16:53:56] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Dzahn) [16:54:28] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361435 (10Dzahn) [16:55:54] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2361439 (10JanZerebecki) @Charlie_WMDE verified that the account matches the person. Yes he works for WMDE. I endorse his grafana-admin LDAP group request. [16:56:16] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361445 (10greg) p:05Triage>03Normal [16:56:50] (03PS1) 10Ema: varnishlog4.py: log errors in execute() [puppet] - 10https://gerrit.wikimedia.org/r/293123 (https://phabricator.wikimedia.org/T137114) [16:58:03] (03PS1) 10Krinkle: Move $wmgReduceStartupExpiry closer to other ResourceLoader config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293125 [16:59:46] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361460 (10Dzahn) [17:00:04] yurik gwicke cscott arlolra subbu: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160607T1700). Please do the needful. [17:00:14] kkno parsoid deploy [17:01:28] thcipriani, hi, i forgot, did we schedule scap3 upgrade? [17:01:50] yurik: yup.https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160608T1800 for tomorrow [17:02:30] thcipriani, ah, silly me, i see it [17:02:33] thanks! [17:02:37] (03CR) 10Krinkle: [C: 032] Move $wmgReduceStartupExpiry closer to other ResourceLoader config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293125 (owner: 10Krinkle) [17:03:15] (03Merged) 10jenkins-bot: Move $wmgReduceStartupExpiry closer to other ResourceLoader config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293125 (owner: 10Krinkle) [17:03:50] 06Operations, 10Ops-Access-Requests, 06WMF-NDA-Requests: NDA-Request Jan Dittrich - https://phabricator.wikimedia.org/T136560#2361468 (10jcrespo) I was about to add you @Jan_Dittrich, but I cannot find your LDAP user (a.k.a. wikitech login). I cannot find a similar user name on the LDAP and it is not linked... [17:04:49] !log starting branch-cut for mediawiki and extensions for version 1.28.0-wmf.5 [17:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:06:14] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: clean-up [17:06:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:08:16] jynus: how urgent is deploying T136598 ? [17:08:16] T136598: Wikidata master database connection issue - https://phabricator.wikimedia.org/T136598 [17:08:17] 06Operations: decom furud - https://phabricator.wikimedia.org/T137221#2361503 (10Dzahn) furud was meant to replace antimony but plans changed and it's not going to be used now https://gerrit.wikimedia.org/r/#/c/292940/ https://gerrit.wikimedia.org/r/#/c/292971/ related: T123718, T111465, T137224 [17:08:28] the fix for it i mean [17:08:59] jzerebecki, as far as I can see, it is only affecting itself, and not other connections [17:09:24] (the wikidata job queue executions) [17:09:36] good thx [17:13:00] 06Operations, 06Discovery, 06Maps: Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229#2361550 (10Gehel) [17:18:07] I am not sure I understand the alert Host: "labmon1001" Service: "graphite.wikimedia.org", what does that mean? [17:18:45] (it is soft state only, just wondering) [17:21:00] jynus: looks like the graphite role for production has the check for graphite.wm.org service and then it got (also) applied on the labmon machine [17:21:13] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=graphite.wikimedia.org [17:21:34] so its green on all actual graphite machines [17:21:51] and i assume something gets tested on labmon [17:22:38] and now it recovered [17:22:42] 06Operations, 06Discovery, 06Maps: Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229#2361578 (10Yurik) I suspect that Tilerator will have one connection per worker. Eventually, I would also like to have Kartotherian to use Postgres directly to get some... [17:22:53] I will investigate and suggest a new, more explict name [17:24:39] (03PS1) 10Dzahn: decom furud [dns] - 10https://gerrit.wikimedia.org/r/293129 (https://phabricator.wikimedia.org/T137221) [17:25:18] cool [17:27:04] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [17:27:04] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [17:27:09] (03PS2) 10Dzahn: decom furud [dns] - 10https://gerrit.wikimedia.org/r/293129 (https://phabricator.wikimedia.org/T137221) [17:27:47] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361612 (10Paladox) @dzahn with https://phabricator.wikimedia.org/D250 it should make it easy to create redirect links. Since all we need t... [17:30:18] 06Operations, 06Discovery, 06Maps: Ensure that maps server can be automatically installed (fully puppetized) - https://phabricator.wikimedia.org/T135750#2361616 (10Yurik) Maps team permissions task - T106637 [17:31:21] (03PS1) 10Ema: varnishapi.py: reset error message [puppet] - 10https://gerrit.wikimedia.org/r/293132 [17:31:32] (03PS2) 10Muehlenhoff: Stop installing PHP on jessie app servers [puppet] - 10https://gerrit.wikimedia.org/r/291909 [17:33:22] (03CR) 10RobH: [C: 031] DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) (owner: 10Dzahn) [17:33:45] !log furud - shutdown, decom, deleteV VM [17:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:13] 06Operations, 10Traffic, 06WMF-Communications, 07HTTPS, 07Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2361655 (10Florian) [17:36:12] 06Operations, 13Patch-For-Review: decom furud - https://phabricator.wikimedia.org/T137221#2361656 (10Dzahn) root@palladium:~# puppetstoredconfigclean.rb furud.codfw.wmnet Killing furud.codfw.wmnet...done. [palladium:~] $ sudo puppet cert clean furud.codfw.wmnet Notice: Revoked certificate with serial 1602 [n... [17:39:10] (03CR) 10Jcrespo: [C: 031] "Let's comment it on ops meeting/list to avoid someone adding by mistake the jessie options." [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) (owner: 10Dzahn) [17:40:16] deleting a VM / the virtual disk of a VM takes surprisingly long [17:40:26] its not like its wiping it .. [17:40:35] or is it [17:44:34] !log `mwscript initSiteStats.php --wiki kshwiki --update` on Terbium (T137234) [17:44:35] T137234: Update statistics count on kshwiki - https://phabricator.wikimedia.org/T137234 [17:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:44] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:45:44] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:47:52] mutante: small 503 on fr. for 40 seconds > Request from 2001:470:1f13:d91:e195:d95a:55d5:83de via cp3040 cp3040, Varnish XID 4074666295 [17:48:02] (through esams) [17:49:54] two similar 503 reports in #wikimedia-tech [17:51:44] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:51:47] !log restarting broker on kafka1020 [17:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:55:44] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [17:58:14] yea, i clearly see a spike there that has passed though [17:59:44] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:01:24] (03PS7) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [18:02:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:05:35] (03PS8) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [18:15:17] 06Operations, 13Patch-For-Review: decom furud - https://phabricator.wikimedia.org/T137221#2361772 (10Dzahn) Continue? y/[n]/?: y Tue Jun 7 17:50:41 2016 - WARNING: Could not remove disk 1 on node ganeti2003.codfw.wmnet, continuing anyway: Error 28: Operation timed out after 900023 milliseconds w ith 0 byt... [18:15:46] jynus: hey, these are sql schema changes I want. https://github.com/wikimedia/mediawiki-extensions-ORES/tree/master/sql Can you review to see if it would leak data to labs? [18:18:26] 06Operations, 13Patch-For-Review: decom furud - https://phabricator.wikimedia.org/T137221#2361776 (10Dzahn) @akosiaris have you seen the problem above before when deleting VMs? ^ [18:18:29] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2361778 (10Papaul) a:05Papaul>03jcrespo Update note on both systems BIOS 1.5.4 to 2.0.2 IDRAC 2.21 to 2.30 Dell uEFI diagnostics Dell Os Driver Pack 15.10 to 16.03 PERC H730 Controller... [18:27:37] 06Operations, 10ops-codfw, 13Patch-For-Review: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2361793 (10Papaul) a:05Papaul>03RobH disk wipe complete on me2041 \-mw2060. servers are unracked and stored in storage. @RobH on the switches ge-3/0/26 to ge-3/0/39 rack A3 a... [18:29:00] (03CR) 10Yuvipanda: [WIP] Kubernetes backend (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 (owner: 10Yuvipanda) [18:30:50] !log rebooting labvirt1011 [18:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:31:02] 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2361800 (10Papaul) @jcrespo can you please attach the log? [18:34:08] (03PS9) 10Dzahn: DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) [18:35:20] (03PS9) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [18:36:16] (03CR) 10Hashar: "Good for me :) Thanks for the notice!" [puppet] - 10https://gerrit.wikimedia.org/r/291909 (owner: 10Muehlenhoff) [18:36:47] (03PS10) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [18:40:25] (03PS11) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [18:40:47] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2361834 (10Gehel) [18:45:33] (03CR) 10Dzahn: [C: 032] DHCP: switch default installer to jessie [puppet] - 10https://gerrit.wikimedia.org/r/293066 (https://phabricator.wikimedia.org/T133539) (owner: 10Dzahn) [18:49:12] papaul: ^ so now jessie is default [18:50:48] (03PS1) 10Hashar: contint: move libav-tools to contint::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/293144 [18:52:29] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:54:19] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.018 second response time [18:57:10] (03CR) 10Dzahn: [C: 04-1] "uhm.. i actually still found nginx'es on titanium, carbon (precise) and there is a bunch of trusty (analytics, rcs, francium, nobelium, el" [puppet/nginx] - 10https://gerrit.wikimedia.org/r/291278 (owner: 10Dzahn) [18:58:34] (03Abandoned) 10Dzahn: nginx: remove jessie conditional for mount [puppet/nginx] - 10https://gerrit.wikimedia.org/r/291278 (owner: 10Dzahn) [18:59:05] (03CR) 10Dzahn: [C: 032] contint: move libav-tools to contint::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/293144 (owner: 10Hashar) [18:59:17] ls [18:59:19] mutante: danke :) [18:59:42] mutante: and kudos for making Jessie the default ^oo^ [18:59:50] hashar: thanks :) [19:00:04] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160607T1900). Please do the needful. [19:01:50] * thcipriani does the needful. [19:02:34] mutante: cool [19:02:42] thanks [19:03:16] yw! [19:04:29] (03PS1) 10Thcipriani: Group0 to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293148 [19:08:42] !log thcipriani@tin Started scap: testwiki to php-1.28.0-wmf.5 and rebuild l10n cache [19:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:51] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup payments1005-8 - https://phabricator.wikimedia.org/T136881#2361942 (10Cmjohnson) payments1005-8 are racked, racktables updated, DNS completed and idrac setup. Payments1005 is currently connected to pfw1- ge-2/0/11. I do not have access to t... [19:10:03] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup Fundraising DB - https://phabricator.wikimedia.org/T136200#2361957 (10Cmjohnson) frdb1001 is racked, dns updated, racktables completed, ilo setup. All that is needed is an available port. [19:10:53] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup new fundraising queue servers - https://phabricator.wikimedia.org/T136882#2361958 (10Cmjohnson) frqueue1001/2 are racked, racktables updated, DNS completed and idrac setup. frqueue1001 is currently connected to pfw2- ge-2/0/11. I do not hav... [19:15:01] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 666 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5584687 keys - replication_delay is 666 [19:20:39] (03PS9) 10Gehel: Adding Icinga checks for Maps [puppet] - 10https://gerrit.wikimedia.org/r/291023 (https://phabricator.wikimedia.org/T135647) [19:21:19] (03PS5) 10Dzahn: varnish: move errorpage.html from misc to module [puppet] - 10https://gerrit.wikimedia.org/r/290876 [19:21:44] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/3061/" [puppet] - 10https://gerrit.wikimedia.org/r/290876 (owner: 10Dzahn) [19:26:24] (03PS1) 10Yuvipanda: Do not attempt to restart kubernetes webservices [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/293149 [19:30:57] 06Operations, 06Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Tune WDQS caching headers - https://phabricator.wikimedia.org/T137238#2361834 (10Smalyshev) Yes, after the switch to built GUI we can cache at least the hashed CSS/JS pretty much forever, they'd never change. Non-hashed ones prob... [19:31:11] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5523956 keys - replication_delay is 0 [19:34:51] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:35:23] !log thcipriani@tin Finished scap: testwiki to php-1.28.0-wmf.5 and rebuild l10n cache (duration: 26m 40s) [19:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:36:09] that was...fast. [19:36:52] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:37:08] thcipriani: It's a little sad when we get suspicious of a fast deployment. [19:37:21] Almost like it missed something :p [19:37:57] yeah, that's what it felt like, I saw all the steps happen though... [19:41:10] 07Blocked-on-Operations, 06Labs, 10Labs-Infrastructure, 10Monitoring: Provide a grafana installation for labs - https://phabricator.wikimedia.org/T137216#2361154 (10hashar) > Remove labmon1001 as a data source from the production grafana installation. I would keep it around if at all possible. For people... [19:42:01] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2362003 (10jcrespo) A bunch of errors is making netfilter and ntp fail on es2017. On the admin console: ``` MEM0701: Correctable memory error rate exceeded for DIMM_A2. 2016-06-07T15:28:36-0... [19:42:05] (03CR) 10Merlijn van Deen: [C: 031] Do not attempt to restart kubernetes webservices [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/293149 (owner: 10Yuvipanda) [19:42:49] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2143716 (10jcrespo) 05Resolved>03Open [19:42:51] (03CR) 10Yuvipanda: [C: 032] Do not attempt to restart kubernetes webservices [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/293149 (owner: 10Yuvipanda) [19:42:53] thcipriani: We should add a --super-duper-fast mode. It just prints the steps but does nothing :p [19:43:03] For those times you just wanna feel good. [19:43:39] :D [19:44:17] (03PS1) 10Yuvipanda: Bump version [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/293150 [19:44:45] !log restarting es2017 due to a bunch of ACPI errors (probably memory-caused) [19:44:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:15] (03CR) 10Yuvipanda: [C: 032] Bump version [software/tools-manifest] - 10https://gerrit.wikimedia.org/r/293150 (owner: 10Yuvipanda) [19:48:41] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:31] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.017 second response time [19:52:26] (03CR) 10Thcipriani: [C: 032] Group0 to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293148 (owner: 10Thcipriani) [19:53:03] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293148 (owner: 10Thcipriani) [19:53:03] 06Operations, 10ops-codfw, 10DBA: es2017 and es2019 crashed with no logs - https://phabricator.wikimedia.org/T130702#2362042 (10jcrespo) restarting es2017 fixed the software issues, but this is clearly not in a closed state. This is not the highest priority, but clearly there is a hardware defect here (board?). [19:55:30] PROBLEM - puppet last run on mw2098 is CRITICAL: CRITICAL: Puppet has 1 failures [19:57:41] 06Operations, 10ops-codfw, 10DBA: db2034 degraded RAID - https://phabricator.wikimedia.org/T136583#2362064 (10jcrespo) Sorry about that, log was obtained by @robh and was pasted here: {P3211} [20:00:36] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.5 [20:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:01:40] hmm that's a new log message: Notice: Undefined index: WOC:d in /srv/mediawiki/php-1.28.0-wmf.5/includes/libs/objectcache/WANObjectCache.php on line 803 [20:02:07] !log dist-upgrade on labvirt1010, in hopes of resolving a nova-compute lockup (possibly related to a kvm upgrade earlier today) [20:02:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:00] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:26] log error spike filed in https://phabricator.wikimedia.org/T137244 [20:04:51] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.013 second response time [20:06:58] the initial spike in errors seems to have subsided, FWIW. [20:08:19] (03CR) 10Dzahn: "this looks right, but i see the old hosts still in pybal, for example" [dns] - 10https://gerrit.wikimedia.org/r/292307 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [20:08:53] 06Operations, 13Patch-For-Review: Set jessie as the default os installer on network boot and manually mark other versions (precise, trusty) - https://phabricator.wikimedia.org/T133539#2362120 (10Dzahn) 05Open>03Resolved a:03Dzahn [20:12:41] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [20:16:44] deploy \o/ [20:19:41] RECOVERY - puppet last run on mw2098 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:20:40] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:32:25] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2362205 (10Ottomata) Ja anytime, we can stop this server with no service downtime, just have to be ready to do it. [20:45:03] (03Abandoned) 10EBernhardson: CirrusSearch: Add new rescore profiles [mediawiki-config] - 10https://gerrit.wikimedia.org/r/281209 (https://phabricator.wikimedia.org/T127896) (owner: 10EBernhardson) [20:45:30] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:47:09] 06Operations, 06Discovery, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2362279 (10RobH) [20:47:11] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362280 (10Paladox) [20:47:21] (03PS1) 10RobH: relforge1002 mgmt dns update [dns] - 10https://gerrit.wikimedia.org/r/293208 [20:53:12] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362289 (10Paladox) [20:53:31] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 7.980 second response time [20:54:50] (03CR) 10RobH: [C: 032] relforge1002 mgmt dns update [dns] - 10https://gerrit.wikimedia.org/r/293208 (owner: 10RobH) [20:56:57] (03PS1) 10BBlack: text VCL: fixup hfp for X-Cache-Int [puppet] - 10https://gerrit.wikimedia.org/r/293212 [20:56:59] (03PS1) 10BBlack: nginx: bump session cache by 10x [puppet] - 10https://gerrit.wikimedia.org/r/293213 [20:57:59] (03CR) 10BBlack: [C: 032 V: 032] text VCL: fixup hfp for X-Cache-Int [puppet] - 10https://gerrit.wikimedia.org/r/293212 (owner: 10BBlack) [20:58:37] (03CR) 10BBlack: [C: 032 V: 032] nginx: bump session cache by 10x [puppet] - 10https://gerrit.wikimedia.org/r/293213 (owner: 10BBlack) [21:01:07] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362308 (10Paladox) [21:01:50] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:05:41] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time [21:08:00] PROBLEM - Hadoop DataNode on analytics1049 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [21:08:30] PROBLEM - MegaRAID on analytics1049 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [21:10:48] (03PS1) 1020after4: Add ssh:userkey for eventlogging user [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) [21:10:56] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362329 (10Paladox) [21:13:10] PROBLEM - puppet last run on analytics1049 is CRITICAL: CRITICAL: Puppet has 1 failures [21:14:57] 06Operations, 10ops-eqiad, 10Analytics-Cluster, 06Analytics-Kanban: Smartctl disk defects on kafka1012 - https://phabricator.wikimedia.org/T136933#2362347 (10Cmjohnson) @elukey We can do it whenever you want. I have disks on-site. Let me know a good day and time. [21:16:37] (03CR) 10Mobrovac: "LGTM, but I'm not sure whether that would cause problems in prod, so I'll let Ottomata inspect it." [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [21:20:34] (03PS1) 10Papaul: DHCP: Add MAC address entries for mw2215-mw2238 skipping mw2218 Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/293218 (https://phabricator.wikimedia.org/T135466) [21:21:34] (03PS1) 10RobH: setting install params for relforge100[12] [puppet] - 10https://gerrit.wikimedia.org/r/293219 [21:24:31] (03CR) 10RobH: [C: 032] setting install params for relforge100[12] [puppet] - 10https://gerrit.wikimedia.org/r/293219 (owner: 10RobH) [21:25:32] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.005 second response time [21:25:37] (03PS2) 10Dzahn: DHCP: Add MAC address entries for mw2215-mw2238 skipping mw2218 Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/293218 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [21:27:40] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.569 second response time [21:28:27] (03CR) 10Dzahn: [C: 04-1] "in mw2219 there is a copy/paste error" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/293218 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [21:29:18] (03PS12) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [21:30:42] Hi ops! I'm trying to get on to a varnish server and look at incoming cookie names as per https://phabricator.wikimedia.org/T132374 . I'm getting through the bastion fine, but getting a PW prompt at cp1066.eqiad.wmnet [21:30:50] Guess i probably need 2FA set up? [21:32:02] ejegg: access to cache servers is restricted [21:32:05] ejegg: only ops have access to cp* hosts in general [21:32:12] sounds sensible! [21:32:15] 06Operations, 10Phabricator: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2362359 (10MC8) [21:32:54] (03CR) 10Mobrovac: [C: 031] "PCC shows the addition of ssh::userkey as expected so I'd say we're good - https://puppet-compiler.wmflabs.org/3063/" [puppet] - 10https://gerrit.wikimedia.org/r/293217 (https://phabricator.wikimedia.org/T137192) (owner: 1020after4) [21:33:18] AndyRussG was hoping to get an updated list of cookie names and maybe look at it uncensored to make sure he cleans up all the junk CentralNotice has been setting [21:33:34] I'll point him this way [21:35:16] !log restarted apache on iridium to deploy D250 [21:35:16] D250: Add support for tag links - https://phabricator.wikimedia.org/D250 [21:35:17] ejegg: re updated list, has CN been unsetting some of them, or just not newly-setting them? [21:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:35:44] 06Operations, 10Phabricator: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2362359 (10Krenair) Isn't commons.wikipedia.org just a historical thing? [21:35:50] ejegg: I don't even know if they're session or long-term, or what expiry is [21:36:15] ejegg: I assume they have long expiries, given how many have built up over time. so they probably need explicit unsets to clear them [21:36:20] bblack: we've been un-setting them and migrating to LocalStorage a few at a time [21:37:30] ejegg: not many yet. The real stuff has yet to be deploy'd [21:37:52] Some of them must have over-long expirations, going by cookie names like "centralnotice_bannercount_wikimania14" [21:38:00] 06Operations, 10Phabricator: Redirect phabricator.mediawiki.org to phabricator.wikimedia.org - https://phabricator.wikimedia.org/T137252#2362359 (10Dzahn) Maybe we should get some actual numbers for these things instead of guessing how much it's used (for both of these things?) [21:38:02] heheh yeah [21:38:24] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10mmodell) D250 is now deployed. [21:38:37] so yeah, that's why the list of things to explicitly unset [21:38:45] bblack: if you remember, this is a task that you started on, then ori pulled some data for us. I'd just like to run the same command that ori tried once more, to see if the sample is significantly different [21:40:23] bblack: here's the command that was run and the initial results: https://phabricator.wikimedia.org/T132374#2229057 [21:40:53] (03PS1) 10Dzahn: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) [21:41:22] Unfortunately a lot of the ones we want to delete aren't detectable by regex. [21:42:13] The actual procedure for removing has merged to CN code but has yet to be deployed. It's a client-side thing that gets a list of cookies to purge from a config variable [21:43:18] (03CR) 1020after4: [C: 031] git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [21:44:16] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362395 (10Paladox) [21:44:37] 06Operations, 06Discovery, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2362400 (10RobH) [21:44:40] (03CR) 10Dzahn: [C: 04-1] "thanks, but not ready yet. error in line 142: override rules must have an associated funnel or rewrite and needs more rules" [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) (owner: 10Dzahn) [21:45:11] hmm, it would be really cool if the cache servers could zap a blacklist of cookies without inflating the CN JavaScript! Would that break caching? They already rewrite some headers on cached responses, right? [21:45:21] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Paladox) [21:45:37] ejegg: bblack: here's the task about just finalizing the list (consolidating from a few sources): https://phabricator.wikimedia.org/T135090 [21:45:54] The full list itself, so far, is here: https://www.mediawiki.org/wiki/Extension:CentralNotice/Notes/Cookies_to_remove [21:46:06] ejegg: yeah! [21:46:20] I imagine varnish could deal wtih a list as well as a regex [21:46:34] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362407 (10Paladox) [21:46:37] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2362408 (10RobH) 05stalled>03Resolved a:03RobH The two machines have been allocated and are now in the OS installation stage. I'm resolv... [21:46:40] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Paladox) [21:47:32] ebernhardson: the os is installing on relforge100[12] now =] [21:47:57] so i'll have the puppet/salt keys signed soon for you guys to add to site.pp and service implementation [21:48:45] 07Puppet, 10ORES, 06Revision-Scoring-As-A-Service, 13Patch-For-Review: ORES-staging is broken due to service::uwsgi mandatory scap::target invoke - https://phabricator.wikimedia.org/T136488#2362425 (10Ladsgroup) 05Open>03Resolved [21:49:55] (03PS1) 10BryanDavis: role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) [21:51:05] yuvipanda: ^ I think that config will work [21:52:30] (03PS1) 10Ladsgroup: ores: Add support for Norwegian [puppet] - 10https://gerrit.wikimedia.org/r/293225 [21:52:30] robh: woo! [21:52:53] robh: glad this will all be ready in time for next quarter, planning on using it :) [21:55:37] (03PS2) 10Dzahn: git.wikimedia.org -> Diffusion redirects [puppet] - 10https://gerrit.wikimedia.org/r/293221 (https://phabricator.wikimedia.org/T137224) [21:59:41] 06Operations, 06Discovery, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2362475 (10RobH) [22:00:04] tgr: Respected human, time to deploy AuthManager (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160607T2200). Please do the needful. [22:03:13] 06Operations, 06Discovery, 06Labs, 10hardware-requests: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2362483 (10RobH) [22:03:28] 06Operations, 06Discovery, 06Labs: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345045 (10RobH) [22:05:20] 06Operations, 13Patch-For-Review: decom furud - https://phabricator.wikimedia.org/T137221#2361367 (10RobH) furud is still showing pending salt key acceptance on the salt master. [22:05:51] (03CR) 10Mobrovac: [C: 031] "Confirmed to work in beta" [puppet] - 10https://gerrit.wikimedia.org/r/292899 (owner: 10Ppchelko) [22:05:57] 06Operations, 06Discovery, 06Labs: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2362489 (10RobH) [22:06:41] 06Operations, 06Discovery, 06Labs: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2345045 (10RobH) a:05Cmjohnson>03EBernhardson All tasks for the setup of relforge100[12] have been accomplished, other than service implementation. I'm assigning t... [22:07:00] ebernhardson: they are all ready for ya, you can resolve that task just above if you are planning to implement service via another task [22:07:09] (sounds like it, since it'll be next quarter) [22:08:17] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [22:10:12] (03PS3) 10Dzahn: DHCP: Add MAC address entries for mw2215-mw2238 skipping mw2218 Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/293218 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [22:10:23] !log icinga config broken: Error: Could not find any host matching 'relforge1001' [22:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:10:59] ah, i see [22:11:07] runs puppet again on neon [22:11:25] (03CR) 10Dzahn: [C: 032] DHCP: Add MAC address entries for mw2215-mw2238 skipping mw2218 Bug:T135466 [puppet] - 10https://gerrit.wikimedia.org/r/293218 (https://phabricator.wikimedia.org/T135466) (owner: 10Papaul) [22:15:22] mutante: odd race condition? [22:15:44] neon complaining about a host during its puppet run. [22:15:51] robh: yes, exactly that [22:15:58] it's ok now [22:16:01] cool [22:16:01] after the next run [22:16:06] thanks for fixing [22:16:09] yw [22:17:16] (03PS1) 10Gergő Tisza: Enable AuthManager on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293227 (https://phabricator.wikimedia.org/T135504) [22:17:38] ostriches: is the procedure for getting someone into the wmf ldap group still "ping ostriches"? [22:17:51] Probably :p [22:18:11] musikanimal should get added. new in commtech [22:18:15] (03CR) 10MaxSem: [C: 031] Change expired file zoom level from 16 to 15. [puppet] - 10https://gerrit.wikimedia.org/r/291885 (https://phabricator.wikimedia.org/T136483) (owner: 10Gehel) [22:18:49] gerrit email is musikanimal@wikimedia.org [22:19:04] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [22:19:13] wow, he got a non-standard @wikimedia.org mail [22:19:22] I have bd808 [22:19:38] and bdavis [22:19:40] Previously, you had to befriend opsen to get them ;) [22:19:41] I have chad@ as a forwarder :) [22:19:42] yea, eh.. depends where it is [22:19:47] oit or ops [22:20:02] What host do I do this from again? [22:20:14] (03CR) 10Gergő Tisza: [C: 032] Enable AuthManager on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293227 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [22:20:18] I thought it was terbium hmmm [22:20:29] should be doable from silver I think [22:20:40] terbium sounds right [22:20:50] (03Merged) 10jenkins-bot: Enable AuthManager on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293227 (https://phabricator.wikimedia.org/T135504) (owner: 10Gergő Tisza) [22:20:53] thats where i look stuff up [22:21:09] Ah yes, tabcomplete was failing me. [22:21:13] bd808: {{done}} [22:21:17] mutante, I got krenair@ when I joined [22:21:26] ostriches: {{hugs}} [22:22:06] wiki user Musikanimal (WMF)? :) [22:22:16] welcome to him btw [22:23:11] 06Operations, 06Discovery, 06Labs, 10hardware-requests, 03Discovery-Search-Sprint: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2362575 (10EBernhardson) [22:23:13] 06Operations, 06Discovery, 06Labs: rack/upgrade/setup/install/deploy relforge100[12].eqiad.wmnet - https://phabricator.wikimedia.org/T136708#2362571 (10EBernhardson) 05Open>03Resolved Thanks! All the work here is much appreciated. I've put together T137256 to track setting up the elasticsearch cluster an... [22:29:27] (03PS2) 10Yuvipanda: ores: Add support for Norwegian [puppet] - 10https://gerrit.wikimedia.org/r/293225 (owner: 10Ladsgroup) [22:29:39] (03CR) 10Yuvipanda: [C: 032 V: 032] ores: Add support for Norwegian [puppet] - 10https://gerrit.wikimedia.org/r/293225 (owner: 10Ladsgroup) [22:39:15] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [22:40:36] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [22:41:16] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [22:41:50] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362614 (10BBlack) [22:43:19] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362614 (10Dzahn) policy.wm runs on https://vip.wordpress.com/ [22:47:12] 06Operations, 13Patch-For-Review: decom furud - https://phabricator.wikimedia.org/T137221#2362635 (10Dzahn) @Robh thanks, the accepted key was deleted, then this got recreated because it was still running. it's in an "ERROR_down" state now after the error above.. hmm [22:55:01] (03PS13) 10Yuvipanda: [WIP] Kubernetes backend [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 [22:55:54] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362614 (10Krenair) >>! In T137258#2362633, @Dzahn wrote: > policy.wm runs on https://vip.wordpress.com/ It appears to be running on different servers to the blog which is h... [22:56:00] !log tgr@tin Started scap: (no message) [22:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:56:44] !log scapping AuthManager backports + feature switch enabled on group0 T135504 [22:56:45] T135504: Enable AuthManager in WMF production - https://phabricator.wikimedia.org/T135504 [22:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:58:31] 06Operations, 10Traffic, 06WMF-Legal, 05Security: policy.wikimedia.org SSL vulnerability - https://phabricator.wikimedia.org/T137258#2362661 (10Slaporte) a:03Slaporte I've reported this to wordpress, and I'll update here as they resolve it. [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160607T2300). [23:00:04] foks: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:22] (03CR) 10Yuvipanda: role::toollabs::merlbot_proxy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [23:00:29] I'm still running scap, I'll do the SWAT afterwards [23:06:36] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:20:52] !log tgr@tin Finished scap: (no message) (duration: 24m 51s) [23:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:07] (03CR) 10BryanDavis: [WIP] Kubernetes backend (033 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/293063 (owner: 10Yuvipanda) [23:25:07] OK, I can log in and I can create an account so I am calling this a success for the immediate term [23:25:19] foks: ready for SWAT? [23:31:40] tgr: I can check the change [23:31:46] Hi. [23:32:10] (03PS3) 10Gergő Tisza: User rights configuration for meta. wmf-supportsafety group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) (owner: 10Dereckson) [23:32:42] (03CR) 10Gergő Tisza: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) (owner: 10Dereckson) [23:33:19] (03Merged) 10jenkins-bot: User rights configuration for meta. wmf-supportsafety group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292518 (https://phabricator.wikimedia.org/T136864) (owner: 10Dereckson) [23:33:25] tgr, apologies - am now [23:33:35] It's Dereckson's code, so :D [23:35:19] !log tgr@tin Synchronized wmf-config/InitialiseSettings.php: SWAT [[gerrit:292518]] User rights configuration for meta. wmf-supportsafety group (duration: 00m 26s) [23:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:35:23] Testing. [23:35:37] Works. [23:35:45] Thanks for deploying tgr. [23:35:55] thanks for checking [23:35:58] Dereckson, thanks for coding! [23:35:59] !log redeploying WDQS to update the Updater for T128947 fix [23:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:03] you're welcome [23:36:04] Easiest SWAT ever. [23:36:06] ;3 [23:41:56] (03CR) 10BryanDavis: role::toollabs::merlbot_proxy (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis) [23:46:51] PROBLEM - graphite.wikimedia.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:46:51] !log aaron@tin Synchronized php-1.28.0-wmf.5/includes/deferred/LinksUpdate.php: 6d85caaa9bb5918cb2888fc82f2c7c346cf746a2 (duration: 00m 25s) [23:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:17] (03CR) 10Mattflaschen: "No, good catch. It was done in I1e8fc632b52694aa6eb34ca1e9eae6d0b57df920, If89d24838e326fe25fe867d02181eebcfbb0e196, I8b52ec8ddf494f23941" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) (owner: 10Mattflaschen) [23:47:21] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2362714 (10Papaul) [23:47:26] (03PS6) 10Mattflaschen: Change login cookies (for 'Remember me') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) [23:48:11] (03PS1) 10Dereckson: Add a project namespace on tg.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293243 (https://phabricator.wikimedia.org/T137200) [23:48:41] RECOVERY - graphite.wikimedia.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.339 second response time [23:51:34] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2362721 (10Paladox) [23:52:16] (03PS7) 10Mattflaschen: Change login cookies (for 'Remember me') to a one year expiry. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/230954 (https://phabricator.wikimedia.org/T68699) [23:54:55] Hey tgr, we're right Auth Manager shouldn't change anything on wikitech.? [23:55:15] Dereckson: yeah, wikitech is group1 [23:55:42] labtestwikitech is messed up ATM, but real wikitech should not be affected [23:56:03] When I log in to the real one, I'v no error message, but I keep to be logged out. [23:56:20] PROBLEM - MD RAID on gallium is CRITICAL: CRITICAL: Active: 1, Working: 1, Failed: 1, Spare: 0 [23:57:37] Dereckson: Hmm. Works when I try it. [23:58:39] (03PS2) 10BryanDavis: role::toollabs::merlbot_proxy [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) [23:58:49] Dereckson: works for me as well [23:59:02] (03CR) 10BryanDavis: role::toollabs::merlbot_proxy (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/293223 (https://phabricator.wikimedia.org/T137235) (owner: 10BryanDavis)