[00:00:04] No patches in the queue for this window. Wheeee! [00:00:49] (03CR) 1020after4: [C: 031] "http://puppet-compiler.wmflabs.org/7867/" [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [00:03:05] (03PS1) 10Gergő Tisza: Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) [00:03:06] jouncebot: next [00:03:07] (03PS1) 10Gergő Tisza: Temporarily prevent users from accessing Special:RenderBook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) [00:03:09] In 12 hour(s) and 56 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T1300) [00:03:18] jouncebot: now [00:03:18] For the next 0 hour(s) and 56 minute(s): Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T0000) [00:03:47] (03PS2) 10Gergő Tisza: Temporarily prevent users from accessing Special:RenderBook [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) [00:05:07] (03CR) 1020after4: "http://puppet-compiler.wmflabs.org/7868/" [puppet] - 10https://gerrit.wikimedia.org/r/374054 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [00:06:26] RECOVERY - EventBus HTTP Error Rate -4xx + 5xx- on graphite1001 is OK: OK: Less than 50.00% above the threshold [1.0] [00:07:21] (03PS2) 10Gergő Tisza: Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) [00:10:03] (03PS3) 10Gergő Tisza: Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) [00:10:07] PROBLEM - salt-minion processes on labtestvirt2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [00:11:12] (03PS3) 10Gergő Tisza: Temporarily prevent users from accessing Special:RenderBook/test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) [00:18:28] !log Begin phabricator upgrade #phab-2017-09-13 [00:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:41] !log twentyafterfour@tin Started deploy [phabricator/deployment@7e3a0a8]: (no justification provided) [00:20:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:05] !log twentyafterfour@tin Finished deploy [phabricator/deployment@7e3a0a8]: (no justification provided) (duration: 00m 25s) [00:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:27] !log twentyafterfour@tin Started deploy [phabricator/deployment@7e3a0a8]: (no justification provided) [00:21:29] !log twentyafterfour@tin Finished deploy [phabricator/deployment@7e3a0a8]: (no justification provided) (duration: 00m 01s) [00:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:19] !log running puppet on phab1001 to restore files which get lost by scap deploy [00:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:26:09] !log everything appears to be up and running. Phabricator upgrade complete. [00:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:59] (03PS1) 10Herron: Lists: Add zen.spamhaus.org DNSBL check to MTA rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/377938 (https://phabricator.wikimedia.org/T175878) [00:40:35] musikanimal, cawiki finally done, now at commons. see you next year [00:46:59] !log restarted populateIpChanges to use updated code [00:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:48] Haha MaxSem [01:19:43] 10Operations, 10Ops-Access-Requests: Requesting access to pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606599 (10Tgr) [01:20:57] 10Operations, 10Ops-Access-Requests: Requesting access to pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606614 (10Tgr) [01:32:28] 10Operations, 10Ops-Access-Requests: Requesting access to pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606616 (10Tgr) [01:35:01] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606617 (10Tgr) [01:35:05] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606599 (10Tgr) Given that I already have shell access for most boxes and this is a pretty limited expansion of privileges, if it's possible to waive the requirement of waiting... [02:02:07] (03CR) 1020after4: [C: 031] "Nice! less symlinks ~= winning" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376762 (owner: 10Chad) [02:27:43] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3606640 (10Verdy_p) It was in scope of this bug when... [02:27:54] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.17) (duration: 09m 27s) [02:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:51] 10Operations, 10Commons, 10MediaWiki-extensions-Scribunto, 10Patch-For-Review, 10Wikimedia-log-errors: Some Commons pages transcluding Template:Countries_of_Europe HTTP 500/503 due to OOM in Lua→PHP→Lua calls - https://phabricator.wikimedia.org/T171392#3606642 (10Verdy_p) And I disagree: T173194 / T1732... [02:47:42] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.18) (duration: 07m 58s) [02:47:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:55:10] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 14 02:55:10 UTC 2017 (duration 7m 28s) [02:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:02:27] (03PS1) 10Tim Starling: Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 [03:35:38] (03PS1) 10Niharika29: Make jouncebot evil [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/377945 [03:42:28] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3606654 (10Ottomata) It was emailed to several lists, ops, analytics researchers, etc. You should get on the analytics list for sure... [03:48:17] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3606655 (10Niharika) I never subscribed to those because they felt like team-specific lists. Wikitech-l is probably the most widely r... [03:48:31] (03CR) 10Niharika29: [C: 032] Make jouncebot evil [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/377945 (owner: 10Niharika29) [03:48:58] (03Merged) 10jenkins-bot: Make jouncebot evil [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/377945 (owner: 10Niharika29) [04:00:05] Niharika, Niharika, Niharika, Niharika, and Niharika: Dear deployers, time to do the Test jouncebot deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T0400). [04:00:05] No patches in the queue for this window. Wheeee! [04:00:39] Good boy. [04:03:39] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3606687 (10awight) I think I've been wrong all this time. The system's total open file count is... [05:04:32] 10Operations, 10Cloud-VPS, 10Toolforge: Toolforge's static websever broken by Puppet changes and stale nginx packages - https://phabricator.wikimedia.org/T175885#3606725 (10bd808) [05:05:00] 10Operations, 10Cloud-VPS, 10Toolforge: Toolforge's static websever broken by Puppet changes and stale nginx packages - https://phabricator.wikimedia.org/T175885#3606739 (10bd808) p:05Triage>03Normal [05:05:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [05:05:22] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Toolforge's static websever broken by Puppet changes and stale nginx packages - https://phabricator.wikimedia.org/T175885#3606725 (10bd808) [05:16:54] (03CR) 10KartikMistry: "@hashar, can you look at this? This looks Jenkins issue as package dependencies for apertium-crh-tur are already available in WMF repo." [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [05:21:07] PROBLEM - HHVM jobrunner on mw1167 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [05:22:07] RECOVERY - HHVM jobrunner on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.001 second response time [05:23:01] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Toolforge's static websever broken by Puppet changes and stale nginx packages - https://phabricator.wikimedia.org/T175885#3606778 (10bd808) Related: {T169247} [05:25:27] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [05:35:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [05:47:07] 10Operations, 10Cloud-VPS, 10Toolforge, 10cloud-services-team (Kanban): Toolforge's static webserver broken by Puppet changes and stale nginx packages - https://phabricator.wikimedia.org/T175885#3606823 (10Quiddity) [06:49:37] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [06:49:57] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:50:37] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:51:27] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [06:58:27] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:59:47] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:00:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:00:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:18:11] (03PS37) 10Gehel: cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [07:23:33] the previous spike in 503s seems to be related to ints from cp1068 [07:24:35] same thing as yesterday, spike in mailbox lag [07:24:36] https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp1068&var-datasource=eqiad%20prometheus%2Fops&from=now-3h&to=now&panelId=21&fullscreen [07:26:13] (03PS38) 10Gehel: cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) [07:26:22] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3603561 (10elukey) Same thing happened this morning for cp1068 from 6:45 to 6:48 UTC: {F9522769} Self recovered, caused 503s and alerts for various text domains. [07:36:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:37:39] 1068 again, but really low peak [07:37:45] already gone [07:39:47] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3606984 (10hashar) That is related. As I migrated some jobs from Trusty to Jessie, I have added a couple Jessie instances. Tha... [07:41:04] 10Operations, 10Traffic: Text eqiad varnish 503 spikes - https://phabricator.wikimedia.org/T175803#3606987 (10elukey) p:05Triage>03High [07:42:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:52:37] (03CR) 10Gehel: "There is still a difference in puppet compiler for xenon (https://puppet-compiler.wmflabs.org/compiler02/7870/xenon.eqiad.wmnet/) where th" [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [07:53:19] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current), 10User-Ladsgroup: Review and fix file handle management in worker and celery processes - https://phabricator.wikimedia.org/T174402#3606989 (10akosiaris) > ores1002 limits are, > ``` > ulimit -a > ... > open files... [07:56:56] (03Abandoned) 10Volans: depool esams [dns] - 10https://gerrit.wikimedia.org/r/377728 (owner: 10Volans) [08:06:51] (03PS1) 10Alexandros Kosiaris: Assign to wtp1025-wtp1048 the parsoid role [puppet] - 10https://gerrit.wikimedia.org/r/377960 (https://phabricator.wikimedia.org/T165520) [08:08:01] <_joe_> gehel: I think we can live with the xenon change tbh [08:08:06] <_joe_> but let's ask godog? [08:08:38] <_joe_> or well, let's merge your change, and carefully run puppet on various machines. It's a pretty easy workflow with cumin [08:08:40] yep, that what I think, but confirmation from someone who knows what he is doing would be nice... [08:09:10] <_joe_> 1 - cumin 'r:class = cassandra' 'disable-puppet "dangerous change"' [08:09:26] <_joe_> 2 - run puppet on a few selected hosts to ensure everything's ok [08:09:48] <_joe_> 3 - cumin 'r:class = cassandra' 'run-puppet-agent -e "dangerous change"' [08:10:06] <_joe_> I'm pretty sure we shouldn't even have those seeds [08:10:11] <_joe_> (the fqdns) [08:10:14] how many hosts? use batch if many [08:10:30] <_joe_> volans: batch is for non-cowboys [08:10:46] * gehel does not have the matching hat... [08:10:48] that includes all europeans ;) [08:10:59] <_joe_> volans: I am an honorary cowboy [08:11:26] only if you can find a proper hat that fit your head and send us a picture in the next 5 minutes :-P [08:11:41] <_joe_> I have one, actually, but not here :P [08:12:26] ok, 51 hosts, let's do this! [08:12:32] <_joe_> ahah [08:12:46] <_joe_> yeah maybe batch a bit the latter command :P [08:12:46] (03CR) 10Filippo Giunchedi: "> There is still a difference in puppet compiler for xenon" [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [08:13:38] !log merging cassandra refactoring for puppet - T171704 [08:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:13:51] T171704: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704 [08:14:09] (03CR) 10Gehel: [C: 032] cassandra: future parser and Puppet 4 compatibility [puppet] - 10https://gerrit.wikimedia.org/r/372124 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [08:14:28] (03Abandoned) 10Thiemo Mättig (WMDE): Simplify Wikibase "unitStorage" configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/366533 (https://phabricator.wikimedia.org/T171107) (owner: 10Thiemo Mättig (WMDE)) [08:14:37] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3263810 (10akosiaris) >>! In T165348#3605721, @Dzahn wrote: > See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest th... [08:15:22] gehel: \o/ [08:15:43] godog: don't rais your arms, keep your fingers crossed... [08:16:16] haha fair enough! [08:16:50] wtf... there is a password change in xenon:/etc/cassandra-a/cqlshrc which did not show up on the compiler... [08:16:55] (03PS2) 10Alexandros Kosiaris: Assign to wtp1025-wtp1048 the parsoid role [puppet] - 10https://gerrit.wikimedia.org/r/377960 (https://phabricator.wikimedia.org/T165520) [08:16:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Assign to wtp1025-wtp1048 the parsoid role [puppet] - 10https://gerrit.wikimedia.org/r/377960 (https://phabricator.wikimedia.org/T165520) (owner: 10Alexandros Kosiaris) [08:17:18] Oh, probably private data matching the default... having a look [08:17:31] note that I only ran a noop, so nothing is broken yet [08:19:48] ofc, private hieradata also need some tuning, on it... [08:21:30] ack, LMK if I can help [08:21:36] (03PS3) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [08:21:48] PROBLEM - puppet last run on wtp1046 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 28 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:07] PROBLEM - puppet last run on wtp1030 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 37 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:27] PROBLEM - puppet last run on wtp1033 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:27] PROBLEM - puppet last run on wtp1027 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:27] PROBLEM - puppet last run on wtp1038 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:28] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:28] PROBLEM - puppet last run on wtp1041 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:37] PROBLEM - puppet last run on wtp1040 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:37] PROBLEM - puppet last run on wtp1044 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:37] PROBLEM - puppet last run on wtp1042 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:37] PROBLEM - puppet last run on wtp1037 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:48] PROBLEM - puppet last run on wtp1026 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:22:48] PROBLEM - puppet last run on wtp1029 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:23:11] that's not me is it?!? [08:23:18] no I think that's akosiaris [08:23:43] yeah that's me [08:23:55] here's a nice race [08:23:58] ok, thanks! you had me worried for a minute [08:24:01] I did set notifications_enabled [08:24:07] PROBLEM - puppet last run on wtp1035 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 52 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:08] PROBLEM - puppet last run on wtp1028 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 54 seconds ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:13] but I added the role in the same patch [08:24:17] PROBLEM - puppet last run on wtp1039 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:17] PROBLEM - puppet last run on wtp1047 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:19] so the new puppet runs fail [08:24:28] PROBLEM - puppet last run on wtp1025 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:28] PROBLEM - puppet last run on wtp1045 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:28] PROBLEM - puppet last run on wtp1034 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:28] PROBLEM - puppet last run on wtp1043 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:24:31] but notifications_disabled has not propagated yet [08:24:40] sigh puppet sigh [08:28:08] PROBLEM - Check systemd state on wtp1032 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:28:47] PROBLEM - Check systemd state on wtp1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:28:56] (03PS4) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [08:29:17] PROBLEM - puppet last run on wtp1032 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:31:14] (03PS1) 10Gehel: cassandra - move super authentication on the main cassandra class [puppet] - 10https://gerrit.wikimedia.org/r/377965 (https://phabricator.wikimedia.org/T171704) [08:31:18] PROBLEM - puppet last run on wtp1031 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:31:37] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Package[parsoid/deploy],Package[confd] [08:31:46] godog: could you have a quick look at https://gerrit.wikimedia.org/r/377965 [08:32:38] gehel: yup [08:33:35] gehel: did it ran on the pcc yet? [08:33:39] godog: thanks! I'm running puppet-compiler right now [08:33:58] (03CR) 10Volans: "I just realized that labcontrol* hosts are Trusty, to avoid to fight with old systems I switched the installation of Cumin master to the p" [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [08:36:28] godog: still running, but https://puppet-compiler.wmflabs.org/compiler02/7873/ [08:39:16] godog: puppet-compiler says it's a noop on all tested nodes [08:41:14] gehel: yeah lgtm, just parameters changes [08:41:38] godog: thanks! [08:42:01] godog: a formal +1? [08:42:15] (03PS2) 10Gehel: cassandra - move super authentication on the main cassandra class [puppet] - 10https://gerrit.wikimedia.org/r/377965 (https://phabricator.wikimedia.org/T171704) [08:42:18] (03CR) 10Filippo Giunchedi: [C: 031] cassandra - move super authentication on the main cassandra class [puppet] - 10https://gerrit.wikimedia.org/r/377965 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [08:42:24] thanks! [08:42:51] (03CR) 10Gehel: [C: 032] cassandra - move super authentication on the main cassandra class [puppet] - 10https://gerrit.wikimedia.org/r/377965 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [08:50:51] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3607071 (10Krinkle) [08:50:59] 10Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 10Traffic, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3607072 (10Krinkle) [08:51:57] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 31 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:56:08] (03PS1) 10Gehel: cassandra - actually use the authentication parameters [puppet] - 10https://gerrit.wikimedia.org/r/377975 (https://phabricator.wikimedia.org/T171704) [08:56:58] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 11 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:58:54] (03PS1) 10Gehel: cassandra - re-introduce the instance_count parameter [puppet] - 10https://gerrit.wikimedia.org/r/377976 (https://phabricator.wikimedia.org/T171704) [08:59:32] godog: 2 more fixes if you have a minute... [09:00:23] !log akosiaris@tin Started deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520 [09:00:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:37] T165520: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520 [09:02:06] (03CR) 10Filippo Giunchedi: [C: 031] cassandra - actually use the authentication parameters [puppet] - 10https://gerrit.wikimedia.org/r/377975 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [09:02:48] !log akosiaris@tin Finished deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520 (duration: 02m 25s) [09:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:24] (03CR) 10Gehel: [C: 032] cassandra - actually use the authentication parameters [puppet] - 10https://gerrit.wikimedia.org/r/377975 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [09:16:50] 10Operations, 10Electron-PDFs, 10OfflineContentGenerator, 10Services (designing): Improve stability and maintainability of our browser-based PDF render service - https://phabricator.wikimedia.org/T172815#3607118 (10Aklapper) As already announced in [[ https://meta.wikimedia.org/wiki/Tech/News/2017/37 | Tec... [09:16:58] 10Operations, 10Collection, 10OfflineContentGenerator, 10Readers-Web-Backlog, and 2 others: Remove deprecated features from book creator UI - https://phabricator.wikimedia.org/T150917#3607121 (10Aklapper) As already announced in [[ https://meta.wikimedia.org/wiki/Tech/News/2017/37 | Tech News ]], OfflineCo... [09:17:02] 10Operations, 10OfflineContentGenerator, 10Readers-Web-Backlog (Tracking), 10Services (watching): Confirm attribution needs - https://phabricator.wikimedia.org/T150875#3607122 (10Aklapper) As already announced in [[ https://meta.wikimedia.org/wiki/Tech/News/2017/37 | Tech News ]], OfflineContentGenerator (... [09:17:04] 10Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3607123 (10Aklapper) As already announced in [[ https://meta.wikimedia.org/wiki/Tech/News/2017/37 | Tech News ]]... [09:17:09] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3607124 (10Aklapper) As already announced in [[ https://meta.wikimedia.org/wiki/Tech/News/2017/37 | Tech News ]... [09:17:30] 10Operations, 10OCG-General, 10Documentation, 10codfw-rollout: Document eqiad/codfw transition plan for OCG - https://phabricator.wikimedia.org/T133164#3607132 (10Aklapper) As already announced in [[ https://meta.wikimedia.org/wiki/Tech/News/2017/37 | Tech News ]], OfflineContentGenerator (OCG) will not be... [09:18:01] 10Operations, 10OCG-General: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#3607144 (10Aklapper) As already announced in [[ https://meta.wikimedia.org/wiki/Tech/News/2017/37 | Tech News ]], OfflineContentGenerator (OCG) will not be used anymore after October 1st, 2017 on W... [09:19:07] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [09:20:38] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3607162 (10jcrespo) > are you OK with all mariadb:: roles Every host with a mysql server (plus the mariadb::client s) gets a screen. Sadly, there are 30 roles for mysql servers... [09:21:55] (03PS1) 10Hashar: puppetmaster: convert tests to spec [puppet] - 10https://gerrit.wikimedia.org/r/377980 [09:23:35] (03PS2) 10Gehel: cassandra - re-introduce the instance_count parameter [puppet] - 10https://gerrit.wikimedia.org/r/377976 (https://phabricator.wikimedia.org/T171704) [09:24:07] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [09:25:02] (03PS1) 10Hashar: base: invoke fail() instead of error() [puppet] - 10https://gerrit.wikimedia.org/r/377981 [09:28:11] (03PS1) 10DCausse: Bump highlighter version to 5.3.2.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377983 (https://phabricator.wikimedia.org/T173231) [09:30:15] (03CR) 10DCausse: "@Gehel: this would be needed if we run a rolling upgrade on 5.3.2, can be ignored if we jump directly to 5.5.2." [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377983 (https://phabricator.wikimedia.org/T173231) (owner: 10DCausse) [09:32:03] !log akosiaris@tin Started deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520 [09:32:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:16] T165520: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520 [09:39:54] (03PS3) 10Elukey: [WIP] role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [09:39:58] (03PS1) 10Hashar: puppetmaster: test for puppetmaster::geoip [puppet] - 10https://gerrit.wikimedia.org/r/377986 [09:40:29] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: test for puppetmaster::geoip [puppet] - 10https://gerrit.wikimedia.org/r/377986 (owner: 10Hashar) [09:40:47] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3607247 (10hashar) I am trying to add the GeoIP files on the CI puppet master. Gotta fix some puppet madness with an undefined... [09:41:33] !log akosiaris@tin Finished deploy [parsoid/deploy@cec7d17]: test deploy using dsh groups. T165520 (duration: 09m 30s) [09:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:46] T165520: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520 [09:41:56] !log re-enabling puppet on all cassandra nodes - T171704 [09:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:09] T171704: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704 [09:42:12] (03PS4) 10Elukey: [WIP] role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [09:42:18] gehel: thanks a lot for this work :) [09:42:30] elukey: my pleasure (well, mostly) [09:43:08] !log akosiaris@puppetmaster1001 conftool action : set/weight=1; selector: name=wtp1025.eqiad.wmnet [09:43:14] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [09:43:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:48] (03PS2) 10Hashar: puppetmaster: test for puppetmaster::geoip [puppet] - 10https://gerrit.wikimedia.org/r/377986 [09:44:28] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: test for puppetmaster::geoip [puppet] - 10https://gerrit.wikimedia.org/r/377986 (owner: 10Hashar) [09:45:01] (03PS1) 10Alexandros Kosiaris: Re-enable notifications for wtp1025-wtp1048 [puppet] - 10https://gerrit.wikimedia.org/r/377987 (https://phabricator.wikimedia.org/T165520) [09:45:20] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, 10Traffic: Make maps active / active - https://phabricator.wikimedia.org/T162362#3607256 (10Gehel) @Pnorman could you have a look at the codfw servers and see if we are ready to move on this? For reference, the puppet change to do: https://gerrit.wikimed... [09:45:52] (03PS5) 10Elukey: [WIP] role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [09:47:53] (03CR) 10Alexandros Kosiaris: [C: 032] Re-enable notifications for wtp1025-wtp1048 [puppet] - 10https://gerrit.wikimedia.org/r/377987 (https://phabricator.wikimedia.org/T165520) (owner: 10Alexandros Kosiaris) [09:51:41] !log akosiaris@puppetmaster1001 conftool action : set/pooled=no; selector: name=wtp1025.eqiad.wmnet [09:51:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:52] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3607289 (10akosiaris) I think we are done. The new parsoid boxes are up and running. They are running Debian stretch and nodejs 6.11. They are not pooled and do not serve any kind of traffic currently. Ic... [09:58:02] (03PS6) 10Elukey: role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [09:58:27] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/7879/kafka-jumbo1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [10:06:21] (03CR) 10MarcoAurelio: [C: 031] "There's no on-wiki discussion nor Phabricator ticket for this. As far as I remember this all came on #wikimedia-operations so they could p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (owner: 10Brian Wolff) [10:09:56] (03PS1) 10Mobrovac: WIP: Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 [10:10:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 (owner: 10Mobrovac) [10:10:33] (03PS3) 10MarcoAurelio: Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [10:14:49] (03CR) 10Brian Wolff: "Its approved by the MediaWiki irc cabal." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [10:18:25] (03CR) 10MarcoAurelio: [C: 031] "> Its approved by the MediaWiki irc cabal." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [10:18:33] (03PS5) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [10:20:52] (03PS4) 10Brian Wolff: Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) [10:23:47] (03CR) 10Brian Wolff: "The IRC cabal thing was kind of a joke, but I don't see why not. We've never stood on ceremony for changes affecting mediawiki.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [10:28:09] (03PS2) 10Mobrovac: WIP: Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 [10:28:11] (03PS3) 10Hashar: puppetmaster: pass volatile_dir to geoip class [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) [10:28:32] (03CR) 10jerkins-bot: [V: 04-1] WIP: Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 (owner: 10Mobrovac) [10:30:30] (03PS3) 10Mobrovac: WIP: Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 [10:31:23] RECOVERY - Check systemd state on restbase2003 is OK: OK - running: The system is fully operational [10:34:10] (03CR) 10Hashar: "I ran it on the CI puppetmaster and on an agent:" [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [10:36:21] (03PS7) 10Elukey: role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [10:42:05] (03CR) 10Hashar: "On the puppetmaster I had to manually run the cron entry:" [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [10:44:58] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic, and 2 others: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3607455 (10hashar) a:03hashar I have rebuild the jenkins build and it passed on the slave 1003 ( https://integra... [10:45:42] 10Operations, 10Continuous-Integration-Infrastructure, 10DNS, 10Traffic, and 2 others: CI: operations-dns-lint broken due to missing Maxmind DB file - https://phabricator.wikimedia.org/T175864#3607458 (10hashar) p:05Triage>03Normal [10:50:06] (03PS1) 10Jcrespo: mariadb: Decommission db1049, add db1100 into production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378003 (https://phabricator.wikimedia.org/T175264) [10:50:30] 10Operations, 10DBA, 10Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3607479 (10jcrespo) [10:50:32] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3607478 (10jcrespo) [10:55:29] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1049, add db1100 into production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378003 (https://phabricator.wikimedia.org/T175264) (owner: 10Jcrespo) [10:58:12] (03Merged) 10jenkins-bot: mariadb: Decommission db1049, add db1100 into production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378003 (https://phabricator.wikimedia.org/T175264) (owner: 10Jcrespo) [10:58:22] (03CR) 10jenkins-bot: mariadb: Decommission db1049, add db1100 into production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378003 (https://phabricator.wikimedia.org/T175264) (owner: 10Jcrespo) [10:58:46] (03PS1) 10Elukey: Add the beta Kafka-Jumbo cluster to the scap targets [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/378004 [10:59:49] (03CR) 10Elukey: "Not sure what is the preference, maybe a beta environment? Let me know.." [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/378004 (owner: 10Elukey) [11:00:28] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [11:00:35] (03PS4) 10Mobrovac: Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 (https://phabricator.wikimedia.org/T172610) [11:02:13] (03CR) 10Mobrovac: [C: 031] "PCC (finally) OK - https://puppet-compiler.wmflabs.org/compiler02/7880/" [puppet] - 10https://gerrit.wikimedia.org/r/377997 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [11:02:16] (03CR) 10Hashar: "The repo apparently lacked an origin/HEAD which confuses zuul-merger." [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [11:03:23] (03PS1) 10Jcrespo: mariadb: Decommission db1049 [puppet] - 10https://gerrit.wikimedia.org/r/378005 (https://phabricator.wikimedia.org/T175264) [11:04:23] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1049 [puppet] - 10https://gerrit.wikimedia.org/r/378005 (https://phabricator.wikimedia.org/T175264) (owner: 10Jcrespo) [11:09:16] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-crh-tur: Initial Debian packaging [debs/contenttranslation/apertium-crh-tur] - 10https://gerrit.wikimedia.org/r/377449 (https://phabricator.wikimedia.org/T174765) (owner: 10KartikMistry) [11:12:24] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove all references to db1049 (duration: 00m 50s) [11:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:51] !log upload apertium-crh-tur_0.2.0~r81866-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [11:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove all references to db1049, pool db1100 (duration: 00m 49s) [11:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:43] (03PS1) 10Aaron Schulz: Make it easy to set PHP ini flags with mwscript [puppet] - 10https://gerrit.wikimedia.org/r/378007 [11:23:37] (03CR) 10Addshore: Install composer for PHP imaages (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/369838 (https://phabricator.wikimedia.org/T172358) (owner: 10BryanDavis) [11:23:58] (03PS1) 10Jcrespo: dblists: Decommission db1048 and other pending updates [software] - 10https://gerrit.wikimedia.org/r/378010 (https://phabricator.wikimedia.org/T175264) [11:25:25] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3607575 (10jcrespo) [11:26:49] (03CR) 10Jcrespo: [C: 032] dblists: Decommission db1048 and other pending updates [software] - 10https://gerrit.wikimedia.org/r/378010 (https://phabricator.wikimedia.org/T175264) (owner: 10Jcrespo) [11:27:54] 10Operations, 10ops-eqiad, 10DBA: Decommission db1049 - https://phabricator.wikimedia.org/T175264#3607590 (10jcrespo) Decomm. done , only references left are spare on site.pp and admin_install. [11:48:41] 10Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3607640 (10ovasileva) [11:51:35] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [12:03:24] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Support supplying a default egress policy [calico-k8s-policy-controller] (0.6.0) - 10https://gerrit.wikimedia.org/r/377421 (https://phabricator.wikimedia.org/T170111) (owner: 10Alexandros Kosiaris) [12:03:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Add WMF http_proxies in build.sh [calico-k8s-policy-controller] (0.6.0) - 10https://gerrit.wikimedia.org/r/377432 (owner: 10Alexandros Kosiaris) [12:03:42] (03PS4) 10Alexandros Kosiaris: Ship the default egress policy [puppet] - 10https://gerrit.wikimedia.org/r/377470 (https://phabricator.wikimedia.org/T170111) [12:03:47] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Ship the default egress policy [puppet] - 10https://gerrit.wikimedia.org/r/377470 (https://phabricator.wikimedia.org/T170111) (owner: 10Alexandros Kosiaris) [12:04:05] 10Operations, 10Goal, 10Kubernetes: Operations Q1 goal: Streamlined Service Delivery - https://phabricator.wikimedia.org/T170108#3607700 (10akosiaris) [12:04:07] 10Operations, 10Goal, 10Kubernetes, 10Patch-For-Review, and 2 others: Implement a pod networking policy approach - https://phabricator.wikimedia.org/T170111#3607697 (10akosiaris) 05Open>03Resolved a:03akosiaris Changes merged, resolving [12:07:55] 10Operations, 10Goal, 10Kubernetes, 10Services (watching), 10User-Joe: Standardize on the "default" pod setup - https://phabricator.wikimedia.org/T170120#3607702 (10akosiaris) FTR +1. Let's take a good look at envoy [12:27:22] (03CR) 10Gehel: [C: 031] "Not merging until we know when we deploy es 5.5.x" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377983 (https://phabricator.wikimedia.org/T173231) (owner: 10DCausse) [12:38:34] jouncebot: refresh [12:38:37] jouncebot: next [12:38:37] I refreshed my knowledge about deployments. [12:38:38] In 3 hour(s) and 21 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T1600) [12:38:53] zeljkof: I am not sure why there is no swat today [12:40:13] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/377916 (https://phabricator.wikimedia.org/T129149) (owner: 10Thcipriani) [12:42:36] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, much "better" approach" [puppet] - 10https://gerrit.wikimedia.org/r/377997 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [12:44:03] (03CR) 10Filippo Giunchedi: [C: 031] Add the beta Kafka-Jumbo cluster to the scap targets [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/378004 (owner: 10Elukey) [12:46:20] gehel, godog, _joe_: so what's next for 372124? [12:47:17] all merged and done? [12:47:34] paravoid: everything that I know is merged. [12:47:37] paravoid: afaik after it got merged this morning only https://gerrit.wikimedia.org/r/#/c/377997/ is left to be merged for the last fix [12:47:56] I still need to do a check on maps, see if they are all good. But the cassandra stuff is out [12:49:00] godog, paravoid: I'm ready to merge https://gerrit.wikimedia.org/r/#/c/377997/ and we can call that done [12:49:08] alright [12:49:20] (03PS5) 10Gehel: Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [12:49:34] (03CR) 10Rush: WMCS: install Cumin for WMCS admins (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [12:49:42] (03PS6) 10Rush: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [12:50:22] gehel: what about https://gerrit.wikimedia.org/r/#/c/377976/ ? [12:50:42] paravoid: that one needs to be abbandonned... [12:50:50] unrelated nitpick: I prefer "cassandra: " over "cassandra -" in commit messages [12:50:50] (03CR) 10Gehel: [C: 032] Cassandra: Include only instance DNS' in the list of seeds [puppet] - 10https://gerrit.wikimedia.org/r/377997 (https://phabricator.wikimedia.org/T172610) (owner: 10Mobrovac) [12:51:11] (03Abandoned) 10Gehel: cassandra - re-introduce the instance_count parameter [puppet] - 10https://gerrit.wikimedia.org/r/377976 (https://phabricator.wikimedia.org/T171704) (owner: 10Gehel) [12:51:43] paravoid: damn, I've been doing it wrong all this time... [12:52:27] akosiaris: there is one of your patch mixed with mine on puppetmaster, OK to merge? [12:55:06] https://gerrit.wikimedia.org/r/#/c/377470/ by akosiaris is un-merged on puppetmaster. It should be good to go, but I have absolutely no idea about what this is (kubernetes related). Is there anyone to have a look? [12:55:13] (03CR) 10Filippo Giunchedi: "LGTM overall, a nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [12:55:52] hashar: me too [12:55:52] gehel: I'd say good to merge, worst case k8s breaks but it isn't in production yet [12:56:02] ok, so merging [12:56:49] gehel: yeah my mistake [12:56:50] sorry [12:57:03] akosiaris: no problem! [12:57:04] and it's not even enforced yet [12:57:12] that file is there for informational purposes [12:57:34] I have yet to figure out if and how I want puppet to enforce it [12:57:37] ok, so all good! I was entirely unsure what it was doing... [12:58:58] 10Operations, 10media-storage, 10User-fgiunchedi: Deleting file on Commons "Error deleting file: An unknown error occurred in storage backend "local-multiwrite"." - https://phabricator.wikimedia.org/T173374#3607772 (10fgiunchedi) @nick could you try again to delete both files? thanks! [12:59:00] Hello [12:59:18] zeljkof: hashar: I can SWAT this window [13:00:10] mobrovac: last cassandra patch deployed, I checked a few ndoes, all look good [13:00:39] gehel: by "deployed" you mena you ran puppet everywhere? [13:01:04] mobrovac: nope, on a few nodes, but I can do a full run on all cassandra [13:01:29] i'll run it on rb nodes [13:01:50] ok, so I'm not touching anything else, ping me if you need me [13:02:03] kk [13:02:05] thnx gehel! [13:02:09] np [13:02:20] (03CR) 10Faidon Liambotis: [C: 04-1] swift: use implicit /dev/swift prefix for swift devices (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [13:02:36] zeljkof: hashar: oh you mean in the Deployments table, yes, indeed, normally the window to skip is the Tuesday morning, to avoid time conflicts with train [13:04:06] Dereckson: well I guess if some people have patches to push we can do them anyway [13:05:39] jouncebot: refresh [13:05:42] I refreshed my knowledge about deployments. [13:05:47] jouncebot: now [13:05:48] For the next 0 hour(s) and 54 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T1300) [13:12:52] (03CR) 10Elukey: role::kafka::jumbo::broker: enable Prometheus JMX monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [13:13:23] (03CR) 10Elukey: [V: 032 C: 032] Add the beta Kafka-Jumbo cluster to the scap targets [software/prometheus_jmx_exporter] - 10https://gerrit.wikimedia.org/r/378004 (owner: 10Elukey) [13:15:47] hashar: is it ok to add patches for today eu swat? [13:17:01] dcausse: yes! :) [13:17:13] hashar: doing, thanks! [13:18:14] (03PS7) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [13:18:37] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3607839 (10mobrovac) [13:18:39] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3607836 (10mobrovac) 05Resolved>03Open Sure. [13:19:32] (03PS8) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [13:20:50] (03PS7) 10DCausse: Setup Cirrus MLR models for top 20 language AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377393 (https://phabricator.wikimedia.org/T175771) (owner: 10EBernhardson) [13:20:52] (03CR) 10Volans: "@Rush: thanks for the review, all comments addressed, see inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [13:21:32] (03PS1) 10Gilles: Add ResourceLoader Grafana performance alerts [puppet] - 10https://gerrit.wikimedia.org/r/378023 (https://phabricator.wikimedia.org/T153171) [13:22:27] (03PS3) 10DCausse: Configure enwiki to use CirrusSearch MLR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) (owner: 10EBernhardson) [13:22:35] hashar: fyi I'll start to swat 2 config patches [13:22:57] +2 [13:23:06] (03CR) 10DCausse: [C: 032] Setup Cirrus MLR models for top 20 language AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377393 (https://phabricator.wikimedia.org/T175771) (owner: 10EBernhardson) [13:24:40] (03Merged) 10jenkins-bot: Setup Cirrus MLR models for top 20 language AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377393 (https://phabricator.wikimedia.org/T175771) (owner: 10EBernhardson) [13:26:10] (03CR) 10jenkins-bot: Setup Cirrus MLR models for top 20 language AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377393 (https://phabricator.wikimedia.org/T175771) (owner: 10EBernhardson) [13:26:28] (03CR) 10Dereckson: "Notified on wiki at https://www.mediawiki.org/wiki/Topic:Ty52a8goimjc13pc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373317 (owner: 10Brian Wolff) [13:31:22] !log dcausse@tin Synchronized wmf-config/InitialiseSettings.php: Setup Cirrus MLR models for top 20 language AB test (duration: 00m 50s) [13:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:13] hello! so when a deploy to an extension goes wrong, we don't have to roll back the train as a whole, do we? [13:32:21] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3607882 (10Gehel) [13:32:32] we can deploy a patch to fix it via SWAT? [13:32:45] (03CR) 10Bmansurov: Temporarily prevent users from accessing Special:RenderBook/test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [13:33:33] !log dcausse@tin Synchronized wmf-config/CirrusSearch-common.php: Setup Cirrus MLR models for top 20 language AB test (duration: 00m 48s) [13:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:38] !log dcausse@tin Synchronized wmf-config/CirrusSearch-production.php: Setup Cirrus MLR models for top 20 language AB test (duration: 00m 49s) [13:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] hashar: I'm done, I'll move my second patch to sf morning swat, I have to pick up kids from school in ~15 min [13:39:06] we're in the middle of a SWAT deploy window, yes? [13:39:48] musikanimal: yes but I guess I was the only one to deploy something [13:39:55] dcausse: sounds wise :] [13:39:58] we have one more thing if at all possible... [13:40:06] musikanimal: yes we can [13:40:10] awesome [13:40:10] musikanimal: sure, I'm done [13:40:29] https://gerrit.wikimedia.org/r/#/c/378025/ [13:40:43] Melos shall I add to the calendar or you? [13:40:56] (03CR) 10Rush: [C: 032] prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) (owner: 10Rush) [13:41:49] musikanimal: do it :) [13:41:55] ok [13:43:53] Okay, I'm deploying that. [13:44:06] awesome awesome awesome [13:44:08] yeah that is just a mistake :D [13:44:27] unlike the SWAT deploy I requested last night, this one is urgent! [13:44:57] musikanimal: it's currently under the Zuul care [13:45:22] https://integration.wikimedia.org/ci/job/mediawiki-extensions-hhvm-jessie/22745/ https://integration.wikimedia.org/ci/job/mediawiki-extensions-php55-jessie/206/ https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit-jessie/40338/ are the three CI tasks we wait for [13:45:25] good ole Zuul [13:46:02] :D [13:46:17] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3607951 (10Ottomata) The Analytics list is used for announcements about analytics specific services and news. We do send things like... [13:46:49] so this fixes checking of IP ranges in CU. You have to admit the irony, though... before only CUs could see contribs in an IP range, this week ALL users get to (T163562), but it broke for CUs!!! hehehe [13:46:50] T163562: Add basic IP range support to Special:Contributions - https://phabricator.wikimedia.org/T163562 [13:47:44] eheh [13:47:51] completely unrelated patches [13:50:49] musikanimal: Melos: live on mwdebug1002.eqiad.wmnet [13:51:49] (03PS6) 10Rush: prometheus: allow setting a specific listening address and port [puppet] - 10https://gerrit.wikimedia.org/r/374650 (https://phabricator.wikimedia.org/T169039) [13:57:59] musikanimal: Melos: can you test it? [13:58:09] I was going to say, if you show me where to go I will [13:58:24] (pardon my newbieness) [13:58:31] no problem [13:59:06] we push first change to a canary server, here mwdebug1002.eqiad.wmnet. The front-end server is configured to redirect queries to the server you want is you add to your request the right header. [13:59:23] This is covered by https://wikitech.wikimedia.org/wiki/Debugging_in_production [13:59:50] To test changes, you only have to install a Chrome/Firefox extension, it will take care to add headers for you: https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [14:00:11] cool, that one I actually have [14:00:23] you install it, in the menu you pick mwdebug1002, you put the trigger to on, and hop you can request the special page and ensure it's not borken [14:00:45] (if you've a CU available, you can ask them to check it's not broken either) [14:01:03] so I can go to production Special:CheckUser? I'm a CU on enwiki [14:01:17] * Dereckson nods [14:01:48] and the extension will take care to send your request to mwdebug1002.eqiad.wmnet as long as you put the slider to ON [14:02:16] yup, so even being group2 (which hasn't gotten wmf.18) it's going to use the SWAT branch? [14:03:56] it works there, I had the slider to ON and set to mwdebug1002.eqiad.wmnet [14:04:47] Dereckson: Work for me [14:05:10] ok [14:05:18] syncing [14:06:44] you know en.wp is 17 by the way? https://tools.wmflabs.org/versions/ [14:07:06] will be upgraded to 18 in a few hours [14:07:14] !log dereckson@tin Synchronized php-1.30.0-wmf.18/extensions/CheckUser/specials/SpecialCheckUser.php: Fix Special cu for ip ranges (temp) (T175898) (duration: 00m 49s) [14:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:29] T175898: Checkuser on IP ranges produces no results, even if there are edits in that range - https://phabricator.wikimedia.org/T175898 [14:07:29] right, that's why I was asking [14:08:03] Dereckson: I've tested it on loginwiki [14:08:09] ok [14:08:16] yes, when en.wp will be upgraded to wmf.18, it will be the branch including your fix [14:08:22] but so long as I have the X-Wikimedia-Debug set to mwdebug1002, it will test the right branch, and not wmf.17 which enwiki is on? [14:09:49] no, it respects the wiki branch [14:09:58] so you was checking wmf17 branch code [14:10:19] got it [14:10:23] thank you :) [14:21:44] You're welcome. [14:24:31] (03Abandoned) 10DCausse: [WIP] Bump version of the ltr plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/364462 (owner: 10DCausse) [14:33:54] (03PS1) 10Ottomata: Initial debian commit [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 [14:36:00] (03PS2) 10Ottomata: Initial debian commit [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 [14:36:30] (03PS1) 10Faidon Liambotis: openstack2: de-duplicate parameter $nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/378038 [14:37:23] (03PS9) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [14:38:32] (03CR) 10Ottomata: "This will allow us to install the prometheus jmx exporter jars without scap." [debs/prometheus-jmx-exporter] (debian) - 10https://gerrit.wikimedia.org/r/378037 (owner: 10Ottomata) [14:39:41] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608038 (10dr0ptp4kt) I approve of the access request being fulfilled. I'm in support of expediting if possible. CC @bearND and @mdholloway for visibility. [14:41:20] (03PS1) 10Filippo Giunchedi: [WIP] smart: new module [puppet] - 10https://gerrit.wikimedia.org/r/378039 (https://phabricator.wikimedia.org/T86552) [14:42:46] (03CR) 10Faidon Liambotis: [C: 032] "As expected, no-op with both change and future:" [puppet] - 10https://gerrit.wikimedia.org/r/378038 (owner: 10Faidon Liambotis) [14:43:28] (03PS2) 10BBlack: browsersec: bump to 17% 2017-09-14 [puppet] - 10https://gerrit.wikimedia.org/r/376311 (https://phabricator.wikimedia.org/T163251) [14:43:39] (03PS21) 10Matthias Mullie: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [14:44:03] (03CR) 10BBlack: [C: 032] browsersec: bump to 17% 2017-09-14 [puppet] - 10https://gerrit.wikimedia.org/r/376311 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [14:44:45] (03PS8) 10Elukey: role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [14:49:01] (03PS9) 10Elukey: role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T167992) [14:49:42] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3608053 (10faidon) @jcrespo, fully agreed that alerts should be actionable and I don't particularly disagree with your alert definitions. This task exists precisely because a lo... [14:51:52] (03PS10) 10Volans: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) [14:52:09] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3608055 (10faidon) >>! In T165348#3605721, @Dzahn wrote: > See patch above, based on the cumin results and feedback from the first few users, in the first round i suggest the f... [14:57:13] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to researchers group (stat1003 and MySQL) for niharika29 - https://phabricator.wikimedia.org/T159780#3608087 (10Nuria) @Niharika : second @Ottomata 's comments, you should add yourself to analytics@ where analytics-related informatio... [14:58:36] 10Operations, 10Puppet, 10Patch-For-Review, 10User-Joe: Switch all hosts to the future parser - https://phabricator.wikimedia.org/T171704#3608100 (10faidon) I think as of today, with the latest compiler run ([[ https://puppet-compiler.wmflabs.org/7882/index-future.html | #7882 ]]) plus another hotfix (2811... [14:59:33] (03PS1) 10Ottomata: Initial debian commit [debs/prometheus-jmx-exporter] - 10https://gerrit.wikimedia.org/r/378040 (https://phabricator.wikimedia.org/T175922) [15:10:05] (03CR) 10Volans: "latest puppet compiler results: https://puppet-compiler.wmflabs.org/compiler02/7886/" [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [15:12:42] (03PS1) 10BBlack: ssl_ciphersuite: prefer ECDSA certs more-strongly [puppet] - 10https://gerrit.wikimedia.org/r/378045 [15:19:12] (03Abandoned) 10Herron: Lists: Add zen.spamhaus.org DNSBL check to MTA rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/377938 (https://phabricator.wikimedia.org/T175878) (owner: 10Herron) [15:20:31] (03PS1) 10Herron: Lists: Add zen.spamhaus.org DNSBL check to MTA rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/378046 (https://phabricator.wikimedia.org/T175878) [15:21:40] (03CR) 10BBlack: [C: 032] ssl_ciphersuite: prefer ECDSA certs more-strongly [puppet] - 10https://gerrit.wikimedia.org/r/378045 (owner: 10BBlack) [15:22:36] godog: what do I need to have https://grafana.wikimedia.org/ include stats from my new prometheus server on labmon1001? [15:23:24] (03PS2) 10Ottomata: Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [15:23:38] andrewbogott: you should add it as a new datasource in grafana-admin.w.o [15:23:56] godog: ok, that's a live config on the site? or a puppet change? [15:25:10] (03PS10) 10Elukey: role::kafka::jumbo::broker: enable Prometheus JMX monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377753 (https://phabricator.wikimedia.org/T175922) [15:25:32] andrewbogott: sadly live config [15:25:41] (03CR) 10jerkins-bot: [V: 04-1] Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [15:25:44] happen to know where the menu for that is? [15:25:47] * andrewbogott is searching [15:28:57] ah, it's because I'm not admin [15:29:02] probably [15:29:22] godog: can you add me as a grafana admin? Or is /that/ in puppet? [15:29:39] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608215 (10RobH) Access checklist: [] - user has existing shell name [] - user has signed L3 (as of 2017-09-14 @ 15:34 GMT it has not been signed.) [] - manager approval (@dr0... [15:30:55] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608221 (10RobH) [15:30:57] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606599 (10RobH) a:03Tgr [15:31:17] andrewbogott: {{done}} try reloading see if it works now [15:31:52] godog: yep, I see the menu now [15:31:59] can't tell what to do with it yet :) [15:32:30] heh the datasource name should be consistent with what's there already so maybe "eqiad prometheus/labs" [15:33:40] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608225 (10RobH) [15:34:18] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606599 (10RobH) >>! In T175882#3608038, @dr0ptp4kt wrote: > I approve of the access request being fulfilled. I'm in support of expediting if possible. > > CC @bearND and @mho... [15:35:02] ah, and I need to set up a service url. [15:35:30] (03CR) 10Volans: "Why not using PySMARTx at all?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378039 (https://phabricator.wikimedia.org/T86552) (owner: 10Filippo Giunchedi) [15:35:53] 10Operations, 10Traffic: Removing support for AES128-SHA TLS cipher - https://phabricator.wikimedia.org/T147202#3608227 (10BBlack) Another note-to-self for the future: https://gerrit.wikimedia.org/r/#/c/301817/ is where we removed the fairly-similar `AES128-SHA256` and `AES128-GCM-SHA256`, throwing all such cl... [15:36:17] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608228 (10dr0ptp4kt) > We cannot expedite requests without @mark specifically overriding this process. In particular on granting a user sudo rights, sorry! Understood. Tha... [15:37:27] (03CR) 10Herron: [C: 032] Lists: Add zen.spamhaus.org DNSBL check to MTA rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/378046 (https://phabricator.wikimedia.org/T175878) (owner: 10Herron) [15:37:32] (03PS2) 10Herron: Lists: Add zen.spamhaus.org DNSBL check to MTA rcpt acl [puppet] - 10https://gerrit.wikimedia.org/r/378046 (https://phabricator.wikimedia.org/T175878) [15:39:47] (03PS1) 10BBlack: Revert "cache::route_table: ulsfo->eqiad directly (bypass codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/378048 [15:39:55] (03PS2) 10BBlack: Revert "cache::route_table: ulsfo->eqiad directly (bypass codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/378048 [15:40:06] (03CR) 10BBlack: [V: 032 C: 032] Revert "cache::route_table: ulsfo->eqiad directly (bypass codfw)" [puppet] - 10https://gerrit.wikimedia.org/r/378048 (owner: 10BBlack) [15:40:26] (03PS1) 10BBlack: Revert "depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/378049 [15:40:30] (03PS2) 10BBlack: Revert "depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/378049 [15:40:36] (03CR) 10BBlack: [V: 032 C: 032] Revert "depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/378049 (owner: 10BBlack) [15:46:08] !log depool cp1063 (upload eqiad) - seems to be having some iowait issues? [15:46:10] (03CR) 10Krinkle: [C: 04-1] Use EventBus for recentchanges stream instead of RCStream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [15:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:36] PROBLEM - Check Varnish expiry mailbox lag on cp1064 is CRITICAL: CRITICAL: expiry mailbox lag is 2026487 [15:49:42] (03PS8) 10Filippo Giunchedi: swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) [15:50:08] (03CR) 10jerkins-bot: [V: 04-1] swift: use implicit /dev/swift prefix for swift devices [puppet] - 10https://gerrit.wikimedia.org/r/361648 (https://phabricator.wikimedia.org/T163673) (owner: 10Filippo Giunchedi) [15:51:13] (03PS1) 10Andrew Bogott: labmon: add prometheus cname [dns] - 10https://gerrit.wikimedia.org/r/378051 [15:51:17] !log cp1064 backend restart, mailbox lag [15:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:41] (03CR) 10Andrew Bogott: [C: 032] labmon: add prometheus cname [dns] - 10https://gerrit.wikimedia.org/r/378051 (owner: 10Andrew Bogott) [15:51:51] !log repool cp1063 (after wiping varnish storage + restarting all 3x main service daemons) [15:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:36] RECOVERY - Check Varnish expiry mailbox lag on cp1064 is OK: OK: expiry mailbox lag is 0 [16:00:04] godog, moritzm, and _joe_: Dear deployers, time to do the Puppet SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T1600). [16:00:04] matthiasmullie and Amir1: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [16:00:12] o/ [16:00:16] here [16:00:31] " Note: If you break the wikis, you will be rewarded with a sticker." LOL, Already got one [16:01:45] I'm here too [16:02:25] I'll do Amir1's patch first since it is easier [16:02:31] (03PS2) 10Filippo Giunchedi: gerrit: Smaller png logo [puppet] - 10https://gerrit.wikimedia.org/r/377547 (owner: 10Ladsgroup) [16:02:34] basically a merge and that's it [16:02:42] Thanks [16:03:10] (03CR) 10Filippo Giunchedi: [C: 032] gerrit: Smaller png logo [puppet] - 10https://gerrit.wikimedia.org/r/377547 (owner: 10Ladsgroup) [16:04:43] Amir1: {{done}} [16:04:53] Thank you! [16:05:09] (03PS22) 10Filippo Giunchedi: Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [16:05:43] matthiasmullie: looking at your patch now [16:05:49] alright [16:05:54] Works just fine [16:05:59] Krinkle: CC [16:06:09] godog it's currently cherry-picked on beta [16:06:59] Amir1: thx [16:07:20] not sure if that needs to get removed again before merging the patch [16:08:48] (03CR) 10Filippo Giunchedi: [C: 032] Add 3d2png deploy repo to image scalers [puppet] - 10https://gerrit.wikimedia.org/r/345377 (https://phabricator.wikimedia.org/T160185) (owner: 10MarkTraceur) [16:09:26] matthiasmullie: good question, if it is the exact same change the automatic rebasing should do the right thing [16:09:29] Anyone else get an exception on https://phabricator.wikimedia.org/T174362? [16:09:56] Niharika: i do too [16:10:01] the code hasn't changed since it was cherry-picked (apart from rebase), so that should be alright then [16:10:18] twentyafterfour: ^^ [16:11:52] matthiasmullie: ack, merged and I'm running puppet where needed [16:14:42] (03PS1) 10Andrew Bogott: designate: open api to labmon [puppet] - 10https://gerrit.wikimedia.org/r/378052 [16:15:19] (03PS1) 10Volans: volans wmcs wide root [labs/private] - 10https://gerrit.wikimedia.org/r/378053 [16:17:46] PROBLEM - puppet last run on mw2150 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:20:06] matthiasmullie: found an issue the patch, fixing [16:21:03] (03PS1) 10Filippo Giunchedi: imagescaler: use class-style for 3d2png [puppet] - 10https://gerrit.wikimedia.org/r/378054 (https://phabricator.wikimedia.org/T160185) [16:21:23] godog: IIRC it happened to me in deployment-prep that the auto-merge broke since I had the same patch cherrypicked [16:22:07] but in theory it should work :D [16:22:21] elukey: ah! thanks, yeah we'll see if it breaks [16:23:58] twentyafterfour: (cc: Niharika) https://phabricator.wikimedia.org/T174362 is broken, Unhandled Exception ("AphrontParameterQueryException") fyi [16:24:17] (03CR) 10Filippo Giunchedi: [C: 032] imagescaler: use class-style for 3d2png [puppet] - 10https://gerrit.wikimedia.org/r/378054 (https://phabricator.wikimedia.org/T160185) (owner: 10Filippo Giunchedi) [16:27:34] I don't even [16:27:36] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Illegal number at /etc/puppet/modules/3d2png/manifests/deploy.pp:8:8 [16:27:43] something similar works fine on thumbor [16:27:55] PROBLEM - puppet last run on naos is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Scap_source[3d2png/deploy] [16:29:55] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [16:31:10] godog: that's an invalid name IMHO: https://docs.puppet.com/puppet/3.8/lang_reserved.html#classes-and-types [16:32:15] ugh, > must begin with a lowercase letter [16:32:25] wonder why beta didn't explode? [16:32:42] (03PS1) 10Filippo Giunchedi: imagescaler: call 3d2png::deploy as thumbor::mediawiki does [puppet] - 10https://gerrit.wikimedia.org/r/378057 (https://phabricator.wikimedia.org/T160185) [16:32:46] volans: on thumbor it works though [16:32:56] i.e. ^ [16:33:04] (03CR) 10jerkins-bot: [V: 04-1] imagescaler: call 3d2png::deploy as thumbor::mediawiki does [puppet] - 10https://gerrit.wikimedia.org/r/378057 (https://phabricator.wikimedia.org/T160185) (owner: 10Filippo Giunchedi) [16:34:04] godog: where exactly? [16:34:13] godog this is not too urgent - can be pulled out today to change the name & try again another time, if that's easier [16:34:14] (03PS2) 10Filippo Giunchedi: imagescaler: call 3d2png::deploy as thumbor::mediawiki does [puppet] - 10https://gerrit.wikimedia.org/r/378057 (https://phabricator.wikimedia.org/T160185) [16:34:15] volans: on thumbor2001 for example [16:34:50] I mean which class starts with a number in the class name? [16:35:25] volans: what I mean is that role::thumbor::mediawiki has the 3d2png class applied and puppet runs fine [16:35:46] not applied, but called, you get what I mean [16:35:59] * volans updating local branch [16:36:11] matthiasmullie: heh I'll give it another one or two tries and back out the changes if it doesn't work [16:36:31] alright, thanks :) [16:36:37] I don't think we shoud call a class starting with a number given that the docs says "must begin with a lowercase letter" [16:37:34] as for the why it works... I don't have a good explanation besides "it's puppet" and http://www.reactiongifs.com/wp-content/uploads/2013/03/magic.gif [16:37:48] heheh [16:38:20] at this point I'm curious to try, looks like we'll be changing the name anyway [16:38:34] (03CR) 10Filippo Giunchedi: [C: 032] imagescaler: call 3d2png::deploy as thumbor::mediawiki does [puppet] - 10https://gerrit.wikimedia.org/r/378057 (https://phabricator.wikimedia.org/T160185) (owner: 10Filippo Giunchedi) [16:39:09] and the compiler didn't barf on any of this [16:40:56] still seems to not be reliable and open to any unpredictable behaviour to me [16:41:14] https://projects.puppetlabs.com/issues/3129 [16:42:32] looking on the related ones seems that during the years they trid to support them and changed their mind :D [16:42:34] (03PS1) 10Filippo Giunchedi: Revert "Add 3d2png deploy repo to image scalers" [puppet] - 10https://gerrit.wikimedia.org/r/378058 (https://phabricator.wikimedia.org/T160185) [16:43:04] fwiw I vote to rename the class [16:43:05] (03CR) 10Gergő Tisza: Temporarily prevent users from accessing Special:RenderBook/test (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [16:43:14] (03CR) 10Filippo Giunchedi: [C: 032] Revert "Add 3d2png deploy repo to image scalers" [puppet] - 10https://gerrit.wikimedia.org/r/378058 (https://phabricator.wikimedia.org/T160185) (owner: 10Filippo Giunchedi) [16:44:58] alright, matthiasmullie ^ reverted [16:45:04] the class will need another name [16:45:12] alright; I'll resubmit with different classname tomorrow ;) [16:45:21] matthiasmullie: kk! thanks [16:45:25] RECOVERY - puppet last run on mw2150 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:45:31] thanks! [16:46:05] I'm shocked nothing in our pipeline catched this heh [16:46:52] * godog off [16:46:56] yeah, that's weird [16:47:27] godog: just checked /var/log/git-sync-upstream.log on deploymentprep-puppetmaster02, all good :) [16:47:59] elukey: neat, thanks! [16:51:58] <_joe_> classes with names starting with numbers will break with the future parser AFAIR [16:52:53] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608405 (10RobH) [16:53:05] <_joe_> but shouldn't work since at least puppet 3.x [16:53:09] 10Operations, 10Ops-Access-Requests: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3606599 (10RobH) I chatted with @tgr via IRC, and he is aware this will have to wait until Monday & he needs to sign the L3. It seems that he only needs pdfrender-admin, not a... [16:53:47] (03CR) 10Rush: [C: 04-1] "I fear this won't work the way you're thinking, let's talk it over on irc etc :)" [puppet] - 10https://gerrit.wikimedia.org/r/378052 (owner: 10Andrew Bogott) [16:55:37] (03CR) 10Rush: [V: 032 C: 032] "I believe ed2551 is fine so we'll see :)" [labs/private] - 10https://gerrit.wikimedia.org/r/378053 (owner: 10Volans) [16:57:15] (03PS1) 10RobH: add tgr to pdfrender-admin sudo group [puppet] - 10https://gerrit.wikimedia.org/r/378060 (https://phabricator.wikimedia.org/T175882) [16:58:02] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608426 (10RobH) [16:59:42] !log krinkle@tin Synchronized php-1.30.0-wmf.17/extensions/NavigationTiming: T104902 (duration: 01m 00s) [16:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:58] T104902: Refactor Navigation Timing gathering to produce reliable stackable measures - https://phabricator.wikimedia.org/T104902 [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: How many deployers does it take to do Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T1700). [17:00:05] No patches in the queue for this window. Wheeee! [17:00:05] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.006 second response time [17:01:05] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.098 second response time [17:05:56] 10Operations, 10Ops-Access-Requests, 10Patch-For-Review: Requesting access to scb* and pdfrender-admin for tgr - https://phabricator.wikimedia.org/T175882#3608496 (10RobH) [17:08:03] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3608515 (10ssastry) >>! In T165520#3607289, @akosiaris wrote: > I think we are done. The new parsoid boxes are up and running. They are running Debian stretch and nodejs 6.11. They are not pooled and do n... [17:08:39] (03PS2) 10Gehel: Deploy discovery-analytics with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/377916 (https://phabricator.wikimedia.org/T129149) (owner: 10Thcipriani) [17:09:20] (03CR) 10Gehel: [C: 032] Deploy discovery-analytics with scap3 [puppet] - 10https://gerrit.wikimedia.org/r/377916 (https://phabricator.wikimedia.org/T129149) (owner: 10Thcipriani) [17:15:37] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: two switches have same serial in racktables - https://phabricator.wikimedia.org/T175737#3608528 (10Cmjohnson) @robh both asset tags are on the switch...it's been awhile so not sure how that happened. 1 switch, 2 asset tags. Delete the wmf4199 entry? [17:16:45] !log gehel@tin Started deploy [wikimedia/discovery/analytics@ab5d5c1]: moving discovery/analytics to scap3 [17:16:50] !log gehel@tin Finished deploy [wikimedia/discovery/analytics@ab5d5c1]: moving discovery/analytics to scap3 (duration: 00m 04s) [17:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:27] thcipriani: all looks good, thanks for the patches! [17:17:53] gehel: awesome, glad it works, thanks for the merges/deploy \o/ [17:18:10] thcipriani: you did the real work :) [17:18:12] !log eevans@tin Started restart [electron-render/deploy@8dd5f13]: (no justification provided) [17:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:35] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: two switches have same serial in racktables - https://phabricator.wikimedia.org/T175737#3608540 (10RobH) 05Open>03Resolved I've deleted the wmf4199 and updated the notes entry for wmf4503. This will fix the issue, thanks for confirmation about the asset t... [17:26:56] win 4 [17:27:03] mutante: :) [17:27:14] :) [17:27:48] "You are about to start a line with 'win', are you sure?" [17:29:35] (03PS2) 10Andrew Bogott: designate: open api to labmon [puppet] - 10https://gerrit.wikimedia.org/r/378052 [17:30:28] (03PS4) 10Ottomata: Stopping event collection for Page events [puppet] - 10https://gerrit.wikimedia.org/r/377667 (https://phabricator.wikimedia.org/T171629) (owner: 10Nuria) [17:30:32] (03CR) 10Rush: [C: 031] "this will work" [puppet] - 10https://gerrit.wikimedia.org/r/378052 (owner: 10Andrew Bogott) [17:30:34] (03CR) 10Ottomata: [V: 032 C: 032] Stopping event collection for Page events [puppet] - 10https://gerrit.wikimedia.org/r/377667 (https://phabricator.wikimedia.org/T171629) (owner: 10Nuria) [17:30:47] (03PS3) 10Rush: designate: open api to labmon [puppet] - 10https://gerrit.wikimedia.org/r/378052 (owner: 10Andrew Bogott) [17:30:49] mutante: maybe have clippy pop up and ask? [17:30:53] an ascii clippy [17:31:40] hehee:) [17:32:53] (03CR) 10Andrew Bogott: [C: 032] designate: open api to labmon [puppet] - 10https://gerrit.wikimedia.org/r/378052 (owner: 10Andrew Bogott) [17:33:51] 10Operations, 10Wikimedia-Logstash, 10Discovery-Search (Current work): decommission logstash100[1-3] - https://phabricator.wikimedia.org/T175830#3608617 (10debt) p:05Triage>03Normal [17:34:19] 10Operations, 10monitoring, 10Discovery-Search (Current work): port elasticsearch diamond collector to prometheus - https://phabricator.wikimedia.org/T175799#3608619 (10debt) p:05Triage>03Normal a:03Gehel [17:35:25] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2073609 [17:39:57] urandom: i miss clippy :( [17:48:37] Niharika: i created a task for the issue we had with the task you mentioned earlier see https://phabricator.wikimedia.org/T175942 [17:49:17] Zppix: Thanks. [17:49:23] np [17:55:53] there's a bastion3002? huh [17:56:05] yes, because bast3001 is no more [17:56:36] !log cp1074 backend restart, mailbox lag [17:56:36] 1001,2001,3002,4001 :) [17:56:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:09] heh [17:57:13] !log cp2001 backend restart, mailbox lag [17:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:42] yeah something seems a little off about our naming policy in that sense, but I don't really have a universally-better suggestion, either [17:57:46] hey tech peoples, can I do a server-intensive rename? [17:57:50] akosiaris: hi, arlolra and I are seeing that the new parsoid servers appear to be getting actual live traffic, but on the ticket you wrote that they're not pooled yet [17:57:52] (90,000 edits) [17:58:01] moving from names like krypton to bast3001 is functional naming [17:58:36] but then we still consider bast3001 to map to a specific physical machine. when it's retired and replaced, the replacement is bast3002, so that all past records we have of racking, hardware issues, faults, etc doesn't get confusing in the transition [17:58:56] bblack: more service dns entries, like bastion-eqiad, bastion-ams, etc [17:58:59] ? [17:59:03] yea, that is the reason for not reinstalling a different physical box with the same name [17:59:18] greg-g: perhaps! [17:59:33] the drawback of serivce entries i thought is the entire fingerprint of bastion changing, etc... [17:59:40] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3268874 (10Legoktm) >>! In T165520#3607289, @akosiaris wrote: > I think we are done. The new parsoid boxes are up and running. They are running Debian stretch and nodejs 6.11. They are not pooled and do n... [17:59:40] or we could leave the asset tags in as alternate hostnames as well, and always use those in the tickets and hw info [17:59:41] unless we migrate the fingerprint between whatever is active. [17:59:42] then it would be additional, like "currently bastion-ams" is bast3002, yea [17:59:58] (03PS1) 10Gehel: admin: adding Erika Bjune as an ldap only user [puppet] - 10https://gerrit.wikimedia.org/r/378071 (https://phabricator.wikimedia.org/T175945) [18:00:01] i dunno, if you are getting shell shouldnt you be able to figure out when a bastion is offline? [18:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do Morning SWAT (Max 8 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T1800). [18:00:05] dcausse: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [18:00:16] o/ [18:00:19] e.g. bast1001 is always the name of the bastion in eqiad (since we only need 1x of that type in its "cluster"), but it's also wmf9876.eqiad.wmnet, and we use that name more than we do today to identify anything hardware-specific. [18:00:45] (03CR) 10RobH: [C: 032] admin: adding Erika Bjune as an ldap only user [puppet] - 10https://gerrit.wikimedia.org/r/378071 (https://phabricator.wikimedia.org/T175945) (owner: 10Gehel) [18:00:57] in that case it doesnt need "1001" and can just be bastion-esams.wm.org [18:01:01] but then those are so unmemorable. nobody knows quickly wtf things are about when a phab ticket title flies by saying "raid failure on wmf9876" [18:01:31] mutante: well except the 1001 is part of the broader standard of functional_cluster_name+series_number.dcname.wmnet [18:01:40] I can SWAT [18:01:46] \o/ [18:02:21] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) (owner: 10EBernhardson) [18:02:22] but if it's always the same name, there is no series anymore [18:02:24] the confusing bit is that "series_number" above is overloaded with two different senses: serial host replacement over the time domain, and series of live parallel hosts in a multi-host cluster [18:02:57] you could break up the overloading by having generation counters in the number-space I guess, but it doesn't fix any practical problem [18:03:00] ah, yea [18:03:04] bast3001 replaced by bast 3101 or whatever [18:03:26] !log disabled oauth validation on metawiki/SUL for Teles after verification [18:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:19] (03Merged) 10jenkins-bot: Configure enwiki to use CirrusSearch MLR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) (owner: 10EBernhardson) [18:05:06] (03CR) 10Jforrester: Use RemexHtml instead of Tidy on mediawikiwiki, testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (owner: 10Tim Starling) [18:05:25] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [18:05:32] dcausse: change is live on mwdebug1002, if there's anything to test there [18:05:38] bast-esams 1H CNAME bast3002 - so would you think that helps? [18:05:42] thcipriani: sure, testing [18:06:14] (03CR) 10jenkins-bot: Configure enwiki to use CirrusSearch MLR by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377776 (https://phabricator.wikimedia.org/T175772) (owner: 10EBernhardson) [18:08:26] PROBLEM - glance-api http on labcontrol1001 is CRITICAL: connect to address 208.80.154.92 and port 9292: Connection refused [18:08:46] thcipriani: looks good to me [18:08:53] dcausse: ok, going live [18:09:22] andrewbogott: glance-api issues^ [18:09:29] mayb that fw rule did something weird? [18:09:35] chasemp: that's me, I'll have it back in a second [18:10:26] RECOVERY - glance-api http on labcontrol1001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 817 bytes in 0.003 second response time [18:10:39] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:377776|Configure enwiki to use CirrusSearch MLR by default]] T175772 (duration: 00m 50s) [18:10:45] ^ dcausse live everywhere [18:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:51] T175772: Deploy MLR as default content search to enwiki - https://phabricator.wikimedia.org/T175772 [18:10:55] thcipriani: thanks! [18:11:00] yw :) [18:11:11] will monitor the cluster to see how it goes [18:11:51] okie doke [18:13:05] (03PS1) 10Legoktm: Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 [18:13:56] (03PS2) 10Legoktm: Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 (https://phabricator.wikimedia.org/T165520) [18:14:53] hi again, is it possible for me to do a rename of a user with 90,000 edits now? :-) [18:17:00] (03CR) 10Arlolra: [C: 031] Whitelist wtp10[25-48] for Linter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378073 (https://phabricator.wikimedia.org/T165520) (owner: 10Legoktm) [18:17:32] ajr: that's a lot and our DBAs are not currently able to help watch this. Can you file a task in the #DBA project on phabricator? [18:17:44] sure :-) [18:17:51] thanks! [18:18:44] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3608782 (10Arlolra) Currently, ``` arlolra@tin:/srv/deployment/parsoid/deploy$ confctl select dc=.*,cluster=parsoid,service=parsoid get {"wtp2011.codfw.wmnet": {"pooled": "yes", "weight": 10}, "tags": "d... [18:19:02] musikanimal, your script is at rev_id 70M on commons and is running super slow [18:28:12] git status on tin:/srv/mediawiki is spewing 400,000 lines of uncommitted changes at me :( [18:28:41] tgr, not mediawiki-staging? [18:29:31] !log Phabricator: Deploying hotfix for T175942 [18:29:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:47] T175942: Unhandled Exception ("AphrontParameterQueryException") when viewing T174362 - https://phabricator.wikimedia.org/T175942 [18:31:16] MaxSem: yeah, I ended up there by mistake, but still, is that normal? [18:31:33] it shouldn't even have .git :P [18:31:35] or I guess why is that even a git repo? [18:35:34] thcipriani: are you done with SWAT? [18:35:43] tgr: yes [18:36:08] I'll SWAT a private patch then [18:37:50] MaxSem: yeah it's getting slower and slower, I think maybe it's slowly running out of memory maybe? [18:38:05] we noticed similar things with the popular pages script, which has a big giant loop like this one [18:38:51] maybe we should kill it and try the new updated script [18:40:37] no, it's a filesort [18:42:02] MaxSem: tgr https://phabricator.wikimedia.org/rMSCA08af4b9da9763e68588cfeba6c40fb5949a0dc1d [18:43:32] greg-g: I see, but is it OK then that the changes apparently never get committed? [18:51:45] (03CR) 10Jforrester: [C: 031] Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (owner: 10Tim Starling) [18:53:35] RECOVERY - salt-minion processes on labtestvirt2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:56:35] PROBLEM - salt-minion processes on labtestvirt2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [18:57:04] !log tgr@tin Synchronized private/PrivateSettings.php: set Electron secret for T175868 (duration: 00m 49s) [18:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:19] T175868: Deploy and test new book rendering (Remex + Electron) - https://phabricator.wikimedia.org/T175868 [18:58:17] !log tgr@tin Synchronized wmf-config/PrivateSettings.php: set Electron secret for T175868 (duration: 00m 48s) [18:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:05] no_justification: Dear deployers, time to do the MediaWiki train deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T1900). [19:00:07] No patches in the queue for this window. Wheeee! [19:00:17] (03PS2) 10Subramanya Sastry: Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (https://phabricator.wikimedia.org/T175095) (owner: 10Tim Starling) [19:00:24] Hah [19:00:34] Niharika: Ok I love the new bot snark [19:00:37] It /knows/ me [19:01:56] (03PS1) 10Chad: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378080 [19:02:20] (03CR) 10Chad: [C: 032] group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378080 (owner: 10Chad) [19:04:47] (03Merged) 10jenkins-bot: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378080 (owner: 10Chad) [19:06:32] (03CR) 10jenkins-bot: group2 to wmf.18 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378080 (owner: 10Chad) [19:07:22] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.18 [19:07:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:24] tgr: epic lag (I switched locations and my message didn't get through before) /me shrugs (and a let down in content) [19:27:25] legoktm: I just saw https://tools.wmflabs.org/sal/log/AV5_xUbbwg13V6285mnf [19:27:34] er wrong link [19:27:45] I meant https://phabricator.wikimedia.org/T165520#3608737 [19:27:50] related though. [19:28:13] my fault for not double checking but I those hosts should not have been pooled [19:28:23] ah ok :) [19:28:24] either something/someone pooled them or we have a bug somewhere [19:28:29] arlo has depooled them for now [19:28:42] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3608954 (10akosiaris) This is really peculiar then and the fault is on me for not double checking the state. That being said the default for parsoid is to have to explicitly pool a node[1]. And as you can... [19:28:50] I 've commented on the task already. Given the config I did expect them to be depooled [19:30:12] no worries then :) [19:30:24] actually I am a bit worried [19:30:32] how did this happen... and can it happen again ? [19:30:41] probably yes.. I 'd like to get to the bottom of it [19:30:56] I 'll talk with _joe_ tomorrow about how to dig further into this [19:31:03] ok [19:31:18] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, 10Traffic: Make maps active / active - https://phabricator.wikimedia.org/T162362#3608957 (10Pnorman) I used a SSH tunnel to check maps2001.codfw.wmnet and it's serving tiles fine. One problem I noticed is that it is at least two months out of date on what... [19:31:42] !log demon@tin Synchronized php-1.30.0-wmf.17/skins/MinervaNeue/resources/skins.minerva.icons.images.scripts/userNormal.svg: fix svg syntax (duration: 00m 45s) [19:31:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:54] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/7891/" [puppet] - 10https://gerrit.wikimedia.org/r/377986 (https://phabricator.wikimedia.org/T175864) (owner: 10Hashar) [19:36:22] !log demon@tin Synchronized php-1.30.0-wmf.18/skins/MinervaNeue/resources/skins.minerva.icons.images.scripts/userNormal.svg: fix svg syntax (duration: 00m 46s) [19:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:50] (03CR) 10Dzahn: [C: 031] "eh.. i thought "tonight is the night" when i looked at the deploy calendar. was i wrong?" [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [19:40:32] (03CR) 10Dzahn: [C: 031] "did you mean "tonights deployment" = mediawiki, or = phab deploy" [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [19:41:51] (03CR) 10Jforrester: [C: 031] Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (https://phabricator.wikimedia.org/T175095) (owner: 10Tim Starling) [19:50:43] no_justification: It's part of my scheme to make jouncebot evil. [19:51:02] evil-plans.txt > jouncebot [19:53:34] (03CR) 10Chad: "Per IRC: let's do this in a few stages, seems safer. Let's swap CommonSettings but leave the symlink. Then see what else we can find (pupp" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376762 (owner: 10Chad) [19:58:27] !log banning elastic1020 to see if T175951 is caused by mixed versions of the ltr plugin [19:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:40] T175951: Search backend error during full_text search for 'QUERY_SRTING' after 39: i_o_exception: Can't read unknown type [50] - https://phabricator.wikimedia.org/T175951 [20:02:17] !log ppchelko@tin Started deploy [cpjobqueue/deploy@af9b590]: Emit broker statistics more frequentl [20:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:47] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@af9b590]: Emit broker statistics more frequentl (duration: 00m 30s) [20:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:35] !log demon@tin Synchronized php-1.30.0-wmf.18/extensions/OAuth/backend/MWOAuthDAO.php: I785230df (duration: 00m 46s) [20:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:21] 10Operations, 10Ops-Access-Requests: Requesting access to Stat1005 for zhousquared - https://phabricator.wikimedia.org/T175959#3609129 (10ZhouZ) [20:16:14] (03PS1) 10Eevans: WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled [puppet] - 10https://gerrit.wikimedia.org/r/378100 (https://phabricator.wikimedia.org/T171772) [20:18:25] !log demon@tin Synchronized php-1.30.0-wmf.18/includes/api/ApiFeedWatchlist.php: Ibea5bd88 (duration: 00m 46s) [20:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:31] !log demon@tin Synchronized php-1.30.0-wmf.18/skins/MinervaNeue/resources/skins.minerva.userpage.icons/userpage.svg: I95d76339 (duration: 00m 45s) [20:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:24] (03PS2) 10Chad: Just include PrivateSettings.php directly [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376762 [20:49:50] (03PS1) 10MaxSem: ACTRIAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378104 [20:50:08] (03PS2) 10Eevans: WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled [puppet] - 10https://gerrit.wikimedia.org/r/378100 (https://phabricator.wikimedia.org/T171772) [20:51:35] (03PS2) 10MaxSem: ACTRIAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378104 (https://phabricator.wikimedia.org/T175963) [20:58:26] (03PS3) 10Eevans: WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled [puppet] - 10https://gerrit.wikimedia.org/r/378100 (https://phabricator.wikimedia.org/T171772) [20:58:56] (03CR) 10jerkins-bot: [V: 04-1] WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled [puppet] - 10https://gerrit.wikimedia.org/r/378100 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [21:00:04] MaxSem and MusikAnimal: How many deployers does it take to do CommTech deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T2100). [21:00:04] No patches in the queue for this window. Wheeee! [21:00:36] no_justification, are we free to go? [21:00:43] Yeah, train's all done [21:00:46] Things pretty quiet [21:00:51] thanks [21:01:12] (03PS4) 10Eevans: WIP: Disable cassandra-metrics-collector when Prometheus agent is enabled [puppet] - 10https://gerrit.wikimedia.org/r/378100 (https://phabricator.wikimedia.org/T171772) [21:01:15] Been a few spikes of replag this afternoon, just keep an eye while populating that table [21:01:21] Probably won't be an issue [21:02:32] (03PS1) 10Madhuvishy: toollabs: Add shinken check for tools-mail exim queue length [puppet] - 10https://gerrit.wikimedia.org/r/378105 (https://phabricator.wikimedia.org/T96898) [21:02:33] that population is bounded by reads, so replication isn't really an issue [21:02:42] (03CR) 10Niharika29: [C: 032] ACTRIAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378104 (https://phabricator.wikimedia.org/T175963) (owner: 10MaxSem) [21:03:06] (03CR) 10Chad: "So it looks like basically everything else already uses the other file. Beta's setup identical to prod here too. But I'm really freaking p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/376762 (owner: 10Chad) [21:03:14] (03CR) 10Madhuvishy: "The warn/crit thresholds are arbitrary, pending discussion. Don't merge" [puppet] - 10https://gerrit.wikimedia.org/r/378105 (https://phabricator.wikimedia.org/T96898) (owner: 10Madhuvishy) [21:04:28] (03Merged) 10jenkins-bot: ACTRIAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378104 (https://phabricator.wikimedia.org/T175963) (owner: 10MaxSem) [21:05:16] how do you get jouncebot to talk like that? [21:05:41] by committing code to Gerrit:P [21:05:41] musikanimal: Make it more intelligent. :P [21:06:06] I'm so confused [21:06:15] (03CR) 10jenkins-bot: ACTRIAL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378104 (https://phabricator.wikimedia.org/T175963) (owner: 10MaxSem) [21:06:21] Why? [21:06:21] MaxSem: so we are NOT doing populateIpChanges on this group, correct? [21:06:38] yeah, group1 is still in the works [21:07:29] Niharika: where is the gerrit code or whatever? or are you guys messing with me (it's OK, I'm gullible, take advantage of it :) [21:08:13] musikanimal: It is on Gerrit. :) https://github.com/wikimedia/wikimedia-bots-jouncebot [21:09:59] meanwhile, we're on mwdebug1002 [21:10:34] I see [21:12:38] MaxSem: https://gerrit.wikimedia.org/r/#/c/377926/ [21:15:14] MaxSem: You want to pull https://gerrit.wikimedia.org/r/#/c/378125/1 or shall I? [21:15:37] I'll do it [21:15:50] better not have 2 people doing deployment [21:16:08] (Y) [21:16:27] I believe that is the only remaining patch for ACW. [21:22:53] !log maxsem@tin Synchronized php-1.30.0-wmf.18/extensions/ArticleCreationWorkflow/: https://gerrit.wikimedia.org/r/#/c/378125/ (duration: 00m 47s) [21:23:02] Niharika, ^ [21:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:29] MaxSem: Why didn't we test eventlogging on mwdebug1002 first? [21:24:10] because it's not live on anything? [21:24:47] Okay. You're doing that now? [21:24:54] I thought you'd sync it together. [21:25:13] Oh nvm. Config patch. [21:31:33] MaxSem: Hello? [21:31:56] waiting for confirmation from you EL is working [21:32:41] MaxSem: When did you ever tell me it's live anywhere? [21:33:02] ewgh [21:33:12] communication is hard [21:37:49] (03PS1) 10Andrew Bogott: fullstack: add a 'success' stat [puppet] - 10https://gerrit.wikimedia.org/r/378175 [21:39:04] (03PS2) 10Andrew Bogott: fullstack: add a 'success' stat [puppet] - 10https://gerrit.wikimedia.org/r/378175 [22:08:09] (03CR) 10Chad: Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [22:08:11] (03PS2) 10Rush: toollabs: Add shinken check for tools-mail exim queue length [puppet] - 10https://gerrit.wikimedia.org/r/378105 (https://phabricator.wikimedia.org/T96898) (owner: 10Madhuvishy) [22:09:08] (03PS1) 10Catrope: Give sysops the flow-create-board right on all wikis with Flow in general use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378183 (https://phabricator.wikimedia.org/T175934) [22:10:50] (03PS3) 10Rush: toollabs: Add shinken check for tools-mail exim queue length [puppet] - 10https://gerrit.wikimedia.org/r/378105 (https://phabricator.wikimedia.org/T96898) (owner: 10Madhuvishy) [22:11:58] kaldari, Niharika: I propose to rever the EL change as it doesn't work and deploy the config [22:12:14] But we're so close to fixing it! [22:12:15] we might need time for more debugging after we go live [22:12:25] are we? I didn't notice [22:12:29] ;) [22:15:59] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: ACTRIAL going live: https://gerrit.wikimedia.org/r/#/c/378104/ (duration: 00m 46s) [22:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:38] (03CR) 10Rush: [C: 032] toollabs: Add shinken check for tools-mail exim queue length [puppet] - 10https://gerrit.wikimedia.org/r/378105 (https://phabricator.wikimedia.org/T96898) (owner: 10Madhuvishy) [22:17:41] (03PS1) 10Catrope: Remove $wgStructuredChangeFiltersEnableLiveUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378184 [22:18:16] chasemp: you may have missed the comment on top of the check :) [22:18:23] (03CR) 10Catrope: [C: 032] Remove $wgStructuredChangeFiltersEnableLiveUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378184 (owner: 10Catrope) [22:18:25] crap [22:18:37] why you do that to me :D [22:18:52] chasemp: i was just about to tell you and I saw the merge go through! [22:19:12] nobodies fault but mine [22:19:32] I'm not fast enough clearly :D [22:19:55] (03Merged) 10jenkins-bot: Remove $wgStructuredChangeFiltersEnableLiveUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378184 (owner: 10Catrope) [22:24:46] (03CR) 10Thcipriani: Scap3: Go ahead and `scap deploy --init` a freshly provisioned repo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/377304 (owner: 10Chad) [22:25:01] (03PS1) 10Rush: toolforge: adjust comment on tools-mail queue check [puppet] - 10https://gerrit.wikimedia.org/r/378186 (https://phabricator.wikimedia.org/T96898) [22:25:41] (03CR) 10Rush: [C: 032] toolforge: adjust comment on tools-mail queue check [puppet] - 10https://gerrit.wikimedia.org/r/378186 (https://phabricator.wikimedia.org/T96898) (owner: 10Rush) [22:30:02] (03CR) 10GWicke: [C: 031] Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [22:31:54] (03CR) 10Rush: fullstack: add a 'success' stat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378175 (owner: 10Andrew Bogott) [22:32:54] (03CR) 10jenkins-bot: Remove $wgStructuredChangeFiltersEnableLiveUpdate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378184 (owner: 10Catrope) [22:50:19] RoanKattouw, are you deploying that change? [22:50:31] MaxSem: It's labs-only [22:50:37] cool [22:50:40] I'll git pull it during the SWAT if that's OK [22:51:22] (03PS1) 10MaxSem: [labs] Override prod settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378187 [22:51:59] (03CR) 10MaxSem: [C: 032] [labs] Override prod settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378187 (owner: 10MaxSem) [22:52:25] Imma pulling it with my change [22:53:27] (03Merged) 10jenkins-bot: [labs] Override prod settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378187 (owner: 10MaxSem) [22:53:41] (03CR) 10jenkins-bot: [labs] Override prod settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378187 (owner: 10MaxSem) [22:55:21] !log maxsem@tin Synchronized wmf-config/: Labs only: https://gerrit.wikimedia.org/r/378184 https://gerrit.wikimedia.org/r/378187 (duration: 00m 47s) [22:55:26] RoanKattouw, ^ [22:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:57] Thanks. sorry abou tthat [22:56:02] !log dzahn@neodymium conftool action : set/pooled=no; selector: name=acamar.* [22:56:08] FFS, the ghost of our dark past strikes back: "3 svn: command not found" :P [22:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:35] !log depooled acamar (codfw dns recursor) for BIOS upgrade - installing firmware [22:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:20] lolwut [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170914T2300). [23:00:05] RoanKattouw, James_F, and tgr: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [23:00:17] I'll do the SWAT [23:00:23] If MaxSem is done that is [23:00:26] (03PS1) 10Dzahn: site/dns: tmp remove acamar from resolv.conf overrides [puppet] - 10https://gerrit.wikimedia.org/r/378188 (https://phabricator.wikimedia.org/T162850) [23:00:28] Hey. [23:01:06] me done, RoanKattouw [23:01:38] (03CR) 10Dzahn: [C: 032] site/dns: tmp remove acamar from resolv.conf overrides [puppet] - 10https://gerrit.wikimedia.org/r/378188 (https://phabricator.wikimedia.org/T162850) (owner: 10Dzahn) [23:02:36] (03PS1) 10Dzahn: Revert "site/dns: tmp remove acamar from resolv.conf overrides" [puppet] - 10https://gerrit.wikimedia.org/r/378189 [23:02:48] Cool [23:03:17] (03PS5) 10Catrope: Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [23:03:36] (03CR) 10Catrope: [C: 032] Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [23:04:03] (03CR) 1020after4: [C: 031] "I meant phab deployment. I managed it without this patch, with a little extra effort." [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:04:52] (03CR) 1020after4: [C: 031] "To be clear: this should be harmless and ready to merge at any time." [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:04:54] jouncebot: next [23:04:54] In 85 hour(s) and 55 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170918T1300) [23:05:10] twentyafterfour: so i was wrong thinking phab deploy is like in an hour? [23:05:13] (03Merged) 10jenkins-bot: Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [23:05:18] sorry for that [23:05:31] mutante: I think it was yesterday, wasn't it? [23:05:39] Or at least Phab went down for maintenance 23h ago [23:05:42] i got it wrong then.. ok [23:07:35] mutante it [23:07:42] it's always on wedsdays nights [23:07:45] :) [23:07:52] my bad [23:07:54] thursday early morning my time [23:08:58] (03CR) 10jenkins-bot: Follow-up 6d62e9ea8a. Also allow crats to remove accountcreator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/373333 (https://phabricator.wikimedia.org/T175903) (owner: 10Brian Wolff) [23:09:35] James_F: Your accountcreator patch is on mwdebug1002, please test [23:09:58] !log acamar - done with upgrade - rebooting - it's depooled and removed from resolv.conf - T162850 [23:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:10:12] T162850: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850 [23:10:14] (03PS3) 10Catrope: Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (https://phabricator.wikimedia.org/T175095) (owner: 10Tim Starling) [23:10:17] (03CR) 10Catrope: [C: 032] Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (https://phabricator.wikimedia.org/T175095) (owner: 10Tim Starling) [23:10:33] RoanKattouw: One sec, need to use my other account. [23:11:39] RoanKattouw: Yup, LGTM. [23:11:47] (03Merged) 10jenkins-bot: Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (https://phabricator.wikimedia.org/T175095) (owner: 10Tim Starling) [23:11:59] (03CR) 10jenkins-bot: Use RemexHtml instead of Tidy on mediawikiwiki, testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377943 (https://phabricator.wikimedia.org/T175095) (owner: 10Tim Starling) [23:12:42] !log acamar - staged BIOS upgrade in progress [23:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:24] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Allow bureaucrats to remove accountcreator on mediawikiwiki (T175903) (duration: 00m 46s) [23:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:36] T175903: Let bureaucrats on mediawiki.org remove 'accountcreator' (they can already add it) - https://phabricator.wikimedia.org/T175903 [23:14:51] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3609476 (10Dzahn) 14:53 < bblack> 1) Explicitly depool acamar from codfw recdns (you can confirm it in logs and ipvsadm -Ln output on lvs2002, should be the active LVS for it) 14:54 < bblack> 2)... [23:15:41] James_F: Synced, and next on deck is Remex [23:15:45] Now on mwdebug1002, please test [23:16:40] RoanKattouw: Yup, LGTM. [23:18:01] !log acamar is back up and running - no failed units [23:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:20] !log dzahn@neodymium conftool action : set/pooled=yes; selector: name=acamar.wikimedia.org [23:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:26] (03CR) 10Dzahn: [C: 032] "reboot is over and server is pooled - can be used again" [puppet] - 10https://gerrit.wikimedia.org/r/378189 (owner: 10Dzahn) [23:22:02] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Use RemexHtml on mediawikiwiki and testwiki (T175095) (duration: 00m 46s) [23:22:08] (03PS2) 10Catrope: Give sysops the flow-create-board right on all wikis with Flow in general use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378183 (https://phabricator.wikimedia.org/T175934) [23:22:11] (03CR) 10Catrope: [C: 032] Give sysops the flow-create-board right on all wikis with Flow in general use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378183 (https://phabricator.wikimedia.org/T175934) (owner: 10Catrope) [23:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:19] T175095: Enable RemexHTML on mediawiki.org and testwiki - https://phabricator.wikimedia.org/T175095 [23:23:32] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3609493 (10Dzahn) [23:24:07] 10Operations, 10Patch-For-Review: CPU throttling on DELL PowerEdge R320 - https://phabricator.wikimedia.org/T162850#3177068 (10Dzahn) Acamar is done. It's pooled again, the resolv.conf change is reverted. I saw no issues, no failed services/units after reboot. [23:24:55] (03Merged) 10jenkins-bot: Give sysops the flow-create-board right on all wikis with Flow in general use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378183 (https://phabricator.wikimedia.org/T175934) (owner: 10Catrope) [23:25:54] (03CR) 10jenkins-bot: Give sysops the flow-create-board right on all wikis with Flow in general use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378183 (https://phabricator.wikimedia.org/T175934) (owner: 10Catrope) [23:27:27] tgr: You around for your SWAT patch? [23:27:35] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Give sysops the flow-create-board right on Flow wikis (T175934) (duration: 00m 45s) [23:27:43] (VRS config for Electron) [23:27:44] RoanKattouw: here [23:27:47] Cool [23:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:51] T175934: Give the right to sysops to create, move and delete Structured Discussions boards on wikis where they are available as a Beta feature - https://phabricator.wikimedia.org/T175934 [23:28:02] (03PS4) 10Catrope: Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [23:28:19] (03CR) 10Catrope: [C: 032] Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [23:28:37] it should be a noop in prod [23:29:16] !log catrope@tin Synchronized wmf-config/CommonSettings.php: Give sysops the flow-create-board right on Flow wikis (T175934) (duration: 00m 45s) [23:29:20] (03PS7) 10Dzahn: PHAB: deployment scripts to be called by scap [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:48] (03Merged) 10jenkins-bot: Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [23:30:00] James_F: Your VE change is on mwdebug1002, please test [23:30:00] (03CR) 10jenkins-bot: Add VirtualRestService config for Electron [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377928 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [23:30:10] tgr: And your VRS change too but there might not be much testing you can do [23:32:08] RoanKattouw: Yup, LGTM. [23:32:19] (03CR) 10Dzahn: [C: 032] PHAB: deployment scripts to be called by scap [puppet] - 10https://gerrit.wikimedia.org/r/370622 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:32:28] RoanKattouw: site's still up and I see the config in eval.php, that's good enough for me :) [23:32:53] OK, will sync both after my CX sync finished [23:32:54] *finishes [23:33:01] !log catrope@tin Synchronized php-1.30.0-wmf.18/extensions/ContentTranslation/modules/widgets/translator/ext.cx.translator.js: Fix JS error in CX (duration: 00m 46s) [23:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:23] !log catrope@tin Synchronized php-1.30.0-wmf.18/extensions/VisualEditor/lib/ve/src/ui/styles/ve.ui.TableLineContext.css: T169389 (duration: 00m 45s) [23:34:32] 10Operations, 10Ops-Access-Requests: Lost access to x1-analytics-slave - https://phabricator.wikimedia.org/T175970#3609511 (10Etonkovidova) [23:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:36] T169389: [Regression pre-wmf.8] Table context menu is appearing over table cell menu - https://phabricator.wikimedia.org/T169389 [23:37:09] (03CR) 10Dzahn: [C: 032] Move phabricator conf files outside of source tree [puppet] - 10https://gerrit.wikimedia.org/r/374054 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:37:27] (03PS5) 10Dzahn: Move phabricator conf files outside of source tree [puppet] - 10https://gerrit.wikimedia.org/r/374054 (https://phabricator.wikimedia.org/T172847) (owner: 1020after4) [23:37:45] !log catrope@tin Synchronized wmf-config/: VRS config for Electron (T175868) (duration: 00m 47s) [23:37:51] Alright, all done [23:37:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:00] T175868: Deploy and test new book rendering (Remex + Electron) - https://phabricator.wikimedia.org/T175868 [23:40:19] (03CR) 10Dzahn: "still WIP?" [puppet] - 10https://gerrit.wikimedia.org/r/324808 (owner: 1020after4) [23:41:55] (03CR) 10Dzahn: [C: 031] "+1 because the linked ticket is closed as resolved and says this only affects beta and has been cherry-picked on beta puppetmaster a long " [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) (owner: 1020after4) [23:42:54] (03PS4) 10Dzahn: Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) (owner: 1020after4) [23:43:35] (03CR) 10Dzahn: [C: 032] Change $deploy_user home directory to /var/lib/${deploy_user} [puppet] - 10https://gerrit.wikimedia.org/r/365891 (https://phabricator.wikimedia.org/T166013) (owner: 1020after4) [23:45:48] 10Operations, 10ops-eqiad, 10DC-Ops: scs-c1-eqiad unresponsive - https://phabricator.wikimedia.org/T175625#3609558 (10RobH) a:05RobH>03Cmjohnson > Rey Lamuri, Sep 14, 17:13 MDT: > Hi Rob, > > The recovery process for your CM4148 is here https://opengear.zendesk.com/hc/en-us/articles/216376223-Firmware-r... [23:51:03] !log phabricator - tested gerrit 354247 before merging using apache-fast-test from tin -restarted apache [23:51:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:20] (03PS9) 10Dzahn: Phabricator: Redirect all http traffic to https [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox) [23:51:59] (03CR) 10Dzahn: [C: 032] "tested a couple URLs and the redirects below these rules using apache-fast-test from tin - looks good, works" [puppet] - 10https://gerrit.wikimedia.org/r/354247 (https://phabricator.wikimedia.org/T165643) (owner: 10Paladox)