[00:00:04] twentyafterfour: It is that lovely time of the day again! You are hereby commanded to deploy Phabricator update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T0000). [00:00:04] No patches in the queue for this window. Wheeee! [00:02:29] !log catrope@tin Synchronized php-1.30.0-wmf.19/resources/src/mediawiki.rcfilters/: Lazy-load the RCFilters menu (T176250) (duration: 00m 48s) [00:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:02:44] T176250: Slowdown due to new filters - https://phabricator.wikimedia.org/T176250 [00:05:21] RoanKattouw: Sorry, was distracted [00:05:24] RoanKattouw: here now [00:05:40] Krinkle: Your jQuery 3 on Commons change is on mwdebug1002, please test [00:06:50] RoanKattouw: thx, tested. LGTM [00:08:20] OK, syncing [00:08:40] small problem with my patch, i have a 1 line update coming in that will fix it (misspelled schema name...) [00:08:47] i can ship it myself if you're done [00:09:06] OK [00:09:08] Yeah go ahead [00:09:09] I'm done [00:09:13] kk [00:09:18] And I need to head out soon [00:09:25] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable jQuery 3 on commons (T124742) (duration: 00m 49s) [00:09:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:40] T124742: Upgrade to jQuery 3 - https://phabricator.wikimedia.org/T124742 [00:18:58] !log ebernhardson@tin Synchronized php-1.30.0-wmf.19/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: Fix schema name for search satisfaction error logging (duration: 00m 49s) [00:19:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:23:36] !log ebernhardson@tin Synchronized php-1.30.0-wmf.18/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: Fix schema name for search satisfaction error logging (duration: 00m 53s) [00:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:34:51] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC, 10HHVM, 10NewPHP: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3623338 (10MaxSem) [00:44:29] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10HHVM: Convert Wikimedia production HHVM instances to have hhvm.php7.all set true - https://phabricator.wikimedia.org/T173786#3623343 (10tstarling) 05Open>03declined I don't think it's a duplicate, we could theoretically do both. But like Ma... [00:45:47] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:47:33] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3623349 (10Legoktm) [00:47:38] 10Operations, 10MediaWiki-Platform-Team, 10TechCom-RfC, 10HHVM, 10NewPHP: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#3623348 (10Legoktm) [01:11:52] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3623400 (10Legoktm) [01:13:36] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3561778 (10Legoktm) I updated the steps based on the plan to use PHP 7 instead of HHVM. [01:14:17] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [01:16:48] 10Operations, 10HHVM: Upload hhvm to stretch apt repo in apt.wikimedia.org - https://phabricator.wikimedia.org/T167225#3623428 (10Legoktm) I think this can be declined now given the plans to use PHP 7? [01:50:37] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational [01:53:37] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [02:29:09] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.18) (duration: 08m 50s) [02:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:06:05] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.19) (duration: 14m 44s) [03:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu Sep 21 03:12:41 UTC 2017 (duration 6m 36s) [03:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:48] PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) timed out before a response was received: /_info (test for /_info) timed out before a response was received [03:19:00] PROBLEM - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:19:01] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp4026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:19:37] RECOVERY - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [03:19:57] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp4026 is OK: HTTP OK: HTTP/1.1 200 OK - 458 bytes in 0.157 second response time [03:19:57] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [03:20:07] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [03:20:47] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [03:21:43] looking [03:22:50] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:23:10] RECOVERY - LVS HTTPS IPv6 on upload-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 908 bytes in 5.451 second response time [03:24:48] !log depooled cp4026 [03:25:00] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 895 bytes in 7.330 second response time [03:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:35:52] 10Operations, 10Traffic: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623529 (10BBlack) [03:37:30] PROBLEM - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:38:29] RECOVERY - LVS HTTPS IPv4 on upload-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 896 bytes in 4.710 second response time [03:39:27] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 21 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:42:57] PROBLEM - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is CRITICAL: /{src}/{z}/{x}/{y}.{format} (get a tile in the middle of the ocean, with overzoom) timed out before a response was received: /img/{src},{z},{lat},{lon},{w}x{h}@{scale}x.{format} (Small scaled map) timed out before a response was received: /{src}/{z}/{x}/{y}@{scale}x.{format} (default scaled tile) timed out before a response was received [03:43:09] 10Operations, 10Traffic: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623545 (10BBlack) Actually, seeing the same on several cp402x. Depooling ulsfo, maybe switch issue? [03:43:47] (03PS1) 10BBlack: depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/379464 (https://phabricator.wikimedia.org/T176386) [03:44:04] (03CR) 10BBlack: [C: 032] depool ulsfo [dns] - 10https://gerrit.wikimedia.org/r/379464 (https://phabricator.wikimedia.org/T176386) (owner: 10BBlack) [03:44:27] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 10 probes of 276 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [03:44:38] RECOVERY - Maps edge ulsfo on upload-lb.ulsfo.wikimedia.org is OK: All endpoints are healthy [03:55:17] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 1 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [03:59:37] PROBLEM - Host cp4007 is DOWN: PING CRITICAL - Packet loss = 100% [03:59:47] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:00:37] PROBLEM - Juniper alarms on asw-ulsfo is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms [04:00:57] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:03:57] PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [04:03:57] PROBLEM - Host ripe-atlas-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [04:04:27] PROBLEM - Host cp4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [04:05:22] 10Operations, 10Traffic, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623552 (10BBlack) So, the same basic issue appears to have happened for almost all of upload@ulsfo (cp402[12356]) at about the same time. cp4021 was the lone exception. cp402[78] in text@... [04:06:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [04:06:43] 10Operations, 10Traffic, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623553 (10BBlack) asw-ulsfo has some other alerts going on, aside from the expected link loss to various flapping or supposedly-down hosts, e.g.: ``` Sep 21 03:57:51 asw-ulsfo alarmd[1458... [04:13:24] 10Operations, 10Traffic, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623555 (10BBlack) ... and now we've lost the cr1-eqiad <-> cr1-codfw link ... ? ``` cr1-eqiad xe-4/2/0: down -> Core: cr1-codfw:xe-5/2/1 ``` [04:26:17] RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [04:28:57] RECOVERY - Host cp4007 is UP: PING OK - Packet loss = 0%, RTA = 78.60 ms [04:30:17] RECOVERY - Host ripe-atlas-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 78.69 ms [04:30:27] RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.67 ms [04:30:57] RECOVERY - Host cp4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.19 ms [04:32:04] 10Operations, 10Traffic, 10Patch-For-Review: cp4026 strange ethernet issue - https://phabricator.wikimedia.org/T176386#3623559 (10BBlack) Recoveries of whatever the hell is happening in ulsfo: ``` 04:26 <+icinga-wm> RECOVERY - Juniper alarms on asw-ulsfo is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms 0... [04:34:36] 10Operations, 10Traffic, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623560 (10BBlack) [05:49:37] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 32 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:59:37] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 7 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:07:13] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3561778 (10Joe) >>! In T174431#3623400, @Legoktm wrote: > I updated the steps based on the plan to use PHP 7 instead of HHVM. There is no way we'll embark in the double migration at the same time. Upgrading... [06:14:27] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3623586 (10Legoktm) [06:15:23] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3561778 (10Legoktm) >>! In T174431#3623584, @Joe wrote: >>>! In T174431#3623400, @Legoktm wrote: >> I updated the steps based on the plan to use PHP 7 instead of HHVM. > > There is no way we'll embark in the... [06:16:38] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 32 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:18:07] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:21:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:22:08] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [06:25:57] PROBLEM - puppet last run on ores2005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:48:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [06:49:18] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [06:51:47] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [06:53:17] RECOVERY - puppet last run on ores2005 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [06:55:04] !log bounce pybal on lvs1009 to clear stale alert [06:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:57] RECOVERY - PyBal backends health check on lvs1009 is OK: PYBAL OK - All pools are healthy [06:59:17] !log bounce pybal on lvs1006 to clear stale alert [06:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:57] RECOVERY - PyBal backends health check on lvs1006 is OK: PYBAL OK - All pools are healthy [07:02:47] PROBLEM - puppet last run on mw2185 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:03:47] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 37 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:06:08] 10Operations, 10Pybal, 10Traffic: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388#3623639 (10ema) [07:06:21] 10Operations, 10Pybal, 10Traffic: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388#3623652 (10ema) p:05Triage>03High [07:06:31] (03PS2) 10Muehlenhoff: Drop trebuchet::packages [puppet] - 10https://gerrit.wikimedia.org/r/379180 [07:06:49] 10Operations, 10Pybal, 10Traffic: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388#3623639 (10ema) [07:08:26] (03CR) 10Muehlenhoff: [C: 032] Drop trebuchet::packages [puppet] - 10https://gerrit.wikimedia.org/r/379180 (owner: 10Muehlenhoff) [07:08:31] (03CR) 10Gilles: [C: 031] webperf: Limit by-country navtiming breakdown to those with 5+ hits/min [puppet] - 10https://gerrit.wikimedia.org/r/377806 (https://phabricator.wikimedia.org/T166390) (owner: 10Krinkle) [07:09:19] !log bounce pybal on lvs1003 to clear stale alert T176388 [07:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:32] T176388: pybal: race condition in alerts instrumentation - https://phabricator.wikimedia.org/T176388 [07:09:47] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [07:10:41] (03PS2) 10Muehlenhoff: Remove a few obsolete references to trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/379281 [07:12:17] PROBLEM - puppet last run on ms-be2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:13:47] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:14:58] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:16:19] 10Operations, 10TemplateStyles, 10Wikimedia-Extension-setup, 10Wikimedia-Site-requests, and 3 others: Deploy TemplateStyles to svwiki - https://phabricator.wikimedia.org/T176082#3623660 (10ema) [07:16:39] (03CR) 10Muehlenhoff: [C: 032] Remove a few obsolete references to trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/379281 (owner: 10Muehlenhoff) [07:18:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [07:18:10] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3623663 (10MoritzMuehlenhoff) I'll take care of builds for stretch-wikimedia [07:19:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [07:21:22] ACKNOWLEDGEMENT - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/296af4efe8f6f57432c8905b9d09a558eca6ba4214f40b1183f1d7a794976745/shm is not accessible: Permission denied Ema Known: T172409 [07:24:27] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [07:27:09] !log mobrovac@tin Started deploy [restbase/deploy@3e9bd9f]: New storage schema for mobile-sections, canary deploy for schema creation, take 2 - T169940 [07:27:17] PROBLEM - puppet last run on mw2108 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:27:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:25] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [07:27:39] !log mobrovac@tin Finished deploy [restbase/deploy@3e9bd9f]: New storage schema for mobile-sections, canary deploy for schema creation, take 2 - T169940 (duration: 00m 30s) [07:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:57] (03PS1) 10Muehlenhoff: Remove obsolete deprecation code [puppet] - 10https://gerrit.wikimedia.org/r/379482 [07:31:17] RECOVERY - puppet last run on mw2185 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:31:45] mobrovac: o/ - is restbase2002 out of service as part of the migration or should it be checked? [07:31:45] <_joe_> !log restarted aphlict on phab1001 [07:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:13] elukey: yup yup, ignore restbase2002 (i disabled the checks for it yday) [07:32:19] in icinga that is [07:32:35] super thanks [07:35:48] RECOVERY - Check systemd state on phab1001 is OK: OK - running: The system is fully operational [07:36:42] 10Operations, 10Traffic, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623695 (10ema) >>! In T176386#3623552, @BBlack wrote: > Inbound network traffic to all the upload@ulsfo nodes was ramping up to unusual values ahead of the n... [07:37:42] 10Operations, 10Traffic, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623696 (10ema) p:05Triage>03High [07:40:47] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [07:40:47] RECOVERY - puppet last run on ms-be2028 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [07:41:28] (03PS1) 10Muehlenhoff: Remove Trebuchet puppet package provider [puppet] - 10https://gerrit.wikimedia.org/r/379486 [07:44:23] (03PS1) 10Muehlenhoff: Stop including trebuchet class from base profile [puppet] - 10https://gerrit.wikimedia.org/r/379487 [07:44:25] (03PS1) 10Muehlenhoff: Remove obsolete trebuchet class [puppet] - 10https://gerrit.wikimedia.org/r/379488 [07:49:45] 10Operations, 10Phabricator, 10Release-Engineering-Team: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3623717 (10Joe) [07:49:54] 10Operations, 10Phabricator, 10Release-Engineering-Team: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3623729 (10Joe) p:05Triage>03High [07:51:28] <_joe_> moritzm: we're done with trebuchet? [07:51:43] <_joe_> can I remove all the trebuchet related shit from service::node then? [07:53:14] (03PS1) 10Muehlenhoff: Remove obsolete sudo group [puppet] - 10https://gerrit.wikimedia.org/r/379490 [07:53:37] (03PS1) 10Giuseppe Lavagetto: Rakefile: start using the future parser for syntax checking [puppet] - 10https://gerrit.wikimedia.org/r/379491 [07:54:56] (03CR) 10Hashar: [C: 031] "Sorry I got confused with all the layers in puppet. Indeed this is just removing the git-deploy/trebuchet craziness." [puppet] - 10https://gerrit.wikimedia.org/r/379189 (owner: 10Muehlenhoff) [07:55:04] (03PS2) 10Giuseppe Lavagetto: Rakefile: start using the future parser for syntax checking [puppet] - 10https://gerrit.wikimedia.org/r/379491 (https://phabricator.wikimedia.org/T171704) [07:55:38] RECOVERY - puppet last run on mw2108 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [07:55:57] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [07:56:08] _joe_: you can :-) [07:56:31] I looked briefly over it yesterday and I think it still default to trebuchet even :-) [07:56:43] <_joe_> yes [07:57:44] !log mobrovac@tin Started deploy [restbase/deploy@3e9bd9f]: New storage schema for mobile-sections - T169940 [07:57:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:00] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [07:59:12] (03PS1) 10Giuseppe Lavagetto: puppet: switch all hosts to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/379492 (https://phabricator.wikimedia.org/T171704) [07:59:26] (03PS9) 10ArielGlenn: Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) [08:00:44] (03CR) 10ArielGlenn: [C: 032] Move datasets nginx logs rsync to dumps web manifest where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/379198 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [08:00:50] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/379491 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [08:00:57] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 13 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:07:52] (03PS1) 10Elukey: torrus: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379494 [08:07:58] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 33 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:08:12] !log mobrovac@tin Finished deploy [restbase/deploy@3e9bd9f]: New storage schema for mobile-sections - T169940 (duration: 10m 28s) [08:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:27] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [08:08:27] RECOVERY - Restbase root url on restbase2002 is OK: HTTP OK: HTTP/1.1 200 - 15723 bytes in 0.088 second response time [08:08:27] RECOVERY - Check systemd state on restbase2002 is OK: OK - running: The system is fully operational [08:08:52] (03CR) 10Elukey: [C: 032] torrus: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379494 (owner: 10Elukey) [08:08:55] (03Abandoned) 10Muehlenhoff: yubiauth: Use the future parser [puppet] - 10https://gerrit.wikimedia.org/r/377422 (owner: 10Muehlenhoff) [08:09:17] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:12:17] (03PS1) 10Elukey: apache::static_site: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379495 [08:14:22] (03CR) 10Elukey: [C: 032] apache::static_site: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379495 (owner: 10Elukey) [08:14:34] !log mobrovac@tin Started deploy [restbase/deploy@9fd380d]: Log only mismatches during updates, not live requests [08:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:28] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [08:19:56] !log mobrovac@tin Finished deploy [restbase/deploy@9fd380d]: Log only mismatches during updates, not live requests (duration: 05m 22s) [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:21] (03PS2) 10Muehlenhoff: Remove deployment::salt_master/role::deployment::salt_masters and related files [puppet] - 10https://gerrit.wikimedia.org/r/379189 [08:20:56] !log mobrovac@tin Started deploy [restbase/deploy@9fd380d]: Log only mismatches during updates, not live requests [08:21:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:21] (03CR) 10Alexandros Kosiaris: "Do we currently or in the near future have a use cause for fully skipping monitoring ? If not, then we should not add code that is not goi" [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [08:21:28] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [08:21:54] !log mobrovac@tin Finished deploy [restbase/deploy@9fd380d]: Log only mismatches during updates, not live requests (duration: 00m 58s) [08:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:08] <_joe_> moritzm: removing that poison from service::node will require a bit more work than anticipated [08:22:21] !log mobrovac@tin Started deploy [restbase/deploy@9fd380d]: Log only mismatches during updates, not live requests [08:22:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:57] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:24:17] ok [08:24:50] (03CR) 10Muehlenhoff: [C: 032] Remove deployment::salt_master/role::deployment::salt_masters and related files [puppet] - 10https://gerrit.wikimedia.org/r/379189 (owner: 10Muehlenhoff) [08:25:39] !log mobrovac@tin Finished deploy [restbase/deploy@9fd380d]: Log only mismatches during updates, not live requests (duration: 03m 18s) [08:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:37] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:27:47] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [08:29:07] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:29:28] 10Operations, 10Ops-Access-Requests: Request access to logstash (nda group) for @framawiki - https://phabricator.wikimedia.org/T176364#3623762 (10Trizek-WMF) {{support}} Trusted user, involved in tech support on fr. ~~~~ [08:29:35] 10Operations, 10Traffic, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623763 (10ema) >>! In T176386#3623552, @BBlack wrote: > So, the same basic issue appears to have happened for almost all of upload@ulsfo (cp402[12356]) at ab... [08:30:36] !log akosiaris@puppetmaster1001 conftool action : set/weight=10; selector: name=wtp1025.eqiad.wmnet [08:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:58] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [08:30:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [08:30:59] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [08:31:00] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [08:31:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [08:31:01] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [08:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:11] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1025.eqiad.wmnet [08:31:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1026.eqiad.wmnet [08:31:12] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1027.eqiad.wmnet [08:31:13] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1028.eqiad.wmnet [08:31:13] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1029.eqiad.wmnet [08:31:14] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1030.eqiad.wmnet [08:31:14] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1031.eqiad.wmnet [08:31:15] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1032.eqiad.wmnet [08:31:15] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1033.eqiad.wmnet [08:31:16] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1034.eqiad.wmnet [08:31:16] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1035.eqiad.wmnet [08:31:17] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1036.eqiad.wmnet [08:31:17] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1037.eqiad.wmnet [08:31:18] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1038.eqiad.wmnet [08:31:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1039.eqiad.wmnet [08:31:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1040.eqiad.wmnet [08:31:19] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1041.eqiad.wmnet [08:31:20] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1042.eqiad.wmnet [08:31:21] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1043.eqiad.wmnet [08:31:21] !log akosiaris@puppetmaster1001 conftool action : set/pooled=yes; selector: name=wtp1044.eqiad.wmnet [08:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:00] !log pool all of wtp1025 to wtp1048 T165520 [08:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:22] akosiaris: lol, one by one... [08:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:33] volans: yeah I just did a for loop [08:32:42] (03PS1) 10Ema: Repool text@ulsfo [dns] - 10https://gerrit.wikimedia.org/r/379498 (https://phabricator.wikimedia.org/T176386) [08:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:47] and if you notice I made a mistake the very first time [08:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:53] pooling wtp1025 multiple times [08:32:53] yeah, noticed ;) [08:33:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:20] !log unbreak deployment-puppetmaster02 in deployment-prep (broken by unattended-upgrades update of apache T159254) [08:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:07] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 23 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:35:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:36] T165520: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520 [08:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:47] T159254: Blacklist apache from unattended-upgrades on tools puppetmaster - https://phabricator.wikimedia.org/T159254 [08:37:57] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [08:38:40] (03PS1) 10Elukey: phragile: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379499 [08:40:07] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 286 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [08:48:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0 [08:49:07] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [08:51:02] (03CR) 10Ema: [C: 032] Repool text@ulsfo [dns] - 10https://gerrit.wikimedia.org/r/379498 (https://phabricator.wikimedia.org/T176386) (owner: 10Ema) [08:52:32] (03PS1) 10Muehlenhoff: Remove salt minion Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/379500 [08:57:28] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [09:01:59] 10Operations, 10Traffic, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623529 (10faidon) >>! In T176386#3623559, @BBlack wrote: > Recoveries of whatever the hell is happening in ulsfo: > ``` > 04:26 <+icinga-wm> RECOVERY - Junip... [09:07:34] (03CR) 10Mobrovac: [C: 031] Remove obsolete deprecation code [puppet] - 10https://gerrit.wikimedia.org/r/379482 (owner: 10Muehlenhoff) [09:09:10] 10Operations, 10Traffic, 10Patch-For-Review: upload@ulsfo strange ethernet / power / switch issues, etc... - https://phabricator.wikimedia.org/T176386#3623810 (10faidon) Confirmed from UnitedLayer email: ``` Assad Kermanshahi, Sep 20, 21:13 PDT Dear WikimediaFoundation, This is to inform you that your PDUb... [09:15:52] (03PS1) 10Muehlenhoff: Remove beta::saltmaster::tools and /usr/local/bin/beta-apaches [puppet] - 10https://gerrit.wikimedia.org/r/379502 [09:18:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [09:21:47] PROBLEM - puppet last run on restbase2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Service[restbase] [09:24:07] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3623829 (10akosiaris) Add wtp1025 to wtp1048 hosts are pooled now. I 've also updated https://grafana.wikimedia.org/dashboard/db/parsoid-servers-cpu-usage?orgId=1 to list the new servers and they look fin... [09:24:18] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3623830 (10akosiaris) 05Open>03Resolved [09:25:02] (03PS1) 10Giuseppe Lavagetto: service: remove trebuchet references [puppet] - 10https://gerrit.wikimedia.org/r/379503 [09:28:37] (03PS3) 10Giuseppe Lavagetto: profile::docker::builder: add build script for production-images [puppet] - 10https://gerrit.wikimedia.org/r/379176 [09:32:59] (03PS1) 10DCausse: Revert "Switch elasticsearch active cluster to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379504 [09:33:06] (03PS2) 10DCausse: Revert "Switch elasticsearch active cluster to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379504 [09:34:28] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::docker::builder: add build script for production-images [puppet] - 10https://gerrit.wikimedia.org/r/379176 (owner: 10Giuseppe Lavagetto) [09:35:48] (03CR) 10Gehel: [C: 031] "LGTM and eqiad cluster looks healthy and ready to receive the traffic" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379504 (owner: 10DCausse) [09:36:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 232, down: 1, dormant: 0, excluded: 0, unused: 0 [09:46:40] (03CR) 10Alexandros Kosiaris: [C: 031] Rakefile: start using the future parser for syntax checking [puppet] - 10https://gerrit.wikimedia.org/r/379491 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:49:27] RECOVERY - puppet last run on restbase2002 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [09:51:28] (03PS3) 10Giuseppe Lavagetto: Rakefile: start using the future parser for syntax checking [puppet] - 10https://gerrit.wikimedia.org/r/379491 (https://phabricator.wikimedia.org/T171704) [09:54:51] !log reindexing group0 wikis T176397 [09:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:07] T176397: Reindex default namespaces that were moved from general to content indices - https://phabricator.wikimedia.org/T176397 [09:56:16] (03CR) 10Giuseppe Lavagetto: [C: 032] Rakefile: start using the future parser for syntax checking [puppet] - 10https://gerrit.wikimedia.org/r/379491 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [09:59:39] !log reindexing group1 wikis T176397 [09:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:53] (03PS1) 10ArielGlenn: move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) [10:02:37] (03CR) 10jerkins-bot: [V: 04-1] move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [10:04:50] 10Operations, 10Phabricator, 10Release-Engineering-Team: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3623717 (10Paladox) Also would like to add that it is failing to connect to the db too. [10:05:10] (03PS2) 10Volans: Backends: add OpenStack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/379247 (https://phabricator.wikimedia.org/T175711) [10:06:34] (03PS4) 10Volans: wmf-auto-reimage refactoring [puppet] - 10https://gerrit.wikimedia.org/r/377501 (https://phabricator.wikimedia.org/T148814) [10:09:28] (03CR) 10Alexandros Kosiaris: [C: 031] icinga: initial whitelist for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [10:09:46] (03CR) 10Volans: [C: 032] wmf-auto-reimage refactoring [puppet] - 10https://gerrit.wikimedia.org/r/377501 (https://phabricator.wikimedia.org/T148814) (owner: 10Volans) [10:14:50] (03PS2) 10ArielGlenn: move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) [10:15:25] (03CR) 10jerkins-bot: [V: 04-1] move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [10:16:07] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/wmf-auto-reimage] [10:16:35] that's me, fixing [10:18:16] (03PS1) 10Volans: Salt: completely remove the orchestration class [puppet] - 10https://gerrit.wikimedia.org/r/379508 (https://phabricator.wikimedia.org/T166300) [10:19:14] (03PS3) 10ArielGlenn: move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) [10:19:48] (03PS2) 10Volans: Salt: completely remove the orchestration class [puppet] - 10https://gerrit.wikimedia.org/r/379508 (https://phabricator.wikimedia.org/T166300) [10:20:49] (03CR) 10Volans: [C: 032] Salt: completely remove the orchestration class [puppet] - 10https://gerrit.wikimedia.org/r/379508 (https://phabricator.wikimedia.org/T166300) (owner: 10Volans) [10:25:08] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:26:33] !log upload docker-ce_17.06.2~ce-0~debian_amd64.deb to apt.wikimedia.org jessie-wikimedia/thirdparty/ci T175293 [10:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:51] T175293: Provision Docker >= 17.05 on contint1001 - https://phabricator.wikimedia.org/T175293 [10:27:03] (03PS2) 10Alexandros Kosiaris: Enable thirdparty/ci on role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/379183 (https://phabricator.wikimedia.org/T175293) [10:27:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Enable thirdparty/ci on role::ci::slave [puppet] - 10https://gerrit.wikimedia.org/r/379183 (https://phabricator.wikimedia.org/T175293) (owner: 10Alexandros Kosiaris) [10:33:52] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Yeah, -1 until we fix them, at least mediawiki" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [10:38:02] (03PS1) 10Alexandros Kosiaris: Install docker-ce on role::ci::slave hosts [puppet] - 10https://gerrit.wikimedia.org/r/379510 (https://phabricator.wikimedia.org/T175293) [10:39:11] (03CR) 10Alexandros Kosiaris: "@hashar does this look ok architecture wise to you? In production there aren't any other hosts including role::ci::slave except contint100" [puppet] - 10https://gerrit.wikimedia.org/r/379510 (https://phabricator.wikimedia.org/T175293) (owner: 10Alexandros Kosiaris) [10:46:42] (03PS4) 10ArielGlenn: move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) [10:47:27] (03CR) 10jerkins-bot: [V: 04-1] move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [10:47:46] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3268837 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by volans on neodymium.eqiad.wmnet for hosts: ``` mw1319.eqiad.wmnet ``` The log can be foun... [10:51:35] (03PS5) 10ArielGlenn: move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) [10:52:13] (03PS1) 10Ema: VCL: Exp cache admission policy for varnish-be [puppet] - 10https://gerrit.wikimedia.org/r/379512 (https://phabricator.wikimedia.org/T144187) [10:57:57] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 234, down: 0, dormant: 0, excluded: 0, unused: 0 [10:59:09] (03PS1) 10Elukey: Add druid.svc.eqiad.wmnet crt file [puppet] - 10https://gerrit.wikimedia.org/r/379513 (https://phabricator.wikimedia.org/T176223) [10:59:38] (03PS2) 10Ema: VCL: Exp cache admission policy for varnish-be [puppet] - 10https://gerrit.wikimedia.org/r/379512 (https://phabricator.wikimedia.org/T144187) [10:59:52] (03CR) 10Elukey: [C: 032] Add druid.svc.eqiad.wmnet crt file [puppet] - 10https://gerrit.wikimedia.org/r/379513 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [11:02:26] (03PS2) 10Giuseppe Lavagetto: puppet: switch all production hosts to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/379492 (https://phabricator.wikimedia.org/T171704) [11:03:23] (03CR) 10jerkins-bot: [V: 04-1] puppet: switch all production hosts to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/379492 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [11:04:22] (03PS6) 10ArielGlenn: move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) [11:05:09] (03CR) 10ArielGlenn: [C: 032] move datasets directory structure and static html file setup to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379506 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [11:09:38] !log reindexing ilwikimedia in codfw to pickup and test new hebmorph analyzer [11:09:47] (03PS3) 10Giuseppe Lavagetto: puppet: switch all production hosts to the future parser [puppet] - 10https://gerrit.wikimedia.org/r/379492 (https://phabricator.wikimedia.org/T171704) [11:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:54] !log reindexing all hebrew wikis to pickup the new hebmorph analyzer [11:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:13] !log installing apache security updates (our configurations are not affected, the upload of 2.4.10-10+deb8u11+wmf1 is mostly for the benefit of whatever people are doing in Cloud VPS, but we should still fix the underlying bug in production as well) [11:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:30] let's fix that bad optionsbleed :P [11:39:44] (03PS1) 10Muehlenhoff: Extend Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/379514 [11:40:12] !log upgrading apache on canary app servers [11:40:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:59:17] !log upgrading apache on deployment servers and script runners [11:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:32] (03CR) 10Muehlenhoff: [C: 032] Extend Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/379514 (owner: 10Muehlenhoff) [12:16:52] (03PS2) 10Muehlenhoff: Configure fixed lock manager ports for labstore NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/357562 (https://phabricator.wikimedia.org/T165136) [12:16:58] (03Draft2) 10Reedy: Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 [12:17:23] (03CR) 10jerkins-bot: [V: 04-1] Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 (owner: 10Reedy) [12:18:46] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3624130 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1319.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1319.eqiad.wmnet'] ``` [12:21:13] (03PS1) 10Volans: wmf-auto-reimage: small improvements [puppet] - 10https://gerrit.wikimedia.org/r/379518 (https://phabricator.wikimedia.org/T148814) [12:22:41] (03CR) 10Volans: [C: 032] "Merging to go continue with the live testing, comments are welcome and I'll amend if needed." [puppet] - 10https://gerrit.wikimedia.org/r/379518 (https://phabricator.wikimedia.org/T148814) (owner: 10Volans) [12:22:47] (03PS2) 10Volans: wmf-auto-reimage: small improvements [puppet] - 10https://gerrit.wikimedia.org/r/379518 (https://phabricator.wikimedia.org/T148814) [12:27:05] jouncebot: next [12:27:06] In 0 hour(s) and 32 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T1300) [12:28:52] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3624152 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by volans on neodymium.eqiad.wmnet for hosts: ``` mw1319.eqiad.wmnet ``` The log can be foun... [12:32:03] (03PS2) 10Muehlenhoff: Remove obsolete deprecation code [puppet] - 10https://gerrit.wikimedia.org/r/379482 [12:32:12] !log mobrovac@tin Started deploy [restbase/deploy@bc02191]: produce a diff for the mobile proxy [12:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:12] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete deprecation code [puppet] - 10https://gerrit.wikimedia.org/r/379482 (owner: 10Muehlenhoff) [12:50:46] !log mobrovac@tin Finished deploy [restbase/deploy@bc02191]: produce a diff for the mobile proxy (duration: 18m 34s) [12:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:09] <_joe_> !log stopped long-running runjobs for refreshlinks on commons. [12:55:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T1300). Please do the needful. [13:00:04] dcausse: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [13:00:15] I can SWAT today! [13:00:19] o/ [13:00:38] dcausse: merging the patch, can you test it at mwdebug1002? [13:00:45] zeljkof: yes [13:01:23] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379504 (owner: 10DCausse) [13:01:34] dcausse: will be there in a few minutes, will ping you [13:01:47] ci is not busy, should not take long [13:02:33] dcausse: any order the files should be deployed in? or any order would do? [13:02:50] zeljkof: it does not matter [13:02:57] (03Merged) 10jenkins-bot: Revert "Switch elasticsearch active cluster to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379504 (owner: 10DCausse) [13:03:13] (03CR) 10jenkins-bot: Revert "Switch elasticsearch active cluster to codfw" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379504 (owner: 10DCausse) [13:04:01] dcausse: the patch is at mwdebug1002, let me know if I can deploy [13:04:06] testing [13:04:09] (03PS1) 10Giuseppe Lavagetto: puppet: move production to the future environment [puppet] - 10https://gerrit.wikimedia.org/r/379523 (https://phabricator.wikimedia.org/T171704) [13:06:10] (03CR) 10Giuseppe Lavagetto: "Latest PCC results for the switch:" [puppet] - 10https://gerrit.wikimedia.org/r/379523 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [13:06:44] zeljkof: looks good, you can proceed [13:06:52] dcausse: ok, deploying [13:07:49] !log zfilipin@tin Synchronized tests/cirrusTest.php: SWAT: [[gerrit:379504|Revert "Switch elasticsearch active cluster to codfw"]] (duration: 00m 49s) [13:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:00] !log zfilipin@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:379504|Revert "Switch elasticsearch active cluster to codfw"]] (duration: 00m 48s) [13:09:10] PROBLEM - salt-minion processes on multatuli is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:54] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:379504|Revert "Switch elasticsearch active cluster to codfw"]] (duration: 00m 48s) [13:09:59] (03Draft2) 10MacFan4000: Update ExtensionDistributer settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379524 [13:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:05] dcausse: deployed, please test [13:11:12] zeljkof: sounds good, elastic@eqiad is starting to receiving traffic [13:11:20] dcausse: great! [13:11:21] I'll continue to monitor the custer [13:11:27] zeljkof: thanks! [13:11:40] dcausse: thanks for releasing with #releng! ;) [13:11:47] :) [13:12:05] !log EU SWAT finished [13:12:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:08] (03PS1) 10Muehlenhoff: Remove salt minion packages in production [puppet] - 10https://gerrit.wikimedia.org/r/379525 [13:13:25] (03CR) 10MGChecker: [C: 031] Enable Timeless skin on 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377864 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [13:19:05] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3624317 (10Gehel) API Feature logs are sent to the cirrus cluster, presumably for consumption by ht... [13:23:48] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3624319 (10Ottomata) Should be AMD FirePro S9150 according to quote. [13:24:31] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3624321 (10Gehel) It might be possible to tune the elasticsearch output plugin to be more robust. T... [13:26:04] (03PS2) 10Muehlenhoff: Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) [13:26:28] (03CR) 10jerkins-bot: [V: 04-1] Remove stretch-wikimedia/backports [puppet] - 10https://gerrit.wikimedia.org/r/357616 (https://phabricator.wikimedia.org/T158583) (owner: 10Muehlenhoff) [13:29:15] (03CR) 10Alexandros Kosiaris: [C: 031] puppet: move production to the future environment [puppet] - 10https://gerrit.wikimedia.org/r/379523 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [13:36:40] (03PS1) 10Muehlenhoff: Remove salt from labs_bootstrapvz config [puppet] - 10https://gerrit.wikimedia.org/r/379529 [13:37:29] (03PS1) 10Gehel: maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/379530 (https://phabricator.wikimedia.org/T162362) [13:38:42] (03PS1) 10Muehlenhoff: Remove salt from labs_vmbuilder [puppet] - 10https://gerrit.wikimedia.org/r/379531 [13:39:15] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, and 2 others: Make maps active / active - https://phabricator.wikimedia.org/T162362#3624370 (10Gehel) We are ready to make maps active / active. Patch https://gerrit.wikimedia.org/r/#/c/379530/ is ready to be merged, but I'll let the traffic team (@ema / @B... [13:41:59] 10Operations, 10Operations-Software-Development, 10Technical-Debt: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300#3624372 (10Volans) 05Open>03Resolved [13:42:00] PROBLEM - puppet last run on labvirt1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:00] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:10] PROBLEM - puppet last run on ganeti1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:20] PROBLEM - puppet last run on ms-be1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:20] PROBLEM - puppet last run on mw1297 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:30] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:40] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:42:40] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:00] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:00] PROBLEM - puppet last run on poolcounter1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:01] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:43:10] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:44:38] PuppetDB at nitrogen.eqiad.wmnet:443: [502 Bad Gateway] [13:44:41] ouch [13:44:44] <_joe_> wat? [13:44:44] probably another OOM [13:45:06] <_joe_> [7249347.738722] Out of memory: Kill process 8610 (java) score 389 or sacrifice child [13:45:31] <_joe_> it already restarted [13:45:39] !log upgrade to nodejs 6.11 on the full maps-test cluster - T171707 [13:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:54] T171707: Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707 [13:48:21] yeah systemd restarts it [13:53:50] (03CR) 10Elukey: [C: 031] Add druid LVS svc name [dns] - 10https://gerrit.wikimedia.org/r/378967 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [13:54:20] (03PS1) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [13:56:46] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet: move production to the future environment [puppet] - 10https://gerrit.wikimedia.org/r/379523 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [13:57:36] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3624409 (10jcrespo) > I'm wondering if people really do leave long-running cumin/salt tasks there currently in a screen. All my multiple-hosts schema changes run from neodymium... [13:57:53] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/379500 (owner: 10Muehlenhoff) [13:58:51] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3624416 (10jcrespo) Did T166570 fixed some outage conditions? [14:00:20] (03CR) 10Jcrespo: [C: 04-1] "There is no mariadb::client included, the one with more screen sessions overally." [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [14:01:54] (03PS2) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [14:04:20] (03PS3) 10Reedy: Template-ise rsync/public.pp hosts allow [puppet] - 10https://gerrit.wikimedia.org/r/379517 [14:07:06] (If you happen to have a minute) I am looking at https://phabricator.wikimedia.org/T174587#3622318 and wonder if that really really needs Faidon's time. [14:09:07] (03CR) 10Hashar: [C: 031] "role::ci::slave is for the Jenkins slaves on production machines. All the rest is on labs and provisioned using different puppet manifests" [puppet] - 10https://gerrit.wikimedia.org/r/379510 (https://phabricator.wikimedia.org/T175293) (owner: 10Alexandros Kosiaris) [14:09:23] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3624478 (10Dzahn) 05Open>03stalled [14:09:40] RECOVERY - puppet last run on ganeti1003 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:10:10] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [14:10:12] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3263810 (10Dzahn) Setting to stalled for now. I don't personally understand the relation to T166570 yet, looks like it needs more discussion. [14:10:20] RECOVERY - puppet last run on poolcounter1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:10:20] RECOVERY - puppet last run on labvirt1017 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [14:10:30] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:10:30] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [14:11:02] (03PS1) 10Elukey: Add tlsproxy fake credentials for Druid [labs/private] - 10https://gerrit.wikimedia.org/r/379538 (https://phabricator.wikimedia.org/T176223) [14:11:06] RECOVERY - puppet last run on ms-be1017 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [14:11:15] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [14:11:21] (03CR) 10Elukey: [V: 032 C: 032] Add tlsproxy fake credentials for Druid [labs/private] - 10https://gerrit.wikimedia.org/r/379538 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [14:11:25] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:11:26] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:12:41] (03PS3) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [14:12:55] RECOVERY - puppet last run on mw1297 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:14:23] (03PS4) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [14:15:35] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:17:17] 10Operations, 10Phabricator, 10Release-Engineering-Team: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3624488 (10Dzahn) What alert was it? I don't think there is any Icinga monitoring for it yet and it wasn't even expected to be used, like the servic... [14:19:43] (03PS1) 10Elukey: Rename druid worker hieradata config [labs/private] - 10https://gerrit.wikimedia.org/r/379540 (https://phabricator.wikimedia.org/T176223) [14:19:58] (03CR) 10Elukey: [V: 032 C: 032] Rename druid worker hieradata config [labs/private] - 10https://gerrit.wikimedia.org/r/379540 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [14:21:26] (03PS2) 10Alexandros Kosiaris: Remove ruthenium from scap::dsh::groups::parsoid [puppet] - 10https://gerrit.wikimedia.org/r/379200 (https://phabricator.wikimedia.org/T165520) [14:21:31] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Remove ruthenium from scap::dsh::groups::parsoid [puppet] - 10https://gerrit.wikimedia.org/r/379200 (https://phabricator.wikimedia.org/T165520) (owner: 10Alexandros Kosiaris) [14:22:25] RECOVERY - salt-minion processes on multatuli is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:22:53] moritzm _joe_ is anyone doing puppet swat atm? [14:23:49] (03PS5) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [14:24:32] (03PS2) 10Alexandros Kosiaris: Install docker-ce on role::ci::slave hosts [puppet] - 10https://gerrit.wikimedia.org/r/379510 (https://phabricator.wikimedia.org/T175293) [14:24:49] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Install docker-ce on role::ci::slave hosts [puppet] - 10https://gerrit.wikimedia.org/r/379510 (https://phabricator.wikimedia.org/T175293) (owner: 10Alexandros Kosiaris) [14:25:09] matthiasmullie: it takes place in 1.5 hrs from now, but I won't be around today [14:25:59] ugh... forgot about timezones :) [14:26:09] ok thanks! [14:27:54] jouncebot: next [14:27:54] In 1 hour(s) and 32 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T1600) [14:28:05] matthiasmullie: ^^^ [14:28:10] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Provision Docker >= 17.05 on contint1001 - https://phabricator.wikimedia.org/T175293#3624510 (10akosiaris) 05Open>03Resolved And done. Resolving [14:28:13] (03PS6) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [14:28:24] right! [14:29:20] (03CR) 10Zoranzoki21: [C: 031] Enable Timeless skin on 5 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377864 (https://phabricator.wikimedia.org/T154371) (owner: 10Framawiki) [14:30:56] 10Operations, 10monitoring, 10Patch-For-Review: Check long-running screen/tmux sessions - https://phabricator.wikimedia.org/T165348#3624517 (10Dzahn) Twentyafterfour fixed the "phab paste with custom permissions" feature, so now i could make one again that is limited to members of Operations: Here you go wi... [14:32:23] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Provision Docker >= 17.05 on contint1001 - https://phabricator.wikimedia.org/T175293#3624518 (10hashar) ``` contint1001:~$ apt-cache policy docker-ce docker-ce: Installed: 17.06.2~ce-0~debian Candidate: 17.06.2~c... [14:33:21] (03CR) 10Dzahn: "what is the name of that labs instance that you can't access? I don't know anything about phragile and whether .htaccess files are used bu" [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [14:34:04] (03CR) 10Dzahn: "adding Paladox and 20after4" [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [14:35:22] (03PS1) 10Elukey: Add fake TLS private key for druid.svc.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/379543 (https://phabricator.wikimedia.org/T176223) [14:35:36] (03CR) 10Elukey: [V: 032 C: 032] Add fake TLS private key for druid.svc.eqiad.wmnet [labs/private] - 10https://gerrit.wikimedia.org/r/379543 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [14:36:28] (03CR) 10Elukey: "The only one that I found is https://tools.wmflabs.org/openstack-browser/server/phragile-pro.phragile.eqiad.wmflabs" [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [14:37:29] 10Operations: Production shell access prompting for password - https://phabricator.wikimedia.org/T176418#3624526 (10Samwalton9) [14:38:01] 10Operations: Production shell access prompting for password - https://phabricator.wikimedia.org/T176418#3624540 (10Samwalton9) [14:38:32] PROBLEM - SSH on copper is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:38:40] (03PS9) 10Dzahn: icinga: initial whitelist for screen monitoring [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) [14:39:24] (03PS2) 10BBlack: browsersec: bump to 20% 2017-09-21 [puppet] - 10https://gerrit.wikimedia.org/r/376312 (https://phabricator.wikimedia.org/T163251) [14:39:58] (03PS1) 10Elukey: Rename the druid fake SSL key [labs/private] - 10https://gerrit.wikimedia.org/r/379544 (https://phabricator.wikimedia.org/T176223) [14:40:08] (03CR) 10Elukey: [V: 032 C: 032] Rename the druid fake SSL key [labs/private] - 10https://gerrit.wikimedia.org/r/379544 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [14:40:10] (03CR) 10BBlack: [C: 032] browsersec: bump to 20% 2017-09-21 [puppet] - 10https://gerrit.wikimedia.org/r/376312 (https://phabricator.wikimedia.org/T163251) (owner: 10BBlack) [14:40:21] RECOVERY - SSH on copper is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [14:43:35] (03CR) 10Elukey: "PCC: https://puppet-compiler.wmflabs.org/compiler02/7972/druid1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [14:48:00] 10Operations: Production shell access prompting for password - https://phabricator.wikimedia.org/T176418#3624574 (10Dzahn) > I'm trying to connect to analytics-store, which I was able to do earlier in the year though I haven't tried for some time Hi, analytics-store is an alias for dbstore1002.eqiad.wmnet. Th... [14:52:33] (03PS3) 10Volans: Backends: add OpenStack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/379247 (https://phabricator.wikimedia.org/T175711) [14:52:52] 10Operations: Production shell access prompting for password - https://phabricator.wikimedia.org/T176418#3624602 (10Samwalton9) 05Open>03Resolved a:03Samwalton9 //facepalm// I'd forgotten the connection process and was trying to connect to the wrong place. I can indeed connect to stat1006 no problem. Thanks! [14:53:56] (03CR) 10Muehlenhoff: [C: 032] Rebuild for Jessie + PHP 5.5 [debs/pkg-php/php-defaults] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/374766 (owner: 10Hashar) [14:54:36] (03CR) 10Muehlenhoff: [V: 032 C: 032] Build for php5.5 on jessie-wikimedia [debs/pkg-php/php-redis] (debian/jessie-wikimedia) - 10https://gerrit.wikimedia.org/r/376483 (https://phabricator.wikimedia.org/T174972) (owner: 10Hashar) [14:54:53] moritzm: \O/ [14:56:56] !log uploaded php-defaults and php-redis built against src:php5.5 to component/ci for jessie-wikimedia [14:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:22] (03CR) 10Volans: [C: 032] "Merging to not block Cumin usage in WMCS, if you have any late comment feel free to add them and I'll amend the code in an upcoming CR." [software/cumin] - 10https://gerrit.wikimedia.org/r/379247 (https://phabricator.wikimedia.org/T175711) (owner: 10Volans) [15:01:36] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3624619 (10dr0ptp4kt) Thanks, @ottomata. Any chance you could take a look at the GPU? I’d like to watch to learn something about the setup of this in Debian. We r... [15:02:09] (03Merged) 10jenkins-bot: Backends: add OpenStack backend [software/cumin] - 10https://gerrit.wikimedia.org/r/379247 (https://phabricator.wikimedia.org/T175711) (owner: 10Volans) [15:02:46] (03Abandoned) 10Dzahn: base::monitoring: make it possible to disable monitoring [puppet] - 10https://gerrit.wikimedia.org/r/368124 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [15:04:17] (03CR) 10Alexandros Kosiaris: [C: 04-1] "First round of comments, I 'll upload a change to address some stuff" (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/374560 (https://phabricator.wikimedia.org/T169516) (owner: 10Jcrespo) [15:05:47] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3624627 (10Ottomata) Haha, I guess I can? But you know as much as I do! [15:17:33] (03CR) 10Ottomata: role::analytics_cluster::druid::worker: introduce tlsproxy for druid (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [15:19:13] 10Operations, 10Analytics, 10Analytics-Cluster, 10Research-management: GPU upgrade for stats machine - https://phabricator.wikimedia.org/T148843#3624653 (10dr0ptp4kt) Ha! That’s what they all say :P Around the next couple weeks? I could set up a time to watch and try to read up on the manuals. I don’t hav... [15:30:46] (03PS1) 10Giuseppe Lavagetto: icinga: fix naggen2 test for parameters [puppet] - 10https://gerrit.wikimedia.org/r/379552 [15:37:04] (03PS1) 10Volans: CHANGELOG: add changelogs for release v1.1.0. [software/cumin] - 10https://gerrit.wikimedia.org/r/379553 [15:39:11] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: fix naggen2 test for parameters [puppet] - 10https://gerrit.wikimedia.org/r/379552 (owner: 10Giuseppe Lavagetto) [15:39:16] (03PS2) 10Giuseppe Lavagetto: icinga: fix naggen2 test for parameters [puppet] - 10https://gerrit.wikimedia.org/r/379552 [15:39:18] (03CR) 10Paladox: [C: 031] phragile: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/379499 (owner: 10Elukey) [15:41:18] (03CR) 10Volans: [C: 031] "Looks good, would be nice to check with a compiler for prod and cherry-pick on a puppetmaster for labs" [puppet] - 10https://gerrit.wikimedia.org/r/379525 (owner: 10Muehlenhoff) [15:41:49] (03CR) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [15:42:30] (03CR) 10Volans: [C: 032] CHANGELOG: add changelogs for release v1.1.0. [software/cumin] - 10https://gerrit.wikimedia.org/r/379553 (owner: 10Volans) [15:46:14] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v1.1.0. [software/cumin] - 10https://gerrit.wikimedia.org/r/379553 (owner: 10Volans) [15:48:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops, 10procurement: eqiad: networking audit for support contract renewal - https://phabricator.wikimedia.org/T176338#3624714 (10Cmjohnson) @robh I want to confirm that I do have 2 spares still in their original packaging in storage. TA3716160376 TA3716160364 [15:55:30] (03PS1) 10Chad: group2 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379555 [15:56:15] (03CR) 10Thcipriani: "Cherry picked on beta. Puppet compiler: https://puppet-compiler.wmflabs.org/compiler02/7973/" [puppet] - 10https://gerrit.wikimedia.org/r/378750 (https://phabricator.wikimedia.org/T168211) (owner: 10Thcipriani) [15:56:49] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3624759 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1319.eqiad.wmnet'] ``` and were **ALL** successful. [15:57:05] 10Operations, 10ops-ulsfo, 10Traffic, 10hardware-requests: Decom cp4005-8,13-16 (8 nodes) - https://phabricator.wikimedia.org/T176366#3624760 (10RobH) [15:57:31] (03PS1) 10Hashar: contint: docker-ce on labs docker slaves [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) [16:00:04] godog and _joe_: Dear deployers, time to do the Puppet SWAT(Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T1600). [16:00:04] matthiasmullie: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [16:00:19] here! [16:02:48] I also had a simple one for puppet swat, but my page fiddling probably tripped up the bot https://gerrit.wikimedia.org/r/#/c/378750/ [16:03:44] (03CR) 10Hashar: [C: 04-1] "WIP / untested." [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [16:06:57] (03PS7) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [16:07:25] (03CR) 10jerkins-bot: [V: 04-1] role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [16:09:23] 10Operations, 10Pybal, 10Traffic, 10fundraising-tech-ops: pybal vs firewall failover - BGP session down - https://phabricator.wikimedia.org/T173028#3624822 (10ema) p:05Triage>03Normal [16:10:12] (03PS8) 10Elukey: role::analytics_cluster::druid::worker: introduce tlsproxy for druid [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) [16:11:48] 10Operations, 10Pybal, 10Traffic, 10fundraising-tech-ops: pybal vs firewall failover - BGP session down - https://phabricator.wikimedia.org/T173028#3624832 (10ema) I've just seen this bug while testing the BGP interactions between pybal-test2001 and quagga, the assertion can be removed as being in ST_IDLE... [16:14:27] (03CR) 10Hashar: [C: 04-1] "Puppet compile https://puppet-compiler.wmflabs.org/compiler02/7974/ shows just a dependency fix for puppet." [puppet] - 10https://gerrit.wikimedia.org/r/379556 (https://phabricator.wikimedia.org/T176267) (owner: 10Hashar) [16:16:40] 10Operations, 10Phabricator, 10Release-Engineering-Team: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3624856 (10Paladox) it does work, it seems. It is starting on port 22280 see /srv/phab/aphlict/config.json root@phabricator:/home/paladox# telnet l... [16:17:39] (03PS1) 10Cmjohnson: Removing remaining dns entries for decom'd host berrylium T147934 [dns] - 10https://gerrit.wikimedia.org/r/379558 [16:17:59] (03PS1) 10Elukey: network::constants: add aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/379559 (https://phabricator.wikimedia.org/T176223) [16:18:25] (03PS1) 10Ema: bgp: use util.log instead of print [debs/pybal] - 10https://gerrit.wikimedia.org/r/379561 [16:19:03] (03CR) 10Cmjohnson: [C: 032] Removing remaining dns entries for decom'd host berrylium T147934 [dns] - 10https://gerrit.wikimedia.org/r/379558 (owner: 10Cmjohnson) [16:19:07] (03CR) 10Elukey: "the AQS_HOSTS constants needs to become available first (with https://gerrit.wikimedia.org/r/#/c/379559)" [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [16:19:34] 10Operations, 10ops-eqiad, 10Patch-For-Review: decommission beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T147934#3624877 (10Cmjohnson) 05Open>03Resolved Removed from rack, wiped, racktables updated, dns removed. [16:21:05] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-Joe: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271#3624882 (10Cmjohnson) [16:21:06] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Joe: Decom mw1170-mw1179, and replace them with new systems. - https://phabricator.wikimedia.org/T167130#3624885 (10Cmjohnson) [16:21:09] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review, 10User-Joe: Decommission mw1170-mw1179 - https://phabricator.wikimedia.org/T168271#3359900 (10Cmjohnson) 05Open>03Resolved all steps have been completed. [16:21:37] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3624890 (10Cmjohnson) @jcrespo anything else with this? Feel free to resolve if an issue comes back please re-open [16:22:08] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3624891 (10Anomie) >>! In T176335#3624317, @Gehel wrote: > API Feature logs are sent to the cirrus... [16:23:02] 10Operations, 10Analytics, 10Analytics-Cluster, 10Patch-For-Review, 10User-Elukey: rack/setup/install new kafka nodes kafka-jumbo100[1-6] - https://phabricator.wikimedia.org/T167992#3624895 (10Cmjohnson) [16:24:58] (03Draft1) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [16:25:02] (03PS2) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [16:25:38] (03PS1) 10Ema: bgp: FSM can be in states != ST_IDLE when the connection is closed [debs/pybal] - 10https://gerrit.wikimedia.org/r/379563 (https://phabricator.wikimedia.org/T173028) [16:27:54] (03CR) 10Ema: [C: 032] bgp: use util.log instead of print [debs/pybal] - 10https://gerrit.wikimedia.org/r/379561 (owner: 10Ema) [16:27:56] (03PS3) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [16:28:00] (03PS1) 10Ema: bgp: use util.log instead of print [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/379565 [16:28:11] twentyafterfour ^^ [16:29:17] (03PS4) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [16:30:18] (03CR) 1020after4: [C: 031] Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:31:31] (03CR) 1020after4: [C: 031] "do we even need to support upstart anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:31:40] (03PS1) 10Giuseppe Lavagetto: Convert to use of the future parser by default [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/379569 (https://phabricator.wikimedia.org/T171704) [16:31:57] (03CR) 10Paladox: "> do we even need to support upstart anymore?" [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:33:03] (03CR) 10Paladox: [C: 031] "Tested locally and works" [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:33:08] hey - is anyone doing puppet SWAT today? [16:33:30] (03CR) 10Ema: [V: 032 C: 032] bgp: use util.log instead of print [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/379565 (owner: 10Ema) [16:33:37] PROBLEM - Host db1023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:33:42] (03PS1) 10Ema: bgp: FSM can be in states != ST_IDLE when the connection is closed [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/379570 (https://phabricator.wikimedia.org/T173028) [16:34:21] 10Operations, 10ops-ulsfo, 10Traffic: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3597148 (10RobH) Please note this host is still pooled and active, and will need to be depooled before it is taken offline for dimm replacement. [16:35:04] hey [16:35:07] ema you sneaky bastard [16:35:14] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3623717 (10mmodell) Indeed, it shouldn't be enabled or alerting. Hmm. [16:35:15] i -1'd that change earlier [16:35:38] something about bgp.py not depending on pybal infra ;) [16:35:46] oh! [16:35:50] oh well [16:35:58] it'll be a quick revert when needed (probably never :) [16:36:40] sorry about that, I genuinely forgot [16:36:46] no worries ;) [16:37:15] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3624933 (10Paladox) i think systemd sent the alert, per recovery at 8:35am this morning [08:35:48] <+icinga-wm> RECOVERY - C... [16:37:53] * mark is eager to get 1.14 deployed [16:41:06] 10Operations, 10ops-ulsfo, 10Traffic: cp4024 kernel errors - https://phabricator.wikimedia.org/T174891#3576064 (10RobH) Dell has preboot diagnostics as an option (epsa) and then the code translation at https://www.dell.com/support/home/us/en/4/pre-boot-analysis When I'm onsite next Monday, I'll reboot this... [16:43:25] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3624959 (10ayounsi) [16:43:52] (03CR) 10Thcipriani: [C: 031] Remove obsolete sudo group [puppet] - 10https://gerrit.wikimedia.org/r/379490 (owner: 10Muehlenhoff) [16:44:19] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3624976 (10mmodell) Is there a way to have a systemd unit installed but not auto-started/monitored/expected? That'd be ideal f... [16:48:04] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3624986 (10Paladox) @mmodell yep, i think it's because we are using base::service, so i guess lets just say for it to not run.... [16:48:56] (03CR) 10Ema: [C: 032] bgp: FSM can be in states != ST_IDLE when the connection is closed [debs/pybal] - 10https://gerrit.wikimedia.org/r/379563 (https://phabricator.wikimedia.org/T173028) (owner: 10Ema) [16:51:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please answer my questions, and I think some details might need further fixing, but this seems like a good start." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:51:48] (03CR) 10Paladox: [C: 031] ">" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:51:55] (03CR) 10BBlack: [C: 032] maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/379530 (https://phabricator.wikimedia.org/T162362) (owner: 10Gehel) [16:52:00] (03PS2) 10BBlack: maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/379530 (https://phabricator.wikimedia.org/T162362) (owner: 10Gehel) [16:52:02] (03CR) 10BBlack: [V: 032 C: 032] maps: active/active public interface [puppet] - 10https://gerrit.wikimedia.org/r/379530 (https://phabricator.wikimedia.org/T162362) (owner: 10Gehel) [16:53:33] (03CR) 1020after4: [C: 031] Phabricator: Fix aphlict systemd script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:54:26] PROBLEM - Check systemd state on phab1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:54:55] (03CR) 10Ottomata: role::analytics_cluster::druid::worker: introduce tlsproxy for druid (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379533 (https://phabricator.wikimedia.org/T176223) (owner: 10Elukey) [16:54:58] (03CR) 10Paladox: [C: 031] Phabricator: Fix aphlict systemd script (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:55:19] (03PS5) 10Paladox: Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) [16:56:23] 10Operations, 10ops-ulsfo, 10Traffic: cp4021 memory hardware issue - DIMM B1 - https://phabricator.wikimedia.org/T175585#3625017 (10RobH) Dell service request: SR954179119. Replacement dimm should arrive either Friday or Monday. I'll be onsite Monday to replace the defective dimm (after depooling the serve... [16:57:33] (03CR) 1020after4: [C: 031] Phabricator: Fix aphlict systemd script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [16:58:06] (03CR) 10Paladox: [C: 031] Phabricator: Fix aphlict systemd script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [17:00:04] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear deployers, time to do the Services – Graphoid / Parsoid / OCG / Citoid / ORES deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T1700). [17:00:08] No patches in the queue for this window. Wheeee! [17:00:19] no parsoid deploy today [17:00:20] Nothing for ORES today [17:01:51] (03CR) 10jerkins-bot: [V: 04-1] Convert to use of the future parser by default [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/379569 (https://phabricator.wikimedia.org/T171704) (owner: 10Giuseppe Lavagetto) [17:02:03] (03CR) 10Thcipriani: [C: 031] Stop including trebuchet class from base profile [puppet] - 10https://gerrit.wikimedia.org/r/379487 (owner: 10Muehlenhoff) [17:02:06] (03CR) 10Thcipriani: [C: 031] Remove obsolete trebuchet class [puppet] - 10https://gerrit.wikimedia.org/r/379488 (owner: 10Muehlenhoff) [17:02:46] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:07:20] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3621843 (10dcausse) >>! In T176335#3624891, @Anomie wrote: >>>! In T176335#3624317, @Gehel wrote: >... [17:09:47] PROBLEM - Host db1022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:10:51] (03PS2) 10Muehlenhoff: Remove obsolete sudo group [puppet] - 10https://gerrit.wikimedia.org/r/379490 [17:10:54] (03PS1) 10Andrew Bogott: codfw labs: define profile::openstack::main::nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/379574 [17:11:08] (03PS2) 10Andrew Bogott: codfw labs: define profile::openstack::main::nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/379574 [17:11:35] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete sudo group [puppet] - 10https://gerrit.wikimedia.org/r/379490 (owner: 10Muehlenhoff) [17:12:05] (03PS3) 10Andrew Bogott: codfw labs: define profile::openstack::main::nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/379574 [17:12:50] (03CR) 10Andrew Bogott: [C: 032] codfw labs: define profile::openstack::main::nova_controller [puppet] - 10https://gerrit.wikimedia.org/r/379574 (owner: 10Andrew Bogott) [17:14:49] (03PS2) 10Muehlenhoff: Stop including trebuchet class from base profile [puppet] - 10https://gerrit.wikimedia.org/r/379487 [17:15:06] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3625095 (10Gehel) Pull request opened on tilerator to upgrade it to nodejs 6.11: https://github.com/kartotherian/tilerator/pull/12 Same for kartotherian: https://... [17:15:45] (03CR) 10Muehlenhoff: [C: 032] Stop including trebuchet class from base profile [puppet] - 10https://gerrit.wikimedia.org/r/379487 (owner: 10Muehlenhoff) [17:18:22] (03PS2) 10Muehlenhoff: Remove obsolete trebuchet class [puppet] - 10https://gerrit.wikimedia.org/r/379488 [17:19:32] (03CR) 10Muehlenhoff: [C: 032] Remove obsolete trebuchet class [puppet] - 10https://gerrit.wikimedia.org/r/379488 (owner: 10Muehlenhoff) [17:22:38] (03PS1) 10Andrew Bogott: labtest: add some more hiera 'main' defaults to get cumin working [puppet] - 10https://gerrit.wikimedia.org/r/379578 [17:22:42] (03PS1) 10Muehlenhoff: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/379579 [17:22:50] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3625139 (10Gehel) It seems to be possible to add a second output to codfw, this requires some minor... [17:23:13] (03CR) 10Andrew Bogott: [C: 032] labtest: add some more hiera 'main' defaults to get cumin working [puppet] - 10https://gerrit.wikimedia.org/r/379578 (owner: 10Andrew Bogott) [17:23:31] (03CR) 10Muehlenhoff: [C: 032] Update comment [puppet] - 10https://gerrit.wikimedia.org/r/379579 (owner: 10Muehlenhoff) [17:23:39] (03PS2) 10Muehlenhoff: Update comment [puppet] - 10https://gerrit.wikimedia.org/r/379579 [17:24:43] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10Wikimedia-Logstash: api feature logs should be sent to both eqiad and codfw clusters - https://phabricator.wikimedia.org/T176430#3625153 (10Gehel) [17:25:11] (03PS1) 10Bmansurov: Enable Print instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379580 (https://phabricator.wikimedia.org/T176341) [17:25:33] (03CR) 10Thcipriani: "hrm. ocg::init and ocg::decommission still use this provider. The puppet compiler doesn't have an error (https://puppet-compiler.wmflabs.o" [puppet] - 10https://gerrit.wikimedia.org/r/379486 (owner: 10Muehlenhoff) [17:26:34] (03CR) 10Muehlenhoff: "Ok, we can simply withhold merging this patch until ocg is removed after the 1st of October." [puppet] - 10https://gerrit.wikimedia.org/r/379486 (owner: 10Muehlenhoff) [17:27:24] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3625175 (10MoritzMuehlenhoff) [17:31:06] RECOVERY - puppet last run on wtp1036 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [17:31:12] 10Operations, 10netops: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#3625182 (10ayounsi) Created the following ROAs via RIPE's website for our two least used prefixes: |AS number|Prefix|Up to| |AS43821|2a02:ec80::/29|48| |AS14907|2a02:ec80::/29|48| |AS43821|18... [17:40:12] (03CR) 10Pmiazga: [C: 031] "Looks ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379580 (https://phabricator.wikimedia.org/T176341) (owner: 10Bmansurov) [17:46:52] (03PS1) 10Andrew Bogott: labtest: try to standardize on labtest-puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/379584 [17:47:03] (03PS2) 10Andrew Bogott: labtest: try to standardize on labtest-puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/379584 [17:48:04] (03CR) 10Andrew Bogott: [C: 032] labtest: try to standardize on labtest-puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/379584 (owner: 10Andrew Bogott) [17:52:47] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 4969 mails in exim queue. [17:54:42] thats alot of mail [17:59:49] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: api feature logs should be sent to both eqiad and codfw clusters - https://phabricator.wikimedia.org/T176430#3625220 (10greg) [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Morning SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T1800). [18:00:04] bmansurov: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [18:00:18] here [18:00:45] jouncebot: Niharika they only get a sticker/t-shirt if they fix it [18:01:06] anyone can break, breaking and fixing gets a shirt [18:01:45] i always hear we are out of shirts. It must be too common :P [18:02:29] I can SWAT [18:02:33] I forget where the box is, but I think it's moving to 1Mont :)) [18:02:36] I would like to add two patches but one of them isn't merged yet [18:03:32] i have one as well thats currently in gate-and-submit [18:03:49] Mine just merged yay [18:03:53] hrm, jenkins is not going to make this easy :\ [18:04:08] Yeah [18:04:20] If only mediawiki-config patches went into the gate-and-submit-swat queue [18:04:29] anyway, RoanKattouw ebernhardson feel free to add patches for swat and we'll try to get them out. [18:04:40] Or, you know, Jenkins had more than 3(!) workers running [18:04:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379580 (https://phabricator.wikimedia.org/T176341) (owner: 10Bmansurov) [18:05:04] thcipriani: mine finished gate and submit, cherry picked to wmf.19 and added to calendar [18:05:13] cool, thanks :) [18:08:41] (03CR) 10Jcrespo: "I am ok with this, I am not voting +1 because I do not know if it is missing some or others do not agree with it." [puppet] - 10https://gerrit.wikimedia.org/r/377823 (https://phabricator.wikimedia.org/T165348) (owner: 10Dzahn) [18:10:04] FWIW, gate-and-submit-swat pipeline has the same priority as gate-and-submit pipeline (there are only 3 levels of priority). The benefit of MW core and extensions going through a different pipeline is they don't end up stuck in a queue behind master changes. Since there is only a master branch of mw-config, that's not a concern for that repo. Moving the mw-config repo to a different pipeline [18:10:05] RoanKattouw: deletes of nodepool instances take minutes due to our API rate limit :( But yeah. [18:10:07] would likely make little difference in the speed that patches are merged to it, but it's super important for mw-core and extensions time-to-merge. [18:10:15] :) [18:11:31] (03PS1) 10EBernhardson: Stop injecting search relevance survey data into pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379591 (https://phabricator.wikimedia.org/T175047) [18:11:46] but yeah, the rate at which workers are provisioned, used, and deleted and the number of those workers is bad :( [18:11:46] thcipriani: OK I added mine to the wiki page, thanks for the flexibility [18:12:00] ah, cool, thanks [18:17:10] (03CR) 10Zoranzoki21: [C: 031] Stop injecting search relevance survey data into pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379591 (https://phabricator.wikimedia.org/T175047) (owner: 10EBernhardson) [18:17:56] (03Merged) 10jenkins-bot: Enable Print instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379580 (https://phabricator.wikimedia.org/T176341) (owner: 10Bmansurov) [18:18:07] (03CR) 10jenkins-bot: Enable Print instrumentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379580 (https://phabricator.wikimedia.org/T176341) (owner: 10Bmansurov) [18:18:27] hey, could I rename an account with 90000 edits on enwiki now? :-) [18:18:51] bmansurov: your change is live on mwdebug1002, check please [18:19:02] ok [18:21:50] thcipriani, I don't see the change [18:22:18] * thcipriani doublechecks sync [18:22:31] robh: is it realistically going to make any difference whether eqsin power/rack start date is 10/16 or 11/1? [18:22:46] I doubt even if we get things signed on monday, we can get equipment to arrive before 11/1 that matters right? [18:23:07] every equipment order other than what is already statged at eqiad is likely a month out yeah [18:23:11] so may as well make it 11/1 [18:23:14] ajr: username? [18:23:31] bmansurov: grep wgWMEPrintEnabled /srv/mediawiki/wmf-config/InitialiseSettings.php shows that the code made it to mwdebug1002. [18:23:33] but we can ship items to eqsin as soon as we sign i assume and maybe just pay some storage fees =] [18:23:42] bblack: just in case any orders arrive quickly [18:23:44] thcipriani, ok, double checking [18:23:48] thanks :) [18:23:58] but our network order is placed, and then aggregated with our VAR here in the US before shipment to eqsin [18:24:07] and dell singapore had an eta of 30+ days on the last quotes [18:24:10] due to ssd stuff [18:24:46] legoktm, RadioRan [18:24:48] bblack: so yeah, 11/1 is likely fine. If we have items arrive earlier, Equinix just has a storage policy so we may have to pay for a week of storage in their shipping. [18:24:49] RadioFan* [18:25:08] and we wont ship out our pre-staged eqiad shipment to eqsin until we're ready for it to arrive. [18:25:21] ok [18:25:35] I mean if it's easy I don't care and we'll bump to 10/16 and get things rolling faster [18:25:51] ajr: go for it [18:25:53] but if it means another round-trip through some bullshit paperwork and signing, it might be easier to stick to the 11/1 in the current quote [18:26:18] yeah [18:26:36] thanks legoktm [18:28:02] ok in progress [18:28:07] thcipriani, I see the change now. Accompanying change has only been deployed to some wikis. [18:28:22] thcipriani, thanks! [18:28:49] bmansurov: is the InitialiseSettings.php change ok to go out all servers? [18:28:56] thcipriani, yes [18:29:04] great, thanks :) [18:29:09] * thcipriani deploys [18:31:14] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:379580|Enable Print instrumentation]] T176341 (duration: 00m 52s) [18:31:22] ^ bmansurov should be live everywhere now [18:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:33] thcipriani, great, thanks! [18:31:33] T176341: Deploy print styles instrumentation - https://phabricator.wikimedia.org/T176341 [18:32:19] ebernhardson: your change is on mwdebug1002, check please [18:32:42] thcipriani: looks sane [18:32:50] ebernhardson: ok, going live [18:34:47] !log thcipriani@tin Synchronized php-1.30.0-wmf.19/extensions/WikimediaEvents/modules/ext.wikimediaEvents.searchSatisfaction.js: SWAT: [[gerrit:379587|Turning off Explore Similar AB test]] T175649 (duration: 00m 49s) [18:34:54] ^ ebernhardson live now [18:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:03] T175649: Turn off test for language links - https://phabricator.wikimedia.org/T175649 [18:35:58] RoanKattouw: both your changes should be live on mwdebug1002, check please [18:36:06] Looking [18:39:06] 10Operations, 10monitoring, 10Patch-For-Review: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#3625304 (10herron) Check_ipmi_sensor is showing failures on 3 out of 4 of the Dell PowerEdge R620 class systems that UnitedLayer recently reported a... [18:40:44] thcipriani: Working [18:40:53] ok, going live [18:40:58] Not fully on some wikis because of local customizations but I'll have to deal with that later [18:41:58] !log T171772: Restarting Cassandra restbase-dev1004-a to apply locally hacked Prometheus exporter config [18:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:11] T171772: Prometheus metrics storage for RESTBase dev environment - https://phabricator.wikimedia.org/T171772 [18:44:19] !log thcipriani@tin Synchronized php-1.30.0-wmf.19/resources/src/mediawiki.rcfilters: SWAT: [[gerrit:379588|WLFilters: Do not hide .watchlistDetails while loading]] T176300 [[gerrit:379589|RCFilters: Make the interface not jump around while loading]] (duration: 00m 49s) [18:44:25] ^ RoanKattouw should be all live now [18:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:33] T176300: Fix page reflow on Watchlist and Recent Changes - https://phabricator.wikimedia.org/T176300 [18:50:08] Thanks! [18:50:42] yw :) [18:51:25] (03CR) 10Zoranzoki21: [C: 04-1] "User can enable it in settings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374328 (https://phabricator.wikimedia.org/T174345) (owner: 10Urbanecm) [18:56:36] !log T171772: Applying locally hacked Prometheus exporter config to RESTBase dev Cassandra instances [18:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:54] T171772: Prometheus metrics storage for RESTBase dev environment - https://phabricator.wikimedia.org/T171772 [18:57:49] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3549423 (10Daimona) Today on it.wiki I noticed a massive increase in search results for some queries related to errors that I'm currently trying to fix.... [18:58:34] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 12 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#3625335 (10dr0ptp4kt) Note to future selves: we'd probably want the filename to contain the human readable File: name so people and machin... [19:00:04] no_justification: (Dis)respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T1900). Please do the needful. [19:00:05] No patches in the queue for this window. Wheeee! [19:00:49] jouncebot: Learn about patches numb nuts [19:01:01] (03CR) 10Chad: [C: 032] group2 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379555 (owner: 10Chad) [19:02:11] (03PS1) 10Herron: Lists: Remove message from HELO checks with warning action [puppet] - 10https://gerrit.wikimedia.org/r/379595 (https://phabricator.wikimedia.org/T173338) [19:03:01] (03Merged) 10jenkins-bot: group2 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379555 (owner: 10Chad) [19:03:14] (03CR) 10jenkins-bot: group2 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379555 (owner: 10Chad) [19:03:36] (03CR) 10Herron: [C: 032] Lists: Remove message from HELO checks with warning action [puppet] - 10https://gerrit.wikimedia.org/r/379595 (https://phabricator.wikimedia.org/T173338) (owner: 10Herron) [19:04:42] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group2 to wmf.19 [19:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:32] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3625344 (10EBernhardson) >>! In T173710#3625333, @Daimona wrote: > Today on it.wiki I noticed a massive increase in search results for some queries relat... [19:08:01] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3625349 (10Volans) [19:16:02] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp4026.* [19:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:27] !log upgrade to nodejs 6.11 on maps servers (including restart of tilerator / kartotherian) - T171707 [19:16:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:40] T171707: Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707 [19:20:50] !log demon@tin Synchronized php-1.30.0-wmf.19/extensions/VisualEditor/ApiVisualEditor.php: fix user/wgUser mixup (duration: 00m 46s) [19:21:04] bd808: Any chance you know what I'm doing wrong in https://gerrit.wikimedia.org/r/#/c/379366/? [19:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:30] logstash => false is what I remember the magic being... [19:23:00] Oh, derp [19:23:04] upd2log [19:23:06] Typo [19:23:25] heh [19:23:40] (03PS2) 10Chad: Stop sending CSP reports to logstash for now, spams my graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 [19:23:55] I'd bet a lot of the csp reports are garbage [19:24:23] hey look, a test that caught a proper bug! :) [19:24:39] I have that on for Striker and had to add some filters that throw out reports from firefox plugins [19:24:55] ebernhardson: :) thank you [19:25:12] Not so much garbage as not actionable right now. Talked to bawolff about it yesterday [19:25:15] Low on his priority list [19:26:06] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3473775 (10Gehel) Nodejs 6.11 is deployed on all maps servers! Services have been restarted. The minor cleanup of updating the package.json file is in progress, b... [19:26:37] 10Operations, 10Discovery, 10Maps-Sprint, 10Maps (Kartographer), and 2 others: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3625373 (10Gehel) [19:27:37] 10Operations, 10Discovery, 10Maps-Sprint, 10Maps (Kartographer), and 2 others: nodejs 6.11 - https://phabricator.wikimedia.org/T170548#3434983 (10Gehel) maps is finally upgraded to nodejs 6.11. @MoritzMuehlenhoff: according to this ticket, aqs still needs to be done, I'll let you check that and close when... [19:27:48] (03PS3) 10Chad: Stop sending CSP reports to logstash for now, spams my graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 [19:27:52] (03CR) 10Chad: [C: 032] Stop sending CSP reports to logstash for now, spams my graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 (owner: 10Chad) [19:30:41] (03Merged) 10jenkins-bot: Stop sending CSP reports to logstash for now, spams my graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 (owner: 10Chad) [19:30:54] (03CR) 10jenkins-bot: Stop sending CSP reports to logstash for now, spams my graphs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379366 (owner: 10Chad) [19:32:55] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: adjust logging on csp reports (duration: 00m 46s) [19:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:43] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: api feature logs should be sent to both eqiad and codfw clusters - https://phabricator.wikimedia.org/T176430#3625390 (10Bnhassin) [19:52:13] 10Operations, 10Discovery, 10Elasticsearch, 10Wikimedia-Logstash, and 2 others: logs sent to logstash are lost when the elasticsearch cirrus cluster is unavailable - https://phabricator.wikimedia.org/T176335#3625410 (10Anomie) [19:52:15] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, and 2 others: api feature logs should be sent to both eqiad and codfw clusters - https://phabricator.wikimedia.org/T176430#3625409 (10Anomie) 05duplicate>03Open [20:03:06] (03PS2) 10Ladsgroup: Add amwikimedia to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) [20:09:36] (03CR) 10Ladsgroup: Add config for amwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:09:43] (03CR) 10Ladsgroup: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:10:08] (03PS2) 10Ladsgroup: Add config for amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) [20:10:25] (03PS1) 10Eevans: Configure agent to export Cassandra histogram metrics [puppet] - 10https://gerrit.wikimedia.org/r/379610 (https://phabricator.wikimedia.org/T171772) [20:11:26] (03CR) 10Zoranzoki21: [C: 031] Add amwikimedia to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:11:28] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10hardware-requests, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625467 (10RobH) [20:12:33] (03PS2) 10Zoranzoki21: Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:12:41] (03CR) 10jerkins-bot: [V: 04-1] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:12:47] (03CR) 10Zoranzoki21: "Removing Cannot merge blabla" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:13:09] (03PS3) 10Zoranzoki21: Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:13:17] (03CR) 10jerkins-bot: [V: 04-1] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:13:31] (03CR) 10Zoranzoki21: [C: 031] Add config for amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:13:44] (03CR) 10Zoranzoki21: [C: 031] "No effect. Sorry" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:14:14] (03CR) 10Zoranzoki21: [C: 031] "Rollback please on patch set 1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:16:57] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10hardware-requests, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625517 (10ayounsi) a:05ayounsi>03Papaul thanks Rob! [20:17:44] (03CR) 10Eevans: "[PC output](http://puppet-compiler.wmflabs.org/7976)" [puppet] - 10https://gerrit.wikimedia.org/r/379610 (https://phabricator.wikimedia.org/T171772) (owner: 10Eevans) [20:19:05] 10Operations, 10cloud-services-team (Kanban): puppet ca_server confusion - https://phabricator.wikimedia.org/T176437#3625520 (10Andrew) [20:20:35] (03PS4) 10Zoranzoki21: Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:20:48] (03CR) 10jerkins-bot: [V: 04-1] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:21:52] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10hardware-requests, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625550 (10ayounsi) [20:22:50] (03PS5) 10Zoranzoki21: Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:23:02] (03CR) 10jerkins-bot: [V: 04-1] Add amwikimedia to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:26:04] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10hardware-requests, 10netops: unrack/decom pfw1-codfw and pfw2-codfw - https://phabricator.wikimedia.org/T176427#3625559 (10ayounsi) [20:28:21] (03CR) 1020after4: [C: 031] Phabricator: Fix aphlict systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379560 (https://phabricator.wikimedia.org/T176392) (owner: 10Paladox) [20:29:19] (03CR) 10Ladsgroup: "I will fix this, don't worry." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:29:26] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Patch-For-Review: The aphlict systemd unit needs to be rewritten from scratch - https://phabricator.wikimedia.org/T176392#3625569 (10mmodell) [20:34:20] (03CR) 10Zoranzoki21: "Ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:34:55] (03CR) 10Zoranzoki21: [C: 031] "I put +1 because is all ok.. Ignore jenkins if he put -1 for this patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378403 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [20:37:47] (03PS1) 10Volans: Upstream release 1.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/379626 [20:42:28] (03Abandoned) 10Volans: Upstream release 1.1.0 [software/cumin] - 10https://gerrit.wikimedia.org/r/379626 (owner: 10Volans) [20:43:44] (03PS1) 10Volans: Upstream release 1.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/379638 [20:49:10] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3625620 (10Jgreen) @ayounsi ports are connected! When you have some time let's start setting them up? [21:01:02] (03PS3) 10MacFan4000: Update ExtensionDistributer settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379524 [21:01:20] (03Draft2) 10Zoranzoki21: Fix Lift IP cap for account creation for John Michael Kohler Art Center - Thur Sept 21, Sun Sept 24 & Tues Sept 26 throttle problem.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) [21:01:42] (03CR) 10Paladox: [C: 031] Update ExtensionDistributer settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379524 (owner: 10MacFan4000) [21:03:40] (03PS4) 10Chad: Update ExtensionDistributer settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379524 (owner: 10MacFan4000) [21:04:02] (03CR) 10Chad: [C: 032] Update ExtensionDistributer settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379524 (owner: 10MacFan4000) [21:05:53] (03PS3) 10Zoranzoki21: Fix Lift IP cap for account creation for John Michael Kohler Art Center throttle problem.. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379661 (https://phabricator.wikimedia.org/T176287) [21:06:06] (03Merged) 10jenkins-bot: Update ExtensionDistributer settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379524 (owner: 10MacFan4000) [21:07:43] !log demon@tin Synchronized wmf-config/CommonSettings.php: extdist settings (duration: 00m 46s) [21:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:28] (03CR) 10jenkins-bot: Update ExtensionDistributer settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379524 (owner: 10MacFan4000) [21:19:45] (03PS2) 10Volans: Upstream release 1.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/379638 [21:26:47] (03PS1) 10ArielGlenn: Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) [21:27:26] (03CR) 10jerkins-bot: [V: 04-1] Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [21:31:22] (03PS2) 10ArielGlenn: Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) [21:31:57] (03CR) 10jerkins-bot: [V: 04-1] Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) (owner: 10ArielGlenn) [21:40:19] (03PS3) 10ArielGlenn: Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) [21:45:19] (03PS1) 10Volans: wmf-auto-reimage: minor logging improvement [puppet] - 10https://gerrit.wikimedia.org/r/379672 (https://phabricator.wikimedia.org/T148814) [21:45:27] (03PS4) 10ArielGlenn: Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) [21:46:44] (03CR) 10Volans: [C: 032] Upstream release 1.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/379638 (owner: 10Volans) [21:47:06] (03PS2) 10Volans: wmf-auto-reimage: minor logging improvement [puppet] - 10https://gerrit.wikimedia.org/r/379672 (https://phabricator.wikimedia.org/T148814) [21:50:04] (03Merged) 10jenkins-bot: Upstream release 1.1.0 [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/379638 (owner: 10Volans) [21:52:14] (03PS5) 10ArielGlenn: Move dataset rsync config manifests to dumps module [puppet] - 10https://gerrit.wikimedia.org/r/379668 (https://phabricator.wikimedia.org/T175528) [21:53:03] (03CR) 10Volans: [C: 032] wmf-auto-reimage: minor logging improvement [puppet] - 10https://gerrit.wikimedia.org/r/379672 (https://phabricator.wikimedia.org/T148814) (owner: 10Volans) [22:12:48] !log uploaded cumin_1.1.0-1_amd64.deb to apt.wikimedia.org jessie-wikimedia [22:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:07] (03PS1) 10Volans: WMCS Cumin: use openstack as default backend [puppet] - 10https://gerrit.wikimedia.org/r/379682 (https://phabricator.wikimedia.org/T175711) [22:42:16] (03CR) 10Volans: [C: 032] WMCS Cumin: use openstack as default backend [puppet] - 10https://gerrit.wikimedia.org/r/379682 (https://phabricator.wikimedia.org/T175711) (owner: 10Volans) [22:43:10] (03PS1) 10Jcrespo: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379684 [22:43:16] if there is mediawiki deployments ongoing, plase pause them [22:44:26] jouncebot: next [22:44:26] In 0 hour(s) and 15 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T2300) [22:45:00] should be quiet until SWAT jynus [22:45:41] but that's not long, so when the bot shouts we can tell people to hold off while you do the depool [22:45:41] mmm [22:45:44] it came back [22:45:48] network issue? [22:46:11] hmmm? [22:46:43] C2 server, XioNoX something ongoing or anything on your monitoring? [22:47:11] jynus: nothing that I'm aware off, let me look at monitoring [22:47:40] I will check the software meanwhile [22:48:29] jynus: what's "c2 server" ? [22:48:46] C2 eqiad rack position [22:49:06] do not know the correspondance with switches [22:49:16] anyway, everything quiet on the network monitoring side [22:50:03] I agree, this looks like a software overload [22:50:43] and the other server looks sane [22:50:48] only that the interface to cp4025 flapped a few minutes ago [22:51:02] bblack, robh ^ ? [22:51:16] nah, this is eqiad only [22:51:32] ? [22:52:02] I'm not sure what the question is? db1055 went down and you are asking if the rack is down? [22:52:13] if so we'd have a LOT of pings right now [22:52:20] I will depool the server anyway and investigate tomorrow [22:52:24] robh: unrelated to DB, just saw that the interface to cp4025 flapped a few minutes ago [22:53:13] XioNoX: ul may be in our rack putting in patch panel ports for the new office connection [22:53:19] i dunno when they are gonna do it [22:53:25] but i'd think we'dget a power flap not network [22:53:46] network ther is all fiber optic patches so it'll either work or break entirely kinda deal i'd think, unless its a hairline break in the fiber [22:53:54] 10Operations, 10Traffic, 10Wikidata, 10wikiba.se, 10Wikidata-Sprint-2016-11-08: [Task] move wikiba.se webhosting to wikimedia misc-cluster - https://phabricator.wikimedia.org/T99531#3625942 (10Reedy) [22:53:57] and movemnet of the fibers causes it to flex, but that seems unlikely... [22:54:01] (it is possiblethough) [22:54:16] in particular since those cabinet doors smash against the fiber bundle [22:54:37] XioNoX: I'd make a note someplace on a trackign ticket about it and if it happens again then we have an issue? [22:54:53] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379684 (owner: 10Jcrespo) [22:54:56] I cannot wait for the cp refresh to be done so we can offline the entire site and re-cable it all. [22:55:45] jynus: So I'm unaware of any other issues in c3-eqiad. Sorry if that isn't helpful =[ [22:56:08] i dont see it being down unhandled in icinga though [22:56:18] I said this was a software issue [22:56:23] oh, ok, cool [22:56:25] sorry! =] [22:56:27] it just looked like network at first [22:56:42] XioNoX: So yeah, it flapped once like real fast or hard down? [22:56:44] so I asked XioNoX if he had seen lost packets [22:56:53] XioNoX: im asking about cp system [22:57:11] sorry, i saw a backlog and was confused on what i was being pinged on heh [22:57:36] so there is this ongoing issue with db1055, but then he changed topic :-) [22:57:45] (03Merged) 10jenkins-bot: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379684 (owner: 10Jcrespo) [22:57:55] (03CR) 10jenkins-bot: mariadb: Depool db1055 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379684 (owner: 10Jcrespo) [22:57:59] I am going to depool db1055 and downtime it so it doesn't generate a page for you people [22:58:03] 10Operations, 10netops: Merge AS14907 with AS43821 - https://phabricator.wikimedia.org/T167840#3625949 (10ayounsi) [22:59:23] yeah, I confused people, as when I looked for issues I only found something odd with ulsfo cp, my bad [22:59:33] it happens all the time [22:59:34] robh: the cp flap was brief [22:59:51] !log jynus@tin Synchronized wmf-config/db-eqiad.php: depool db1055 (duration: 00m 46s) [22:59:53] you find the "wrong" problems when looking for other issues [22:59:56] hrmm, it could be a bad fiber patch being bumped then when they open or closed the cabinet [23:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170921T2300). [23:00:05] No patches in the queue for this window. Wheeee! [23:00:10] cuz we are installing the office xconnect in the next 24-72 hours [23:00:12] or unitedlayer is [23:00:35] XioNoX: i really appreciate you ahve seen that goddamn mess of cabinets so you know what we're dealing with ;D [23:00:46] hehe [23:00:46] and why im so keen to re-cable it all. [23:01:20] i dunno wtf we're going to do with the old cp systems though [23:01:33] I wouldnt want to give to oit cuz they are failing [23:01:38] throw them on the floor [23:01:48] it may be we simply pull the disks for destruction (the non ssds) and then indeed, scrap. [23:01:51] XioNoX: yes cp4025 is me, playing with ethtool parameters [23:01:53] powersupplies are dying. [23:02:08] look at this strange pattern: https://grafana.wikimedia.org/dashboard/file/server-board.json?refresh=1m&orgId=1&from=1506016108392&to=1506034860953&var-server=db1055&var-network=eth0 [23:02:21] so we'll have to hire some company to take them away, or i can rent a car and spend a day driving them to a recycling center or something. [23:02:24] there was contention for 5 minutes or so [23:02:31] servers dont fit in a miata [23:02:42] a single server does, angled in funny, in the passenger seat [23:02:46] none fit in trunk, this is experience. [23:03:01] the drop on traffic made me thing of network first [23:03:31] hmm, i must have put my name in wikitech deployments badly. Anyways i'll ship my patch [23:03:38] jouncebot claimed no patches [23:04:09] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3625966 (10Volans) [23:04:17] (03PS2) 10EBernhardson: Stop injecting search relevance survey data into pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379591 (https://phabricator.wikimedia.org/T175047) [23:04:19] greg-g, no more blockers on mediawiki deployments [23:04:21] robh: It's funny, I had a van load of stuff taken away from my dads work for free [23:04:47] I just needed priority for the hardware depool due to the unknown glitch [23:04:49] jynus: thanks, godspeed :) [23:05:25] 10Operations, 10Operations-Software-Development, 10Goal, 10Patch-For-Review, 10Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3245193 (10Volans) [23:05:29] Reedy: i was one told by jens that dell germany has to come and get your old servers when you are done due to some regulations (and built into the cost of the servers) [23:05:39] most countries handle waste better than us [23:05:52] Amazon charge a waste disposal levy thing [23:06:04] Yeah, original manufacturer is responsible for the WEEE at the end in some way or another [23:06:23] greg-g, some people may be angry about the 60-second query limit, but we haven't had a query-related outage since it was setup (it used to happen weekly) [23:06:58] jynus: I am very happy about the 60s limit actually [23:07:17] RoanKattouw, cool, then I am not alone agaist a sea of angry people :-) [23:07:27] We have some query timeouts on the RC page sometimes when people select combinations of filters that are rare, especially on wikis with high Wikidata saturation [23:07:34] "Your watchlist is how big? Go away" [23:07:39] yes [23:07:45] Matt recently got a patch merged that makes MW's DB backend throw a different exception time for timeouts [23:07:46] although it would be nice to have it on the application [23:07:54] So we can now show a specific error message for timeouts to the user [23:07:56] and not on the database [23:08:02] Yes [23:08:09] my patch was an "angry patch" (in a good sense) [23:08:23] because I could not stand the discussions about the best way to implement it [23:08:29] Also -- 1) wow, you're up late ; 2) I just wrote https://phabricator.wikimedia.org/T171027#3625968 which has a question for you (as well as other things) [23:08:46] 3) I am supposedly on band holiday :-) [23:08:53] haha sorry [23:08:56] *bank [23:09:05] but manuel is on vacation [23:09:11] Not quick enough for an american pie joke [23:09:43] We started taking note of the RC/WL query perf issues recently, and after reading Bawolff's analysis I was stunned [23:09:54] I had no idea Wikidata was 95% of the RC table on some big wikis [23:09:55] thanks for taking that task into account [23:10:12] Or that a single WD edit could generate several million RC rows in extreme cases [23:10:19] I didn't either, bawolff detected it [23:10:44] I think a "large" RC defeats the purpose of rc table [23:11:03] I would move wikidata to a separate table rather than partitions [23:11:09] but I have not read your suggestion [23:11:14] RoanKattouw: "Undeploy wikidata" [23:11:55] My suggestion was basically, whichever you think is best between indexes, partitions or a separate table [23:12:00] well, some people are angry, but even if the response can be a bit extreme [23:12:09] I do not blame them [23:12:12] (03CR) 10EBernhardson: [C: 032] Stop injecting search relevance survey data into pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379591 (https://phabricator.wikimedia.org/T175047) (owner: 10EBernhardson) [23:12:36] A separate table would require software changes so I like that option a bit less, but we can do it if we have to [23:12:41] I (not me) contribute 100000 edits to wikipedia, and I am "rewarded" with not being to see my watchlist [23:12:57] I understand the frustration [23:12:58] Reedy: I suggested that too :) [23:13:35] robh: Does Recology take recyling at that giant yard they have over on Tunnel Ave near Bayshore station? That's reasonably close to the DC right? [23:13:36] i still like the idea of using some sort of 'mailbox' idea for watchlists. but primarily because i don't have to think about how to do it :P [23:13:41] RoanKattouw, software changes? It is literally perform the same query twice? [23:13:52] oh, you mean on inserting [23:13:57] RoanKattouw: i have no idea where that is =] [23:14:12] looking on map [23:14:19] robh: There's a train tunnel right by the DC right? It's basically on the other side of that tunnel [23:14:44] RoanKattouw, the problem with indexes is that some previous queries were already badly optimized [23:14:47] Ive not seen the train tunnel, but I've only ever arrived/departed off the highway via paul ave exit [23:15:03] so it was assumed the table was small to make it work [23:15:15] The only time I had to take out big recycling items I went to the yard at 7th & Berry (in SOMA near the Caltrain station) but that was cardboard boxes, not servers :) [23:15:17] (03Merged) 10jenkins-bot: Stop injecting search relevance survey data into pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379591 (https://phabricator.wikimedia.org/T175047) (owner: 10EBernhardson) [23:15:46] (03CR) 10VolkerE: "Who's able to merge this? The `id` attributes are of no interest for these kind of SVGs. While saving bytes is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377406 (https://phabricator.wikimedia.org/T175670) (owner: 10VolkerE) [23:15:51] robh: Right, the tracks are kind of behind the DC I think. It's probably a bit farther by car than by train ;) [23:15:52] I would not solve the issue isolated, I would have a look if you can to the api errors and evaluate the best way to move forward, I will have a look at it tomorrow [23:16:08] (03CR) 10jenkins-bot: Stop injecting search relevance survey data into pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/379591 (https://phabricator.wikimedia.org/T175047) (owner: 10EBernhardson) [23:16:30] jynus: Yeah the SW changes wouldn't be too bad, insert into the right table, and query from the right table (or UNION between both) [23:16:40] hmm, how do i scap out a deleted file? [23:16:51] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: T175047: Turn off human search relevance survey (duration: 00m 46s) [23:16:57] ebernhardson: sync-dir the parent dir? [23:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:07] T175047: Search Relevance Survey test #3: turn off test - https://phabricator.wikimedia.org/T175047 [23:17:10] RoanKattouw: ohhh i see [23:17:19] its south of 200 paul on tunnel ave? [23:17:20] RoanKattouw: that ought to work. thanks [23:17:38] what exactly is this? [23:17:47] jynus: Yeah you're right about the poorly optimized queries. Segregating WD into its own table would make the non-WD 20x smaller, so maybe then the "small" assumption would hold again [23:17:48] this doesnt seem like a dump... [23:17:57] It's some sort of Recology facility [23:18:12] It's like on the east side of Tunnel Ave near the SF/Brisbane city limit [23:18:13] !log ebernhardson@tin Synchronized wmf-config: T175047: Turn off human search relevance survey (duration: 00m 48s) [23:18:24] I've biked past it a few times but I don't really know what kind of facility it is [23:18:26] oh, so a sorting center kinda deal [23:18:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:28] or something [23:18:29] It might not be accessible to the public [23:18:34] Yeah could be [23:18:46] yeah thats pretty close i coul dlikely just get a zip car near it [23:18:59] park my car elsewhere and just snag systems easily enough, ill make a note to call them and find out [23:19:01] I just wondered if they let you take stuff there because it felt like it was close (but since your servers can't fly it might not be /that/ close) [23:19:02] thx for recommendation =] [23:19:15] well, its still way closer than driving elsewhere from DC [23:19:48] True [23:19:54] RoanKattouw, for context- this is not new, just got worse: https://phabricator.wikimedia.org/T101502#1341361 [23:20:01] And better than hitting Bay Bridge traffic on 101 trying to get into the city [23:20:45] jynus: Yeah I saw those, there seems to be a particular client/request/query for ns=6 and type=3 that is very slow [23:21:10] I wondered if ORDER BY rc_timestamp ASC, rc_id ASC was the problem, but if I just order by timestamp it's equally slow [23:21:48] That particular query is pretty much glued to the top of tendirl [23:21:50] *tendril [23:21:52] the thing is, I can have an opinion [23:22:09] but if the tests prove me wrong, that opinion is not worth it [23:22:20] so I would look at what people really want [23:22:32] and implement the best way that works [23:22:47] What tests are you referring to? [23:23:00] I thought about the separation beacause I was told the same people that complained about slowness [23:23:12] also considered wikidata as "noise" [23:23:19] Right, yes [23:23:24] so that (maybe) could make them happy [23:23:43] !log ebernhardson@tin Synchronized php-1.30.0-wmf.19/extensions/WikimediaEvents/extension.json: T175047: Stop delivering search relevance survey javascript (duration: 00m 46s) [23:23:48] tests as in different queries or adding indexes [23:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:00] T175047: Search Relevance Survey test #3: turn off test - https://phabricator.wikimedia.org/T175047 [23:24:10] (which is highly related to its excessive volume BTW, people think it's "noise" because many of the updates are "the description of this item in [language you don't speak] was changed from [alphabet you can't read] to [alphabet you can't read]") [23:24:11] meaning that I can say "this probably will not help" [23:24:27] but I can be objectively proven wrong with examples [23:24:34] Right, you mean like adding an index on one machine and seeing what it does? [23:25:06] I haven't evaluated the index, I think there were other suggestions before that [23:25:23] Yeah my index suggestion is a bit different, but it's basically abusing an index as a partition [23:26:31] My problem is I don't know how I would test these things outside production (which I can't do and shouldn't be allowed to do) [23:26:45] Generate a large testing data set on my localhost? Couple million rows? [23:27:04] Like, if and when you resist the urge to test in production, what do you do? [23:27:11] I can do that on a spare replica [23:27:40] there are also plans to have permanent production (restricted data) testing hosts [23:28:29] Ooh that would be nice [23:29:01] for backup testing and large refactoring tests (initially MCR), but no user traffic [23:29:38] devels should have root-like there, but I have to see how to implement it so they do not get confused with real hosts [23:29:58] yuge MOTY [23:29:59] *MOTD [23:30:25] yeah, but I would like that on the real ones, not on the fake ones [23:30:43] (03CR) 10MaxSem: "It's not only about merging, it also requires deploying. See https://wikitech.wikimedia.org/wiki/SWAT_deploys for how to get your patch de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377406 (https://phabricator.wikimedia.org/T175670) (owner: 10VolkerE) [23:31:12] anyway, that is right now just an idea, there is nothing yet firm [23:31:44] I am going to sleep, will have a look at the watchlists tomorrow and see if I can help [23:31:52] I would really love to have a test host where I can create indexes for testing. That would be great for stuff like the ORES index I proposed [23:31:55] Have a good night! [23:32:02] And thanks for all your input