[00:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: It is that lovely time of the day again! You are hereby commanded to deploy Evening SWAT (Max 8 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180302T0000). [00:00:05] Zackary: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:01:22] i can ship these [00:01:25] Zackary: around? [00:02:40] (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [00:03:12] Niharika: mtg! [00:03:54] (03Merged) 10jenkins-bot: Setup Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [00:07:58] (03CR) 10jenkins-bot: Setup Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415754 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [00:09:50] !log ebernhardson@tin Synchronized wmf-config/: SWAT: T187148 Configure Cirrus AB test (duration: 01m 00s) [00:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:09] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [00:11:15] sigh, extra warnings triggered... [00:12:34] !log ebernhardson@tin Synchronized wmf-config/: REVERT SWAT: T187148 Configure Cirrus AB test (duration: 00m 59s) [00:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:00] (03PS1) 10EBernhardson: Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 [00:15:10] (03CR) 10jerkins-bot: [V: 04-1] Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 (owner: 10EBernhardson) [00:15:57] (03PS2) 10EBernhardson: Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 [00:17:00] (03CR) 10jerkins-bot: [V: 04-1] Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 (owner: 10EBernhardson) [00:18:25] (03PS3) 10EBernhardson: Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 [00:19:49] (03CR) 10EBernhardson: [C: 032] Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 (owner: 10EBernhardson) [00:21:16] (03Merged) 10jenkins-bot: Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 (owner: 10EBernhardson) [00:21:26] (03CR) 10jenkins-bot: Gracefully handle change of variable to array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415779 (owner: 10EBernhardson) [00:23:58] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: SWAT: T187148 Configure Cirrus AB test (step 1) (second try) (duration: 00m 57s) [00:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:16] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [00:25:09] Zackary: around for SWAT? [00:25:47] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148 Configure Cirrus AB test (step 2) (second try) (duration: 00m 57s) [00:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:42] ye, I'm around [00:37:46] Zackary: \o/ I'll get yours shipped out [00:37:59] (03CR) 10EBernhardson: [C: 032] Restrict FlaggedRevs to only operated on NS_MAIN on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404620 (https://phabricator.wikimedia.org/T148603) (owner: 10TerraCodes) [00:39:07] (03CR) 10EddieGP: [C: 031] Run initSiteStats twice a month [puppet] - 10https://gerrit.wikimedia.org/r/415066 (https://phabricator.wikimedia.org/T59788) (owner: 10Chad) [00:39:51] (03Merged) 10jenkins-bot: Restrict FlaggedRevs to only operated on NS_MAIN on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404620 (https://phabricator.wikimedia.org/T148603) (owner: 10TerraCodes) [00:40:07] (03CR) 10jenkins-bot: Restrict FlaggedRevs to only operated on NS_MAIN on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/404620 (https://phabricator.wikimedia.org/T148603) (owner: 10TerraCodes) [00:40:33] Zackary: able to test on mwdebug1002 ? [00:42:29] 10Operations, 10Puppet, 10Patch-For-Review: compile/diff catalogs between puppetdb v2 (production) and puppetdb v4 - https://phabricator.wikimedia.org/T188544#4016233 (10herron) Octocatalog-diff is set up on elnath.codfw.wmnet with a local `/etc/puppet/auth.conf` hack in place on the 3 eqiad puppet masters t... [00:46:26] !log ebernhardson@tin Synchronized php-1.31.0-wmf.23/extensions/WikimediaEvents/modules/all/ext.wikimediaEvents.searchSatisfaction.js: SWAT T187148: Start cirrus query explorer AB test (duration: 00m 57s) [00:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:43] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [00:48:47] !log fermium (lists) and mx systems rebooted for kernel update [00:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:22] !log ebernhardson@tin Synchronized wmf-config/flaggedrevs.php: SWAT: T148603: (duration: 00m 57s) [00:49:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:37] T148603: Limit the Quality version of the flagged revision in Arabic Wikipedia to ns=0 - https://phabricator.wikimedia.org/T148603 [00:56:15] PROBLEM - MegaRAID on db1064 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [00:56:15] ACKNOWLEDGEMENT - MegaRAID on db1064 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T188685 [00:56:20] 10Operations, 10ops-eqiad: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4016254 (10ops-monitoring-bot) [00:57:36] PROBLEM - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/grid/start/trusty - 185 bytes in 0.331 second response time [00:59:13] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#4016257 (10EddieGP) Sorry, I was about to ask when exactly to do this o... [01:01:56] (03PS4) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [01:02:26] (03CR) 10jerkins-bot: [V: 04-1] wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [01:02:36] RECOVERY - toolschecker: Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.700 second response time [01:06:22] (03CR) 10Bstorm: "Ok, I've applied all of anomie's advice with an exception. I did add a view for the comment table, but I haven't added a whole bunch of i" [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [01:07:15] (03PS5) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [01:07:17] (03CR) 10jerkins-bot: [V: 04-1] wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [01:08:27] (03CR) 10Bstorm: "Also, wmcs team should note that I've made a change in this patch set to the logging table whitelist parsing method. I'm now passing the " [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [01:09:21] (03PS1) 10Dzahn: jupyterhub: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415786 [01:09:54] (03CR) 10jerkins-bot: [V: 04-1] jupyterhub: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415786 (owner: 10Dzahn) [01:13:18] (03CR) 10BryanDavis: "> perhaps this would need to be on the source (sanitarium?)." [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [01:24:17] ebernhardson: no, I'm not able to test, since I don't have rights on that wiki [01:27:24] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (bad URL) timed out before a response was received [01:27:24] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Scrapes sample page) timed out before a response was received [01:28:15] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [01:28:15] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [01:30:56] (03PS1) 10Aaron Schulz: [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 [01:31:37] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [01:33:43] (03CR) 10BryanDavis: wiki-replicas: Accommodate new comments table with rules and compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [01:37:41] (03PS2) 10Dzahn: jupyterhub: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415786 [01:43:11] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/10228/notebook1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415786 (owner: 10Dzahn) [01:46:32] !log LDAP: added lucaswerkmeister-wmde to 'wmde' and 'nda' groups (T188105) [01:46:46] logmsgbot: plz [01:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:53] T188105: Add Lucas Werkmeister to the ldap/wmde and ldap/nda groups - https://phabricator.wikimedia.org/T188105 [01:49:07] (03PS1) 10Dzahn: admins: add Lucas Werkmeister to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/415792 (https://phabricator.wikimedia.org/T188105) [01:49:57] (03CR) 10Dzahn: [C: 032] admins: add Lucas Werkmeister to ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/415792 (https://phabricator.wikimedia.org/T188105) (owner: 10Dzahn) [01:54:24] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [01:54:41] !log cobalt (gerrit) - rebooting for kernel upgrade [01:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:50] gerrit server is restarting - maintenance - 2 minutes please [01:57:14] PROBLEM - Host cobalt is DOWN: PING CRITICAL - Packet loss = 100% [01:57:46] RECOVERY - Host cobalt is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [01:57:48] * legoktm waits patiently [01:58:23] back [01:59:00] sorry for making you all login again. sometimes we can't avoid them [02:03:46] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_jenkins CI slave scripts] [02:05:43] ^ already ran puppet.. [02:08:46] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [02:10:06] ok, done and oiut [02:44:50] hmm [02:44:55] extensiondistributor is broken [02:46:45] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4016446 (10Prtksxna) >>! In T188362#4013204, @Aklapper wrote: > Curious: @Prtksxna, can you access that?) Nope. I don't see "Vis... [02:48:50] !log manually purged ExtensionDistributor cache (T188692) [02:49:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:49:10] T188692: Special:ExtensionDistributor displays an error - https://phabricator.wikimedia.org/T188692 [03:22:21] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4016491 (10Cmjohnson) I have one left and now I see db1064 is degraded. We needed to order more @robh. [03:25:58] (03PS2) 10Chad: Enable reusable TC on HHVM on canary appservers [puppet] - 10https://gerrit.wikimedia.org/r/414876 (https://phabricator.wikimedia.org/T103886) [03:27:21] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 832.79 seconds [04:25:41] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 231.43 seconds [05:24:59] 10Operations, 10Domains, 10Traffic, 10WMF-Design, and 3 others: Create subdomain for Design and Wikimedia User Interface Style Guide - https://phabricator.wikimedia.org/T185282#3911827 (10Prtksxna) >>! In T185282#4015968, @Bawolff wrote: > Speaking of which, if the code already exists, you should request t... [05:27:30] (03PS3) 10Andrew Bogott: labweb: include equivalent functionality to hhvm::admin [puppet] - 10https://gerrit.wikimedia.org/r/415758 [05:33:18] (03PS4) 10Andrew Bogott: labweb: include equivalent functionality to hhvm::admin [puppet] - 10https://gerrit.wikimedia.org/r/415758 [05:35:26] (03PS5) 10Andrew Bogott: labweb: include equivalent functionality to hhvm::admin [puppet] - 10https://gerrit.wikimedia.org/r/415758 [05:36:58] (03CR) 10Andrew Bogott: [C: 032] labweb: include equivalent functionality to hhvm::admin [puppet] - 10https://gerrit.wikimedia.org/r/415758 (owner: 10Andrew Bogott) [06:34:56] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1068 - https://phabricator.wikimedia.org/T188187#4016615 (10Marostegui) >>! In T188187#4016491, @Cmjohnson wrote: > I have one left and now I see db1064 is degraded. We needed to order more > @robh. Yeah...just saw that. Let's save that spare disk for db... [06:37:01] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1064 - https://phabricator.wikimedia.org/T188685#4016618 (10Marostegui) p:05Triage>03Normal This is a slave in s4. There is only one spare disk left and we will use it for db1068 (s4 master - T188187#4016615) so we need to order more as per @Cmjohnson... [06:45:20] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415805 (https://phabricator.wikimedia.org/T183469) [06:48:17] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415805 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [06:49:29] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415805 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [06:51:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 to clone db1114 - T183469 (duration: 00m 58s) [06:51:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:21] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [06:52:37] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415805 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [06:52:54] !log Stop MySQL on db1073 to clone db1114 - T183469 [06:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:21] (03PS2) 10Aaron Schulz: [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 [06:53:48] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [06:57:29] (03PS3) 10Aaron Schulz: [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 [06:58:02] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [07:01:15] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Add db1114 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415806 (https://phabricator.wikimedia.org/T183469) [07:04:11] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Add db1114 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415806 (https://phabricator.wikimedia.org/T183469) [07:10:52] <_joe_> AaronSchulz: wow, you're already working on it [07:11:06] <_joe_> AaronSchulz: I'll be able to work on this next quarter, probably [07:11:08] <_joe_> hopefully [07:11:09] <_joe_> :P [07:11:14] !log rebooting xenon/praseodymium/xenon for kernel security update [07:11:24] !log rebooting xenon/praseodymium/cerium for kernel security update [07:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:49] (03CR) 10Marostegui: [C: 032] db-eqiad,db-codfw.php: Add db1114 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415806 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:19:02] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1114 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415806 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:19:16] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Add db1114 to the config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415806 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:20:21] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Add db1114 to the config - T183469 (duration: 00m 57s) [07:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:35] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [07:21:24] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Add db1114 to the config - T183469 (duration: 00m 57s) [07:21:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:38] (03PS1) 10Marostegui: s1.hosts: Add db1114 to s1 [software] - 10https://gerrit.wikimedia.org/r/415807 (https://phabricator.wikimedia.org/T183469) [07:23:34] (03CR) 10Marostegui: [C: 032] s1.hosts: Add db1114 to s1 [software] - 10https://gerrit.wikimedia.org/r/415807 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:24:17] (03Merged) 10jenkins-bot: s1.hosts: Add db1114 to s1 [software] - 10https://gerrit.wikimedia.org/r/415807 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [07:37:42] (03PS1) 10Marostegui: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415808 (https://phabricator.wikimedia.org/T162807) [07:40:38] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4016701 (10MoritzMuehlenhoff) If we had no firewall setup and would start from scratch it would be different, but I don't think the work necessary to migrate outweighs the potential benefits at this point. O... [07:43:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415808 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:44:18] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415808 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:45:38] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1067 - T162807 (duration: 00m 57s) [07:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:52] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [07:48:13] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415808 (https://phabricator.wikimedia.org/T162807) (owner: 10Marostegui) [07:51:36] (03PS4) 10Aaron Schulz: [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 [07:52:13] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add dynomite module and dynomite_wancache profile [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [08:14:33] 10Operations, 10Ops-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4008190 (10MoritzMuehlenhoff) Just to confirm the list of names to bring up in the next SRE meeting; This means the access is requested for... [08:15:38] (03CR) 10Giuseppe Lavagetto: "Thanks for starting working on this, I already tried to stash some of my time in the next quarter to help on this front. Some early feedba" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415789 (owner: 10Aaron Schulz) [08:19:54] Good morning [08:19:57] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is actually the port in the container, not the port in production and it's set in the image definition. Changing the port requires ch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/415605 (https://phabricator.wikimedia.org/T184919) (owner: 10Mobrovac) [08:20:05] Question: Why CI run tests for patches already merged before 5/6 days? [08:20:15] Examples: http://prntscr.com/ilqpaa [08:24:53] 10Operations, 10Ops-Access-Requests: Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4014970 (10MoritzMuehlenhoff) @Imarlier Do you need access to those boxes before they are fully provisioned? If not, I'd close this access request since they'll be accessible once fully set up. [08:27:41] Zoranzoki21: you don't need to ask in multiple channels at once [08:28:07] : Ok. If you want to help, tell it [08:32:09] 10Operations, 10Ops-Access-Requests: Need access to graphite servers - https://phabricator.wikimedia.org/T188649#4014947 (10MoritzMuehlenhoff) Such an access request needs to be raised in the next SRE meeting (Monday). Is the request only for yourself or also for other members of the Performance team? [08:33:00] 10Operations, 10Ops-Access-Requests: Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4016822 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:33:09] 10Operations, 10Ops-Access-Requests: Need access to graphite servers - https://phabricator.wikimedia.org/T188649#4016823 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:33:29] 10Operations: Decrease the amount of IRC spam in case of widespread puppet failures - https://phabricator.wikimedia.org/T188602#4016824 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:34:16] 10Operations, 10monitoring: Many "NRPE: Unable to read output" from "long running screen/tmux" checks in icinga - https://phabricator.wikimedia.org/T187528#4016825 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:35:10] 10Operations, 10Wikimedia-Incident: Detect high server load earlier – prometheus alert? - https://phabricator.wikimedia.org/T188317#4016826 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:39:02] (03PS1) 10Marostegui: db-eqiad.php: Pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415810 [08:40:52] (03PS1) 10Marostegui: db1114: Enable notification [puppet] - 10https://gerrit.wikimedia.org/r/415811 (https://phabricator.wikimedia.org/T183469) [08:51:14] !log repooling scb1003 after memory module was replaced (T188385) [08:51:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:30] T188385: Memory initialization error on scb1003 - https://phabricator.wikimedia.org/T188385 [08:54:06] (03CR) 10Gehel: [C: 031] "The code change itself LGTM. I think the tests don't have much value as they rely too much on mock and dont check any assumption about how" [software/cumin] - 10https://gerrit.wikimedia.org/r/415587 (https://phabricator.wikimedia.org/T188627) (owner: 10Volans) [08:57:24] !log rebooting scb1004 for kernel security update (was omitted from earlier reboots due to hardware issues on scb1003) [08:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:17] elukey: I 'll be testing for T181121 on ganeti1005 (where bohrium is [08:58:18] T181121: Kernels errors on ganeti1005- ganeti1008 under high I/O - https://phabricator.wikimedia.org/T181121 [08:58:28] keep in mind it might cause some issues [09:05:12] (03CR) 10Volans: [C: 032] "@gehel thanks for the review. I agree this test is a bit sterile and a better and proper way to test this would be with an integration tes" [software/cumin] - 10https://gerrit.wikimedia.org/r/415587 (https://phabricator.wikimedia.org/T188627) (owner: 10Volans) [09:06:53] akosiaris: ack! [09:08:58] (03Merged) 10jenkins-bot: CLI: fix setup_logging() when without path [software/cumin] - 10https://gerrit.wikimedia.org/r/415587 (https://phabricator.wikimedia.org/T188627) (owner: 10Volans) [09:09:17] And again CI [09:10:36] (03CR) 10jenkins-bot: CLI: fix setup_logging() when without path [software/cumin] - 10https://gerrit.wikimedia.org/r/415587 (https://phabricator.wikimedia.org/T188627) (owner: 10Volans) [09:12:01] PROBLEM - cassandra-a CQL 10.64.16.153:9042 on cerium is CRITICAL: connect to address 10.64.16.153 and port 9042: Connection refused [09:12:11] PROBLEM - cassandra-a SSL 10.64.16.153:7001 on cerium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:12:11] PROBLEM - restbase endpoints health on praseodymium is CRITICAL: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page [09:12:12] html by title from storage) is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) is CRITICAL: Test Retrieve selected the events for Jan 01 returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) is CRITICAL: Test Get r [09:12:12] orage returned the unexpected status 500 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retri [09:12:22] PROBLEM - Restbase root url on xenon is CRITICAL: connect to address 10.64.0.200 and port 7231: Connection refused [09:12:42] PROBLEM - Restbase root url on cerium is CRITICAL: connect to address 10.64.16.147 and port 7231: Connection refused [09:12:51] PROBLEM - cassandra-a CQL 10.64.0.202:9042 on xenon is CRITICAL: connect to address 10.64.0.202 and port 9042: Connection refused [09:12:51] PROBLEM - cassandra-a SSL 10.64.0.202:7001 on xenon is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:16:02] ^ that's just the deprecated test cluster, will be decom, I'm silencing this again [09:17:56] (03CR) 10Marostegui: [C: 032] db1114: Enable notification [puppet] - 10https://gerrit.wikimedia.org/r/415811 (https://phabricator.wikimedia.org/T183469) (owner: 10Marostegui) [09:22:26] (03PS2) 10Marostegui: db-eqiad.php: Pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415810 [09:24:12] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415810 (owner: 10Marostegui) [09:25:25] (03Merged) 10jenkins-bot: db-eqiad.php: Pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415810 (owner: 10Marostegui) [09:27:26] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly pool db1114 in s1 after cloning it from db1073 - T183469 (duration: 01m 01s) [09:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:39] T183469: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469 [09:28:14] (03CR) 10jenkins-bot: db-eqiad.php: Pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415810 (owner: 10Marostegui) [09:28:34] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Lucas Werkmeister to the ldap/wmde and ldap/nda groups - https://phabricator.wikimedia.org/T188105#4016913 (10MoritzMuehlenhoff) [09:31:11] 10Operations, 10ops-eqiad, 10Analytics-Kanban: DIMM errors for analytics1062 - https://phabricator.wikimedia.org/T187164#4016914 (10elukey) @Cmjohnson the server went down, can we test it ? [09:33:18] (03PS1) 10Volans: Puppet: temporary allow elnath to retrieve catalogs [puppet] - 10https://gerrit.wikimedia.org/r/415813 (https://phabricator.wikimedia.org/T188544) [09:35:05] (03PS1) 10Muehlenhoff: Drop use of experimental repository component for caches [puppet] - 10https://gerrit.wikimedia.org/r/415814 [09:36:01] (03CR) 10Volans: "Compiler results: https://puppet-compiler.wmflabs.org/compiler02/10233/puppetmaster1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/415813 (https://phabricator.wikimedia.org/T188544) (owner: 10Volans) [09:38:34] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Puppet: temporary allow elnath to retrieve catalogs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415813 (https://phabricator.wikimedia.org/T188544) (owner: 10Volans) [09:38:52] <_joe_> I know it seems silly, but comments on such hacks are important [09:39:04] no, you're right [09:40:44] (03PS2) 10Volans: Puppet: temporary allow elnath to retrieve catalogs [puppet] - 10https://gerrit.wikimedia.org/r/415813 (https://phabricator.wikimedia.org/T188544) [09:40:44] !log draining restbase1015 for eventual reboot for kernel security update [09:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:20] _joe_: done [09:42:49] (03CR) 10Giuseppe Lavagetto: [C: 031] Puppet: temporary allow elnath to retrieve catalogs [puppet] - 10https://gerrit.wikimedia.org/r/415813 (https://phabricator.wikimedia.org/T188544) (owner: 10Volans) [09:43:08] (03CR) 10Volans: [C: 032] Puppet: temporary allow elnath to retrieve catalogs [puppet] - 10https://gerrit.wikimedia.org/r/415813 (https://phabricator.wikimedia.org/T188544) (owner: 10Volans) [09:43:45] (03PS3) 10Volans: Puppet: temporary allow elnath to retrieve catalogs [puppet] - 10https://gerrit.wikimedia.org/r/415813 (https://phabricator.wikimedia.org/T188544) [09:49:07] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415817 [09:51:12] (03PS1) 10Elukey: burrow: fix zookeeper lock path value [puppet] - 10https://gerrit.wikimedia.org/r/415818 (https://phabricator.wikimedia.org/T180442) [09:52:34] (03CR) 10Elukey: [C: 032] burrow: fix zookeeper lock path value [puppet] - 10https://gerrit.wikimedia.org/r/415818 (https://phabricator.wikimedia.org/T180442) (owner: 10Elukey) [09:55:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415817 (owner: 10Marostegui) [09:56:23] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415817 (owner: 10Marostegui) [09:57:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1114 (duration: 00m 57s) [09:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:04] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415817 (owner: 10Marostegui) [10:01:07] !log deleted /etc/burrow/* from zookeeper main eqiad/codfw after https://gerrit.wikimedia.org/r/415818 (garbage to cleanup) [10:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:30] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415820 [10:13:47] (03PS1) 10Muehlenhoff: Reimage mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/415822 (https://phabricator.wikimedia.org/T188587) [10:14:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415820 (owner: 10Marostegui) [10:14:25] (03CR) 10jerkins-bot: [V: 04-1] Reimage mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/415822 (https://phabricator.wikimedia.org/T188587) (owner: 10Muehlenhoff) [10:15:56] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415820 (owner: 10Marostegui) [10:17:25] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1114 (duration: 00m 56s) [10:17:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:07] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415820 (owner: 10Marostegui) [10:18:27] !log shutting down labsdb1010 [10:18:36] (03PS2) 10Muehlenhoff: Reimage mc2036 after mainboard replacement [puppet] - 10https://gerrit.wikimedia.org/r/415822 (https://phabricator.wikimedia.org/T188587) [10:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:44] !log draining restbase1016 for eventual reboot for kernel security update [10:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:25] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [10:28:23] (03PS1) 10Filippo Giunchedi: Decom restbase-test cluster and role [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) [10:30:16] PROBLEM - Disk space on stat1005 is CRITICAL: DISK CRITICAL - free space: /srv 282942 MB (3% inode=93%) [10:30:16] ^see log, expected [10:30:21] the proxt, I mean [10:30:27] not stat1005 [10:33:22] stat1005 needs some data deletion that should happen today from a user, I can try to move something to HDFS in the meantime [10:33:25] stat1005 is also known, I pinged the analytics channel earlier [10:35:05] (03PS2) 10Giuseppe Lavagetto: hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 [10:35:06] (03PS2) 10Giuseppe Lavagetto: hhvm::admin: convert to using httpd instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/415574 [10:35:08] (03PS1) 10Giuseppe Lavagetto: hhvm: remove legacy diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415828 [10:35:10] (03PS1) 10Giuseppe Lavagetto: hhvm: remove deep hiera_hash call inside the class [puppet] - 10https://gerrit.wikimedia.org/r/415829 [10:35:12] (03PS1) 10Giuseppe Lavagetto: mediawiki::hhvm: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/415830 [10:38:29] (03CR) 10Muehlenhoff: Decom restbase-test cluster and role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415827 (https://phabricator.wikimedia.org/T186755) (owner: 10Filippo Giunchedi) [10:41:01] godog gilles it seems like https://commons.wikimedia.org/wiki/File:Prusa_i3_MK2-full_model.stl is stuck in thumbnailing. any insight? i lurk between channels, if you can handle-mention me here, most appreciated - lmk if i should file a ticket [10:42:26] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415832 [10:43:09] (03CR) 10Muehlenhoff: [C: 031] hhvm: remove legacy diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415828 (owner: 10Giuseppe Lavagetto) [10:43:15] 10Operations, 10Traffic: Collect Google IPs pinging the load balancers - https://phabricator.wikimedia.org/T165651#4017007 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi I mentioned this task and problem to a friend working in SRE networking, we're now receiving about one tenth of the icmp traffic inbou... [10:44:05] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415832 (owner: 10Marostegui) [10:44:09] dr0ptp4kt: yes please, a task would be better! [10:44:28] godog: thx, which board names should i add? [10:45:29] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415832 (owner: 10Marostegui) [10:45:44] dr0ptp4kt: thumbor and 3d at least [10:46:41] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Fully repool db1114 (duration: 00m 57s) [10:46:55] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4017041 (10Marostegui) db1114 is now fully pooled in s1 and db1073 is depooled. Let's see how it goes during the weekend and then move db1073 to m5. [10:46:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:30] thx godog, here you go: https://phabricator.wikimedia.org/T188711 (cc gilles) [10:48:12] dr0ptp4kt: sweet! yeah stuff on irc tends to get lost, at least for me [10:48:16] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1114 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415832 (owner: 10Marostegui) [10:53:11] !log draining restbase1017 for eventual reboot for kernel security update [10:53:19] (03PS1) 10Vgutierrez: Release 1.13.9-1+wmf1 for stretch [software/nginx] (wmf-1.13) - 10https://gerrit.wikimedia.org/r/415835 [10:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:37] !log spare LVSs lvs[1011-1012], lvs[4001-4004]: reboot for retpoline kernel updates T188092 [10:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:57:42] (03PS3) 10Giuseppe Lavagetto: hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 [10:57:44] (03PS3) 10Giuseppe Lavagetto: hhvm::admin: convert to using httpd instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/415574 [10:57:46] (03PS2) 10Giuseppe Lavagetto: hhvm: remove legacy diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415828 [10:57:48] (03PS2) 10Giuseppe Lavagetto: hhvm: remove deep hiera_hash call inside the class [puppet] - 10https://gerrit.wikimedia.org/r/415829 [11:07:12] !log rebooting mwdebug* for kernel security update [11:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:38] godog: the scrollback often stings me on irc, too. the bouncer/irccloud helps a little, but only a little :) [11:18:23] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [11:19:13] (03PS29) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch [puppet] - 10https://gerrit.wikimedia.org/r/394977 [11:24:04] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [11:25:37] (03PS4) 10Giuseppe Lavagetto: hhvm::admin: remove inclusion of apache::mod::proxy_fcgi [puppet] - 10https://gerrit.wikimedia.org/r/415573 [11:25:39] (03PS4) 10Giuseppe Lavagetto: hhvm::admin: convert to using httpd instead of apache [puppet] - 10https://gerrit.wikimedia.org/r/415574 [11:25:41] (03PS3) 10Giuseppe Lavagetto: hhvm: remove legacy diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415828 [11:25:43] (03PS3) 10Giuseppe Lavagetto: hhvm: remove deep hiera_hash call inside the class [puppet] - 10https://gerrit.wikimedia.org/r/415829 [11:25:45] (03PS2) 10Giuseppe Lavagetto: mediawiki::hhvm: move to profile [puppet] - 10https://gerrit.wikimedia.org/r/415830 [11:28:08] !log upload to apt.wikimedia.org component thirdparty/ci distro jessie-wikimedia docker-ce_17.12.1~ce-0~debian_amd64 T177499 [11:28:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:25] T177499: On CI, upgrade docker-ce from 17.06.2 to 17.12.1 - https://phabricator.wikimedia.org/T177499 [11:33:31] !log draining restbase1018 for eventual reboot for kernel security update [11:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:06] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[docker-ce] [11:37:45] akosiaris hashar ^^ [11:37:58] (03PS1) 10Alexandros Kosiaris: ci: Upgrade docker image version [puppet] - 10https://gerrit.wikimedia.org/r/415838 (https://phabricator.wikimedia.org/T177499) [11:38:12] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] ci: Upgrade docker image version [puppet] - 10https://gerrit.wikimedia.org/r/415838 (https://phabricator.wikimedia.org/T177499) (owner: 10Alexandros Kosiaris) [11:39:45] 10Operations, 10Ops-Access-Requests: Give maps deployment rights to sbisson - https://phabricator.wikimedia.org/T188720#4017240 (10Legoktm) [11:41:00] (03CR) 10ArielGlenn: [WIP] php7 manifests for mediawiki on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/394977 (owner: 10ArielGlenn) [11:41:06] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:41:16] paladox: and fixed ^ [11:41:22] thanks akosiaris :) [11:46:21] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4017247 (10ema) >>! In T187994#4014123, @faidon wrote: > Separately, for the Pybal/IPVS stuff, I think this could benefit to being discussed at a separate task (since it's not about iptables or firewalling)... [11:54:34] (03PS1) 10Arturo Borrero Gonzalez: labstore: monitoring: interfaces: reduce check timeframe to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/415840 (https://phabricator.wikimedia.org/T188624) [11:58:35] !log drain + reboot analytics10[29,31,32] for kernel updates [11:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:08] !log rebooting etcd* for kernel security updates [12:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:10] PROBLEM - etcd request latencies on argon is CRITICAL: CRITICAL - etcd_request_latencies is 52002 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:16:10] RECOVERY - etcd request latencies on argon is OK: OK - etcd_request_latencies is 3804 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:23:59] !log rebooting kubetcd/kubestagetcd for kernel security update [12:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:45] (03PS1) 10Mark Bergsma: Improve naming of the new BGP metric names and labels [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 [12:31:34] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4017371 (10BBlack) And just to put the nail in the coffin of LVS/IPVS-level concerns being raised in this ticket - if we were to look at replacing IPVS as the underlying (kernel-level) mechanism for our load... [12:37:55] (03PS2) 10Mark Bergsma: Improve naming of the new BGP metric names and labels [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 [12:37:58] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 (owner: 10Mark Bergsma) [12:38:06] vgutierrez: sorry ^ :) [12:38:21] 'asn' is ambiguous too, so i made it local_asn [12:38:51] (03CR) 10jerkins-bot: [V: 04-1] Improve naming of the new BGP metric names and labels [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 (owner: 10Mark Bergsma) [12:39:05] heh [12:39:22] yup.. you have some issues with that change [12:39:27] PROBLEM - etcd request latencies on acrux is CRITICAL: CRITICAL - etcd_request_latencies is 90559 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:40:20] (03PS3) 10Mark Bergsma: Improve naming of the new BGP metric names and labels [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 [12:41:28] RECOVERY - etcd request latencies on acrux is OK: OK - etcd_request_latencies is 3874 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:43:43] (03CR) 10Vgutierrez: [C: 031] "LGTM, now we should fix change 415260 as well, but I'll do that once this one is merged" [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 (owner: 10Mark Bergsma) [12:49:50] 10Operations, 10Ops-Access-Requests: Give maps deployment rights to sbisson - https://phabricator.wikimedia.org/T188720#4017423 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:50:01] hello there! [12:50:13] is this the right place to ask about cumin? [12:52:49] sonne: should be fine here [13:01:13] cheers :) [13:12:39] (03PS1) 10Arturo Borrero Gonzalez: apt: unattended-upgrades: don't install apt-show-upgrades [puppet] - 10https://gerrit.wikimedia.org/r/415843 (https://phabricator.wikimedia.org/T186230) [13:13:26] (03CR) 10Arturo Borrero Gonzalez: [C: 032] apt: unattended-upgrades: don't install apt-show-upgrades [puppet] - 10https://gerrit.wikimedia.org/r/415843 (https://phabricator.wikimedia.org/T186230) (owner: 10Arturo Borrero Gonzalez) [13:15:54] sonne: sure, go ahead [13:17:02] 10Operations, 10Ops-Access-Requests: Give maps deployment rights to sbisson - https://phabricator.wikimedia.org/T188720#4017488 (10Gehel) the maps production clusters (`maps[12]00[1-4]`) and test cluster (`maps-test200[1-4]`) are managed by the same group (`maps-admins`). This is actually what we want. Stephan... [13:17:19] !log upgrading labtest trusty hosts to latest 4.4 kernel [13:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] (03PS1) 10Gehel: Stephane Bisson should be able to deploy maps. [puppet] - 10https://gerrit.wikimedia.org/r/415845 (https://phabricator.wikimedia.org/T188720) [13:21:00] (03CR) 10Gehel: "I'm not sure if Stephane also needs to be added to another group to have access to the deployment servers (tin / naos)." [puppet] - 10https://gerrit.wikimedia.org/r/415845 (https://phabricator.wikimedia.org/T188720) (owner: 10Gehel) [13:22:25] !log drain + reboot analytics10[33,34,36,37] for kernel updates [13:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:09] volans: actually i got my things working before i found out about this channel, i asked for the future :) [13:26:20] (03PS1) 10Mark Bergsma: Prefix log lines with the class name [debs/pybal] - 10https://gerrit.wikimedia.org/r/415846 [13:26:22] i'm the guy that reported the log_file bug by the way [13:28:39] sonne: I'm glad you already solved it, and thanks for bug report. With the workaround I mentioned in the task you should be already unblocked [13:28:49] but let me know if that's not the case ;) [13:33:56] i am :) [13:34:19] i only managed to find time to try your tool now, i've been wanting to ever since i saw the presentation at fosdem - really excited about it [13:35:18] RECOVERY - configured eth on labtestnet2002 is OK: OK - interfaces up [13:36:39] nice, any feedback is welcome! [13:39:25] well i'll start asking stuff right away then! [13:40:10] (03PS1) 10Elukey: profile::analytics::refinery::job::sqoop_mediawiki: add stdout redirect to crons [puppet] - 10https://gerrit.wikimedia.org/r/415849 [13:40:21] do you think it would be easy to pass http requests to puppetdb through basic auth? [13:40:58] 10Operations, 10Ops-Access-Requests: Need access to graphite servers - https://phabricator.wikimedia.org/T188649#4017564 (10Imarlier) It would be helpful if the entire team had this access. It's only //necessary// for me for the moment. [13:41:17] RECOVERY - Disk space on stat1005 is OK: DISK OK [13:43:21] sonne: is not possible as of now, but should be a kinda trivial change to read the auth credentials from the config file and pass them to python requests for the call [13:43:50] i see [13:44:05] also, why do you need sudo to run cumin? [13:44:26] I thought that usually puppetdb is not 'exposed' to the public, but I might be wrong ;) [13:45:53] volans: yeah i exposed it behind an apache proxy that limits to my ip, not too happy about it though. as an alternative to basic auth, i thought maybe one could have support for pre-connect commands (e.g. ssh -R 8081... puppetserver) or something like that. [13:46:03] (03CR) 10Vgutierrez: [C: 031] "I've already thought about doing this, +1" [debs/pybal] - 10https://gerrit.wikimedia.org/r/415846 (owner: 10Mark Bergsma) [13:46:06] (03PS1) 10Arturo Borrero Gonzalez: toollabs: introduce role::toollabs::base [puppet] - 10https://gerrit.wikimedia.org/r/415851 (https://phabricator.wikimedia.org/T187193) [13:46:23] !log drain + reboot analytics10[38,39,40,41] for kernel updates [13:46:33] (03CR) 10jerkins-bot: [V: 04-1] toollabs: introduce role::toollabs::base [puppet] - 10https://gerrit.wikimedia.org/r/415851 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:00] (03CR) 10Mark Bergsma: [C: 032] Improve naming of the new BGP metric names and labels [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 (owner: 10Mark Bergsma) [13:47:05] (03CR) 10Mark Bergsma: [C: 032] Prefix log lines with the class name [debs/pybal] - 10https://gerrit.wikimedia.org/r/415846 (owner: 10Mark Bergsma) [13:47:37] (03Merged) 10jenkins-bot: Improve naming of the new BGP metric names and labels [debs/pybal] - 10https://gerrit.wikimedia.org/r/415841 (owner: 10Mark Bergsma) [13:47:39] (03Merged) 10jenkins-bot: Prefix log lines with the class name [debs/pybal] - 10https://gerrit.wikimedia.org/r/415846 (owner: 10Mark Bergsma) [13:48:06] sonne: so our use case and installation is to have it on a couple of 'management' servers, and run it from there. While it can perfectly work from a personal laptop too, there are many limitations, depending on your infrastructure [13:48:16] (I'll get back to the sudo question shortly) [13:48:35] sure [13:49:10] i'm more enthralled by the personal laptop workflow rather than the management server, and the few tests i've run are very satisfactory [13:49:44] (03PS5) 10Vgutierrez: pybal: Prometheus based icinga check for BGP established sessions [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) [13:49:53] that's because it would not scale for us from the personal laptop due to various limitations, and also you get the bad RTT for each connection [13:50:54] also is harder to do auditing if not running from a central place [13:51:11] right, i guess that starts to matter when passing a certain amount of machines [13:51:14] (03CR) 10Vgutierrez: "Updated after change 415841 got merged in PyBal master" [puppet] - 10https://gerrit.wikimedia.org/r/415260 (https://phabricator.wikimedia.org/T188085) (owner: 10Vgutierrez) [13:51:29] i'm usually in the 10^1 order of magnitude [13:52:35] ok, then the difference would be much less noticeable [13:53:01] thanks for the tip though :) [13:53:12] i hope we'll have to care about that in the near future [13:55:00] timings here in practice from our mgmt server to various servers (including some halfway around the world): [13:55:03] bblack@neodymium:~$ time sudo cumin --force '*' 'id' [13:55:04] real 0m13.302s [13:55:06] 1293 hosts will be targeted: [13:55:09] [...] [13:55:12] (03PS2) 10Arturo Borrero Gonzalez: toollabs: introduce role::toollabs::base [puppet] - 10https://gerrit.wikimedia.org/r/415851 (https://phabricator.wikimedia.org/T187193) [13:55:19] so ~13s to get a response to a trivial command back and collated from a bit over 1K hosts [13:55:40] (03CR) 10jerkins-bot: [V: 04-1] toollabs: introduce role::toollabs::base [puppet] - 10https://gerrit.wikimedia.org/r/415851 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:55:53] ehehe! The sudo part is partly because of the above and mostly because given that most of administrative tasks you want to run with cumin needs root, we allow it to connect as root, and the way the ssh key is exposed to the process requires root. But we've tought in the past to consider allowing non-sudo users to perform non-sudo tasks. [13:55:54] bblack: not bad at all [13:56:14] thanks for quick stat bblack ;) [13:56:31] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] toollabs: introduce role::toollabs::base [puppet] - 10https://gerrit.wikimedia.org/r/415851 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [13:56:48] It's just that we didn't had any use case so far hence has not been prioritized [13:57:01] (03PS1) 10Alexandros Kosiaris: akosiaris: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/415852 [13:57:49] sonne: if you run it locally a quick workaround is to run: SUDO_USER=$USER USER=root cumin ... [13:58:19] if your user has access to the ssh key and the config file everything should work fine (I think) [13:58:45] that's our the integration tests are run for example [14:03:24] volans: i can understand why one would want the cumin user to not be able to access the ssh keys, but it feels a bit awkward for it to ask sudo by default [14:04:17] wouldn't it make more sense to check if the keys / config file are accessible and be happy with it, possibly suggesting to use sudo if they are not? [14:04:25] that's true, you could say that there is an additional 'protection' in the tool too while it should just be in the way it's installed locally [14:04:42] s/installed/installed\/configured/ [14:05:59] i see [14:06:03] could be a config.yaml setting, whether to assume the sudo model or not [14:06:37] I agree that is definitely something to improve [14:06:38] bblack: well if what changes in the model is just that some files are inaccessible, then it doesn't really matter as long as you check that in advance no? [14:07:07] I think that's tricky to do dynamically and have the error output meaningful [14:07:20] since the actual ssh operations are beneath some layers of abstraction when they finally fail [14:07:24] (due to unable to read keys) [14:08:03] (03PS2) 10Muehlenhoff: Drop use of experimental repository component for caches [puppet] - 10https://gerrit.wikimedia.org/r/415814 (https://phabricator.wikimedia.org/T188545) [14:09:00] (in other words, the files are inaccessible to the invoked parallel sshs, but there's not currently a reason for cumin's own code to know or care what keys were going to be read by them, and it might be tricky to find out) [14:09:29] bblack: that would only be a problem if soneone specified weird keys in the ssh option hash of the config file though no? [14:09:51] any valid ssh configuration is accepted and passed through, cumin doesn't really know or care about it [14:10:04] assuming where someone would have their keys would be too bold though i have to say [14:10:20] maybe bblack's suggestion of configuring the model is better [14:12:53] (03PS2) 10Alexandros Kosiaris: akosiaris: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/415852 [14:14:15] it's an option, yes, or maybe allow to configure a user/group to require at startup [14:14:27] (probably a list of users/groups) [14:15:52] surely it need some additional thought [14:16:57] yeah that and/or a flag that just says "assume permissions are already all worked out and ignore this stuff" [14:17:10] sure [14:17:20] (03PS1) 10Mark Bergsma: Fix log method invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/415855 [14:18:01] maybe the user/group defaults to root, can be configured to a list, and configuring it to the empty list means ignore it. [14:19:41] (03CR) 10Vgutierrez: [C: 031] "good catch!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/415855 (owner: 10Mark Bergsma) [14:19:58] (03CR) 10Mark Bergsma: [C: 032] Fix log method invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/415855 (owner: 10Mark Bergsma) [14:20:37] (03Merged) 10jenkins-bot: Fix log method invocation [debs/pybal] - 10https://gerrit.wikimedia.org/r/415855 (owner: 10Mark Bergsma) [14:23:50] (03PS3) 10Alexandros Kosiaris: akosiaris: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/415852 [14:24:38] (03CR) 10Alexandros Kosiaris: [C: 032] akosiaris: Update my dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/415852 (owner: 10Alexandros Kosiaris) [14:31:17] PROBLEM - puppet last run on mw1278 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/akosiaris/.vim/ftplugin/html.vim] [14:32:16] akosiaris: ^^^ [14:33:55] race condition? [14:35:14] could be RESOURCE_NOT_FOUND [14:35:59] * volans re-running puppet [14:36:08] RECOVERY - puppet last run on mw1278 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:36:11] (03CR) 10Imarlier: "> LGTM, one point remaining about multiple topics. Looks we're" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [14:37:32] volans: probably [14:37:58] yeah, sorry for the ping [14:38:03] puppet's funny in that way [14:38:15] the catalog probably got compiled by host A [14:38:24] <_joe_> you using an "ftpplugin" for your editor, in production, and then make fun of emacs [14:38:25] but the specific resources was request by host B [14:38:32] <_joe_> you vim people are funny [14:38:39] no, you emacs people are funny [14:39:41] the jobqueue might go up a little [14:41:33] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#4017759 (10jcrespo) a:05jcrespo>03Marostegui Reflecting latest work [14:47:08] (03CR) 10Elukey: "Couple of news!" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [14:59:49] (03Abandoned) 10Filippo Giunchedi: lvs: add graphite service [puppet] - 10https://gerrit.wikimedia.org/r/289636 (https://phabricator.wikimedia.org/T85451) (owner: 10Filippo Giunchedi) [15:02:05] (03CR) 10Ema: [C: 031] elasticsearch - notifiy nginx of SSL certificate changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333664 (owner: 10Gehel) [15:02:59] (03PS1) 10Muehlenhoff: Add a thirdparty/php71 component for use by Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/415856 [15:03:01] (03PS1) 10Muehlenhoff: Add repository configuration for thirdparty/php71 [puppet] - 10https://gerrit.wikimedia.org/r/415857 [15:03:07] (03CR) 10Gehel: elasticsearch - notifiy nginx of SSL certificate changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/333664 (owner: 10Gehel) [15:03:50] (03PS1) 10Mark Bergsma: Fix inconsistent label name [debs/pybal] - 10https://gerrit.wikimedia.org/r/415858 [15:03:52] (03PS1) 10Mark Bergsma: Add peer address information to BGP and FSM log lines [debs/pybal] - 10https://gerrit.wikimedia.org/r/415859 [15:04:38] (03CR) 10jerkins-bot: [V: 04-1] Add peer address information to BGP and FSM log lines [debs/pybal] - 10https://gerrit.wikimedia.org/r/415859 (owner: 10Mark Bergsma) [15:07:35] (03PS2) 10Mark Bergsma: Add peer address information to BGP and FSM log lines [debs/pybal] - 10https://gerrit.wikimedia.org/r/415859 [15:09:58] PROBLEM - puppet last run on labtestvirt2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:14:37] (03PS1) 10Awight: Convert ORES tresholds config to new syntax (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415861 (https://phabricator.wikimedia.org/T181159) [15:14:43] (03CR) 10Elukey: [C: 031] Add a thirdparty/php71 component for use by Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [15:15:15] !log rebooting auth* for kernel security updates [15:15:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:12] (03CR) 10Elukey: [C: 031] "@Joe: do you think that we could keep this component only for Phabricator (to solve all the segfaults etc..) and then reason about what to" [puppet] - 10https://gerrit.wikimedia.org/r/415856 (owner: 10Muehlenhoff) [15:16:28] PROBLEM - Host helium is DOWN: PING CRITICAL - Packet loss = 100% [15:16:48] PROBLEM - Host heze is DOWN: PING CRITICAL - Packet loss = 100% [15:17:17] RECOVERY - Host helium is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:17:52] akosiaris^ maintenance? [15:18:07] RECOVERY - Host heze is UP: PING OK - Packet loss = 0%, RTA = 36.16 ms [15:18:18] PROBLEM - HHVM rendering on mw2104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:18:25] jynus: I think so, I poked him about those reboots a few minutes ago :-) [15:18:31] ok [15:18:39] yes [15:18:50] kernel upgrades [15:18:53] cool [15:18:56] moritzm: done btw [15:18:57] :-) [15:19:06] you are always pinging me when there are no jobs running :-) [15:19:17] RECOVERY - HHVM rendering on mw2104 is OK: HTTP OK: HTTP/1.1 200 OK - 79769 bytes in 0.315 second response time [15:19:19] !log drain + reboot analytics10[41-45] for kernel updates [15:19:27] guess why there are no jobs running so much lately :-) [15:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:09] thanks :-) [15:20:39] jynus: it's Friday, I don't think there would have been anyway [15:20:46] but yes dbstore1001 ;-) [15:21:10] lately they used to take 3 days [15:21:19] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4017870 (10faidon) To answer my own earlier question: I was looking at nftables' wiki about the supported features [[ https://wiki.nftables.org/wiki-nftables/index.php/Supported_features_compared_to_xtables... [15:21:46] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/415858 (owner: 10Mark Bergsma) [15:27:13] (03CR) 10Vgutierrez: [C: 032] Add peer address information to BGP and FSM log lines [debs/pybal] - 10https://gerrit.wikimedia.org/r/415859 (owner: 10Mark Bergsma) [15:27:28] (03CR) 10Vgutierrez: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/415859 (owner: 10Mark Bergsma) [15:31:27] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Scrapes sample page) timed out before a response was received [15:32:17] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [15:40:30] (03PS1) 10Ppchelko: Switch all of the cdnPurge to kafka. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415870 (https://phabricator.wikimedia.org/T188540) [15:43:37] (03CR) 10Mark Bergsma: [C: 032] Add peer address information to BGP and FSM log lines [debs/pybal] - 10https://gerrit.wikimedia.org/r/415859 (owner: 10Mark Bergsma) [15:43:44] (03CR) 10Mark Bergsma: [C: 032] Fix inconsistent label name [debs/pybal] - 10https://gerrit.wikimedia.org/r/415858 (owner: 10Mark Bergsma) [15:44:21] (03Merged) 10jenkins-bot: Fix inconsistent label name [debs/pybal] - 10https://gerrit.wikimedia.org/r/415858 (owner: 10Mark Bergsma) [15:44:23] (03Merged) 10jenkins-bot: Add peer address information to BGP and FSM log lines [debs/pybal] - 10https://gerrit.wikimedia.org/r/415859 (owner: 10Mark Bergsma) [15:44:24] (03PS1) 10Gehel: [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) [15:45:05] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wdqs: configure the new internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/415872 (https://phabricator.wikimedia.org/T187766) (owner: 10Gehel) [15:46:27] (03CR) 10Muehlenhoff: admins: remove duplicate outdated entry for chrisneuroth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413667 (owner: 10Dzahn) [15:47:23] (03PS2) 10Rush: openstack: keystone bootstrap setup for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/415392 (https://phabricator.wikimedia.org/T188266) [15:47:57] (03CR) 10jerkins-bot: [V: 04-1] openstack: keystone bootstrap setup for mitaka [puppet] - 10https://gerrit.wikimedia.org/r/415392 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [15:52:30] (03CR) 10Rush: [C: 031] "sure, let's try it :)" [puppet] - 10https://gerrit.wikimedia.org/r/415840 (https://phabricator.wikimedia.org/T188624) (owner: 10Arturo Borrero Gonzalez) [15:53:33] (03PS2) 10Arturo Borrero Gonzalez: labstore: monitoring: interfaces: reduce check timeframe to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/415840 (https://phabricator.wikimedia.org/T188624) [15:54:25] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labstore: monitoring: interfaces: reduce check timeframe to 10 mins [puppet] - 10https://gerrit.wikimedia.org/r/415840 (https://phabricator.wikimedia.org/T188624) (owner: 10Arturo Borrero Gonzalez) [15:55:07] (03PS1) 10Rush: toolforge: change test tolerance for paws and trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/415874 (https://phabricator.wikimedia.org/T178405) [15:55:22] (03PS2) 10Rush: toolforge: change test tolerance for paws and trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/415874 (https://phabricator.wikimedia.org/T178405) [16:00:07] (03CR) 10Chico Venancio: [C: 031] toolforge: change test tolerance for paws and trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/415874 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [16:02:52] (03CR) 10Muehlenhoff: admins: remove duplicate outdated entry for chrisneuroth (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/413667 (owner: 10Dzahn) [16:03:01] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Lucas Werkmeister to the ldap/wmde and ldap/nda groups - https://phabricator.wikimedia.org/T188105#4018154 (10Dzahn) Oops, fixed! Adjusted both wmde and nda group for Lucas with a C. [16:03:18] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Lucas Werkmeister to the ldap/wmde and ldap/nda groups - https://phabricator.wikimedia.org/T188105#4018159 (10Dzahn) 05Open>03Resolved [16:04:05] (03PS3) 10Rush: toolforge: change test tolerance for paws and trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/415874 (https://phabricator.wikimedia.org/T178405) [16:06:02] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toolforge: change test tolerance for paws and trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/415874 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [16:07:00] (03PS3) 10Dzahn: admins: remove duplicate outdated entry for chrisneuroth [puppet] - 10https://gerrit.wikimedia.org/r/413667 [16:07:36] (03CR) 10Dzahn: [C: 032] admins: remove duplicate outdated entry for chrisneuroth [puppet] - 10https://gerrit.wikimedia.org/r/413667 (owner: 10Dzahn) [16:07:43] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Lucas Werkmeister to the ldap/wmde and ldap/nda groups - https://phabricator.wikimedia.org/T188105#4018180 (10Lucas_Werkmeister_WMDE) Great, thank you very much! [16:12:24] (03CR) 10Anomie: wiki-replicas: Accommodate new comments table with rules and compatibility (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:15:51] (03PS1) 10Ppchelko: Switch 50% for refreshLinks to kafka job queue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415877 (https://phabricator.wikimedia.org/T185052) [16:16:19] (03PS4) 10Arturo Borrero Gonzalez: toolforge: change test tolerance for paws and trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/415874 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [16:16:36] (03CR) 10Arturo Borrero Gonzalez: [V: 032 C: 032] toolforge: change test tolerance for paws and trusty grid [puppet] - 10https://gerrit.wikimedia.org/r/415874 (https://phabricator.wikimedia.org/T178405) (owner: 10Rush) [16:17:04] (03CR) 10jerkins-bot: [V: 04-1] Switch 50% for refreshLinks to kafka job queue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415877 (https://phabricator.wikimedia.org/T185052) (owner: 10Ppchelko) [16:18:15] (03PS2) 10Ppchelko: Switch 50% for refreshLinks to kafka job queue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415877 (https://phabricator.wikimedia.org/T185052) [16:19:56] (03PS6) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [16:20:25] (03CR) 10jerkins-bot: [V: 04-1] wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:23:08] (03PS7) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [16:23:58] Job queue is getting bigger, that's me. It will get reduced to way smaller by tomorrow [16:27:24] (03PS8) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [16:34:12] (03PS9) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [16:36:34] (03CR) 10Arturo Borrero Gonzalez: [C: 032] wikireplicas: Add partial index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/388572 (https://phabricator.wikimedia.org/T140609) (owner: 10BryanDavis) [16:36:42] 10Operations, 10Analytics-Kanban, 10Performance-Team, 10Release-Engineering-Team, 10User-Elukey: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4018276 (10elukey) p:05Triage>03Normal [16:36:45] (03PS2) 10Arturo Borrero Gonzalez: wikireplicas: Add partial index for page_props.pp_value [puppet] - 10https://gerrit.wikimedia.org/r/388572 (https://phabricator.wikimedia.org/T140609) (owner: 10BryanDavis) [16:36:57] (03PS1) 10Gehel: wdqs: switch alterting to prometheus instead of icinga [puppet] - 10https://gerrit.wikimedia.org/r/415884 [16:36:59] (03PS3) 10Arturo Borrero Gonzalez: mariadb: remove labsdb1001 & labsdb1003 special behavior [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [16:37:24] (03CR) 10jerkins-bot: [V: 04-1] wdqs: switch alterting to prometheus instead of icinga [puppet] - 10https://gerrit.wikimedia.org/r/415884 (owner: 10Gehel) [16:37:32] (03CR) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:39:53] (03PS1) 10Arturo Borrero Gonzalez: toollabs: base role: adjust system role string [puppet] - 10https://gerrit.wikimedia.org/r/415885 (https://phabricator.wikimedia.org/T187193) [16:40:09] (03PS4) 10Arturo Borrero Gonzalez: mariadb: remove labsdb1001 & labsdb1003 special behavior [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [16:40:53] (03CR) 10Arturo Borrero Gonzalez: [C: 032] mariadb: remove labsdb1001 & labsdb1003 special behavior [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [16:41:18] (03CR) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:42:05] (03PS2) 10Arturo Borrero Gonzalez: toollabs: base role: adjust system role string [puppet] - 10https://gerrit.wikimedia.org/r/415885 (https://phabricator.wikimedia.org/T187193) [16:44:14] (03PS1) 10Elukey: role::eventlogging::analytics: deprecate mw.errors.* metrics [puppet] - 10https://gerrit.wikimedia.org/r/415887 (https://phabricator.wikimedia.org/T188749) [16:46:47] (03CR) 10Arturo Borrero Gonzalez: [C: 032] toollabs: base role: adjust system role string [puppet] - 10https://gerrit.wikimedia.org/r/415885 (https://phabricator.wikimedia.org/T187193) (owner: 10Arturo Borrero Gonzalez) [16:48:11] (03PS2) 10Gehel: wdqs: switch alterting to prometheus instead of icinga [puppet] - 10https://gerrit.wikimedia.org/r/415884 [16:48:16] 10Operations, 10Analytics-Kanban, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4018324 (10elukey) [16:52:41] (03PS7) 10Jcrespo: Add Proxysql creation debian package script [software] - 10https://gerrit.wikimedia.org/r/404153 [16:52:43] (03PS1) 10Jcrespo: Consider as busy all queries that are not in Sleep state [software] - 10https://gerrit.wikimedia.org/r/415888 (https://phabricator.wikimedia.org/T188505) [16:52:45] 10Operations, 10Analytics: setup/install eventlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T185667#4018337 (10elukey) As preparation step to make the migration I discovered that eventlog1001 seems to receive udp traffic from mwlog* hosts to parse and generate mw.errors.* metrics. I opened a task to... [16:53:15] (03PS1) 10Gehel: wdqs: remove diamond collectors which have been replaced by prometheus [puppet] - 10https://gerrit.wikimedia.org/r/415889 [16:53:24] (03CR) 10Elukey: "pcc: https://puppet-compiler.wmflabs.org/compiler02/10237/" [puppet] - 10https://gerrit.wikimedia.org/r/415887 (https://phabricator.wikimedia.org/T188749) (owner: 10Elukey) [16:55:01] (03PS1) 10Gehel: wdqs: cleanup after removing diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/415891 [16:56:00] (03PS2) 10Herron: naggen2: support puppetdb 4 settings and api [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) [16:56:57] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [16:58:02] 10Operations, 10Analytics-Kanban, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4018358 (10fgiunchedi) Sounds good to me, we'd also need to audit dashboards in case we're using it somewhere and replace with logst... [16:58:57] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [17:02:24] (03PS8) 10Jcrespo: Add Proxysql creation debian package script [software] - 10https://gerrit.wikimedia.org/r/404153 [17:02:25] (03PS2) 10Jcrespo: Consider as busy all queries that are not in Sleep state [software] - 10https://gerrit.wikimedia.org/r/415888 (https://phabricator.wikimedia.org/T188505) [17:05:27] PROBLEM - HHVM rendering on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:06:17] RECOVERY - HHVM rendering on mw2142 is OK: HTTP OK: HTTP/1.1 200 OK - 79779 bytes in 0.305 second response time [17:12:18] (03PS1) 10Filippo Giunchedi: wmflib: support segmented keys in Hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/415896 (https://phabricator.wikimedia.org/T188623) [17:12:43] (03CR) 10Jcrespo: [C: 031] "We should check this are still working everywhere they are deployed- many thing changed since they were added." [puppet] - 10https://gerrit.wikimedia.org/r/408469 (https://phabricator.wikimedia.org/T184832) (owner: 10BryanDavis) [17:13:28] (03CR) 10jerkins-bot: [V: 04-1] wmflib: support segmented keys in Hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/415896 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [17:18:35] (03PS10) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [17:20:01] (03CR) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [17:21:32] 10Operations, 10Wikimedia-Apache-configuration, 10User-Joe: Gain visibility into httpd mod_proxy actions - https://phabricator.wikimedia.org/T188601#4018471 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:22:06] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Setup cron for foreachwikiindblist all-labs.dblist extensions/AbuseFilter/maintenance/purgeOldLogIPData.php on Beta - https://phabricator.wikimedia.org/T187658#4018472 (10MoritzMuehlenhoff) p:05Triage>03Normal [17:22:26] (03PS2) 10Filippo Giunchedi: wmflib: support segmented keys in Hiera 3 [puppet] - 10https://gerrit.wikimedia.org/r/415896 (https://phabricator.wikimedia.org/T188623) [17:27:55] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#4018501 (10MoritzMuehlenhoff) Hmm, I just tried to reproduce by setting "tshark" on hold (a package which is installed via standard_packages) and the puppet run simply changed the package status:... [17:28:29] (03PS1) 10Ema: Add hiera max_core_rtt data for labs [puppet] - 10https://gerrit.wikimedia.org/r/415900 (https://phabricator.wikimedia.org/T157430) [17:28:56] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#4018507 (10jcrespo) So would you think it is a "valid" one? On the other side, it could lead to packages being forgotten on hold. [17:31:31] (03PS2) 10ArielGlenn: update recompressxml so it can handle the new html dump schema [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/415689 [17:35:26] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#4018533 (10jcrespo) oh, you said you "just tried" but were not able to. Interesting, we install mariadb packages like this: `require_package( $mariadb_package )` but clients like this: ``` pack... [17:37:28] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#4018556 (10MoritzMuehlenhoff) I think puppet is right in overriding the local admin choice here. If the package is configured as "present", it should also be upgradble in general. Per https://puppe... [17:41:08] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#4018573 (10MoritzMuehlenhoff) Hah, actually while puppet claims to reset the package status with the output I pasted, in fact it doesn't actually do that: "dpkg -l tshark" still shows the package h... [17:42:20] 10Operations, 10Mail, 10OTRS, 10WMF-Legal, 10User-Urbanecm: WMCZ want to use its own mail system instead of OTRS queue wm-cz@wikimedia.org - https://phabricator.wikimedia.org/T188753#4018577 (10Keegan) [17:43:31] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#4018586 (10jcrespo) I don't think this has a lot of priority, I would classify it as low and but when there is time, just get all gotchas and document them on the wiki. [17:45:05] 10Operations, 10Mail, 10OTRS, 10WMF-Legal, 10User-Urbanecm: WMCZ want to use its own mail system instead of OTRS queue wm-cz@wikimedia.org - https://phabricator.wikimedia.org/T188753#4018595 (10Urbanecm) Great, thanks! [17:45:11] (03CR) 10BBlack: [C: 031] Add hiera max_core_rtt data for labs [puppet] - 10https://gerrit.wikimedia.org/r/415900 (https://phabricator.wikimedia.org/T157430) (owner: 10Ema) [17:46:45] 10Operations, 10Puppet: Setting packages on 'hold' breaks puppet runs - https://phabricator.wikimedia.org/T187651#4018604 (10jcrespo) Interesting- maybe there was some other strange reasons at the time, like upgrading a dependent package or something else: ``` root@neodymium:~$ sudo apt-mark hold wmf-mariadb10... [17:47:08] (03PS1) 10Andrew Bogott: labweb: include scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/415904 [17:47:10] (03CR) 10Filippo Giunchedi: "This fixes a catalog compilation issue on ganeti hosts, where we look up the ganeti cluster name which contains dots and thus segments the" [puppet] - 10https://gerrit.wikimedia.org/r/415896 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [17:48:19] (03CR) 10Ema: [C: 032] Add hiera max_core_rtt data for labs [puppet] - 10https://gerrit.wikimedia.org/r/415900 (https://phabricator.wikimedia.org/T157430) (owner: 10Ema) [17:50:01] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#4018622 (10phuedx) @mobrovac Did you sync with SRE about this task per last week's Audiences Services Sync meeting? [17:50:02] (03CR) 10Andrew Bogott: [C: 032] labweb: include scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/415904 (owner: 10Andrew Bogott) [17:50:07] (03PS2) 10Andrew Bogott: labweb: include scap::scripts [puppet] - 10https://gerrit.wikimedia.org/r/415904 [17:51:36] (03PS1) 10Gehel: wdqs: propagate rename of updater_option to wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/415906 [17:52:27] (03PS2) 10Gehel: wdqs: propagate rename of updater_option to wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/415906 [17:56:15] 10Operations, 10Cassandra, 10Patch-For-Review, 10Services (next): enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#4018640 (10Eevans) >>! In T92471#4015525, @mobrovac wrote: > @Eevans @fgiunchedi is there something left to be done here? This issue has been open for a... [17:57:21] 10Operations, 10Cassandra, 10Patch-For-Review, 10Services (next), 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471#4018641 (10Eevans) [18:01:12] 10Operations, 10Analytics-Kanban, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4018652 (10Krinkle) I’ve got a dashboard search tool we could use to easily check that: https://gist.github.com/Krinkle/b5ceff5156c1... [18:10:36] 10Operations, 10Mail, 10OTRS, 10WMF-Legal, 10User-Urbanecm: WMCZ want to use its own mail system instead of OTRS queue wm-cz@wikimedia.org - https://phabricator.wikimedia.org/T188753#4018388 (10Dzahn) Please see the details in the linked ticket T160400. It was ultimately solved by OIT (eross) by adding... [18:17:27] 10Operations: netfilter software at WMF: iptables vs nftables - https://phabricator.wikimedia.org/T187994#4018736 (10BBlack) https://lwn.net/Articles/747551/ has some interesting discussion on related topics, too. [18:20:18] (03CR) 10Filippo Giunchedi: "The hiera 3 compability would need to be present on stretch hosts only, so I'll rework the patches to DTRT on jessie and stretch" [puppet] - 10https://gerrit.wikimedia.org/r/415896 (https://phabricator.wikimedia.org/T188623) (owner: 10Filippo Giunchedi) [18:23:27] PROBLEM - HHVM rendering on mw2141 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:24:17] RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 79276 bytes in 0.303 second response time [18:25:35] (03PS2) 10Dzahn: varnish: add director for transparency-private [puppet] - 10https://gerrit.wikimedia.org/r/415512 (https://phabricator.wikimedia.org/T188362) [18:26:32] (03CR) 10Dzahn: [C: 032] "T175445" [puppet] - 10https://gerrit.wikimedia.org/r/415512 (https://phabricator.wikimedia.org/T188362) (owner: 10Dzahn) [18:40:38] (03PS1) 10Dzahn: transparency: add/adjust Apache config for private site [puppet] - 10https://gerrit.wikimedia.org/r/415912 (https://phabricator.wikimedia.org/T188362) [18:42:11] (03PS2) 10Dzahn: transparency: add/adjust Apache config for private site [puppet] - 10https://gerrit.wikimedia.org/r/415912 (https://phabricator.wikimedia.org/T188362) [18:44:18] 10Operations, 10ops-codfw: attach furud's new arrays (furud-array[3-7]) - https://phabricator.wikimedia.org/T185153#4018820 (10faidon) a:05faidon>03Papaul Let's keep the existing arrays (array1 & array2) offline, and just connect all of the new ones. Also, no RAID config needed on the BIOS; we'll do the c... [18:46:16] (03PS1) 10Andrew Bogott: labweb: hhvm-enabled vhost for the new wikitech [puppet] - 10https://gerrit.wikimedia.org/r/415913 (https://phabricator.wikimedia.org/T168470) [18:47:22] (03PS1) 10Andrew Bogott: multiversion: add a transitional mapping for newwikitech.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415914 (https://phabricator.wikimedia.org/T168470) [18:47:26] (03CR) 10Dzahn: [C: 032] transparency: add/adjust Apache config for private site [puppet] - 10https://gerrit.wikimedia.org/r/415912 (https://phabricator.wikimedia.org/T188362) (owner: 10Dzahn) [18:51:24] (03PS3) 10Herron: naggen2: add support for puppetdb v4 settings and api [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) [18:52:27] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_clone_wikimedia/TransparencyReport-private] [18:53:18] ^due to my work and already fixed [18:54:02] 10Operations, 10Traffic: Turn up network links for Asia Cache DC - https://phabricator.wikimedia.org/T156031#4018837 (10faidon) [18:54:17] (03PS2) 10Dzahn: add transparency-private.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/415302 (https://phabricator.wikimedia.org/T188362) [18:54:25] (03CR) 10Herron: "> Given it's pretty easy to do, let's try to make it backward" [puppet] - 10https://gerrit.wikimedia.org/r/413435 (https://phabricator.wikimedia.org/T188032) (owner: 10Herron) [18:55:14] (03CR) 10Dzahn: [C: 032] add transparency-private.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/415302 (https://phabricator.wikimedia.org/T188362) (owner: 10Dzahn) [18:55:30] 10Operations, 10ORES, 10Scoring-platform-team, 10Traffic, and 4 others: 503 spikes and resulting API slowness starting 18:45 October 26 - https://phabricator.wikimedia.org/T179156#4018842 (10greg) 05stalled>03Resolved a:03BBlack >>! In T179156#3782508, @BBlack wrote: > No, we never made an incident r... [18:57:18] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:58:42] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4018856 (10Dzahn) done! https://transparency.wikimedia.org (as always) https://transparency.wikimedia.org/private (now removed... [18:59:19] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4018859 (10Dzahn) 05Open>03Resolved [19:03:25] 10Operations, 10Proton, 10Readers-Web-Backlog, 10Services (watching): Choose a server for the chromium-render service - https://phabricator.wikimedia.org/T187821#4018885 (10mobrovac) >>! In T187821#4018622, @phuedx wrote: > @mobrovac Did you sync with SRE about this task per last week's Audiences Services... [19:09:02] (03Abandoned) 10Mobrovac: Mathoid chart: Use port 10042 [deployment-charts] - 10https://gerrit.wikimedia.org/r/415605 (https://phabricator.wikimedia.org/T184919) (owner: 10Mobrovac) [19:10:57] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [19:17:53] (03PS2) 10Andrew Bogott: labweb: hhvm-enabled vhost for the new wikitech [puppet] - 10https://gerrit.wikimedia.org/r/415913 (https://phabricator.wikimedia.org/T168470) [19:18:26] (03CR) 10Andrew Bogott: [C: 032] labweb: hhvm-enabled vhost for the new wikitech [puppet] - 10https://gerrit.wikimedia.org/r/415913 (https://phabricator.wikimedia.org/T168470) (owner: 10Andrew Bogott) [19:21:03] (03CR) 10Bstorm: "For indexes:" [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:26:23] (03CR) 10Bstorm: "Unless this is all manual, in which case I can add it to this change." [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [19:29:13] 10Operations, 10Analytics-Kanban, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4018949 (10Krinkle) It seems that aside from `mw.errors.exception`, the `mw.errors.*` metrics were last written to in 2016. ``` [1... [19:29:30] 10Operations, 10Analytics-Kanban, 10Release-Engineering-Team, 10Patch-For-Review, and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4018950 (10Krinkle) [19:36:33] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Move "transparency.wikimedia.org/private" to "transparency-private.wikimedia.org" - https://phabricator.wikimedia.org/T188362#4018972 (10APalmer_WMF) Thank you so much, @Dzahn! [19:42:27] PROBLEM - Apache HTTP on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:17] RECOVERY - Apache HTTP on mw2123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.122 second response time [19:44:18] !log restarting labsdb1010 [19:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:31] (03CR) 10Imarlier: "> LGTM, one point remaining about multiple topics. Looks we're" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:46:37] (03PS10) 10Imarlier: coal: Process from Kafka instead of from ZMQ [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) [19:47:07] PROBLEM - haproxy failover on dbproxy1011 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [19:51:05] (03CR) 10Imarlier: "> Couple of news!" [puppet] - 10https://gerrit.wikimedia.org/r/415218 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [19:51:07] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [19:59:11] (03PS1) 10Jcrespo: Revert "labsdb: Depool labsdb1010 in preparation for its recovery" [puppet] - 10https://gerrit.wikimedia.org/r/415923 [19:59:46] (03CR) 10Jcrespo: "This should be merged when labsdb1010 catches up with replication." [puppet] - 10https://gerrit.wikimedia.org/r/415923 (owner: 10Jcrespo) [20:02:24] (03PS8) 10ArielGlenn: restbase dumps in xml format [dumps] - 10https://gerrit.wikimedia.org/r/413212 [20:02:54] (03PS3) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) [20:03:14] (03CR) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) (owner: 10Imarlier) [20:05:49] (03CR) 10Anomie: "> For indexes:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [20:15:41] 10Operations, 10DNS, 10Traffic, 10WMF-Communications, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4019085 (10greg) [20:15:56] (03CR) 10Smalyshev: [C: 031] wdqs: propagate rename of updater_option to wdqs-test [puppet] - 10https://gerrit.wikimedia.org/r/415906 (owner: 10Gehel) [20:16:02] 10Operations, 10DNS, 10Release-Engineering-Team, 10Traffic, and 2 others: Move Foundation Wiki to new URL when new Wikimedia Foundation website launches - https://phabricator.wikimedia.org/T188776#4019086 (10greg) [20:29:45] 10Operations, 10Analytics-Kanban, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Deprecation of mw.errors.* metrics - https://phabricator.wikimedia.org/T188749#4019098 (10greg) [20:41:06] (03PS4) 10Imarlier: NavigtationTiming: Enable oversampling for Singapore [mediawiki-config] - 10https://gerrit.wikimedia.org/r/415618 (https://phabricator.wikimedia.org/T188652) [21:09:11] (03PS11) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [21:09:34] (03PS2) 10Dzahn: icinga: add stretch compat for php-gd/php7.0-gd [puppet] - 10https://gerrit.wikimedia.org/r/415764 [21:10:14] (03CR) 10Dzahn: [C: 032] icinga: add stretch compat for php-gd/php7.0-gd [puppet] - 10https://gerrit.wikimedia.org/r/415764 (owner: 10Dzahn) [21:10:45] (03CR) 10Bstorm: "Applying those fixes. If this is a manual script, and I think it is, then I'll just add the index logs to this change." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [21:11:06] (03PS12) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [21:12:24] (03PS3) 10Dzahn: locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 [21:12:46] (03CR) 10jerkins-bot: [V: 04-1] locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 (owner: 10Dzahn) [21:16:08] (03PS4) 10Dzahn: locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 [21:22:13] (03CR) 10Dzahn: "violation delta: -1 (fix 2, add 1, but that one would need more refactoring)" [puppet] - 10https://gerrit.wikimedia.org/r/402164 (owner: 10Dzahn) [21:24:20] (03PS5) 10Dzahn: locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 [21:26:30] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:20] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 79310 bytes in 0.388 second response time [21:44:57] (03PS6) 10Dzahn: locales: convert to profile [puppet] - 10https://gerrit.wikimedia.org/r/402164 [21:45:31] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/10240/" [puppet] - 10https://gerrit.wikimedia.org/r/402164 (owner: 10Dzahn) [21:50:06] 10Operations, 10Ops-Access-Requests: Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4019295 (10Dzahn) btw, there is this change i uploaded, that would also create the access but it was stalled / had -1 from Krinkle for now: https://gerrit.wikimedia.org/r/#/c/392030/ also, maybe w... [21:53:20] (03PS3) 10Dzahn: jupyterhub: apache -> httpd module [puppet] - 10https://gerrit.wikimedia.org/r/415786 [21:53:57] 10Operations, 10Ops-Access-Requests: Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4019302 (10Imarlier) @MoritzMuehlenhoff @Dzahn Thanks much, that makes sense. @Dzahn With regards to that stalled change, the issue is that a few of the things that get installed by the webperf rol... [21:54:04] 10Operations, 10Ops-Access-Requests: Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4019304 (10Imarlier) 05Open>03Resolved a:03Imarlier [22:03:27] (03CR) 10Dzahn: "i think the second group you need is "deploy-service" in this case" [puppet] - 10https://gerrit.wikimedia.org/r/415845 (https://phabricator.wikimedia.org/T188720) (owner: 10Gehel) [22:04:51] 10Operations, 10Ops-Access-Requests: performance-team/imarlier need access to graphite servers - https://phabricator.wikimedia.org/T188649#4019356 (10Dzahn) [22:07:30] 10Operations, 10Ops-Access-Requests: performance-team/imarlier need access to graphite servers - https://phabricator.wikimedia.org/T188649#4014947 (10Dzahn) Note there are 2 types of graphite machines. graphite::primary and graphite::production. 'primary' means 'production + performance::site' # graphite pro... [22:09:05] 10Operations, 10Ops-Access-Requests: performance-team/imarlier need access to graphite servers - https://phabricator.wikimedia.org/T188649#4019361 (10Imarlier) @Dzahn Yes pease! [22:09:19] 10Operations, 10Ops-Access-Requests, 10Performance-Team (Radar): Need access to webperf* servers - https://phabricator.wikimedia.org/T188650#4019362 (10Krinkle) [22:10:21] 10Operations, 10Ops-Access-Requests: performance-team/imarlier need access to graphite servers - https://phabricator.wikimedia.org/T188649#4019365 (10Dzahn) Do you just need a regular shell user (without root/sudo privileges) or do you need more than that? That would make a difference. We have an existing gro... [22:14:24] 10Operations, 10Ops-Access-Requests: performance-team/imarlier need access to graphite servers - https://phabricator.wikimedia.org/T188649#4019387 (10Imarlier) Hrm, probably need (limited) sudo access. Specifically, being able to read/tail the logs that are written by systemd when it launches a new service, a... [22:31:06] 10Operations, 10Ops-Access-Requests: performance-team/imarlier need access to graphite servers - https://phabricator.wikimedia.org/T188649#4019421 (10Dzahn) Ok, no problem. We have other examples of groups that have sudo privileges to read logfiles. [22:32:13] (03PS1) 10Andrew Bogott: Remove mariadb from silver [puppet] - 10https://gerrit.wikimedia.org/r/415995 (https://phabricator.wikimedia.org/T188029) [22:33:06] (03CR) 10Andrew Bogott: [C: 032] Remove mariadb from silver [puppet] - 10https://gerrit.wikimedia.org/r/415995 (https://phabricator.wikimedia.org/T188029) (owner: 10Andrew Bogott) [22:43:17] (03CR) 10Dzahn: [C: 032] "complete no-op on netbook1001" [puppet] - 10https://gerrit.wikimedia.org/r/415786 (owner: 10Dzahn) [22:45:30] PROBLEM - HHVM rendering on mw2209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:20] RECOVERY - HHVM rendering on mw2209 is OK: HTTP OK: HTTP/1.1 200 OK - 79284 bytes in 0.295 second response time [23:04:18] 10Operations, 10Ops-Access-Requests, 10Discovery-Search (Current work): Google Search Console access for Search Platform team - https://phabricator.wikimedia.org/T188453#4019547 (10EBjune) Yes @MoritzMuehlenhoff that is the correct list of names, thanks! [23:37:17] (03PS13) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) [23:40:00] (03CR) 10Bstorm: "Adjusting those indexes a bit." [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [23:43:01] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:58:42] (03PS14) 10Bstorm: wiki-replicas: Accommodate new comments table with rules and compatibility [puppet] - 10https://gerrit.wikimedia.org/r/415384 (https://phabricator.wikimedia.org/T181650)