[00:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181214T0000). [00:00:04] Zoranzoki21: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:48] I am here [00:01:36] (03PS1) 10Jforrester: Enforce a 10-byte password for staff users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479569 [00:01:38] (03PS1) 10Jforrester: Enforce a 10-byte password for privileged users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) [00:01:40] (03PS1) 10Jforrester: Require an 8-byte new password for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479571 (https://phabricator.wikimedia.org/T211622) [00:01:42] (03PS1) 10Jforrester: Require passwords do not match account names for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479572 (https://phabricator.wikimedia.org/T208441) [00:01:44] (03PS1) 10Jforrester: Require that passwords are not in any common list for privileged groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479573 [00:01:46] (03PS1) 10Jforrester: Require that passwords are not in the most common 100k list for all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) [00:02:03] (03PS1) 10Paladox: php: Create profile::php::fpm to handle fpm integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 [00:02:05] (03CR) 10Jforrester: [C: 04-2] "Let's not go any further just right now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479569 (owner: 10Jforrester) [00:02:53] (03CR) 10jerkins-bot: [V: 04-1] php: Create profile::php::fpm to handle fpm integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 (owner: 10Paladox) [00:04:19] (03PS1) 10BryanDavis: wmcs: catch and log view drop errors in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) [00:06:55] (03PS10) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [00:09:34] Will anyone SWAT or I can go to sleep? [00:21:54] Zoranzoki21: Hey, sorry, I'll do it. [00:22:03] (03PS11) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [00:22:10] Zoranzoki21: Is there no task associated? [00:22:21] James_F: No [00:22:31] James_F: No, there isn't [00:22:40] OK. [00:22:43] (03PS4) 10Jforrester: wmgBabelMainCategory: Update srwikinews translation, add srwikibooks, srwikiquote and srwiktionary translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479562 (owner: 10Zoranzoki21) [00:22:48] (03CR) 10Jforrester: [C: 03+2] wmgBabelMainCategory: Update srwikinews translation, add srwikibooks, srwikiquote and srwiktionary translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479562 (owner: 10Zoranzoki21) [00:22:58] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:23:37] James_F: It no needs testing at mwdebug [00:23:54] (03Merged) 10jenkins-bot: wmgBabelMainCategory: Update srwikinews translation, add srwikibooks, srwikiquote and srwiktionary translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479562 (owner: 10Zoranzoki21) [00:23:56] (03PS12) 10Paladox: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:24:01] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:24:39] Zoranzoki21: Do you mean that you don't think it needs testing? It looks fine to me. [00:24:50] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:24:51] Yes [00:25:24] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:25:26] OK, seems sane. [00:25:37] * James_F double-checks. [00:26:41] OK, synching. [00:27:05] (03CR) 10Jforrester: [C: 03+2] "Approved by Trust and Safety." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479569 (owner: 10Jforrester) [00:27:15] (03CR) 10Jforrester: [C: 04-2] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479570 (https://phabricator.wikimedia.org/T208246) (owner: 10Jforrester) [00:27:20] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT I7867277d Make wmgBabelMainCategory consistent for sr* wikis (duration: 00m 45s) [00:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:27:36] Zoranzoki21: OK, all looks good. Thanks! [00:28:08] (03Merged) 10jenkins-bot: Enforce a 10-byte password for staff users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479569 (owner: 10Jforrester) [00:28:30] James_F: yw [00:29:54] (03CR) 10jenkins-bot: wmgBabelMainCategory: Update srwikinews translation, add srwikibooks, srwikiquote and srwiktionary translations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479562 (owner: 10Zoranzoki21) [00:29:56] (03CR) 10jenkins-bot: Enforce a 10-byte password for staff users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479569 (owner: 10Jforrester) [00:31:05] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Enforce a 10-byte password for +staff users, I4ecac70e (duration: 00m 44s) [00:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:31:56] (03PS13) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [00:32:44] (03CR) 10jerkins-bot: [V: 04-1] gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:33:19] !log rebooting install2002 via ganeti2003, to add new virtual disk [00:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:16] Deployment clear. But it's Thursday night, so no more deploys please. ;-) [00:38:46] (03PS14) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [00:43:14] !log install2002 (T211850) restarted instance, created ext4 filesystem on new /dev/vdb, mounted on /mnt/vdb, rsyncing /srv/ to /mnt/vdb/ [00:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:43:18] T211850: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 [00:51:24] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13936/cobalt.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [00:51:35] (03PS15) 10Dzahn: gerrit: add data types for all parameters [puppet] - 10https://gerrit.wikimedia.org/r/478116 [00:51:41] (03PS2) 10Paladox: php: Create profile::php::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 [00:53:04] (03CR) 10jerkins-bot: [V: 04-1] php: Create profile::php::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 (owner: 10Paladox) [00:53:53] 10Operations, 10Discovery-Wikidata-Query-Service-Sprint, 10Patch-For-Review: make wdqs-updater heap size configurable from puppet - https://phabricator.wikimedia.org/T210290 (10Smalyshev) 05Open→03Resolved I think this is done now? [00:53:56] (03PS3) 10Paladox: php: Create profile::php::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 [00:54:52] (03CR) 10jerkins-bot: [V: 04-1] php: Create profile::php::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 (owner: 10Paladox) [00:58:10] (03PS4) 10Paladox: php: Create profile::php::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 [01:02:07] (03PS5) 10Paladox: php: Create profile::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 [01:05:52] (03PS6) 10Paladox: php: Create profile::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 [01:07:36] (03PS7) 10Paladox: php: Create profile::php to handle fpm/mod_php integration [puppet] - 10https://gerrit.wikimedia.org/r/479575 [01:07:52] (03CR) 10Dzahn: [C: 03+2] "noop on gerrit2001 and then also cobalt afterwards" [puppet] - 10https://gerrit.wikimedia.org/r/478116 (owner: 10Dzahn) [01:10:13] (03PS1) 10Paladox: phabricator: use new profile::php module [puppet] - 10https://gerrit.wikimedia.org/r/479580 [01:12:23] (03PS1) 10Dzahn: gerrit: use specific data types for IPv4 vs IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/479581 [01:13:06] 10Operations, 10Wikimedia-General-or-Unknown: Request for information about hosting services for WM-ES - https://phabricator.wikimedia.org/T211414 (10Platonides) We already run a wiki for our chapter (well, actually a couple, one public and another private). The main url for WM-ES page is https://www.wikimedi... [01:14:31] mutante: to be fair, I don't know what is in the mind of the task author [01:14:43] (03PS2) 10Paladox: phabricator: use new profile::php module [puppet] - 10https://gerrit.wikimedia.org/r/479580 [01:14:56] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479580 (owner: 10Paladox) [01:15:46] Platonides: oh:) i see, i was also surprised why he says "our website" and "hosting" but links to a Wikipedia portal [01:15:49] (03CR) 10jerkins-bot: [V: 04-1] phabricator: use new profile::php module [puppet] - 10https://gerrit.wikimedia.org/r/479580 (owner: 10Paladox) [01:16:40] Platonides: i guess he means "WMF should host wikimedia.es" then [01:17:01] (03CR) 10jerkins-bot: [V: 04-1] phabricator: use new profile::php module [puppet] - 10https://gerrit.wikimedia.org/r/479580 (owner: 10Paladox) [01:17:42] which is (another) Mediawiki [01:17:57] yes [01:18:21] there were some plans to make that a wordpress [01:18:27] but things move slow... [01:19:10] that wiki also has some magic features, eg. http://www.wikilovesearth.es/ [01:19:40] is actually https://wiki.wikimedia.es/wiki/Wiki_Loves_Earth?useskin=monobook [01:19:45] (03PS3) 10Paladox: phabricator: use new profile::php module [puppet] - 10https://gerrit.wikimedia.org/r/479580 [01:19:49] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479580 (owner: 10Paladox) [01:19:51] the general trend is "all or nothing" so either we move ownership of domain names and DNS zone and mail and the wiki.. or we move none of it [01:20:00] we want to avoid the mixed ones [01:20:32] I don't we would be interested on doing that [01:20:38] certainly possible to host chapter wikis in general [01:21:24] but probably not all the other stuff on that domain names [01:21:42] we have a lot of things ;) [01:22:43] (03CR) 10Paladox: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1001/67/" [puppet] - 10https://gerrit.wikimedia.org/r/479580 (owner: 10Paladox) [01:26:26] (03CR) 10Paladox: php: Create profile::php to handle fpm/mod_php integration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479575 (owner: 10Paladox) [01:28:31] (03CR) 10Dzahn: [C: 03+2] gerrit: use specific data types for IPv4 vs IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/479581 (owner: 10Dzahn) [01:28:42] (03PS2) 10Dzahn: gerrit: use specific data types for IPv4 vs IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/479581 [01:39:09] !log install2002 deleted /srv/ contents,then mounted /mnt/vdb on /srv so same content but now / is used only 7% and /srv 57% (T211850) [01:39:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:39:13] T211850: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 [01:41:35] 10Operations: install2002 94% disk usage on "/" - https://phabricator.wikimedia.org/T211850 (10Dzahn) p:05High→03Normal ` root@install2002:/srv# df -hT Filesystem Type Size Used Avail Use% Mounted on udev devtmpfs 10M 0 10M 0% /dev tmpfs tmpfs 401M 5.4M 396M 2%... [02:33:47] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Services (designing), and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Imarlier) On a very random note, I wanted to say that I enjoyed this: {F27546380} Guess the subscriber list tr... [02:40:04] (03PS1) 10CRusnov: Change all reports to log only errors except for a summary count [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479583 (https://phabricator.wikimedia.org/T205899) [02:45:54] (03CR) 10Bstorm: "Nit-picky thing. Although I see the point in the way it's done in patchset 1, I do feel that exit codes should be more similar to the outp" [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) (owner: 10BryanDavis) [02:51:58] (03CR) 10CRusnov: "Note that this also adds other wishlist items such as certain device excluding, and makes the formats for asset tags and tickets case inse" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479583 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [02:54:43] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) Question: I was asked to implement the report what connect to puppetdb, what were the p... [03:32:25] (03CR) 10BryanDavis: "> Nit-picky thing. Although I see the point in the way it's done in" [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) (owner: 10BryanDavis) [03:35:27] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 948.45 seconds [03:55:49] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) (owner: 10BryanDavis) [03:57:39] (03CR) 10Bstorm: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) (owner: 10BryanDavis) [04:01:03] (03PS2) 10BryanDavis: wmcs: catch and log view drop errors in maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/479576 (https://phabricator.wikimedia.org/T211940) [04:23:59] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 211.06 seconds [04:38:51] 10Operations, 10MediaWiki-Page-deletion, 10MW-1.32-release, 10Performance-Team (Radar): Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10BPirkle) @Krinkle , have you seen evidence of an ongoing problem, or can we resolve this task? [05:18:52] 10Operations, 10MediaWiki-Page-deletion, 10MW-1.32-release, 10Performance-Team (Radar): Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10tstarling) @Krinkle's test case appears to be unrelated, although it is a bug. I filed T211953 for it. [05:22:30] 10Operations, 10MediaWiki-Page-deletion, 10MW-1.32-release, 10Performance-Team (Radar): Deleting pages on the English Wikipedia is very slow - https://phabricator.wikimedia.org/T207530 (10tstarling) 05Open→03Resolved [05:48:34] Hey, I need to deploy a patch on wikis. I can't wait for MediaWiki train. [05:49:02] The next train will be 17 Dec. [05:49:26] And I need to deploy the patch something right now. [05:49:38] Jayprakash12345: Then you need to contact releng to see if your patch is critical enough so it can be deployed right now [05:50:50] (03PS1) 10Marostegui: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479586 (https://phabricator.wikimedia.org/T86338) [05:51:20] marostegui: Thanks, What is the channel name of releng? [05:52:09] Jayprakash12345: #wikimedia-releng [05:53:02] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479586 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [05:54:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479586 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [05:55:08] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1094 T86338 T202167 (duration: 00m 47s) [05:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:55:14] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [05:55:14] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:01:22] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479587 [06:05:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1094 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479586 (https://phabricator.wikimedia.org/T86338) (owner: 10Marostegui) [06:10:19] !log Deployed schema change on db1094 T86338 T202167 [06:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:10:29] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:10:30] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:12:05] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479587 (owner: 10Marostegui) [06:13:16] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479587 (owner: 10Marostegui) [06:14:12] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1094 T86338 T202167 (duration: 00m 44s) [06:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:20] !log Deploy schema change on db1062 (s7 primary master) T86338 T202167 [06:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:51] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1094" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479587 (owner: 10Marostegui) [06:28:39] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:31:15] PROBLEM - puppet last run on analytics1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/smartmontools/run.d/20logger] [06:35:59] (03PS2) 10Giuseppe Lavagetto: Remove references to the old, decommissioned etcd cluster [dns] - 10https://gerrit.wikimedia.org/r/479178 [06:38:25] RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational [06:43:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [06:44:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [06:46:52] https://phabricator.wikimedia.org/T211882 [06:47:02] please look at this urgently ^ [06:53:45] !log Deploy schema change on db2043 (s3 codfw master) - this will generate lag on s3 codfw T86338 T20216 [06:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:56] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [06:53:56] T20216: Bugzilla landing page should point to mediawiki.org - https://phabricator.wikimedia.org/T20216 [06:54:26] !log Deploy schema change on db2043 (s3 codfw master) - this will generate lag on s3 codfw T86338 T202167 [06:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:30] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [06:57:15] RECOVERY - puppet last run on analytics1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:12:42] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Joe) >>! In T211721#4819457, @Tgr wrote: >> Sessions are currently stored in Redis, a highly-optimized in-... [07:14:33] (03Abandoned) 10Elukey: Revert "Add change_tag to list of tables to sqoop" [puppet] - 10https://gerrit.wikimedia.org/r/477818 (owner: 10Fdans) [07:20:26] 10Operations, 10Performance-Team, 10TechCom, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Establish an SLA for session storage - https://phabricator.wikimedia.org/T211721 (10Joe) To add to what @tgr found, we have to search for usage of `MediaWikiServices::getInstance()->getMain... [07:20:46] (03PS3) 10Elukey: Make Kerberos configurable for cdh::hadoop::namenode::primary [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [07:44:09] (03PS1) 10Marostegui: db-codfw.php: Depool db2084 and db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479590 [07:47:00] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2084 and db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479590 (owner: 10Marostegui) [07:48:04] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2084 and db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479590 (owner: 10Marostegui) [07:49:22] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2084 and db2088 for mysql and kernel upgrade (duration: 00m 45s) [07:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:26] !log Upgrade mysql and kernel on db2084 and db2088 [07:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:59] !log Deploy schema change on dbstore1002:s3 T86338 T202167 [07:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:04] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [07:59:04] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [08:00:10] (03CR) 10jenkins-bot: db-codfw.php: Depool db2084 and db2088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479590 (owner: 10Marostegui) [08:04:13] (03CR) 10Filippo Giunchedi: "LGTM, we've established the consumer group id isn't going to change and thus weblog1001 will pick up where oxygen left off?" [puppet] - 10https://gerrit.wikimedia.org/r/479448 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [08:05:00] (03CR) 10Filippo Giunchedi: [C: 03+1] add forward/reverse records for kibana.svc.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/479539 (owner: 10Herron) [08:06:54] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM other than a typo" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479541 (https://phabricator.wikimedia.org/T205850) (owner: 10Herron) [08:07:19] (03CR) 10Elukey: "> LGTM, we've established the consumer group id isn't going to change" [puppet] - 10https://gerrit.wikimedia.org/r/479448 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [08:12:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/479448 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [08:16:30] (03CR) 10Muehlenhoff: spicerack: configure APT component/spicerack (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479555 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [08:28:32] (03PS4) 10Elukey: Allow cdh::hadoop::directory to use kerberos auth [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [08:29:22] (03CR) 10Filippo Giunchedi: "Thanks all for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/479395 (https://phabricator.wikimedia.org/T208215) (owner: 10Mathew.onipe) [08:33:28] (03PS5) 10Elukey: Allow cdh::hadoop::directory to use kerberos auth [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [08:39:29] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2084 and db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479594 [08:45:31] (03PS7) 10Muehlenhoff: Add kerberos puppet wrapper [puppet] - 10https://gerrit.wikimedia.org/r/477987 [08:45:51] 10Operations, 10Traffic, 10Performance-Team (Radar), 10Services (designing), and 3 others: Harmonise the identification of requests across our stack - https://phabricator.wikimedia.org/T201409 (10Gilles) @imarlier https://translate.google.com/#view=home&op=translate&sl=et&tl=en&text=krinkle [08:45:57] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2084 and db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479594 (owner: 10Marostegui) [08:47:02] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2084 and db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479594 (owner: 10Marostegui) [08:47:20] !log disabled kafkatee-webrequest logstash output on oxygen (prep step before weblog1001) [08:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:23] godog: --^ [08:47:36] (03PS2) 10Elukey: Swap oxygen with weblog1001 [puppet] - 10https://gerrit.wikimedia.org/r/479448 (https://phabricator.wikimedia.org/T211883) [08:47:46] will leave it running for a couple of days just in case [08:47:51] now provisioning weblog1001 [08:48:12] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2084 and db2088 after mysql and kernel upgrade (duration: 00m 44s) [08:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:24] (03CR) 10Elukey: [C: 03+2] Swap oxygen with weblog1001 [puppet] - 10https://gerrit.wikimedia.org/r/479448 (https://phabricator.wikimedia.org/T211883) (owner: 10Elukey) [08:50:47] !log swap oxygen with weblog1001 [08:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:36] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2084 and db2088" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479594 (owner: 10Marostegui) [08:51:45] elukey: neat! [08:52:10] (03PS8) 10Muehlenhoff: Add kerberos puppet wrapper [puppet] - 10https://gerrit.wikimedia.org/r/477987 [08:52:57] (03PS1) 10Marostegui: db-codfw.php: Depool db2070,db2049,db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479595 [08:53:49] (03CR) 10Muehlenhoff: [C: 03+2] Add kerberos puppet wrapper [puppet] - 10https://gerrit.wikimedia.org/r/477987 (owner: 10Muehlenhoff) [08:54:35] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2070,db2049,db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479595 (owner: 10Marostegui) [08:55:37] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2070,db2049,db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479595 (owner: 10Marostegui) [08:56:43] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2049, db2059, db2070 for mysql and kernel upgrade (duration: 00m 43s) [08:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:07] 10Operations, 10Patch-For-Review, 10User-Elukey: Move oxygen to weblog1001 - https://phabricator.wikimedia.org/T211883 (10elukey) Current status: * disabled puppet on oxygen and killed the kafkatee output to logstash (verified via netstat that we don't have any conn to logstash anymore) * enabled role `logg... [08:58:39] !log Stop MySQL on db2049, db2059 and db2070 for mysql and kernel upgrade [08:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:53] 10Operations, 10Patch-For-Review, 10User-Elukey: Move oxygen to weblog1001 - https://phabricator.wikimedia.org/T211883 (10elukey) @herron decided to proceed to unblock the oxygen's decom process, from now on we can decide how to proceed with logstash/webrequest-503 (it will likely take a bit of time so bette... [09:01:41] (03CR) 10Elukey: [C: 04-1] "https://puppet-compiler.wmflabs.org/compiler1002/13938/an-master1001.eqiad.wmnet/change.an-master1001.eqiad.wmnet.err" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [09:04:47] (03CR) 10jenkins-bot: db-codfw.php: Depool db2070,db2049,db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479595 (owner: 10Marostegui) [09:07:38] (03PS6) 10Elukey: Allow cdh::hadoop::directory to use kerberos auth [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [09:07:53] (03CR) 10jerkins-bot: [V: 04-1] Allow cdh::hadoop::directory to use kerberos auth [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [09:08:00] ouch [09:09:43] (03PS7) 10Elukey: Allow cdh::hadoop::directory to use kerberos auth [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [09:17:29] 10Operations, 10media-storage, 10User-fgiunchedi: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10fgiunchedi) a:05fgiunchedi→03Cmjohnson >>! In T209618#4788269, @fgiunchedi wrote: >>>! In T209618#4786263, @Cmjohnson wrote: >> @fgiunchedi For racking this is th... [09:18:11] (03CR) 10Elukey: "Looks good now! https://puppet-compiler.wmflabs.org/compiler1002/13940/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [09:21:53] !log global user rename is in progress - T209488 [09:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:57] T209488: Global rename of Massimo Telò → Teseo: supervision needed - https://phabricator.wikimedia.org/T209488 [09:22:32] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2070,db2049,db2059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479596 [09:23:45] (03PS8) 10Muehlenhoff: Allow cdh::hadoop::directory to use kerberos auth [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 [09:26:51] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2070,db2049,db2059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479596 (owner: 10Marostegui) [09:27:57] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2070,db2049,db2059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479596 (owner: 10Marostegui) [09:28:59] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2049, db2059, db2070 after mysql and kernel upgrade (duration: 00m 45s) [09:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:22] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2070,db2049,db2059" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479596 (owner: 10Marostegui) [09:33:33] PROBLEM - Check systemd state on weblog1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:33:47] !log Deploy schema change on db1095:3313 T86338 T202167 [09:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:52] T86338: Dropping page.page_counter on wmf databases - https://phabricator.wikimedia.org/T86338 [09:33:52] T202167: Schema change for rc_this_oldid index - https://phabricator.wikimedia.org/T202167 [09:36:59] 10Operations, 10Performance-Team, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10fgiunchedi) I can confirm I'm getting 502/503 from those dashboards every now and then. I suspect this being related to having changed graphite datasource in grafana from "d... [09:47:28] (03CR) 10Muehlenhoff: [C: 03+2] Allow cdh::hadoop::directory to use kerberos auth [puppet/cdh] - 10https://gerrit.wikimedia.org/r/478625 (owner: 10Muehlenhoff) [09:52:03] 10Operations, 10Release-Engineering-Team, 10Scap, 10User-ArielGlenn: Make scap and opcache work consistently together - https://phabricator.wikimedia.org/T211964 (10Joe) p:05Triage→03Normal [09:52:25] (03PS1) 10Muehlenhoff: Remove Diamond from Swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/479600 [09:52:58] (03PS1) 10Elukey: Update cdh submodule to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/479601 [09:57:43] (03CR) 10Muehlenhoff: [C: 03+1] Update cdh submodule to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/479601 (owner: 10Elukey) [09:58:20] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13943/" [puppet] - 10https://gerrit.wikimedia.org/r/479601 (owner: 10Elukey) [10:04:44] 10Operations, 10Performance-Team, 10Graphite: Graphite generates a lot of 502 in Grafana - https://phabricator.wikimedia.org/T211747 (10fgiunchedi) The 502s are also present on graphite1004 from `wsgi-handler` in `/var/log/apache2/other_vhosts_access.log`: ` 2018-12-14T09:29:46 664 10.64.48.103 u... [10:08:49] jouncebot: next [10:08:49] In 72 hour(s) and 21 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181217T1030) [10:09:58] (03PS4) 10Filippo Giunchedi: logging: introduce cee formatter usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478621 (https://phabricator.wikimedia.org/T211124) [10:16:24] (03CR) 10Filippo Giunchedi: [C: 03+2] logging: introduce cee formatter usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478621 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [10:18:02] !log filippo@deploy1001 Synchronized wmf-config/InitialiseSettings.php: (no justification provided) (duration: 00m 45s) [10:18:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:30] !log filippo@deploy1001 Synchronized wmf-config/logging.php: wmf-config/InitialiseSettings-labs.php (duration: 00m 44s) [10:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:52] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Afandian) Thanks for all your help @Dzahn @jijiki @RyanSteinberg . I can now log in, all working as... [10:22:13] (03CR) 10jenkins-bot: logging: introduce cee formatter usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478621 (https://phabricator.wikimedia.org/T211124) (owner: 10Filippo Giunchedi) [10:22:37] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10jijiki) @toddleroux, @Afandian, @RyanSteinberg if everything if alright, we can mark this as resolv... [10:27:32] (03PS1) 10Muehlenhoff: Enable Kerberos for spark_assembly_jar_install [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479606 [10:28:10] (03CR) 10jerkins-bot: [V: 04-1] Enable Kerberos for spark_assembly_jar_install [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479606 (owner: 10Muehlenhoff) [10:44:04] 10Operations, 10Cloud-VPS, 10IPv6, 10cloud-services-team (Kanban): Enable IPv6 on CloudVPS - https://phabricator.wikimedia.org/T37947 (10aborrero) Persisting here some notes from @chasemp for future reference: * This comes from around Kilo time when IPv6 was first being introduced and it was described as... [10:45:50] (03CR) 10Banyek: "I am not sure, because I am not seeing how this will differ at the end from the other multiinstance hosts (I mean OS level, not data-wise)" [puppet] - 10https://gerrit.wikimedia.org/r/479224 (https://phabricator.wikimedia.org/T210478) (owner: 10Banyek) [10:47:44] (03PS1) 10Marostegui: db-codfw.php: Depool db2083, db2068, db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479607 [10:48:55] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2083, db2068, db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479607 (owner: 10Marostegui) [10:49:59] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2083, db2068, db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479607 (owner: 10Marostegui) [10:50:02] (03PS1) 10Elukey: Extend use_kerberos to other classes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479608 [10:50:45] !log Stop MySQL on db2083, db2068 and db2067 for mysql and kernel upgrade [10:50:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:07] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2083 db2068 db2067 for mysql and kernel upgrade (duration: 00m 45s) [10:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:28] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13944/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479608 (owner: 10Elukey) [11:01:06] (03CR) 10jenkins-bot: db-codfw.php: Depool db2083, db2068, db2067 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479607 (owner: 10Marostegui) [11:02:39] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2083, db2068, db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479611 [11:03:07] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "> I was able to run the tests and noticed the CPU usage was hitting the limit during the concurrency tests. I increased the limit to 2 and" [deployment-charts] - 10https://gerrit.wikimedia.org/r/479026 (https://phabricator.wikimedia.org/T211708) (owner: 10Jeena Huneidi) [11:03:32] (03PS1) 10Ema: ATS: do not log varnishcheck requests [puppet] - 10https://gerrit.wikimedia.org/r/479612 (https://phabricator.wikimedia.org/T204225) [11:08:23] (03PS1) 10Elukey: Update cdh module to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/479613 [11:08:25] (03PS1) 10Elukey: profile::hadoop/hive: introduce the use_kerberos flag [puppet] - 10https://gerrit.wikimedia.org/r/479614 [11:08:30] moritzm: --^ [11:08:41] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2083, db2068, db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479611 (owner: 10Marostegui) [11:10:07] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2083, db2068, db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479611 (owner: 10Marostegui) [11:11:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2083 db2068 db2067 after mysql and kernel upgrade (duration: 00m 44s) [11:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:00] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2083, db2068, db2067" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479611 (owner: 10Marostegui) [11:20:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/479614 (owner: 10Elukey) [11:21:43] (03CR) 10Elukey: [C: 03+2] Update cdh module to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/479613 (owner: 10Elukey) [11:21:52] (03CR) 10Elukey: [C: 03+2] profile::hadoop/hive: introduce the use_kerberos flag [puppet] - 10https://gerrit.wikimedia.org/r/479614 (owner: 10Elukey) [11:22:06] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13946/" [puppet] - 10https://gerrit.wikimedia.org/r/479614 (owner: 10Elukey) [11:32:29] (03PS1) 10Muehlenhoff: profile::hadoop::worker: Make kerberos configurable [puppet] - 10https://gerrit.wikimedia.org/r/479616 [11:33:22] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::worker: Make kerberos configurable [puppet] - 10https://gerrit.wikimedia.org/r/479616 (owner: 10Muehlenhoff) [11:34:32] (03PS2) 10Muehlenhoff: profile::hadoop::worker: Make kerberos configurable [puppet] - 10https://gerrit.wikimedia.org/r/479616 [11:35:30] (03CR) 10Volans: [C: 04-1] "This commit is mixing few different things. See comments inline." (036 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479583 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [11:40:55] (03PS1) 10Muehlenhoff: Make Kerberos configurable in cdh::hadoop::worker [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479617 [11:46:53] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10Volans) >>! In T205899#4822775, @crusnov wrote: > Question: I was asked to implement the report... [11:48:48] (03CR) 10Reedy: [C: 04-1] Require that passwords are not in the most common 100k list for all users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [11:50:07] (03CR) 10Reedy: [C: 04-1] Require that passwords are not in the most common 100k list for all users (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479574 (https://phabricator.wikimedia.org/T151425) (owner: 10Jforrester) [11:50:46] (03PS2) 10Volans: spicerack: configure APT component/spicerack [puppet] - 10https://gerrit.wikimedia.org/r/479555 (https://phabricator.wikimedia.org/T205884) [11:56:37] (03CR) 10Volans: "Done, thanks for the review." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479555 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [12:07:36] (03PS1) 10Banyek: mariadb: depool db1081 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479631 (https://phabricator.wikimedia.org/T85757) [12:08:31] (03PS1) 10Banyek: mariadb: depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479634 (https://phabricator.wikimedia.org/T85757) [12:09:32] (03PS1) 10Banyek: mariadb: depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479638 (https://phabricator.wikimedia.org/T85757) [12:10:28] (03PS1) 10Banyek: mariadb: depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479642 (https://phabricator.wikimedia.org/T85757) [12:11:04] (03PS1) 10Banyek: mariadb: depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479644 (https://phabricator.wikimedia.org/T85757) [12:11:49] (03PS1) 10Banyek: mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) [12:12:31] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 337 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [12:17:28] 10Operations, 10Traffic: kartotherian TLS support - https://phabricator.wikimedia.org/T211970 (10ema) p:05Triage→03Normal [12:17:43] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 20 probes of 337 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [12:18:17] 10Operations, 10Traffic, 10Maps (Kartotherian): kartotherian TLS support - https://phabricator.wikimedia.org/T211970 (10ema) [12:28:13] (03PS1) 10Ema: role::maps::{master,slave}: add tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) [12:30:46] (03PS2) 10Ema: role::maps::{master,slave}: add tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) [12:35:38] (03PS1) 10Ema: secret: dummy key for kartotherian [labs/private] - 10https://gerrit.wikimedia.org/r/479672 (https://phabricator.wikimedia.org/T211970) [12:39:54] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: make spreadcheck monitor region-aware [puppet] - 10https://gerrit.wikimedia.org/r/479673 (https://phabricator.wikimedia.org/T211451) [12:41:22] (03CR) 10Ema: [V: 03+2 C: 03+2] secret: dummy key for kartotherian [labs/private] - 10https://gerrit.wikimedia.org/r/479672 (https://phabricator.wikimedia.org/T211970) (owner: 10Ema) [12:42:48] (03PS3) 10Ema: role::maps::{master,slave}: add tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) [12:46:58] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvps: make spreadcheck monitor region-aware [puppet] - 10https://gerrit.wikimedia.org/r/479673 (https://phabricator.wikimedia.org/T211451) (owner: 10Arturo Borrero Gonzalez) [12:47:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Compiler result: https://puppet-compiler.wmflabs.org/compiler1002/13952/" [puppet] - 10https://gerrit.wikimedia.org/r/479673 (https://phabricator.wikimedia.org/T211451) (owner: 10Arturo Borrero Gonzalez) [12:51:03] (03PS1) 10Arturo Borrero Gonzalez: cloudvps: spreadcheck: rename to Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/479674 [12:52:55] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudvps: spreadcheck: rename to Toolforge [puppet] - 10https://gerrit.wikimedia.org/r/479674 (owner: 10Arturo Borrero Gonzalez) [12:52:58] (03PS4) 10Ema: role::maps::{master,slave}: add tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) [12:59:52] (03PS3) 10Muehlenhoff: profile::hadoop::worker: Make kerberos configurable [puppet] - 10https://gerrit.wikimedia.org/r/479616 [13:10:38] 10Operations, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10Aklapper) [13:12:36] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13955/" [puppet] - 10https://gerrit.wikimedia.org/r/479616 (owner: 10Muehlenhoff) [13:12:43] (03PS4) 10Elukey: profile::hadoop::worker: Make kerberos configurable [puppet] - 10https://gerrit.wikimedia.org/r/479616 (owner: 10Muehlenhoff) [13:13:32] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::hadoop::worker: Make kerberos configurable [puppet] - 10https://gerrit.wikimedia.org/r/479616 (owner: 10Muehlenhoff) [13:17:26] (03PS3) 10Lucas Werkmeister (WMDE): Specify $wgWBRepoSettings['conceptBaseUri'] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477521 [13:17:28] (03PS3) 10Lucas Werkmeister (WMDE): Fix Wikidata base URI in client config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477522 (https://phabricator.wikimedia.org/T198946) [13:17:46] (03PS1) 10Filippo Giunchedi: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479677 (https://phabricator.wikimedia.org/T205851) [13:19:07] (03CR) 10Filippo Giunchedi: [C: 03+2] LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479677 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [13:20:12] (03Merged) 10jenkins-bot: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479677 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [13:23:28] (03CR) 10Marostegui: mariadb: depool db1081 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479631 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:23:30] (03CR) 10Filippo Giunchedi: [C: 03+1] Remove Diamond from Swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/479600 (owner: 10Muehlenhoff) [13:24:17] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1121 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:24:35] 10Operations, 10serviceops, 10User-ArielGlenn, 10User-jijiki: create IRC channel for the Service Operations SRE subteam - https://phabricator.wikimedia.org/T211902 (10jijiki) [13:24:47] (03CR) 10Marostegui: [C: 03+1] mariadb: depool db1103:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479644 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:24:52] 10Operations, 10serviceops, 10User-jijiki: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) [13:25:19] 10Operations, 10Scap, 10serviceops, 10User-jijiki: Introduce state to Scap - https://phabricator.wikimedia.org/T209881 (10jijiki) [13:25:21] (03CR) 10Marostegui: [C: 03+1] mariadb: depool db1097:3314 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479642 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:25:37] (03CR) 10Marostegui: [C: 03+1] mariadb: depool db1091 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479638 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:26:12] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1084 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479634 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [13:27:04] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/479424 (https://phabricator.wikimedia.org/T211810) (owner: 10GTirloni) [13:27:23] 10Operations, 10Operations-Software-Development, 10serviceops, 10User-Joe, 10User-jijiki: Convert makevm to spicerack cookbook - https://phabricator.wikimedia.org/T203963 (10jijiki) [13:28:10] 10Operations, 10ChangeProp, 10serviceops, 10SCB, and 2 others: Memory consumption in Redis 3.2 vs Redis 2.8 - https://phabricator.wikimedia.org/T209890 (10jijiki) [13:28:54] 10Operations, 10Thumbor, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Assess Thumbor upgrade options - https://phabricator.wikimedia.org/T209886 (10jijiki) [13:30:14] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 3 others: Upgrade Thumbor servers to Stretch - https://phabricator.wikimedia.org/T170817 (10jijiki) [13:30:40] (03CR) 10jenkins-bot: LabsServices: ship logs locally [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479677 (https://phabricator.wikimedia.org/T205851) (owner: 10Filippo Giunchedi) [13:42:25] !log Enable GTID on db1124:3318 - T211973 [13:42:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:29] T211973: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 [13:44:13] (03PS2) 10Elukey: Enable Kerberos for spark_assembly_jar_install [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479606 (owner: 10Muehlenhoff) [13:46:54] (03PS1) 10Lucas Werkmeister (WMDE): Configure WikibaseQualityConstraints on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) [13:47:37] (03CR) 10jerkins-bot: [V: 04-1] Configure WikibaseQualityConstraints on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [13:47:48] !log elastic@codfw copying index data from the main cluster to psi & omega (test disk usage & import speed) [13:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:06] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13956/" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479606 (owner: 10Muehlenhoff) [13:48:37] (03PS2) 10Lucas Werkmeister (WMDE): Configure WikibaseQualityConstraints on Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) [13:50:24] (03CR) 10Lucas Werkmeister (WMDE): "Unfortunately, this bloats the file quite significantly :/ but I don’t see a good way around that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479681 (https://phabricator.wikimedia.org/T209957) (owner: 10Lucas Werkmeister (WMDE)) [13:51:13] (03PS1) 10Elukey: Update cdh module to latest sha [puppet] - 10https://gerrit.wikimedia.org/r/479682 [13:52:53] 10Operations, 10vm-requests: eqiad: 1 VM %request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10hashar) [13:53:09] 10Operations, 10vm-requests: eqiad: 1 VM %request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10hashar) [13:53:23] 10Operations, 10ops-eqiad, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10Cmjohnson) The idrac logs reporting a couple of things. The errors could just be DIMM but there is a CPU Machine Check error, that indicates that CPU2 may be bad now. A DIMM Swap... [13:54:36] 10Operations, 10monitoring, 10Goal, 10cloud-services-team (Kanban): Toolforge: Port sge.py stats to Prometheus - https://phabricator.wikimedia.org/T211684 (10aborrero) [13:54:41] 10Operations, 10vm-requests: eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10hashar) [13:55:02] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13957/" [puppet] - 10https://gerrit.wikimedia.org/r/479682 (owner: 10Elukey) [13:55:11] 10Operations, 10vm-requests, 10Release-Engineering-Team (Kanban): eqiad: 1 VM request for doc.wikimedia.org - https://phabricator.wikimedia.org/T211974 (10hashar) [13:57:31] !log Enable notifications for db2068 (s7 lag check)- T211973 [13:57:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:35] T211973: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 [13:57:44] (03PS2) 10Banyek: mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) [13:59:01] (03PS2) 10Banyek: mariadb: depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479634 (https://phabricator.wikimedia.org/T85757) [13:59:02] !log Enable notifications for db1095 (s3 lag check)- T211973 [13:59:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:56] (03CR) 10Marostegui: [C: 04-1] mariadb: depool db1121 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [14:00:07] 10Operations, 10ops-eqiad: Broken memory on mw1239 - https://phabricator.wikimedia.org/T209139 (10Cmjohnson) I am sure we have something that can be used from a decom server. [14:00:42] (03CR) 10Marostegui: [C: 03+1] mariadb: depool db1084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479634 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [14:02:02] (03CR) 10Banyek: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [14:02:51] (03PS3) 10Banyek: mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) [14:03:36] (03PS1) 10Elukey: profile::hadoop::mysql_password: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/479686 [14:03:36] !log Compare ruwiki.revision between db2039 (s6 master) and db1085 - T211973 [14:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:40] T211973: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 [14:04:23] (03CR) 10Marostegui: [C: 03+1] mariadb: depool db1121 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479646 (https://phabricator.wikimedia.org/T85757) (owner: 10Banyek) [14:04:27] (03PS2) 10Elukey: profile::hadoop::mysql_password: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/479686 [14:04:32] (03CR) 10jerkins-bot: [V: 04-1] profile::hadoop::mysql_password: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/479686 (owner: 10Elukey) [14:07:07] (03PS1) 10Filippo Giunchedi: rsyslog: fix property name for udp localhost [puppet] - 10https://gerrit.wikimedia.org/r/479689 [14:08:12] (03CR) 10Filippo Giunchedi: [C: 03+2] rsyslog: fix property name for udp localhost [puppet] - 10https://gerrit.wikimedia.org/r/479689 (owner: 10Filippo Giunchedi) [14:08:19] (03PS2) 10Filippo Giunchedi: rsyslog: fix property name for udp localhost [puppet] - 10https://gerrit.wikimedia.org/r/479689 [14:08:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "One general comment: I wonder if we couldn't use more the puppet API to fetch certificate information. Have you considered the possibility" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:10:05] (03PS1) 10Andrew Bogott: Horizon: move projects to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/479690 (https://phabricator.wikimedia.org/T204745) [14:12:54] (03PS2) 10Andrew Bogott: Horizon: move projects to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/479690 (https://phabricator.wikimedia.org/T204745) [14:13:44] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: move projects to eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/479690 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [14:28:39] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/13958/" [puppet] - 10https://gerrit.wikimedia.org/r/479686 (owner: 10Elukey) [14:28:41] (03CR) 10Elukey: [C: 03+2] profile::hadoop::mysql_password: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/479686 (owner: 10Elukey) [14:28:48] (03PS3) 10Elukey: profile::hadoop::mysql_password: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/479686 [14:28:59] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::hadoop::mysql_password: add kerberos support [puppet] - 10https://gerrit.wikimedia.org/r/479686 (owner: 10Elukey) [14:30:01] (03PS1) 10Muehlenhoff: Enable Kerberos support for hdfs-balancer systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479696 [14:30:03] (03PS1) 10Muehlenhoff: Remove unused cron and logrotate config, replaced by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479697 [14:30:43] (03CR) 10jerkins-bot: [V: 04-1] Enable Kerberos support for hdfs-balancer systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479696 (owner: 10Muehlenhoff) [14:30:52] (03CR) 10Volans: "> Patch Set 4: Code-Review-1" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:31:12] (03CR) 10jerkins-bot: [V: 04-1] Remove unused cron and logrotate config, replaced by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479697 (owner: 10Muehlenhoff) [14:35:09] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "See minor nit, otherwise LGTM." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:36:30] (03CR) 10Giuseppe Lavagetto: [C: 03+1] icinga: fix typo in test docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/478931 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:38:04] !log Enable GTID on db2039 (s6 codfw master) - T211973 [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:09] T211973: Check GTID, consistency options and notifications across the fleet - https://phabricator.wikimedia.org/T211973 [14:40:48] (03PS5) 10Ema: role::maps::{master,slave}: add tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) [14:42:51] (03PS1) 10Elukey: profile::hadoop::master::standby: allow kerberos settings [puppet] - 10https://gerrit.wikimedia.org/r/479699 [14:44:35] (03PS2) 10Muehlenhoff: Enable Kerberos support for hdfs-balancer systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479696 [14:45:44] (03CR) 10Giuseppe Lavagetto: puppet: add additional methods to PuppetHosts (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/479431 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:45:52] (03CR) 10Filippo Giunchedi: "See inline!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [14:46:05] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/13961/" [puppet] - 10https://gerrit.wikimedia.org/r/479696 (owner: 10Muehlenhoff) [14:46:36] (03PS2) 10Muehlenhoff: Remove unused cron and logrotate config, replaced by systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/479697 [14:48:01] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/13960/" [puppet] - 10https://gerrit.wikimedia.org/r/479699 (owner: 10Elukey) [14:48:59] (03CR) 10Giuseppe Lavagetto: [C: 04-1] puppet: add PuppetMaster class (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [14:54:35] (03CR) 10Muehlenhoff: profile::hadoop::mysql_password: add kerberos support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479686 (owner: 10Elukey) [14:55:40] 10Operations, 10monitoring, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10CDanis) p:05Triage→03Normal [14:55:48] (03CR) 10Elukey: [V: 03+2 C: 03+2] profile::hadoop::mysql_password: add kerberos support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479686 (owner: 10Elukey) [14:57:00] (03CR) 10Vgutierrez: [C: 03+1] "besides an optional nitpick, LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) (owner: 10Ema) [15:02:08] (03PS1) 10Ema: swift: actually check https, not http [puppet] - 10https://gerrit.wikimedia.org/r/479704 [15:05:15] (03PS6) 10Ema: role::maps::{master,slave}: add tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) [15:05:41] (03CR) 10Filippo Giunchedi: [C: 03+1] swift: actually check https, not http [puppet] - 10https://gerrit.wikimedia.org/r/479704 (owner: 10Ema) [15:11:07] (03CR) 10Ema: [C: 03+2] swift: actually check https, not http [puppet] - 10https://gerrit.wikimedia.org/r/479704 (owner: 10Ema) [15:21:31] (03PS7) 10Ema: role::maps::{master,slave}: add tlsproxy [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) [15:22:44] (03CR) 10Ema: role::maps::{master,slave}: add tlsproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479669 (https://phabricator.wikimedia.org/T211970) (owner: 10Ema) [15:27:17] (03CR) 10Mforns: [C: 04-1] "We should merge this together with the next refinery-source deployment." [puppet] - 10https://gerrit.wikimedia.org/r/478129 (https://phabricator.wikimedia.org/T202429) (owner: 10Mforns) [15:40:47] (03PS1) 10Elukey: profile::hadoop::mysql_password: add more kerberos configs [puppet] - 10https://gerrit.wikimedia.org/r/479710 [15:42:18] (03PS2) 10Ema: ATS: do not log varnishcheck requests [puppet] - 10https://gerrit.wikimedia.org/r/479612 (https://phabricator.wikimedia.org/T204225) [15:42:39] (03PS2) 10Elukey: profile::hadoop::mysql_password: add more kerberos configs [puppet] - 10https://gerrit.wikimedia.org/r/479710 [15:43:00] (03CR) 10jerkins-bot: [V: 04-1] ATS: do not log varnishcheck requests [puppet] - 10https://gerrit.wikimedia.org/r/479612 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [15:44:04] (03PS3) 10Ema: ATS: do not log varnishcheck requests [puppet] - 10https://gerrit.wikimedia.org/r/479612 (https://phabricator.wikimedia.org/T204225) [15:49:47] (03CR) 10Muehlenhoff: [C: 03+1] profile::hadoop::mysql_password: add more kerberos configs [puppet] - 10https://gerrit.wikimedia.org/r/479710 (owner: 10Elukey) [15:50:13] (03CR) 10Elukey: [C: 03+2] profile::hadoop::mysql_password: add more kerberos configs [puppet] - 10https://gerrit.wikimedia.org/r/479710 (owner: 10Elukey) [15:55:55] 10Operations, 10ORES, 10Security-Team, 10Scoring-platform-team (Current), 10User-Ladsgroup: Fetching ORES API from en.wikipedia.org blocked in debug mode - https://phabricator.wikimedia.org/T211511 (10Ladsgroup) a:03Ladsgroup It's because of CORS blocking cross origin requests that have unknown headers... [15:57:48] (03PS1) 10Tulsi Bhagat: Update be.wikisource logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479713 (https://phabricator.wikimedia.org/T211795) [15:59:55] (03PS1) 10Ladsgroup: ores: Allow cross origin requests if 'X-Wikimedia-Debug' header is sent [puppet] - 10https://gerrit.wikimedia.org/r/479715 (https://phabricator.wikimedia.org/T211511) [16:01:40] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [16:01:47] sigh [16:02:39] hmmm memory is increasing for a pod [16:02:42] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [16:02:44] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received [16:03:48] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [16:04:18] UnhandledPromiseRejectionWarning: Error: ESOCKETTIMEDOUT [16:04:43] akosiaris: does that have anything to do with the presentation? [16:05:01] yeah it's a demo on how to debug crappy software running on kubernetes [16:05:06] :) [16:05:24] preceded by SyntaxError: Unexpected token u in JSON at position 0 [16:05:39] yeah... at least 2 errors [16:08:06] (03PS5) 10Volans: puppet: add PuppetMaster class [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) [16:08:09] (03PS5) 10Volans: Add ipmi module [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) [16:08:11] (03PS4) 10Volans: icinga: fix typo in test docstring [software/spicerack] - 10https://gerrit.wikimedia.org/r/478931 (https://phabricator.wikimedia.org/T205884) [16:08:26] (03PS2) 10Volans: puppet: add additional methods to PuppetHosts [software/spicerack] - 10https://gerrit.wikimedia.org/r/479431 (https://phabricator.wikimedia.org/T205884) [16:08:28] it does seem like it was transient though [16:08:36] memory usage is back down to normal [16:08:40] (03CR) 10Volans: "Done, see inline." (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/477707 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:08:46] (03CR) 10Volans: "Done, see inline" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/478030 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:09:25] (03CR) 10Volans: "Replies/questions inline (no code change was made, just rebased resolving conflicts)" (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/479431 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [16:09:36] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:10:07] no pods were restarted due to OOM either akosiaris since we increased it to 4g [16:10:12] so that is looking better [16:10:32] there was a spike of 1.5G ~12:00 but yeah things look better [16:10:48] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:10:49] I am thinking we can descrease the number of pods a bit, but no today [16:14:42] 10Operations, 10monitoring, 10User-CDanis: Find links to grafana.wikimedia.org and change them to use the new URL format - https://phabricator.wikimedia.org/T211982 (10Volans) Regarding the puppet repo there are a lot of links related to the dashboards linked in Icinga checks. It would be nice if as part of... [16:23:13] (03PS4) 10Ema: ATS: do not log varnishcheck requests [puppet] - 10https://gerrit.wikimedia.org/r/479612 (https://phabricator.wikimedia.org/T204225) [16:24:07] 10Operations, 10ops-codfw, 10User-fgiunchedi: ms-be2047 spontaneous reboots - https://phabricator.wikimedia.org/T209921 (10Papaul) Dell will be shipping 1 New CPU by Monday. [16:24:09] (03CR) 10Ema: [C: 03+2] ATS: do not log varnishcheck requests [puppet] - 10https://gerrit.wikimedia.org/r/479612 (https://phabricator.wikimedia.org/T204225) (owner: 10Ema) [16:30:46] (03CR) 10Filippo Giunchedi: profile: enable statsd_exporter and add matching rules to logstash::collector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:38:33] (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/479563 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [16:39:10] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:39:46] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:40:52] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) [16:41:31] 10Operations, 10ops-eqiad, 10media-storage, 10User-fgiunchedi: rack/setup/install ms-be10[44-50].eqiad.wmnet - https://phabricator.wikimedia.org/T209618 (10fgiunchedi) [16:41:40] (03PS1) 10Ema: ATS: origin server certificate validation settings [puppet] - 10https://gerrit.wikimedia.org/r/479720 (https://phabricator.wikimedia.org/T207048) [16:42:00] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) @bblack: This work is now scheduled for Tuesday, December 18th, at 11:00-12:00 Pacific. So either you can depool the site that AM, or I'll do so when I wake up (befor... [16:43:17] (03PS1) 10Andrew Bogott: Horizon: remove reference to mwfileimport [puppet] - 10https://gerrit.wikimedia.org/r/479721 [16:44:00] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: remove reference to mwfileimport [puppet] - 10https://gerrit.wikimedia.org/r/479721 (owner: 10Andrew Bogott) [16:48:44] PROBLEM - Apache HTTP on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [16:48:58] PROBLEM - Nginx local proxy to apache on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.009 second response time [16:49:08] PROBLEM - HHVM rendering on mw1345 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time [16:49:43] (03PS2) 10Ema: ATS: origin server certificate validation settings [puppet] - 10https://gerrit.wikimedia.org/r/479720 (https://phabricator.wikimedia.org/T207048) [16:49:56] RECOVERY - Apache HTTP on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.045 second response time [16:50:10] RECOVERY - Nginx local proxy to apache on mw1345 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.036 second response time [16:50:22] RECOVERY - HHVM rendering on mw1345 is OK: HTTP OK: HTTP/1.1 200 OK - 79993 bytes in 0.370 second response time [16:51:07] (03CR) 10Ema: [C: 03+2] ATS: origin server certificate validation settings [puppet] - 10https://gerrit.wikimedia.org/r/479720 (https://phabricator.wikimedia.org/T207048) (owner: 10Ema) [16:56:44] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) Alright, going through puppetboard seems reasonable. How would the authentication infor... [17:03:14] (03PS1) 10Tulsi Bhagat: Enable 'pagemover' user group at ur.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) [17:06:06] (03CR) 10CRusnov: Change all reports to log only errors except for a summary count (035 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479583 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [17:08:56] (03CR) 10Tulsi Bhagat: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) (owner: 10Tulsi Bhagat) [17:18:18] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10ayounsi) >>! In T205899#4824154, @crusnov wrote: > We don't have the full hostname in netbox, bu... [17:18:42] (03PS1) 10RobH: bohrium dns update [dns] - 10https://gerrit.wikimedia.org/r/479726 [17:18:54] (03PS2) 10RobH: bohrium dns update [dns] - 10https://gerrit.wikimedia.org/r/479726 [17:19:49] (03CR) 10RobH: [C: 03+2] bohrium dns update [dns] - 10https://gerrit.wikimedia.org/r/479726 (owner: 10RobH) [17:22:54] !log mw1272 down for h/w troubleshooting [17:22:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:57] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10ayounsi) I have the codfw switch maintenance from 8am to 11am (where codfw will be depooled). And a dentist apt at 1pm. I think it's better to repool codfw before depooling... [17:23:14] (03PS1) 10RobH: bohrium install params [puppet] - 10https://gerrit.wikimedia.org/r/479729 [17:23:18] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:24:00] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:24:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:25:12] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:25:19] (03Abandoned) 10Muehlenhoff: Make Kerberos configurable in cdh::hadoop::worker [puppet/cdh] - 10https://gerrit.wikimedia.org/r/479617 (owner: 10Muehlenhoff) [17:26:33] (03CR) 10RobH: [C: 03+2] bohrium install params [puppet] - 10https://gerrit.wikimedia.org/r/479729 (owner: 10RobH) [17:26:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/479555 (https://phabricator.wikimedia.org/T205884) (owner: 10Volans) [17:32:43] 10Operations, 10ops-eqiad, 10HHVM: mw1272 crashed: Bad page map in process hhvm - https://phabricator.wikimedia.org/T211668 (10Cmjohnson) Today I swapped the DIMM from B1 to A1 and cleared the log. We have to wait and see [17:34:35] (03PS2) 10CRusnov: Change all reports to log only errors except for a summary count [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479583 (https://phabricator.wikimedia.org/T205899) [17:35:18] (03CR) 10CRusnov: Change all reports to log only errors except for a summary count (035 comments) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479583 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [17:35:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:35:32] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10MarcoAurelio) @Dzahn With the patch merged above, I assume that we have now a deployment-mwmaint01 server where to... [17:35:37] (03CR) 10CRusnov: Change all reports to log only errors except for a summary count (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/479583 (https://phabricator.wikimedia.org/T205899) (owner: 10CRusnov) [17:36:36] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [17:37:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [17:41:06] 10Operations, 10ops-eqiad, 10netops: asw2-a-eqiad FPC7 faulty PEM0 - https://phabricator.wikimedia.org/T206972 (10Cmjohnson) 05Open→03Resolved I received the new PEM from juniper ...resolving this task [17:43:52] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [17:50:09] (03CR) 10محمد شعیب: [C: 03+1] "looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) (owner: 10Tulsi Bhagat) [17:54:18] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10faidon) Manufacturer, model and serial checks all sound good to me! Manufacturer may need some r... [17:58:43] (03PS1) 10ArielGlenn: enable lbzip2 for xml/sql dumps testing in labs [puppet] - 10https://gerrit.wikimedia.org/r/479734 [18:02:19] 10Operations, 10ops-eqiad: Decommission brokenasw-c2-eqiad - https://phabricator.wikimedia.org/T211998 (10ayounsi) [18:02:29] (03PS2) 10ArielGlenn: enable lbzip2 for xml/sql dumps testing in labs [puppet] - 10https://gerrit.wikimedia.org/r/479734 [18:05:35] (03CR) 10ArielGlenn: [C: 03+2] enable lbzip2 for xml/sql dumps testing in labs [puppet] - 10https://gerrit.wikimedia.org/r/479734 (owner: 10ArielGlenn) [18:09:45] 10Operations, 10SRE-Access-Requests: Requesting shell access for sasheto - https://phabricator.wikimedia.org/T212001 (10Sasheto) [18:10:13] (03PS1) 10Arturo Borrero Gonzalez: toolforge: declare sonofgridengime::submit_host from a central point [puppet] - 10https://gerrit.wikimedia.org/r/479736 [18:10:15] (03PS1) 10Arturo Borrero Gonzalez: toolforge: webservicemonitor is now in cron nodes [puppet] - 10https://gerrit.wikimedia.org/r/479737 (https://phabricator.wikimedia.org/T211977) [18:17:24] (03CR) 10GTirloni: [C: 03+1] toolforge: webservicemonitor is now in cron nodes [puppet] - 10https://gerrit.wikimedia.org/r/479737 (https://phabricator.wikimedia.org/T211977) (owner: 10Arturo Borrero Gonzalez) [18:26:11] (03CR) 10MarcoAurelio: [C: 04-1] "The name 'pagemover' does not exist in any repo, so it'll cause i18n issues. If you're attempting to replicate enwiki config and naming he" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479723 (https://phabricator.wikimedia.org/T211978) (owner: 10Tulsi Bhagat) [18:33:24] (03PS2) 10Cwhite: profile: enable statsd_exporter and add matching rules to ores::worker [puppet] - 10https://gerrit.wikimedia.org/r/479563 (https://phabricator.wikimedia.org/T205870) [18:38:45] (03PS8) 10Cwhite: ci: define statsd prometheus exporter mappings [puppet] - 10https://gerrit.wikimedia.org/r/479139 (https://phabricator.wikimedia.org/T205870) [18:40:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:42:42] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [18:42:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4fullscreenrefresh=1morgId=1 [18:43:20] 10Operations, 10SRE-Access-Requests: Requesting shell access for sasheto - https://phabricator.wikimedia.org/T212001 (10MarcoAurelio) 05Open→03declined I am boldly declining this task as what this user is requesting is not access to production hosts but to take over an abandoned tool, which is an entirely... [18:47:32] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3fullscreenorgId=1var-site=eqiadvar-cache_type=Allvar-status_type=5 [18:48:47] (03CR) 10MarcoAurelio: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/479713 (https://phabricator.wikimedia.org/T211795) (owner: 10Tulsi Bhagat) [19:01:27] (03PS1) 10Ladsgroup: Change WMF logo to white [software/tendril] - 10https://gerrit.wikimedia.org/r/479741 [19:05:30] 10Operations, 10ops-ulsfo: ulsfo: install new PDUs in racks / phase out APC loaner PDU use - https://phabricator.wikimedia.org/T209101 (10RobH) >>! In T209101#4824197, @ayounsi wrote: > I have the codfw switch maintenance from 8am to 11am (where codfw will be depooled). And a dentist apt at 1pm. > I think it's... [19:12:39] (03CR) 10BryanDavis: toolforge: declare sonofgridengime::submit_host from a central point (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479736 (owner: 10Arturo Borrero Gonzalez) [19:28:57] 10Operations: Add eprodromou@wikimedia.org to cpt-leads@wikimedia.org - https://phabricator.wikimedia.org/T212007 (10kchapman) [19:31:56] 10Operations, 10Beta-Cluster-Infrastructure, 10Wikidata, 10wikidata-tech-focus, and 3 others: Run mediawiki::maintenance scripts in Beta Cluster - https://phabricator.wikimedia.org/T125976 (10Dzahn) @MarcoAurelio The patch means more specifically just that a host `deployment-mwmaint01.deployment-prep.eqiad... [19:32:00] 10Operations, 10Release Pipeline: blubber template for nodejs should allow defining configuration files to copy to the container - https://phabricator.wikimedia.org/T211580 (10thcipriani) > AIUI, blubber has no way to allow project owners to specify files to copy to specific locations inside the container, but... [19:35:02] 10Operations, 10Research-Programs, 10SRE-Access-Requests, 10Patch-For-Review: access to analytics-privatedata-users for @toddleroux, @Afandian, & @RyanSteinberg - https://phabricator.wikimedia.org/T209298 (10Dzahn) 05stalled→03Resolved Thanks for confirming @Afandian ! Yea, i agree @jijiki , resolving... [19:39:21] (03PS1) 10Thcipriani: ci: Remove minikube from pipeline instances [puppet] - 10https://gerrit.wikimedia.org/r/479746 [19:48:14] PROBLEM - Host sodium is DOWN: PING CRITICAL - Packet loss = 100% [19:49:08] oh fuck [19:49:13] uhm, that's the mirrors [19:49:22] i fucked up [19:49:28] i reset it into installer, just stopped it [19:49:29] what happened? [19:49:35] ah [19:49:36] fucccccccckkkk [19:49:57] ok, rebooting now and hoping i didnt fuck it over [19:50:01] do you think it got the partitioning step? [19:50:05] ok [19:50:05] i dont know [19:50:18] presses thumbs [19:50:21] im so stupid. [19:50:31] i meant to connect to sulfur [19:50:35] and my brain just farted. [19:50:52] shit happens [19:51:01] i also had to lookup the name again because it used to be something different [19:51:22] i hate element names but this is still 100% my fault. one should be more careful when sending a reboot than i just was. [19:51:39] robh: mollyguard for all! [19:51:48] Reedy: i connected via mgmt [19:51:54] cuz the other system, sulfur, has no os [19:51:59] we actually do have that but not on DRACs [19:52:13] i found the one way to accidentially reboot left to me [19:52:14] and used it. [19:52:43] linux seems to be loading. [19:52:50] maybe i got lucky. [19:52:52] :) [19:53:08] RECOVERY - Host sodium is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [19:53:29] whew. [19:53:35] The last Puppet run was at Fri Dec 14 19:24:26 UTC 2018 (29 minutes ago). [19:53:40] no format, all intact. [19:54:19] i am deeply embarrassed =P [19:55:00] dont be, just be glad you caught it in time [19:55:17] restoring would have been messy i think [19:55:19] oh, if i hadn't i'd have upgraded that statement to 'positively mortified' [19:55:23] maybe we want a second one of those in other dc [19:55:31] then we also have data to copy back [19:55:59] when you echoed what it was my thoughts went from 'ok not instant site outage conditions but holy shit recovery is manual on a lot of that' [19:57:12] yea, would break it not so much for ourselves but for others using as as mirror [19:59:14] well... i was thinking about another cup of coffee, but now that 'fml what did i do' moment worked equally as well. [20:00:09] gets the heart pumpin' :) [20:00:54] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) >>! In T205899#4824310, @faidon wrote: > Manufacturer, model and serial checks all soun... [20:07:07] !log otto@deploy1001 Started deploy [analytics/refinery@ef1f7c6]: (no justification provided) [20:07:07] !log otto@deploy1001 deploy aborted: (no justification provided) (duration: 00m 00s) [20:07:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:23] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10faidon) >>! In T205899#4824679, @crusnov wrote: >>>! In T205899#4824310, @faidon wrote: >> I wou... [20:07:26] !log otto@deploy1001 Started deploy [analytics/refinery@ef1f7c6]: deploying refinery-source 0.0.82 with fix for T211833 [20:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:29] T211833: [BUG] User agent parsing error for MobileWikiAppSearch table - https://phabricator.wikimedia.org/T211833 [20:07:56] 10Operations, 10Analytics, 10Research, 10Patch-For-Review, 10User-Banyek: Import recommendations into production database - https://phabricator.wikimedia.org/T208622 (10bmansurov) 05Open→03Resolved a:03bmansurov @Ottomata helped import the data we need for now. I'll follow up on open questions in o... [20:08:47] 10Operations, 10Core Platform Team Backlog (Watching / External): Create email alias for CPT Leads - https://phabricator.wikimedia.org/T210624 (10Dzahn) [20:08:51] 10Operations: Add eprodromou@wikimedia.org to cpt-leads@wikimedia.org - https://phabricator.wikimedia.org/T212007 (10Dzahn) [20:09:33] (03PS1) 10Ottomata: Bump refinery-job to 0.0.82 for refine.pp jobs [puppet] - 10https://gerrit.wikimedia.org/r/479753 (https://phabricator.wikimedia.org/T211833) [20:13:30] !log otto@deploy1001 Finished deploy [analytics/refinery@ef1f7c6]: deploying refinery-source 0.0.82 with fix for T211833 (duration: 06m 04s) [20:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:34] T211833: [BUG] User agent parsing error for MobileWikiAppSearch table - https://phabricator.wikimedia.org/T211833 [20:13:50] (03CR) 10Ottomata: [C: 03+2] Bump refinery-job to 0.0.82 for refine.pp jobs [puppet] - 10https://gerrit.wikimedia.org/r/479753 (https://phabricator.wikimedia.org/T211833) (owner: 10Ottomata) [20:13:52] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb={GET,PATCH} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:13:52] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation={get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:14:28] PROBLEM - etcd request latencies on argon is CRITICAL: instance=10.64.32.133:6443 operation={compareAndSwap,get} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:14:46] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={LIST,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:15:16] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10RobH) [20:16:09] 10Operations, 10Core Platform Team Backlog (Watching / External): Create email alias for CPT Leads - https://phabricator.wikimedia.org/T210624 (10Dzahn) [20:16:11] 10Operations: Add eprodromou@wikimedia.org to cpt-leads@wikimedia.org - https://phabricator.wikimedia.org/T212007 (10Dzahn) 05Open→03Resolved a:03Dzahn done! ` +# CPT Leads (T210624, T212007) +cpt-leads: mobrovac@wikimedia.org, ccicalese@wikimedia.org, tstarling@wikimedia.org, dkinzler@wikimedia.org, kch... [20:16:44] 10Operations, 10Core Platform Team Backlog (Watching / External): Create email alias for CPT Leads - https://phabricator.wikimedia.org/T210624 (10Dzahn) added @EvanProdromou as requested on T212007 [20:16:56] RECOVERY - etcd request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:17:18] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:17:30] 10Operations, 10LDAP-Access-Requests: Add LDAP to aezell for read/write access of Grafana - https://phabricator.wikimedia.org/T211945 (10Dzahn) [20:17:55] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10ops-monitoring-bot) [20:18:18] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:18:54] robh: coincidence? new RAID issue on sodium detected it claims [20:19:08] 10Operations, 10netops: migrate netinsights from rhenium to sulfer - https://phabricator.wikimedia.org/T212011 (10RobH) p:05Triage→03Normal [20:19:15] mutante: oh, then i bet it got fubar. [20:19:35] or, rebooting caused a disk to detect as bad... [20:20:04] PROBLEM - DPKG on sulfur is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.87: Connection reset by peer [20:20:34] 10Operations, 10ops-eqiad: rack/setup/install sulfur.wikimedia.org - https://phabricator.wikimedia.org/T201364 (10RobH) 05Open→03Resolved @faidon, So this now has the netinsights role running. T212011 will track the migration of services. [20:20:42] yea, it's just 1 logical drive [20:20:46] service is up [20:21:01] oh, this is hw raid? [20:21:19] megacli https://phabricator.wikimedia.org/T212010 [20:21:20] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [20:21:34] ok, not my fault! [20:21:41] i mean, the reboots likely triggered the failing drive to finally fail. [20:21:45] so a little my fault... [20:21:57] heh, yea, they do that on reboots [20:22:02] still in warranty [20:22:07] ok, cool [20:22:58] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10RobH) a:03Cmjohnson So failed disk, but under warranty until June 17, 2019. We cannot really 'test' the failed disk, since the others have data and we cannot move them around. So this will just need a support... [20:23:22] 10Operations, 10ops-eqiad: Degraded RAID on sodium - https://phabricator.wikimedia.org/T212010 (10Dzahn) p:05Triage→03Normal service is up and disk still in warranty -> normal [20:23:40] PROBLEM - Disk space on sulfur is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.87: Connection reset by peer [20:23:53] hrmm [20:23:59] i just set that up and it ran puppet [20:24:01] why is it upset [20:24:15] netinsights, a new role [20:24:16] aha [20:24:29] its not new, rhemium runs it [20:24:36] has a long time [20:24:41] but new system running it to replace it yeah [20:24:49] oh, i see, replaced pmacct [20:25:07] yeah, so its not in service [20:25:08] yet [20:25:42] 10Operations, 10netops: migrate netinsights from rhenium to sulfer - https://phabricator.wikimedia.org/T212011 (10RobH) [20:29:41] !log andrew@deploy1001 Started deploy [horizon/deploy@1a830b9]: Rolling out fix for T177855 [20:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:45] T177855: Difficulty applying profile class parameters in Horizon interface - https://phabricator.wikimedia.org/T177855 [20:31:06] PROBLEM - MD RAID on sulfur is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.87: Connection reset by peer [20:32:57] !log andrew@deploy1001 Finished deploy [horizon/deploy@1a830b9]: Rolling out fix for T177855 (duration: 03m 17s) [20:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:52] PROBLEM - Check size of conntrack table on sulfur is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.87: Connection reset by peer [20:36:42] PROBLEM - Check systemd state on sulfur is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.87: Connection reset by peer [20:36:42] PROBLEM - configured eth on sulfur is CRITICAL: CHECK_NRPE: Error - Could not connect to 208.80.154.87: Connection reset by peer [20:38:36] RECOVERY - DPKG on sulfur is OK: All packages OK [20:38:49] !log sulfur systemctl restart nagios-nrpe-server [20:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:54] robh: fixed it [20:39:04] cool [20:39:08] RECOVERY - configured eth on sulfur is OK: OK - interfaces up [20:39:10] thank you [20:39:18] RECOVERY - Disk space on sulfur is OK: DISK OK [20:39:20] RECOVERY - MD RAID on sulfur is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [20:39:27] likely failed the order of operation required to start it on inital puppet run [20:39:59] yea, the order of getting the public IP and starting it or so [20:40:30] PROBLEM - Check whether ferm is active by checking the default input chain on sulfur is CRITICAL: NRPE: Command check_ferm_active not defined [20:40:54] RECOVERY - Check size of conntrack table on sulfur is OK: OK: nf_conntrack is 0 % full [20:41:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team: rack/setup/install cloudvirt10[25-30].eqiad.wmnet - https://phabricator.wikimedia.org/T209616 (10Andrew) @robh, I'm always happy for you to image these things, but if you wind up with too much to do @aborrero has offered to do the OS installs. [20:41:42] RECOVERY - Check whether ferm is active by checking the default input chain on sulfur is OK: OK ferm input default policy is set [20:45:08] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Andrew) @faidon, who is 'please also construct a draft email' directed to? [20:45:46] (03PS2) 10Cwhite: profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) [20:46:06] (03CR) 10jerkins-bot: [V: 04-1] profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [20:46:54] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10faidon) >>! In T196507#4824817, @Andrew wrote: > @faidon, who is 'please also construct a draft email' directed to? Sorry, re-reading this I can see... [20:47:33] (03PS1) 10Paladox: gerrit: make ipv6 optional again [puppet] - 10https://gerrit.wikimedia.org/r/479763 [20:48:19] (03PS2) 10Paladox: gerrit: make ipv6 optional again [puppet] - 10https://gerrit.wikimedia.org/r/479763 [20:48:24] (03CR) 10Cwhite: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) (owner: 10Cwhite) [20:50:34] (03PS3) 10Cwhite: profile: enable statsd_exporter and add matching rules to logstash::collector [puppet] - 10https://gerrit.wikimedia.org/r/479353 (https://phabricator.wikimedia.org/T205870) [20:51:22] (03CR) 10Cwhite: [C: 03+1] Remove Diamond from Swift proxies [puppet] - 10https://gerrit.wikimedia.org/r/479600 (owner: 10Muehlenhoff) [20:53:17] (03PS1) 10Andrew Bogott: nfs: add another VM to the Maps nfs mount [puppet] - 10https://gerrit.wikimedia.org/r/479764 (https://phabricator.wikimedia.org/T204506) [20:54:14] (03CR) 10Andrew Bogott: [C: 03+2] nfs: add another VM to the Maps nfs mount [puppet] - 10https://gerrit.wikimedia.org/r/479764 (https://phabricator.wikimedia.org/T204506) (owner: 10Andrew Bogott) [21:00:15] (03PS3) 10Paladox: gerrit: make ipv6 optional again [puppet] - 10https://gerrit.wikimedia.org/r/479763 [21:00:50] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) Ahh Yes, so the report would include the failures and they'd be fixed manually and then... [21:01:14] (03CR) 10Dzahn: [C: 03+2] gerrit: make ipv6 optional again [puppet] - 10https://gerrit.wikimedia.org/r/479763 (owner: 10Paladox) [21:05:13] (03PS1) 10Paladox: gerrit: Make ipv6 optional again part 2 [puppet] - 10https://gerrit.wikimedia.org/r/479768 [21:07:30] (03PS2) 10Paladox: gerrit: Make ipv6 optional again part 2 [puppet] - 10https://gerrit.wikimedia.org/r/479768 [21:20:58] (03PS1) 10Cwhite: hiera: add puppetboard and puppetdb to puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/479772 (https://phabricator.wikimedia.org/T210486) [21:21:25] (03PS2) 10Cwhite: hiera: add puppetboard and puppetdb to puppet cluster [puppet] - 10https://gerrit.wikimedia.org/r/479772 (https://phabricator.wikimedia.org/T210486) [21:27:26] (03PS1) 10Cwhite: hiera: add alerting_host cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479843 (https://phabricator.wikimedia.org/T210486) [21:31:47] (03PS1) 10Andrew Bogott: Horizon: move 'openstack' project to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/479844 (https://phabricator.wikimedia.org/T204745) [21:32:07] 10Operations, 10netops: migrate netinsights from rhenium to sulfer - https://phabricator.wikimedia.org/T212011 (10RobH) a:05RobH→03faidon So the setup task noted that @faidon is familar with the services on this box, assigning him for input on best way to migrate. [21:32:21] (03PS1) 10Cwhite: hiera: add ci cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/479845 (https://phabricator.wikimedia.org/T210486) [21:32:42] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: move 'openstack' project to eqiad1-r [puppet] - 10https://gerrit.wikimedia.org/r/479844 (https://phabricator.wikimedia.org/T204745) (owner: 10Andrew Bogott) [21:34:20] 10Operations, 10ops-eqiad: rack/setup/install puppetmaster1003.eqiad.wmnet - https://phabricator.wikimedia.org/T201342 (10RobH) So, there is already a puppetmaster1001 and puppetmaster1002, do we need a third puppetmaster? Perhaps this was not needed, since rhodium is a third online puppetmaster? [21:36:12] (03PS1) 10Mforns: Update analytics eventlogging_to_druid_job.pp to mirror changes in scala job [puppet] - 10https://gerrit.wikimedia.org/r/479847 (https://phabricator.wikimedia.org/T210099) [21:36:43] (03CR) 10Mforns: [C: 04-1] "This should be merged together with the next refinery-source deployment." [puppet] - 10https://gerrit.wikimedia.org/r/479847 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [21:37:08] (03CR) 10jerkins-bot: [V: 04-1] Update analytics eventlogging_to_druid_job.pp to mirror changes in scala job [puppet] - 10https://gerrit.wikimedia.org/r/479847 (https://phabricator.wikimedia.org/T210099) (owner: 10Mforns) [21:37:22] 10Operations, 10fundraising-tech-ops, 10Patch-For-Review: rack/setup/install Prometeuse/Grafana host frmon2001 for fr-tech - https://phabricator.wikimedia.org/T196476 (10RobH) As this is no longer awaiting any onsite work from #ops-codfw, I've removed it from the projects. [21:39:14] (03CR) 10Dzahn: Use standard version of plain-text GPL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [21:42:59] (03PS3) 10Dzahn: gerrit: Make ipv6 optional again part 2 [puppet] - 10https://gerrit.wikimedia.org/r/479768 (owner: 10Paladox) [21:45:59] (03CR) 10Dzahn: [C: 03+2] gerrit: Make ipv6 optional again part 2 [puppet] - 10https://gerrit.wikimedia.org/r/479768 (owner: 10Paladox) [21:48:28] (03CR) 10Dzahn: Use standard version of plain-text GPL (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [21:50:55] (03PS1) 10Paladox: gerrit: Fix puppet types in jetty.pp [puppet] - 10https://gerrit.wikimedia.org/r/479850 [21:52:09] (03PS2) 10Paladox: gerrit: Fix puppet types in jetty.pp [puppet] - 10https://gerrit.wikimedia.org/r/479850 [21:54:27] (03CR) 10Paladox: gerrit: Fix puppet types in jetty.pp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/479850 (owner: 10Paladox) [21:56:22] (03PS3) 10Paladox: gerrit: Fix puppet types in jetty.pp [puppet] - 10https://gerrit.wikimedia.org/r/479850 [21:59:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10RobH) labnodepool1001 asw2-b-eqiad ge-3/0/18 [22:00:35] !log increase accepted-prefix-limit for HE to 200000 [22:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:18] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for labnodepool1001.eqiad.wmnet and performed the following actions: - Revoked Puppet c... [22:07:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10RobH) [22:08:34] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10RobH) [22:08:42] (03PS4) 10Paladox: gerrit: Fix puppet types in jetty.pp [puppet] - 10https://gerrit.wikimedia.org/r/479850 [22:08:47] (03CR) 10Paladox: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/479850 (owner: 10Paladox) [22:10:27] (03PS1) 10Andrew Bogott: Clean up remaining nodepool/labnodepool1001 refs [puppet] - 10https://gerrit.wikimedia.org/r/479855 (https://phabricator.wikimedia.org/T209642) [22:11:19] (03CR) 10Andrew Bogott: [C: 03+2] Clean up remaining nodepool/labnodepool1001 refs [puppet] - 10https://gerrit.wikimedia.org/r/479855 (https://phabricator.wikimedia.org/T209642) (owner: 10Andrew Bogott) [22:12:54] (03CR) 10Dzahn: [C: 03+2] gerrit: Fix puppet types in jetty.pp [puppet] - 10https://gerrit.wikimedia.org/r/479850 (owner: 10Paladox) [22:13:21] (03PS5) 10Paladox: gerrit: Fix puppet types in jetty.pp [puppet] - 10https://gerrit.wikimedia.org/r/479850 [22:14:17] (03PS6) 10Dzahn: gerrit: Fix puppet types in jetty.pp [puppet] - 10https://gerrit.wikimedia.org/r/479850 (owner: 10Paladox) [22:16:19] 10Operations, 10decommission: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10RobH) a:03RobH [22:19:40] (03PS1) 10RobH: decom labnodepool1001 prod dns [dns] - 10https://gerrit.wikimedia.org/r/479856 (https://phabricator.wikimedia.org/T209642) [22:20:32] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10RobH) [22:20:36] (03CR) 10RobH: [C: 03+2] decom labnodepool1001 prod dns [dns] - 10https://gerrit.wikimedia.org/r/479856 (https://phabricator.wikimedia.org/T209642) (owner: 10RobH) [22:21:22] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10Patch-For-Review: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10RobH) [22:22:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission: Remove labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T209642 (10RobH) a:05RobH→03Cmjohnson This is ready for disk wipe and remainder of steps to decom the system. [22:24:28] (03PS1) 10RobH: decom einsteinium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/479857 (https://phabricator.wikimedia.org/T209738) [22:25:16] (03PS1) 10RobH: decom einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/479858 (https://phabricator.wikimedia.org/T209738) [22:27:18] 10Operations, 10decommission, 10Patch-For-Review: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10ops-monitoring-bot) wmf-decommission-host was executed by robh for einsteinium.wikimedia.org and performed the following actions: - Revoked Puppet certificate - Removed from PuppetDB - Downtimed... [22:29:07] 10Operations, 10decommission: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10RobH) [22:29:40] 10Operations, 10decommission: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10RobH) [22:30:57] 10Operations, 10ops-eqiad, 10decommission: decom einsteinium - https://phabricator.wikimedia.org/T209738 (10RobH) a:05RobH→03Cmjohnson ready for disk wipe and remainder of steps [22:34:44] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:35:18] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:38:51] andrewbogott ^^ (not sure if your change caused puppet failures on contint*) [22:39:18] hm, probably [22:39:20] I'll check [22:42:09] (03CR) 10RobH: [C: 03+2] decom einsteinium production dns entries [dns] - 10https://gerrit.wikimedia.org/r/479857 (https://phabricator.wikimedia.org/T209738) (owner: 10RobH) [22:42:21] (03CR) 10RobH: [C: 03+2] decom einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/479858 (https://phabricator.wikimedia.org/T209738) (owner: 10RobH) [22:43:29] (03PS1) 10Andrew Bogott: contint: remove references to nodepool [puppet] - 10https://gerrit.wikimedia.org/r/479859 [22:46:32] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) >>! In T205899#4824856, @crusnov wrote: > Ahh Yes, so the report would include the fail... [22:52:39] (03CR) 10Andrew Bogott: [C: 03+2] contint: remove references to nodepool [puppet] - 10https://gerrit.wikimedia.org/r/479859 (owner: 10Andrew Bogott) [22:54:42] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10crusnov) [22:55:15] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:55:38] paladox: fixed! thank you for noticing. [22:55:45] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [22:55:47] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Develop and deploy at least three Netbox reports to assist with data correctness and consistency - https://phabricator.wikimedia.org/T205899 (10ayounsi) To pile on that, LibreNMS can be queried instead of Puppetboard for network devices: eg... [22:55:47] andrewbogott thanks! and your welcome :) [22:59:15] 10Operations, 10ops-codfw: Interface errors on cr1-codfw:xe-5/3/1 - https://phabricator.wikimedia.org/T211715 (10ayounsi) a:05ayounsi→03Papaul Let's try to replace the patch cable. Please sync up with me so I can drain traffic first. [23:08:41] 10Operations, 10netops, 10Patch-For-Review: IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 noisy alert - https://phabricator.wikimedia.org/T205829 (10ayounsi) 05Open→03Resolved This has been quiet since. No root cause identified though. [23:22:13] (03PS14) 10Rafidaslam: Initial configuration for napwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/478945 (https://phabricator.wikimedia.org/T210752) [23:47:53] (03PS1) 10Dzahn: admins: add Alex Ezell to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/479868 (https://phabricator.wikimedia.org/T211945) [23:49:49] (03PS2) 10Dzahn: admins: add Alex Ezell to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/479868 (https://phabricator.wikimedia.org/T211945) [23:56:44] (03PS3) 10Dzahn: admins: add Alex Ezell to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/479868 (https://phabricator.wikimedia.org/T211945) [23:57:24] (03CR) 10Dzahn: [C: 03+2] admins: add Alex Ezell to ldap_only admins (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/479868 (https://phabricator.wikimedia.org/T211945) (owner: 10Dzahn) [23:59:43] !log LDAP: added aezell to wmf group (T211945) for grafana access [23:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:48] T211945: Add LDAP to aezell for read/write access of Grafana - https://phabricator.wikimedia.org/T211945