[00:00:01] he hit it at https://puppet-compiler.wmflabs.org/compiler02/9445/tin.eqiad.wmnet/prod.tin.eqiad.wmnet.err [00:02:07] hmm. seems a general compiler issue. mine is a different node and role [00:27:25] (03PS1) 10Dzahn: wmcs/labs: move more firewall/standard includes into roles [puppet] - 10https://gerrit.wikimedia.org/r/399542 [00:31:03] (03PS1) 10Dzahn: mediawiki::jobrunner: move firewall includes to role [puppet] - 10https://gerrit.wikimedia.org/r/399543 [01:01:31] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3853095 (10RobH) I've gone ahead and programmed the replacement with all the documented connections and cables, and sent off a copy of [[https://docs.google.com/a/wikimedia.org/document/d/1eYttY2Rd4WWjl... [01:08:07] 10Operations, 10monitoring, 10Patch-For-Review, 10Technical-Debt: decom uranium - https://phabricator.wikimedia.org/T183209#3853106 (10Dzahn) [01:09:23] 10Operations, 10monitoring, 10Patch-For-Review, 10Technical-Debt: decom uranium - https://phabricator.wikimedia.org/T183209#3846949 (10Dzahn) [01:10:04] 10Operations, 10ops-eqiad, 10monitoring, 10procurement, and 2 others: decom uranium - https://phabricator.wikimedia.org/T183209#3846949 (10Dzahn) [01:10:11] 10Operations, 10ops-eqiad, 10monitoring, 10procurement, and 2 others: decom uranium - https://phabricator.wikimedia.org/T183209#3853109 (10Dzahn) a:05Dzahn>03None [01:10:49] 10Operations, 10ops-eqsin: rack/setup scs-eqsin.mgmt.eqsin.wmnet - https://phabricator.wikimedia.org/T181569#3853110 (10RobH) [01:15:50] (03PS1) 10Dzahn: deploy1001: switch to use stretch installer [puppet] - 10https://gerrit.wikimedia.org/r/399549 (https://phabricator.wikimedia.org/T175288) [01:19:21] (03CR) 10Dzahn: [C: 032] deploy1001: switch to use stretch installer [puppet] - 10https://gerrit.wikimedia.org/r/399549 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [01:19:44] (03PS2) 10Dzahn: deploy1001: switch to use stretch installer [puppet] - 10https://gerrit.wikimedia.org/r/399549 (https://phabricator.wikimedia.org/T175288) [01:21:13] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3853115 (10RobH) FedEx 417953907699 delivered today. Emailed support@ul to notify them I'll pick it up tomorrow and install in cp4032. [01:30:01] 10Operations, 10ops-codfw: mw2251 problems - https://phabricator.wikimedia.org/T181263#3853122 (10RobH) So it appears that there is a memory issue: > /admin1-> racadm getsel > Record: 1 > Date/Time: 11/16/2017 18:59:13 > Source: system > Severity: Ok > Description: Log cleared. > ------------... [01:30:43] 10Operations, 10ops-codfw: mw2251 failed memory dimm - https://phabricator.wikimedia.org/T181263#3784626 (10RobH) a:03Papaul [01:32:22] 10Operations, 10ops-eqiad, 10hardware-requests, 10monitoring, 10Technical-Debt: decom uranium - https://phabricator.wikimedia.org/T183209#3853128 (10Dzahn) [01:35:02] 10Operations, 10ops-eqdfw, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#3853129 (10Dzahn) a:05Dzahn>03Cmjohnson Hi Chris, i can't login on mgmt console on this one. Could you chec... [01:45:59] 10Operations, 10hardware-requests: Replacement hardware for cumin masters - https://phabricator.wikimedia.org/T178392#3690629 (10RobH) That is a lot of cores for very little memory. Can the core count be lowered? (If so we could go with a single CPU system to save some money on it.) Looking at its history... [01:52:30] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Icinga check for WDQS should do an actual query - https://phabricator.wikimedia.org/T181989#3808841 (10Smalyshev) Is this finished? [01:52:39] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Icinga check for WDQS should do an actual query - https://phabricator.wikimedia.org/T181989#3853163 (10Smalyshev) p:05Triage>03Normal [02:19:53] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:02] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:12] !log l10nupdate@tin scap sync-l10n completed (1.31.0-wmf.12) (duration: 05m 31s) [02:20:23] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:20:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:22] PROBLEM - HHVM rendering on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:33] PROBLEM - Nginx local proxy to apache on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:42] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:52] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.033 second response time [02:21:52] PROBLEM - Apache HTTP on mw1312 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:21:53] PROBLEM - Apache HTTP on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:12] PROBLEM - Apache HTTP on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:33] PROBLEM - Nginx local proxy to apache on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:33] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:22:33] PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:03] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:12] PROBLEM - Nginx local proxy to apache on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:23:43] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:42] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 76161 bytes in 6.526 second response time [02:26:02] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.022 second response time [02:28:55] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:29:14] PROBLEM - Apache HTTP on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:30:54] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.089 second response time [02:31:04] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 1.983 second response time [02:31:05] RECOVERY - Nginx local proxy to apache on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.033 second response time [02:33:34] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 76161 bytes in 0.121 second response time [02:34:04] PROBLEM - Nginx local proxy to apache on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:36:44] PROBLEM - HHVM rendering on mw1278 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:41:34] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 76161 bytes in 0.114 second response time [02:42:06] RECOVERY - Nginx local proxy to apache on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.030 second response time [02:42:14] RECOVERY - Apache HTTP on mw1278 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.022 second response time [03:00:09] RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [03:00:09] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 76120 bytes in 0.103 second response time [03:03:19] PROBLEM - Nginx local proxy to apache on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:03:19] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:08:19] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 619 bytes in 7.908 second response time [03:08:20] RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 76121 bytes in 7.970 second response time [03:11:20] PROBLEM - HHVM rendering on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:20] PROBLEM - Nginx local proxy to apache on mw1315 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:55:11] !log mw1290 kill and restart hhvm | mw1230 stop and start hhvm [03:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:55:29] RECOVERY - Nginx local proxy to apache on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.089 second response time [03:55:49] RECOVERY - Apache HTTP on mw1230 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.172 second response time [03:55:50] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 76119 bytes in 3.133 second response time [03:56:50] RECOVERY - Apache HTTP on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.299 second response time [03:56:50] RECOVERY - Nginx local proxy to apache on mw1290 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.033 second response time [03:56:50] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 76119 bytes in 1.104 second response time [03:58:09] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:59:49] RECOVERY - HHVM rendering on mw1312 is OK: HTTP OK: HTTP/1.1 200 OK - 76117 bytes in 0.308 second response time [03:59:49] RECOVERY - Nginx local proxy to apache on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.038 second response time [03:59:50] RECOVERY - Apache HTTP on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.113 second response time [03:59:50] RECOVERY - Nginx local proxy to apache on mw1315 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.353 second response time [03:59:50] !log mw1315 kill and restart hhvm | mw1312 stop and start hhvm [03:59:59] RECOVERY - HHVM rendering on mw1315 is OK: HTTP OK: HTTP/1.1 200 OK - 76119 bytes in 2.133 second response time [04:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:30] RECOVERY - Apache HTTP on mw1312 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.443 second response time [04:02:39] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [04:03:09] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:25:02] (03PS1) 10Dzahn: builder/icinga: regex to avoid check_disk alerts for docker [puppet] - 10https://gerrit.wikimedia.org/r/399557 [04:28:02] (03CR) 10Dzahn: [C: 032] builder/icinga: regex to avoid check_disk alerts for docker [puppet] - 10https://gerrit.wikimedia.org/r/399557 (owner: 10Dzahn) [04:28:17] (03Draft2) 10Jayprakash12345: Enable commons import in tawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399556 [04:29:38] (03CR) 10Dzahn: "we still had an Icinga alert. follow-up see https://gerrit.wikimedia.org/r/#/c/399557/" [puppet] - 10https://gerrit.wikimedia.org/r/399426 (owner: 10Alexandros Kosiaris) [04:29:52] (03PS3) 10Jayprakash12345: Enable commons import in tawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399556 (https://phabricator.wikimedia.org/T181774) [04:30:30] RECOVERY - Disk space on boron is OK: DISK OK [04:30:40] (03CR) 10Dzahn: "[boron:/etc/nagios/nrpe.d] $ /usr/lib/nagios/plugins/check_disk -w 10% -c 5% -W 6% -K 3% -l -e -A -i /var/lib/docker/* -i /run/docker/net" [puppet] - 10https://gerrit.wikimedia.org/r/399557 (owner: 10Dzahn) [04:36:27] (03PS1) 10Dzahn: icinga/docker: check_disk regex for ci::master,kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/399558 [04:38:37] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/399558/" [puppet] - 10https://gerrit.wikimedia.org/r/399426 (owner: 10Alexandros Kosiaris) [04:52:35] (03PS19) 10TerraCodes: Add wikidata and mediawiki.org to $wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392999 (https://phabricator.wikimedia.org/T117302) [05:21:24] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3853279 (10Tgr) [05:21:36] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807578 (10Tgr) [05:31:04] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3853281 (10Tgr) [05:31:39] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3807578 (10Tgr) >>! In T181952#3848976, @Nuria wrote: > Updating ticket from conversation on e-mail. To grant access two things are needed: >... [06:28:47] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399564 (https://phabricator.wikimedia.org/T161294) [06:31:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399564 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:32:37] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399564 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [06:33:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1109 - T161294 (duration: 00m 51s) [06:33:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:55] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:35:37] !log Stop replication in sync db1100 and db1071 - T161294 [06:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:13] !log Remove some old files in dbstore1001:/srv/tmp to address the WARNING alert [06:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:42] PROBLEM - HHVM rendering on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:36:43] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:02] PROBLEM - Nginx local proxy to apache on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:42] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 4.860 second response time [06:38:32] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:40:22] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 76188 bytes in 0.111 second response time [06:40:52] PROBLEM - Apache HTTP on mw1208 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:42:33] !log Stop replication in sync on db1100 - db2052 - T161294 [06:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:42:45] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:45:42] RECOVERY - HHVM rendering on mw1208 is OK: HTTP OK: HTTP/1.1 200 OK - 76189 bytes in 0.417 second response time [06:45:42] RECOVERY - Apache HTTP on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.110 second response time [06:46:02] RECOVERY - Nginx local proxy to apache on mw1208 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.034 second response time [06:50:10] !log Stop replication in sync on dbstore1002 - db1100 - T161294 [06:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:21] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [06:55:33] <_joe_> !log restarting hhvm on mw1231 and mw1208 [06:55:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:51] (03PS1) 10Marostegui: db-eqiad.php: Depool db1109 and db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399566 (https://phabricator.wikimedia.org/T161294) [07:11:19] (03PS2) 10Marostegui: db-eqiad.php: Repool db1109 and db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399566 (https://phabricator.wikimedia.org/T161294) [07:13:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1109 and db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399566 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:15:16] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1109 and db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399566 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [07:16:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1109 and db1096:3315 - T161294 (duration: 00m 51s) [07:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:47] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [07:21:15] !log Upgrade MariaDB on db1100 [07:21:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:17] (03PS1) 10Marostegui: db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399570 [07:34:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399570 (owner: 10Marostegui) [07:35:59] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399570 (owner: 10Marostegui) [07:37:23] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Slowly repool db1100- T161294 (duration: 00m 52s) [07:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:34] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [08:22:12] (03CR) 10Muehlenhoff: WIP allow labmon1001 to contact pdns exporters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/399439 (owner: 10Filippo Giunchedi) [08:23:25] !log repool mw1277 after investigation [08:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:05] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3853401 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by elukey on neodymium.eqiad.wmnet for hosts: ``` ['mw1333.eqiad.wmnet'] ``` The log can be... [08:27:58] (03PS1) 10Marostegui: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399581 (https://phabricator.wikimedia.org/T161294) [08:29:41] (03PS1) 10Muehlenhoff: Use custom profile for PDNS exporter on labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399582 [08:32:46] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399581 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [08:34:10] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399581 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [08:34:50] !log Stop replication in sync on db1100 and dbstore1002 - T161294 [08:35:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:04] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [08:35:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1100 - T161294 (duration: 00m 51s) [08:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:15] (03Abandoned) 10Filippo Giunchedi: WIP allow labmon1001 to contact pdns exporters [puppet] - 10https://gerrit.wikimedia.org/r/399439 (owner: 10Filippo Giunchedi) [08:38:09] (03CR) 10Filippo Giunchedi: [C: 031] "The alternative approach in https://gerrit.wikimedia.org/r/#/c/399439/ would fail due to lack of mapped v6/AAAA records. This will do for " [puppet] - 10https://gerrit.wikimedia.org/r/399582 (owner: 10Muehlenhoff) [08:39:43] (03CR) 10jenkins-bot: Fill in $wgGNSMFallbackCategory based on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399433 (https://phabricator.wikimedia.org/T172875) (owner: 10Brian Wolff) [08:39:45] (03CR) 10jenkins-bot: Revert "Depool deployment-db04" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399449 (https://phabricator.wikimedia.org/T183252) (owner: 10Catrope) [08:39:47] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1109 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399564 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [08:39:49] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1109 and db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399566 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [08:39:51] (03CR) 10jenkins-bot: db-eqiad.php: Repool db1100 with low weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399570 (owner: 10Marostegui) [08:39:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399581 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [08:51:07] !log run kafka preferred-replica-election after maintenance of kafka1023 (fully bootstrapped now) [08:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:09] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3853425 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['mw1333.eqiad.wmnet'] ``` Of which those **FAILED**: ``` ['mw1333.eqiad.wmnet'] ``` [09:00:02] elukey: need help for the failure? if so give me 5 minutes and I can have a look [09:00:38] hahahaha [09:01:10] volans: don't worry you are too reactive to these issues :) [09:01:42] true but it should *just work* [09:01:49] I needed to manually put rootdelay during the first boot but didn't get to it in time (endedup in initramfs) [09:02:34] there is a pending CR that amomg other stuff increase the reboot tineout to 1h [09:02:46] reviews are welcome ;) [09:02:54] will try to get to it today :) [09:03:26] the other stuff being to simplify the resume after a failed installation if you want to skip pxe [09:03:40] thanks! no hurry from my side [09:03:56] ah nice! [09:04:25] <3 [09:13:56] (03PS10) 10ArielGlenn: ability to do xmlpageslogging several pieces at a time in parallel [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) [09:14:19] (03CR) 10DCausse: [C: 031] Lower refresh interval for Wikidata to 5s [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399466 (https://phabricator.wikimedia.org/T183053) (owner: 10Smalyshev) [09:14:59] (03CR) 10ArielGlenn: [C: 04-2] "Do not merge until end of curent dump run, so around Dec 31." [dumps] - 10https://gerrit.wikimedia.org/r/394857 (https://phabricator.wikimedia.org/T181935) (owner: 10ArielGlenn) [09:15:22] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3853439 (10Joe) To recap my experiments: - I built envoy following our container build guidelines, but I am blocked on building it in production as it fails to... [09:17:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399588 [09:19:56] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399588 (owner: 10Marostegui) [09:20:21] !log restart zookeeper on conf1001 for jvm updates - T179943 [09:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:33] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [09:21:28] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399588 (owner: 10Marostegui) [09:21:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1100" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399588 (owner: 10Marostegui) [09:22:34] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1100 - T161294 (duration: 00m 51s) [09:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:45] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [09:25:25] (03PS1) 10ArielGlenn: enable pagelogs to be dumped by several processes in parallel [puppet] - 10https://gerrit.wikimedia.org/r/399589 (https://phabricator.wikimedia.org/T181935) [09:27:24] (03CR) 10ArielGlenn: [C: 04-2] "Cannot be merged until end of current dump run, around Dec 31. https://gerrit.wikimedia.org/r/#/c/394857 should go at the same time." [puppet] - 10https://gerrit.wikimedia.org/r/399589 (https://phabricator.wikimedia.org/T181935) (owner: 10ArielGlenn) [09:30:39] !log restart zookeeper on conf100[2,3] for jvm updates - T179943 [09:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:49] T179943: Restart Analytics JVM daemons for open-jdk security updates - https://phabricator.wikimedia.org/T179943 [09:32:19] PROBLEM - Check the NTP synchronisation status of timesyncd on mw1333 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:32:19] PROBLEM - configured eth on mw1333 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:33:56] this is me --^ [09:34:08] PROBLEM - Check whether ferm is active by checking the default input chain on mw1333 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:34:08] PROBLEM - dhclient process on mw1333 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [09:34:21] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399590 (https://phabricator.wikimedia.org/T161294) [09:35:57] (03PS1) 10Hashar: Unit test for docker_pkg.image_fullname [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399592 [09:36:48] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399590 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [09:38:30] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399590 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [09:38:41] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399590 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [09:39:51] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 - T161294 (duration: 00m 51s) [09:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:04] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [09:42:16] (03CR) 10Alexandros Kosiaris: [C: 031] [WIP] Bird-lg [puppet] - 10https://gerrit.wikimedia.org/r/390330 (owner: 10Ayounsi) [09:44:53] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3853502 (10fgiunchedi) [09:44:55] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring: Port application-specific metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T145659#3853503 (10fgiunchedi) [09:44:58] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Port redis statistics to Prometheus - https://phabricator.wikimedia.org/T148637#3853500 (10fgiunchedi) 05Open>03Resolved All done! The dashboards will likely need some tuning but metrics are there. [09:45:08] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3656370 (10fgiunchedi) [09:45:10] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Port nutcracker statistics to Prometheus - https://phabricator.wikimedia.org/T181995#3853504 (10fgiunchedi) 05Open>03Resolved All done! The dashboards will likely need some tuning but metrics are there. [09:46:31] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3853507 (10akosiaris) >>! In T165170#3852233, @Dzahn wrote: > So looks like we need a new role class for "regular ores server in production" (regular as opposed to... [09:47:00] (03PS1) 10Hashar: Support ignoring the namespace in image_tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399594 [09:47:08] (03CR) 10Jcrespo: [C: 04-2] "We will keep this up for some time, as most of production has been setup, but we are missing misc." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/338996 (https://phabricator.wikimedia.org/T158580) (owner: 10Jcrespo) [09:47:56] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3853508 (10fgiunchedi) [09:48:13] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3853509 (10akosiaris) 05stalled>03Open Unstalling. Note that we are in the year-end deployment freeze. The boxes should not be put in production until we are o... [09:50:07] !log restart eventbus on kafka2001 for openssl updates [09:50:09] RECOVERY - dhclient process on mw1333 is OK: PROCS OK: 0 processes with command name dhclient [09:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:18] RECOVERY - Check whether ferm is active by checking the default input chain on mw1333 is OK: OK ferm input default policy is set [09:50:22] (03PS1) 10Alexandros Kosiaris: Reimage ores200* as stretch [puppet] - 10https://gerrit.wikimedia.org/r/399595 (https://phabricator.wikimedia.org/T165170) [09:50:29] RECOVERY - configured eth on mw1333 is OK: OK - interfaces up [09:51:15] (03CR) 10Alexandros Kosiaris: [C: 032] Reimage ores200* as stretch [puppet] - 10https://gerrit.wikimedia.org/r/399595 (https://phabricator.wikimedia.org/T165170) (owner: 10Alexandros Kosiaris) [09:51:21] (03CR) 10Filippo Giunchedi: [C: 032] Use custom profile for PDNS exporter on labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399582 (owner: 10Muehlenhoff) [09:51:31] (03PS2) 10Filippo Giunchedi: Use custom profile for PDNS exporter on labservices1001 [puppet] - 10https://gerrit.wikimedia.org/r/399582 (owner: 10Muehlenhoff) [09:53:14] (03PS1) 10Alexandros Kosiaris: Reimage ores100* as stretch [puppet] - 10https://gerrit.wikimedia.org/r/399596 (https://phabricator.wikimedia.org/T171851) [09:53:41] akosiaris: ok to merge your change? [09:53:54] (03PS2) 10Alexandros Kosiaris: Reimage ores100* as stretch [puppet] - 10https://gerrit.wikimedia.org/r/399596 (https://phabricator.wikimedia.org/T171851) [09:53:56] godog: I got one more coming, I will merge all 3 [09:54:01] akosiaris: ok! [09:54:30] (03CR) 10Alexandros Kosiaris: [C: 032] Reimage ores100* as stretch [puppet] - 10https://gerrit.wikimedia.org/r/399596 (https://phabricator.wikimedia.org/T171851) (owner: 10Alexandros Kosiaris) [09:54:59] all 3 merged [09:56:32] (03PS2) 10Filippo Giunchedi: Add PowerDNS Recursor scraper config on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/399422 (owner: 10Muehlenhoff) [09:57:50] (03PS32) 10TerraCodes: $wmf* -> $wmg* [mediawiki-config] - 10https://gerrit.wikimedia.org/r/392184 (https://phabricator.wikimedia.org/T45956) [09:58:25] (03CR) 10Filippo Giunchedi: [C: 032] Add PowerDNS Recursor scraper config on labmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/399422 (owner: 10Muehlenhoff) [09:58:27] (03PS1) 10Marostegui: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399597 (https://phabricator.wikimedia.org/T161294) [10:00:22] 10Operations, 10Discovery-Search (Current work), 10Goal, 10Patch-For-Review, and 2 others: Port elasticsearch metrics to Prometheus - https://phabricator.wikimedia.org/T181627#3853557 (10Gehel) Short summary of the status of this task (since quite a lot of discussion happened): [x] the prometheus elastics... [10:00:44] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399597 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:02:07] (03Merged) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399597 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:02:19] RECOVERY - Check the NTP synchronisation status of timesyncd on mw1333 is OK: OK: synced at Thu 2017-12-21 10:02:16 UTC. [10:02:19] (03CR) 10jenkins-bot: db-eqiad.php: Increase traffic for db1100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399597 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:03:20] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Increase traffic for db1100 and start restoring original weight for db1082 - T161294 (duration: 00m 48s) [10:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:30] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [10:14:28] (03PS1) 10Filippo Giunchedi: labs: split pdns-rec jobs [puppet] - 10https://gerrit.wikimedia.org/r/399598 [10:15:07] !log rolling restart of eventbus on kafka* for openssl security updates [10:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:19] (03CR) 10Filippo Giunchedi: [C: 032] labs: split pdns-rec jobs [puppet] - 10https://gerrit.wikimedia.org/r/399598 (owner: 10Filippo Giunchedi) [10:17:09] PROBLEM - puppet last run on labtestservices2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:19:17] 10Operations, 10DBA, 10Goal: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3853576 (10jcrespo) [10:21:22] 10Operations, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3853589 (10jcrespo) [10:21:44] 10Operations, 10DBA, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2087114 (10jcrespo) [10:22:08] RECOVERY - puppet last run on labtestservices2001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:22:14] (03CR) 10Alexandros Kosiaris: [C: 031] "This is not to be applied on boxes while in the deployment freeze, but otherwise +1. Just a note, the ores module will need a little more " [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) (owner: 10Dzahn) [10:23:21] 10Operations, 10DBA, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2087114 (10jcrespo) [10:23:39] 10Operations, 10DBA, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3853595 (10jcrespo) [10:24:13] (03PS1) 10Marostegui: db-eqiad.php: Fully repool db1100,db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399599 (https://phabricator.wikimedia.org/T161294) [10:26:49] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Fully repool db1100,db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399599 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:29:25] (03Merged) 10jenkins-bot: db-eqiad.php: Fully repool db1100,db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399599 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:29:36] (03CR) 10jenkins-bot: db-eqiad.php: Fully repool db1100,db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399599 (https://phabricator.wikimedia.org/T161294) (owner: 10Marostegui) [10:30:48] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Restore db1082 and db1100 original weight - T161294 (duration: 00m 51s) [10:31:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:01] T161294: run pt-tablechecksum on s5/s8 - https://phabricator.wikimedia.org/T161294 [10:33:14] (03PS1) 10Jcrespo: mariadb: Repool db1067, remove references to db1055, 56, 39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) [10:34:31] (03PS2) 10Jcrespo: mariadb: Repool db1067, remove references to db1055, 56, 39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) [10:34:33] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Repool db1067, remove references to db1055, 56, 39 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [10:35:37] (03CR) 10Marostegui: [C: 04-1] "Do not remove db1039 yet, I am still checksumming data in that shard, and I prefer to leave it there just as a reminder." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [10:35:49] PROBLEM - mediawiki-installation DSH group on mw1333 is CRITICAL: Host mw1333 is not in mediawiki-installation dsh group [10:40:20] (03PS3) 10Jcrespo: mariadb: Repool db1067, remove references to db1055, 56 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) [10:41:34] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Port non-deprecated Diamond collectors to Prometheus - https://phabricator.wikimedia.org/T177196#3853627 (10fgiunchedi) [10:41:38] 10Operations, 10Goal, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Create Prometheus exporter for PowerDNS - https://phabricator.wikimedia.org/T182970#3853625 (10fgiunchedi) 05Open>03Resolved This is done as well, Prometheus on labmon1001 now pulls metrics for pdns and pdns-... [10:41:44] (03CR) 10Marostegui: "The commit message says you are repooling db1067, but its config has not changed, is that intended?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [10:42:22] (03PS4) 10Jcrespo: mariadb: Remove references to db1055 & db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) [10:42:32] (03CR) 10Marostegui: [C: 031] mariadb: Remove references to db1055 & db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [10:45:00] (03PS5) 10Jcrespo: mariadb: Remove references to db1055 & db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) [10:45:17] 10Operations, 10Diamond, 10Upstream: Upstream our Diamond PowerDNSRecursorCollector - https://phabricator.wikimedia.org/T133643#3853649 (10MoritzMuehlenhoff) 05Open>03declined We've converted this collector to a Prometheus exporter and given that we're moving away from Diamond it doesn't make sense to up... [10:46:45] 10Operations, 10Documentation: Write documentation on how / when to use custom Diamond metrics collectors - https://phabricator.wikimedia.org/T132856#3853659 (10MoritzMuehlenhoff) 05Open>03declined We're moving away from Diamond towards Prometheus, so marking as declined. [10:47:52] (03CR) 10Jcrespo: [C: 032] mariadb: Remove references to db1055 & db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [10:48:28] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3853664 (10akosiaris) >>! In T177394#3853439, @Joe wrote: > To recap my experiments: > > - I built envoy following our container build guidelines, but I am blo... [10:48:41] 10Operations, 10monitoring: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454#3853665 (10fgiunchedi) [10:49:20] (03Merged) 10jenkins-bot: mariadb: Remove references to db1055 & db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [10:49:30] (03CR) 10jenkins-bot: mariadb: Remove references to db1055 & db1056 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399601 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [10:49:49] 10Operations, 10monitoring: Deprovision Diamond collectors no longer in use - https://phabricator.wikimedia.org/T183454#3853688 (10fgiunchedi) [10:50:05] 10Operations: Integrate stretch 9.3 point update - https://phabricator.wikimedia.org/T182655#3853689 (10MoritzMuehlenhoff) These are fully rolled out: zsh iproute [10:50:41] 10Operations: Integrate jessie 8.10 point release - https://phabricator.wikimedia.org/T182656#3853690 (10MoritzMuehlenhoff) These are fully rolled out: request-tracker4 libxv [10:52:40] !log jynus@tin Synchronized wmf-config/db-codfw.php: Remove db1055 & db1056 (duration: 00m 51s) [10:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:48] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Port redis statistics to Prometheus - https://phabricator.wikimedia.org/T148637#3853701 (10fgiunchedi) [10:53:50] 10Operations, 10Patch-For-Review, 10Prometheus-metrics-monitoring, 10User-fgiunchedi: Move deployment-prep redis instances to stretch - https://phabricator.wikimedia.org/T179371#3853700 (10fgiunchedi) [10:54:12] 10Operations, 10Mail, 10monitoring, 10Patch-For-Review: prometheus metrics and grafana dashboard for exim - https://phabricator.wikimedia.org/T179302#3853704 (10fgiunchedi) [10:54:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Remove db1055 & db1056 (duration: 00m 51s) [10:54:14] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3853703 (10fgiunchedi) [10:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:31] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3292437 (10fgiunchedi) With T179302 resolved I think this can be closed ? [11:08:28] 10Operations, 10Services (doing), 10User-Eevans, 10User-fgiunchedi: New upstream jvm-tools - https://phabricator.wikimedia.org/T178839#3853727 (10fgiunchedi) I've completed steps 1-3 from above as suggested and pushed the result, not great in terms of coordination required but workable I think for now. If... [11:21:13] 10Operations, 10monitoring: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027#3853738 (10fgiunchedi) [11:21:31] 10Operations, 10monitoring, 10User-fgiunchedi: Collect per-cgroup cpu/mem and other system level metrics - https://phabricator.wikimedia.org/T108027#1510650 (10fgiunchedi) [11:38:37] (03PS2) 10Giuseppe Lavagetto: puppet-compiler: fix facts update process [puppet] - 10https://gerrit.wikimedia.org/r/399210 [11:39:17] (03Abandoned) 10Hashar: Support ignoring the namespace in image_tag [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399594 (owner: 10Hashar) [11:39:20] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet-compiler: fix facts update process [puppet] - 10https://gerrit.wikimedia.org/r/399210 (owner: 10Giuseppe Lavagetto) [11:39:24] (03PS1) 10Alexandros Kosiaris: Add prometheus hosts to k8s staging ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/399605 [11:40:22] (03PS1) 10Elukey: profile::hadoop::monitoring: add explicit jmx whitelist [puppet] - 10https://gerrit.wikimedia.org/r/399606 (https://phabricator.wikimedia.org/T177458) [11:40:28] (03PS2) 10Alexandros Kosiaris: Add prometheus hosts to k8s staging ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/399605 [11:43:29] (03CR) 10Alexandros Kosiaris: [C: 032] Add prometheus hosts to k8s staging ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/399605 (owner: 10Alexandros Kosiaris) [11:43:37] <_joe_> !log refreshing puppet facts on the puppet compilers [11:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:46] (03PS2) 10Elukey: profile::hadoop::monitoring: add explicit jmx whitelist [puppet] - 10https://gerrit.wikimedia.org/r/399606 (https://phabricator.wikimedia.org/T177458) [11:53:13] (03Abandoned) 10Elukey: profile::hadoop::monitoring: add explicit jmx whitelist [puppet] - 10https://gerrit.wikimedia.org/r/399606 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [11:55:17] (03PS1) 10Elukey: profile::hadoop::monitoring: add explicit jmx whitelist [puppet] - 10https://gerrit.wikimedia.org/r/399608 (https://phabricator.wikimedia.org/T177458) [11:56:23] (03PS2) 10Muehlenhoff: Remove Hiera host entry for palladium [puppet] - 10https://gerrit.wikimedia.org/r/398837 [11:58:59] (03CR) 10Elukey: [C: 032] profile::hadoop::monitoring: add explicit jmx whitelist [puppet] - 10https://gerrit.wikimedia.org/r/399608 (https://phabricator.wikimedia.org/T177458) (owner: 10Elukey) [11:59:02] (03CR) 10Muehlenhoff: [C: 032] Remove Hiera host entry for palladium [puppet] - 10https://gerrit.wikimedia.org/r/398837 (owner: 10Muehlenhoff) [11:59:18] (03PS3) 10Muehlenhoff: Remove Hiera host entry for palladium [puppet] - 10https://gerrit.wikimedia.org/r/398837 [12:07:28] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=mw1333.*.eqiad.wmnet [12:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:43] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3853820 (10elukey) Current status of the hosts: ``` elukey@puppetmaster1001:~$ sudo -i confctl select 'name=mw133.*.eqiad.wmnet' get | sort {"mw1330.eqiad.wmne... [12:19:34] (03PS1) 10Alexandros Kosiaris: kubernetes: Allow parameterization of prometheus_url [puppet] - 10https://gerrit.wikimedia.org/r/399613 [12:20:02] (03CR) 10jerkins-bot: [V: 04-1] kubernetes: Allow parameterization of prometheus_url [puppet] - 10https://gerrit.wikimedia.org/r/399613 (owner: 10Alexandros Kosiaris) [12:21:04] (03CR) 10Alexandros Kosiaris: [C: 031] Puppetmaster web frontend: support specifying different certs for a hostname [puppet] - 10https://gerrit.wikimedia.org/r/399459 (https://phabricator.wikimedia.org/T183414) (owner: 10Andrew Bogott) [12:24:56] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] "jenkins-bot says" [puppet] - 10https://gerrit.wikimedia.org/r/399613 (owner: 10Alexandros Kosiaris) [12:29:39] PROBLEM - puppet last run on cp4025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:10] PROBLEM - puppet last run on mw1282 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:10] PROBLEM - puppet last run on mw1312 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:31:00] * volans bets on puppetdb restart [12:31:20] PROBLEM - puppet last run on logstash1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:31:51] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:31:52] PROBLEM - puppet last run on db1094 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:22] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:31] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:32] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:51] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:52] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:07] yep, confirmed Active: active (running) since Thu 2017-12-21 12:27:40 UTC; 5min ago [12:33:11] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:12] PROBLEM - puppet last run on conf1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:12] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:51] PROBLEM - puppet last run on baham is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:54] (03PS1) 10Elukey: role::druid::public::worker: review jvm settings [puppet] - 10https://gerrit.wikimedia.org/r/399617 [12:34:01] PROBLEM - puppet last run on bast4001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:34:01] PROBLEM - puppet last run on boron is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:34:21] and just to confirm it java invoked oom-killer [12:34:32] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:35:43] (03CR) 10Joal: [C: 031] "Thanks elukey :)" [puppet] - 10https://gerrit.wikimedia.org/r/399617 (owner: 10Elukey) [12:35:51] RECOVERY - mediawiki-installation DSH group on mw1333 is OK: OK [12:36:04] (03CR) 10Elukey: [C: 032] role::druid::public::worker: review jvm settings [puppet] - 10https://gerrit.wikimedia.org/r/399617 (owner: 10Elukey) [12:39:39] !log restart druid historical/broker to apply new jvm settings to druid public workers (druid100[456] - https://gerrit.wikimedia.org/r/399617) [12:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:36] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Failed to submit 'replace facts' command for contint2001.wikimedia.org to PuppetDB at nitrogen.eqiad.wmnet:443: [502 Bad Gateway] 502 Bad Gateway

502 Bad Gateway


nginx/1.11.13
[12:45:40] which I guess is transient :) [12:46:02] hashar: see above (puppetdb oom-ed and got restarted) [12:46:26] good [12:46:48] also hmm Warning: Setting configtimeout is deprecated. I guess it is due to /etc/puppet/puppet.conf:configtimeout = 960 [12:47:02] PROBLEM - puppet last run on conf1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:47:04] yeah it's part of the migration to puppet4 clients [12:47:27] T182585 [12:47:27] T182585: Puppet: Setting configtimeout is deprecated - https://phabricator.wikimedia.org/T182585 [12:47:32] \o/ [12:48:07] (03CR) 10Hashar: [C: 031] puppet.conf: replace configtimeout [puppet] - 10https://gerrit.wikimedia.org/r/398484 (https://phabricator.wikimedia.org/T182585) (owner: 10Andrew Bogott) [12:58:15] RECOVERY - puppet last run on conf1003 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [12:58:15] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [12:58:45] RECOVERY - puppet last run on baham is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [12:58:55] RECOVERY - puppet last run on bast4001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [12:59:04] RECOVERY - puppet last run on boron is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:59:34] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:59:35] RECOVERY - puppet last run on cp4025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:14] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:00:14] RECOVERY - puppet last run on mw1312 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:01:24] RECOVERY - puppet last run on logstash1005 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:01:54] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:01:54] RECOVERY - puppet last run on db1094 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:02:05] RECOVERY - puppet last run on conf1002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:02:24] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:02:34] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:02:34] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [13:02:45] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:02:54] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:04:21] (03PS1) 10Muehlenhoff: Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [13:04:33] (03PS2) 10Muehlenhoff: Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [13:05:17] (03CR) 10jerkins-bot: [V: 04-1] Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:07:46] (03PS3) 10Muehlenhoff: Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) [13:08:16] (03CR) 10jerkins-bot: [V: 04-1] Add support for selective automatic restarts of stateless services after library upgrades (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/399618 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:15:35] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [13:16:14] 10Operations, 10Traffic: Backport iproute2 4.x from debian testing -> our jessie - https://phabricator.wikimedia.org/T138591#3853953 (10MoritzMuehlenhoff) a:05MoritzMuehlenhoff>03None I think at this point a backport doesn't make much sense, the time is better spend on migration the respective systems to... [13:25:12] (03PS1) 10BBlack: Remove GS 2016 unified certs from active set (expired) [puppet] - 10https://gerrit.wikimedia.org/r/399620 [13:25:13] (03PS1) 10BBlack: Remove GS 2016 unified certs from files/ssl/ [puppet] - 10https://gerrit.wikimedia.org/r/399621 [13:26:28] (03CR) 10BBlack: [C: 032] Remove GS 2016 unified certs from active set (expired) [puppet] - 10https://gerrit.wikimedia.org/r/399620 (owner: 10BBlack) [13:30:27] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3853984 (10Joe) >>! In T177394#3853664, @akosiaris wrote: > > There are few questions. Things like > > * Should we go with the TLS certs right from the beginn... [13:41:07] (03PS1) 10Gehel: aqs: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399624 (https://phabricator.wikimedia.org/T182304) [13:41:11] (03CR) 10Thiemo Kreuz (WMDE): Fix linewrap issue on wikimedia error page (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/395552 (https://phabricator.wikimedia.org/T180656) (owner: 10Phantom42) [13:44:13] (03CR) 10Gehel: "Puppet compiler agrees, this is a NOOP: https://puppet-compiler.wmflabs.org/compiler02/9452/" [puppet] - 10https://gerrit.wikimedia.org/r/399624 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [13:50:44] (03CR) 10BBlack: [C: 032] Remove GS 2016 unified certs from files/ssl/ [puppet] - 10https://gerrit.wikimedia.org/r/399621 (owner: 10BBlack) [13:54:05] (03PS1) 10Gehel: kafkatee: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399629 (https://phabricator.wikimedia.org/T182304) [13:56:07] (03CR) 10Gehel: "Puppet compiler agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler02/9453/" [puppet] - 10https://gerrit.wikimedia.org/r/399629 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [14:20:58] (03CR) 10Elukey: [C: 031] "LGTM! https://puppet-compiler.wmflabs.org/compiler02/9454/aqs1004.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/399624 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [14:35:26] <_joe_> win 25 [14:41:52] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3854163 (10akosiaris) >>! In T177394#3853984, @Joe wrote: >>>! In T177394#3853664, @akosiaris wrote: >> >> There are few questions. Things like >> >> * Should... [14:42:00] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3854164 (10akosiaris) I think we can call this done ? [14:48:53] (03PS1) 10Giuseppe Lavagetto: Add envoy image with TLS termination. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/399640 [14:49:45] (03PS2) 10Gehel: aqs: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399624 (https://phabricator.wikimedia.org/T182304) [14:50:24] 10Operations, 10Kubernetes: Operations 2017-18 Q2 Program 6 umbrella task - https://phabricator.wikimedia.org/T178325#3854180 (10akosiaris) [14:50:26] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Design pod-level monitoring and service-level alerting - https://phabricator.wikimedia.org/T177396#3854177 (10akosiaris) 05Open>03Resolved a:03akosiaris After a discussion with Giuseppe on IRC, I think we can successfully resolve this. I 'll... [14:51:08] (03CR) 10Gehel: [C: 032] aqs: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399624 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [14:52:45] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3854182 (10Joe) Indeed, we have an implementation path, the dockerfiles for building the software, and some plans for the future. We can call this done. [14:52:53] 10Operations, 10Prod-Kubernetes, 10Kubernetes, 10User-Joe: Experiment with a TLS proxy/router for pods - https://phabricator.wikimedia.org/T177394#3854183 (10Joe) 05Open>03Resolved [14:52:55] 10Operations, 10Kubernetes: Operations 2017-18 Q2 Program 6 umbrella task - https://phabricator.wikimedia.org/T178325#3854184 (10Joe) [14:54:00] (03CR) 10Filippo Giunchedi: [C: 031] kafkatee: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399629 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [14:55:52] (03PS2) 10Gehel: kafkatee: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399629 (https://phabricator.wikimedia.org/T182304) [14:58:32] (03CR) 10Gehel: [C: 032] kafkatee: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399629 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [14:59:11] 10Operations, 10DBA: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10jcrespo) p:05Triage>03Normal [14:59:45] 10Operations, 10DBA: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854202 (10jcrespo) Very related to T183249 [15:00:18] 10Operations, 10DBA, 10Patch-For-Review: Reimage and upgrade to stretch all dbproxies - https://phabricator.wikimedia.org/T183249#3848018 (10jcrespo) Very related (blocked on failovers of): T183469 [15:06:15] 10Operations, 10DBA: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3854215 (10jcrespo) p:05Triage>03Normal [15:07:39] 10Operations, 10DBA: Setup newer machines and replace all old misc (m*) and x1 codfw machines - https://phabricator.wikimedia.org/T183470#3854229 (10jcrespo) Eqiad version: T183469 Very related: T170662 [15:12:47] (03PS1) 10Gehel: gerrit: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399644 (https://phabricator.wikimedia.org/T182304) [15:16:39] (03CR) 10Gehel: "Puppet compiler agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler02/9455/" [puppet] - 10https://gerrit.wikimedia.org/r/399644 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [15:18:47] yooo joe just trying this out... [15:19:41] are you talking from inside the matrix ottomata[m] ? [15:19:49] sure am! [15:20:02] wait no I am! [15:20:09] the mind boggles [15:21:02] * ottomata[m] uploaded an image: Screen Shot 2017-11-01 at 10.50.45.png (1766KB) [15:21:04] COOL [15:21:07] buuurp [15:21:13] <_joe_> yeah it works [15:21:28] <_joe_> and I didn't remember my riot.im nickname, how embarassing [15:21:38] <_joe_> I tested it the other day [15:23:18] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 29 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:24:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 24 probes of 288 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:28:05] (03CR) 10Hashar: [C: 031] "That looks good to me. The puppet compiler is all happy so I guess we can deploy it." [puppet] - 10https://gerrit.wikimedia.org/r/399644 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [15:28:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 10 probes of 287 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [15:28:26] (03CR) 10Gehel: [C: 032] gerrit: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399644 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [15:28:32] mobrovac: did you do anything else to the restbase1010 tests? should I ack the Icinga warning with a link to the task? [15:29:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 10 probes of 288 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [15:30:01] (03CR) 10Paladox: "Gerrit only needs a restart if it changes the log4j.xml file." [puppet] - 10https://gerrit.wikimedia.org/r/399644 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [15:30:35] (03CR) 10Gehel: "This is a noop, no changes to the config file at all..." [puppet] - 10https://gerrit.wikimedia.org/r/399644 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [15:30:44] (03PS1) 10Jcrespo: mariadb: Allow reimage of db1055 and db1056 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/399646 (https://phabricator.wikimedia.org/T183469) [15:37:18] Hi ops-team - I'm going to deploy analytics-refinery (oozie jobs for analytics hadoop cluster) [15:37:47] ottomata: FYI ^^^ ;) [15:37:55] yyaa [15:39:04] !log joal@tin Started deploy [analytics/refinery@92f9318]: Deploying for new pageview_top_bycountry job [15:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:33] (03PS1) 10Gehel: wdqs: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399650 (https://phabricator.wikimedia.org/T182304) [15:41:19] volans: rb1010 failing? [15:41:32] (03CR) 10Gehel: "Puppet compiler agrees this is a NOOP: https://puppet-compiler.wmflabs.org/compiler02/9456/" [puppet] - 10https://gerrit.wikimedia.org/r/399650 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [15:41:36] mobrovac: no, just the check raid complaining about the poicy ;) [15:41:52] oh [15:41:59] *policy [15:42:09] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=restbase1010&service=HP+RAID [15:43:43] !log joal@tin Finished deploy [analytics/refinery@92f9318]: Deploying for new pageview_top_bycountry job (duration: 04m 40s) [15:43:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:08] PROBLEM - etc request latencies on neon is CRITICAL: CRITICAL - etcd_request_latencies is 68578 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:47:08] RECOVERY - etc request latencies on neon is OK: OK - etcd_request_latencies is 24639 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:01:10] (03PS1) 10Gehel: thumbor: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399652 (https://phabricator.wikimedia.org/T182304) [16:01:37] (03CR) 10jerkins-bot: [V: 04-1] thumbor: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399652 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [16:06:41] (03PS2) 10Jcrespo: mariadb: Allow reimage of db1055 and db1056 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/399646 (https://phabricator.wikimedia.org/T183469) [16:08:27] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db1055 and db1056 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/399646 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [16:09:17] Hey operations shinken and icinga need a little reboot love just fyi [16:09:35] Zppix ? [16:09:43] They are down [16:09:48] in production ? [16:09:50] Yes [16:10:02] we got no shinken in production [16:10:09] how on earth can it be down ? [16:10:12] 10:08 AM ⇐︎ shinken-wm quit (~shinken-w@internal-server-nat.wmflabs.org): Ping timeout: 240 seconds [16:10:17] ah labs [16:10:19] so not production [16:10:41] Oh :P i saw internal-server and thats it [16:11:15] but icinga-wm quit too... let me restart it [16:11:27] akosiaris: UnicodeDecodeError: 'utf8' codec can't decode byte 0x93 in position 100: invalid start byte [16:11:38] you already saw this some time ago, right? [16:12:05] (03PS2) 10Gehel: thumbor: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399652 (https://phabricator.wikimedia.org/T182304) [16:13:16] volans: I don't remember that one [16:13:57] it's after a reconnection I can open a task [16:14:06] sure [16:14:35] (03PS1) 10Jcrespo: mariadb: Move db1055 and db1056 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/399653 (https://phabricator.wikimedia.org/T183469) [16:15:03] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Move db1055 and db1056 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/399653 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [16:16:01] (03CR) 10Gehel: "Compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler02/9457/" [puppet] - 10https://gerrit.wikimedia.org/r/399652 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [16:17:35] (03PS2) 10Jcrespo: mariadb: Move db1055 and db1056 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/399653 (https://phabricator.wikimedia.org/T183469) [16:18:37] 10Operations, 10monitoring: ircecho: UnicodeDecodeError on reconnect - https://phabricator.wikimedia.org/T183475#3854377 (10Volans) [16:18:44] akosiaris: done ^^^ [16:20:18] heh, deep in the irc libray [16:20:23] yeah [16:20:23] (03PS1) 10Jcrespo: mariadb: Move db1055 and db1056 to x1 [software] - 10https://gerrit.wikimedia.org/r/399654 (https://phabricator.wikimedia.org/T183469) [16:20:56] try: except around bot.start() and .... restart ? [16:21:15] or we could just rely a bit more on systemd ;-) [16:21:44] how? the process was [16:21:44] Active: active (running) since Tue 2017-12-19 00:43:55 UTC; 2 days ago [16:21:52] according to systemd [16:21:53] active ? [16:22:01] (03PS3) 10Jcrespo: mariadb: Move db1055 and db1056 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/399653 (https://phabricator.wikimedia.org/T183469) [16:22:01] despite the exception ? [16:22:05] yep [16:22:12] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1055 and db1056 to x1 [software] - 10https://gerrit.wikimedia.org/r/399654 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [16:22:14] aaa... it does threading [16:22:34] yeah and one thread died by the other did not [16:22:56] must be python we're talking about! :) [16:23:02] lol [16:23:02] :D [16:23:41] don't you love how python threads have all the bad stuff but not much of the good stuff ? apart from being very very cheap to spawn ? [16:24:07] (03Abandoned) 10Jcrespo: Revert "labsdb: Switchover labsdb1009 to labsdb1010 for maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/398508 (owner: 10Jcrespo) [16:24:20] volans: ok I am sold. try: bot.start except: sys.exit(1) [16:24:24] that should fix it :P [16:24:43] and I am wondering whether it's a thread per IRC channel [16:24:49] and that's without reading the code [16:24:52] I don't want to [16:24:57] done it before, not today [16:25:11] I'm already with icinga's source today [16:25:21] cannot add another one ;) [16:25:47] I can send a letter of recommendation for you to a certain ethan [16:26:00] rotfl [16:29:16] yeah implicitly, the norms for most smaller-scale software is https://en.wikipedia.org/wiki/Crash-only_software . That is, you conform with the idea that if anything looks fishy, it's best to just hard-error and crash and then either some manager will restart it, or some human will look and fix the environmental or code problems, etc. [16:29:59] but then you take that mindset into a threaded python program where your attempt to crash-only, only crashes 1/N threads and not the process, and it all goes off the rails. Now you're undetected-partial-crash-only :P [16:30:06] (03PS4) 10Jcrespo: mariadb: Move db1055 and db1056 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/399653 (https://phabricator.wikimedia.org/T183469) [16:30:17] :D [16:30:24] indeed [16:32:00] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10User-Elukey: thorium - failed git clone of geowiki-data-private - https://phabricator.wikimedia.org/T171923#3854442 (10Milimetric) 05Open>03Resolved Fixed by finishing the migration of the repository / script to my home folder on stat1005 [16:32:31] (03CR) 10Jcrespo: [C: 032] mariadb: Move db1055 and db1056 to x1 [puppet] - 10https://gerrit.wikimedia.org/r/399653 (https://phabricator.wikimedia.org/T183469) (owner: 10Jcrespo) [16:32:45] (03CR) 10Alexandros Kosiaris: "Note the difference" [puppet] - 10https://gerrit.wikimedia.org/r/399557 (owner: 10Dzahn) [16:33:22] (03CR) 10Filippo Giunchedi: "LGTM, though since it isn't urgent let's merge after holidays" [puppet] - 10https://gerrit.wikimedia.org/r/399652 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [16:33:23] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854189 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on neodymium.eqiad.wmnet for hosts: ``` ['db1055.eqiad.wmnet', 'db1056.eqi... [16:36:11] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by akosiaris on neodymium.eqiad.wmnet for hosts: ``` ['ores2001.codfw.wmnet', 'ores2002.codfw.w... [16:38:50] (03PS1) 10Volans: ircecho: exit on failure [puppet] - 10https://gerrit.wikimedia.org/r/399658 (https://phabricator.wikimedia.org/T183475) [16:39:19] (03CR) 10Alexandros Kosiaris: "Or" [puppet] - 10https://gerrit.wikimedia.org/r/399557 (owner: 10Dzahn) [16:41:42] (03CR) 10Alexandros Kosiaris: [C: 032] "Yeah, check_disk is crazy. Merging. thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/399558 (owner: 10Dzahn) [16:41:48] (03PS2) 10Alexandros Kosiaris: icinga/docker: check_disk regex for ci::master,kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/399558 (owner: 10Dzahn) [16:41:50] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] icinga/docker: check_disk regex for ci::master,kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/399558 (owner: 10Dzahn) [16:50:12] 10Operations, 10DBA, 10hardware-requests, 10Goal: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#3854519 (10RobH) a:05RobH>03None [16:50:52] (03CR) 10Alexandros Kosiaris: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/399658 (https://phabricator.wikimedia.org/T183475) (owner: 10Volans) [16:50:58] (03CR) 10Alexandros Kosiaris: [C: 031] ircecho: exit on failure [puppet] - 10https://gerrit.wikimedia.org/r/399658 (https://phabricator.wikimedia.org/T183475) (owner: 10Volans) [16:52:16] (03PS2) 10Volans: ircecho: exit on failure [puppet] - 10https://gerrit.wikimedia.org/r/399658 (https://phabricator.wikimedia.org/T183475) [16:52:48] (03CR) 10Volans: [C: 032] ircecho: exit on failure [puppet] - 10https://gerrit.wikimedia.org/r/399658 (https://phabricator.wikimedia.org/T183475) (owner: 10Volans) [16:56:19] and ofc our puppettization doesn't restart it on change [16:56:55] !log restarted ircecho to pick up gerrit/399658 [16:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:15] (03CR) 10Awight: [C: 031] "Thanks for decoupling the worker and web roles!" [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) (owner: 10Dzahn) [16:59:13] 10Operations, 10monitoring: ircecho: UnicodeDecodeError on reconnect - https://phabricator.wikimedia.org/T183475#3854557 (10Volans) 05Open>03Resolved p:05Triage>03Normal a:03Volans Workaround to make it fail completely and let systemd restart it deployed. Resolving it for now. [17:01:19] !log debugging Icinga notes_url (no side effect expected but logging it in case there will be) T170353 [17:01:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:31] T170353: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353 [17:02:42] 10Operations, 10ORES, 10Patch-For-Review, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3854587 (10awight) Works perfectly now. Waiting until January to deploy to production. [17:02:53] 10Operations, 10ORES, 10Scoring-platform-team (Current): Investigate why ORES logs are being written to syslog despite explicit logging config. Fix. - https://phabricator.wikimedia.org/T182614#3854588 (10awight) [17:03:26] 10Operations, 10Ops-Access-Requests, 10AICaptcha, 10WMF-NDA-Requests: Requesting access to EventLogging data for Vinitha - https://phabricator.wikimedia.org/T181952#3854590 (10Nuria) Access approved, please be sure to expire at the end of March. [17:06:01] volans: i have to brag my instance of icinga (icinga2-wm) does restart on changes :P [17:06:24] PROBLEM - carbon-frontend-relay metric drops on graphite1001 is CRITICAL: TEST - IGNORE - volans https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen [17:09:25] Zppix: :D [17:09:31] 10Operations, 10DBA, 10Patch-For-Review: Setup newer machines and replace all old misc (m*) and x1 eqiad machines - https://phabricator.wikimedia.org/T183469#3854610 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db1056.eqiad.wmnet', 'db1055.eqiad.wmnet'] ``` and were **ALL** successful. [17:13:36] paravoid https://phabricator.wikimedia.org/T181647#3854611 [17:16:53] PROBLEM - Check whether ferm is active by checking the default input chain on notebook1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly [17:17:35] that's me sorry ^ [17:17:53] RECOVERY - Check whether ferm is active by checking the default input chain on notebook1001 is OK: OK ferm input default policy is set [17:18:57] was testing some spark port stuff with notebooks [17:22:53] 10Operations, 10ORES, 10Scoring-platform-team, 10Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3854619 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['ores2008.codfw.wmnet', 'ores2002.codfw.wmnet', 'ores2003.codfw.wmnet', 'ores2005.codfw.wmnet'... [17:24:16] !log starting manual backup of x1-master onto db1055 [17:24:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:02] RECOVERY - carbon-frontend-relay metric drops on graphite1001 is OK: OK: Less than 80.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/graphite-eqiad?orgId=1panelId=21fullscreen https://grafana.wikimedia.org/dashboard/db/graphite-codfw?orgId=1panelId=21fullscreen [18:00:17] 10Operations, 10ops-eqiad: Hardware errors on ganeti1005- ganeti1008 - https://phabricator.wikimedia.org/T181121#3854743 (10akosiaris) ganeti1006 has been added back to the group, and cluster has been rebalanced. There are a few extra VMs on ganeti1006 in order to see if this will trigger the issue. I guess w... [18:06:46] (03PS1) 10BryanDavis: aptly: Update published repo distributions [puppet] - 10https://gerrit.wikimedia.org/r/399667 (https://phabricator.wikimedia.org/T183235) [18:09:39] (03CR) 10Arturo Borrero Gonzalez: [C: 032] aptly: Update published repo distributions [puppet] - 10https://gerrit.wikimedia.org/r/399667 (https://phabricator.wikimedia.org/T183235) (owner: 10BryanDavis) [18:22:40] (03Abandoned) 10Alexandros Kosiaris: Increase ORES queue_maxsize by 20% [puppet] - 10https://gerrit.wikimedia.org/r/394047 (https://phabricator.wikimedia.org/T181538) (owner: 10Alexandros Kosiaris) [18:23:08] 10Operations, 10DBA, 10Patch-For-Review: Rack and setup db1111 and db1112 - https://phabricator.wikimedia.org/T180788#3854799 (10Marostegui) This has been all set. Servers replicate between each other (db1111 being the master). They contain, as requested, Commonswiki and eowiki Credentials were sent to MCR... [18:25:37] (03PS5) 10Dzahn: ores: basic role for a worker-only node [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) [18:26:23] (03CR) 10Dzahn: [C: 032] "cool, thanks for the reviews, i'll just merge the role, we can use it after the stretch reinstalls :)" [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) (owner: 10Dzahn) [18:29:41] (03CR) 10Dzahn: "..well the reinstalls look done, but after the code freeze then as Alex said on ticket, yep" [puppet] - 10https://gerrit.wikimedia.org/r/399452 (https://phabricator.wikimedia.org/T165170) (owner: 10Dzahn) [18:34:03] (03CR) 10Smalyshev: [C: 031] wdqs: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399650 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [18:34:26] (03PS2) 10Gehel: wdqs: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399650 (https://phabricator.wikimedia.org/T182304) [18:35:34] (03CR) 10Gehel: [C: 032] wdqs: use the canonical definition of logstash host [puppet] - 10https://gerrit.wikimedia.org/r/399650 (https://phabricator.wikimedia.org/T182304) (owner: 10Gehel) [18:55:14] (03PS1) 10Ottomata: Remove no longer needed kafka/simple/broker.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/399674 [18:56:03] (03CR) 10Ottomata: [V: 032 C: 032] Remove no longer needed kafka/simple/broker.yaml hiera [puppet] - 10https://gerrit.wikimedia.org/r/399674 (owner: 10Ottomata) [19:01:09] (03PS1) 10Ottomata: Set proper kafka log_dirs default [puppet] - 10https://gerrit.wikimedia.org/r/399677 [19:02:04] 10Operations, 10monitoring, 10Patch-For-Review: Icinga: timeseries checks should have the link to a graph with the data - https://phabricator.wikimedia.org/T170353#3855107 (10Volans) To summarize the current status, everything is deployed and works as expected, except one small detail: the ampersand are remo... [19:02:08] (03CR) 10Ottomata: [C: 032] Set proper kafka log_dirs default [puppet] - 10https://gerrit.wikimedia.org/r/399677 (owner: 10Ottomata) [19:05:13] paravoid: I'm leaving for holiday break, I commented https://phabricator.wikimedia.org/T181647 follow up in january? :-) thanks for your time! [19:25:29] !log cp4032 going offline for memory swap [19:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:11] PROBLEM - IPsec on cp2004 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:32:11] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:32:12] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:32:12] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:32:12] PROBLEM - IPsec on cp2016 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:32:12] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [19:32:12] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [19:32:13] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [19:32:21] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [19:32:22] PROBLEM - IPsec on cp1052 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:32:41] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:32:51] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [19:32:52] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:32:52] PROBLEM - IPsec on cp1068 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:32:52] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:32:52] PROBLEM - IPsec on cp2019 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:32:52] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:33:01] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp4032_v4, cp4032_v6 [19:33:01] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:33:02] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 112 not-conn: cp4032_v4, cp4032_v6 [19:33:02] PROBLEM - IPsec on cp1054 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:33:02] PROBLEM - IPsec on cp1053 is CRITICAL: Strongswan CRITICAL - ok: 42 not-conn: cp4032_v4, cp4032_v6 [19:33:32] PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [19:34:45] (03CR) 10Dzahn: "lol, these examples are fun. i agree to all of that, call it absurd and move on :) thanks too" [puppet] - 10https://gerrit.wikimedia.org/r/399557 (owner: 10Dzahn) [19:38:41] RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.09 ms [19:48:25] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3855238 (10RobH) I've put in the new memory dimm and run memory tests, which are still running (and will take awhile.) I'll check on the system remotely later today. Once its ready to go, I've pinged @bbla... [19:58:51] (03PS1) 10Addshore: BETA: unset $wgSentryDsn [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) [20:04:20] (03CR) 10Dzahn: "@Chris i'll leave it to you when to merge this, per linked decom ticket" [dns] - 10https://gerrit.wikimedia.org/r/399125 (https://phabricator.wikimedia.org/T183209) (owner: 10Dzahn) [20:06:14] (03PS1) 10BBlack: Add Digicert 2017 certs [puppet] - 10https://gerrit.wikimedia.org/r/399681 [20:06:16] (03PS1) 10BBlack: Add Digicert 2017 certs to deployed ocsp set [puppet] - 10https://gerrit.wikimedia.org/r/399682 [20:06:18] (03PS1) 10BBlack: switch esams to digicert-2017 unified cert [puppet] - 10https://gerrit.wikimedia.org/r/399683 [20:07:42] (03PS1) 10Dzahn: decom: remove uranium from site,DHCP,netboot [puppet] - 10https://gerrit.wikimedia.org/r/399684 (https://phabricator.wikimedia.org/T183209) [20:08:20] (03CR) 10BBlack: [C: 032] Add Digicert 2017 certs [puppet] - 10https://gerrit.wikimedia.org/r/399681 (owner: 10BBlack) [20:09:28] 10Operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#3855296 (10Dzahn) Talked to Moritz, iron can go away soonish and won't need a replacement. [20:13:02] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Can't docker pull from docker-registry.discovery.wmnet - https://phabricator.wikimedia.org/T183342#3855298 (10hashar) 05Open>03declined I think that is properly solved by switching all images`FROM` to t... [20:15:07] (03CR) 10BBlack: [C: 032] Add Digicert 2017 certs to deployed ocsp set [puppet] - 10https://gerrit.wikimedia.org/r/399682 (owner: 10BBlack) [20:15:41] 10Operations, 10Cloud-Services, 10VPS-project-Phabricator, 10Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3855306 (10Dzahn) 05stalled>03Resolved Yes, it can. exim4::ganglia has been deleted as part of T177225 so Puppet will definitely not re-add that anyw... [20:15:44] (03CR) 10Gergő Tisza: "The Sentry extension doesn't do anything useful when the DSN is not set so if you want to remove it it's probably better to just set wmgUs" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) (owner: 10Addshore) [20:16:42] (03CR) 10Addshore: "Well, I don't really want to remove it, but it seems it is breaking various bits of beta" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) (owner: 10Addshore) [20:18:26] (03PS2) 10Addshore: BETA: set wmgSentryDsn to null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) [20:20:08] (03CR) 10Gergő Tisza: "> Well, I don't really want to remove it, but it seems it is breaking" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) (owner: 10Addshore) [20:21:48] (03CR) 10Addshore: "See both Bug: T183494 & Bug: T180208" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) (owner: 10Addshore) [20:24:29] (03PS1) 10Dzahn: confluent:kafka:jmxtrans: remove Ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/399686 (https://phabricator.wikimedia.org/T177225) [20:26:43] (03Abandoned) 10Dzahn: confluent:kafka:jmxtrans: remove Ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/399686 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:34:25] (03Restored) 10Dzahn: confluent:kafka:jmxtrans: remove Ganglia support [puppet] - 10https://gerrit.wikimedia.org/r/399686 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:35:08] (03CR) 10Dzahn: "@ottomata Is it worth it? i see this would rely on the jmxtrans submodule.. does it make sense to also remove it there or should i just no" [puppet] - 10https://gerrit.wikimedia.org/r/399686 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:35:31] (03PS1) 10Dzahn: mediawiki: remove comment about Ganglia, replace with Statsd [puppet] - 10https://gerrit.wikimedia.org/r/399688 [20:36:05] 10Operations, 10Graphite, 10Nodepool, 10Zuul: Improve graphite failover - https://phabricator.wikimedia.org/T88997#3855335 (10hashar) [20:36:42] (03CR) 10Dzahn: [C: 032] "comment line only" [puppet] - 10https://gerrit.wikimedia.org/r/399688 (owner: 10Dzahn) [20:36:44] (03CR) 10Gergő Tisza: "I still don't get how you jumped to the conclusion that Sentry is causing those problems. It's an error logging service; the obvious expla" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) (owner: 10Addshore) [20:40:32] (03CR) 10Ottomata: [C: 031] "I'm pretty sure the jmxtrans submodule params for this are optional, so this should work." [puppet] - 10https://gerrit.wikimedia.org/r/399686 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:42:04] (03PS1) 10Ottomata: Parameterize kafka.ssl.cipher.suites [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/399689 (https://phabricator.wikimedia.org/T177225) [20:42:13] (03Abandoned) 10Addshore: BETA: set wmgSentryDsn to null [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399680 (https://phabricator.wikimedia.org/T106920) (owner: 10Addshore) [20:42:24] (03CR) 10jerkins-bot: [V: 04-1] Parameterize kafka.ssl.cipher.suites [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/399689 (https://phabricator.wikimedia.org/T177225) (owner: 10Ottomata) [20:43:14] (03PS2) 10Ottomata: Parameterize kafka.ssl.cipher.suites [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/399689 (https://phabricator.wikimedia.org/T177225) [20:45:48] (03PS1) 10Dzahn: apache: remove comments about Ganglia monitoring [puppet] - 10https://gerrit.wikimedia.org/r/399691 (https://phabricator.wikimedia.org/T177225) [20:47:41] (03CR) 10Dzahn: [C: 032] "comments only" [puppet] - 10https://gerrit.wikimedia.org/r/399691 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [20:50:16] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, 10User-Elukey: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#3855423 (10Ottomata) > we want to force clients to use TLSv2 Did you mean TLSv1.2? > Does ssl.cipher.suites imply a specific version of T... [21:01:46] (03CR) 10Dzahn: "thanks! it definitely says "Optional" for them in jmxtrans::metrics. But looks like not in jmxtrans::metrics::jvm" [puppet] - 10https://gerrit.wikimedia.org/r/399686 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [21:02:25] (03PS1) 10Gergő Tisza: Fix wgContentLanguage on deployment.wikimedia.beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399698 (https://phabricator.wikimedia.org/T183499) [21:07:35] (03PS1) 10Dzahn: drop optional Ganglia params from metrics::jvm [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/399699 (https://phabricator.wikimedia.org/T177225) [21:09:03] (03PS1) 10Ottomata: Set ssl.cipher.suites and ssl.enabled.protocols for Kafka jumbo and varnishkafka (canary) [puppet] - 10https://gerrit.wikimedia.org/r/399700 (https://phabricator.wikimedia.org/T167304) [21:09:25] (03CR) 10jerkins-bot: [V: 04-1] Set ssl.cipher.suites and ssl.enabled.protocols for Kafka jumbo and varnishkafka (canary) [puppet] - 10https://gerrit.wikimedia.org/r/399700 (https://phabricator.wikimedia.org/T167304) (owner: 10Ottomata) [21:09:38] (03PS2) 10Dzahn: drop optional Ganglia params from metrics::jvm [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/399699 (https://phabricator.wikimedia.org/T177225) [21:10:49] (03CR) 10Dzahn: "i think it will need a small change in submodule https://gerrit.wikimedia.org/r/#/c/399699/ while it can stay in jmxtrans::metrics" [puppet] - 10https://gerrit.wikimedia.org/r/399686 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [21:12:09] (03CR) 10Dzahn: "probably should be merged together with https://gerrit.wikimedia.org/r/#/c/399686/" [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/399699 (https://phabricator.wikimedia.org/T177225) (owner: 10Dzahn) [21:16:06] (03PS1) 10Dzahn: site/logging/kafkatee: move includes from site to role [puppet] - 10https://gerrit.wikimedia.org/r/399702 [21:23:08] (03PS1) 10Dzahn: labs::nfs: move standard includes from site to roles [puppet] - 10https://gerrit.wikimedia.org/r/399704 [21:38:31] 10Operations, 10Continuous-Integration-Infrastructure (shipyard), 10Patch-For-Review, 10User-Joe: Unify production and CI docker image build process - https://phabricator.wikimedia.org/T177276#3855514 (10hashar) I am working on it with @joe acting as the mentor for docker-pkg. I am quite happy about the sy... [21:43:28] 10Operations: "Opsonly" bastion? - https://phabricator.wikimedia.org/T114992#3855516 (10RobH) Well, I sometimes use it for a root mgmt screen session when remotely testing hardware, but I suppose I can just as easily use the puppetmaster for that! [21:45:00] 10Operations, 10Release-Engineering-Team, 10LDAP: Create 'releng' LDAP group - https://phabricator.wikimedia.org/T183507#3855517 (10demon) p:05Triage>03Normal [21:45:05] 10Operations, 10Release-Engineering-Team, 10LDAP: Create 'releng' LDAP group - https://phabricator.wikimedia.org/T183507#3855528 (10demon) I think I have access to do all/most of this in LDAP, but would love your thoughts @MoritzMuehlenhoff so we can do this right :) [21:54:42] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 114 ESP OK [21:54:51] RECOVERY - IPsec on cp1052 is OK: Strongswan OK - 44 ESP OK [21:54:52] RECOVERY - IPsec on cp2004 is OK: Strongswan OK - 56 ESP OK [21:54:52] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 56 ESP OK [21:54:52] RECOVERY - IPsec on cp2016 is OK: Strongswan OK - 56 ESP OK [21:54:52] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 56 ESP OK [21:55:11] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - 44 ESP OK [21:55:12] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 114 ESP OK [21:55:21] RECOVERY - IPsec on cp1068 is OK: Strongswan OK - 44 ESP OK [21:55:21] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - 44 ESP OK [21:55:22] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - 44 ESP OK [21:55:31] RECOVERY - IPsec on cp1054 is OK: Strongswan OK - 44 ESP OK [21:55:31] RECOVERY - IPsec on cp1053 is OK: Strongswan OK - 44 ESP OK [21:55:31] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 114 ESP OK [21:55:32] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - 44 ESP OK [21:55:41] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 56 ESP OK [21:55:41] RECOVERY - IPsec on cp2019 is OK: Strongswan OK - 56 ESP OK [21:55:41] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 56 ESP OK [21:55:41] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 56 ESP OK [21:55:42] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 114 ESP OK [21:55:42] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 114 ESP OK [21:55:42] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 114 ESP OK [21:57:14] 10Operations, 10ops-ulsfo, 10Traffic: cp4032 memory error - https://phabricator.wikimedia.org/T183176#3855577 (10RobH) a:05RobH>03BBlack Ok, this is ready to go back online. new memory tested fine. [22:01:36] 10Operations, 10Release-Engineering-Team, 10LDAP: Create 'releng' LDAP group - https://phabricator.wikimedia.org/T183507#3855517 (10greg) FTR, when it comes to it: I approve adding current RelEng team members (Antoine M, Chad H, Dan D, Greg G, Jean-Rene B, Mukunda M, Tyler C, and Zeljko F) to ldap/releng. [22:12:23] 10Operations, 10ORES, 10Graphite, 10Scoring-platform-team (Current), 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3855629 (10Halfak) `$ find . -type f -mtime +30 | wc -l` can be purged safely. I'll build a short list of metrics that should be... [22:19:33] (03PS1) 10ArielGlenn: permit use of 7zip compressed files for prefetch [dumps] - 10https://gerrit.wikimedia.org/r/399753 (https://phabricator.wikimedia.org/T179267) [22:27:20] 10Operations, 10ORES, 10Graphite, 10Scoring-platform-team (Current), 10User-fgiunchedi: Regularly purge old ores graphite metrics - https://phabricator.wikimedia.org/T169969#3855640 (10Halfak) OK looks like we want to keep all of the base metrics for the following names indefinitely: ``` ores.*.precache... [22:32:08] (03CR) 10Thcipriani: [C: 032] "Looks like this about covers it :)" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399592 (owner: 10Hashar) [22:32:52] (03Merged) 10jenkins-bot: Unit test for docker_pkg.image_fullname [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/399592 (owner: 10Hashar) [23:38:44] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Use content hash based image / thumb URLs - https://phabricator.wikimedia.org/T149847#3855739 (10Krinkle) [23:44:18] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3855802 (10Krinkle) [23:52:32] 10Operations, 10TechCom-RfC, 10Services (attic), 10User-mobrovac: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825#3855815 (10Krinkle) [23:52:59] 10Operations, 10TechCom-RfC, 10Traffic, 10Wikipedia-Android-App-Backlog, and 2 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588#3855817 (10Krinkle)