[00:07:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add DNS entries for initiativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm)
[00:07:50] <wikibugs>	 (03PS4) 10Dzahn: Add DNS entries for initiativeswiki [dns] - 10https://gerrit.wikimedia.org/r/504503 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm)
[00:08:55] <mutante>	 !log DNS - add initiatives.wikimedia.org (and initiaves.m) for campaign wiki requested at T167375
[00:08:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:09:00] <stashbot>	 T167375: Creation of a "Campaign" Wiki - initiatives.wikimedia.org - https://phabricator.wikimedia.org/T167375
[00:17:44] <icinga-wm>	 PROBLEM - Check systemd state on mw1297 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[00:19:33] <chaomodus>	 oic wm1297 is WIP :)
[00:19:58] <mutante>	 chaomodus: yes, and there wasn't even an alert, right
[00:20:06] <mutante>	 ah, yes
[00:20:10] <chaomodus>	 no just here
[00:20:58] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1297 is CRITICAL: connect to address 10.64.16.62 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers
[00:21:05] <mutante>	 yes, downtimed
[00:21:09] <mutante>	 but can't do that before it exists
[00:21:17] <chaomodus>	 ofc
[00:21:24] <mutante>	 and yes..  a single appserver would never page
[00:21:36] <mutante>	 that just happens for some cloud*
[00:22:26] <mutante>	 also the systemd state is fixed by "have you tried rebooting it"
[00:23:11] <chaomodus>	 hehe in this case it was apache because it didn't have configs in conf-enabled yet
[00:28:52] <mutante>	 !log mw1297 - scap pull
[00:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:45] <mutante>	 !log mw1297 - rebooting for nutcracker issue
[00:29:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:30:38] <icinga-wm>	 PROBLEM - Host mw1297 is DOWN: PING CRITICAL - Packet loss = 100%
[00:31:32] <icinga-wm>	 RECOVERY - Host mw1297 is UP: PING WARNING - Packet loss = 58%, RTA = 70.38 ms
[00:31:36] <icinga-wm>	 RECOVERY - Check systemd state on mw1297 is OK: OK - running: The system is fully operational
[00:31:46] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[00:32:39] <icinga-wm>	 ACKNOWLEDGEMENT - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: sgebastion class instances not spread out enough BryanDavis New instance added for testing, needs to be placed on a different cloudvirt manually. - The acknowledgement expires at: 2019-04-26 00:31:10. https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[00:35:55] <wikibugs>	 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: Multiple systems in esams OE10 showing PSU failures - https://phabricator.wikimedia.org/T177228 (10Dzahn) cp3033 is shown as having CRIT redundancy  https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp3033&service=IPMI+Sensor+Status
[00:40:03] <wikibugs>	 10Operations, 10ops-esams, 10Traffic: cp3033 unreacheable since 2018-07-15 11:47:31 - https://phabricator.wikimedia.org/T199677 (10Dzahn) The host also shows that power supplies are not redundant.. which had a comment linking to T177403 ->  T177228.  And support has expired (https://netbox.wikimedia.org/dcim...
[00:41:24] <icinga-wm>	 ACKNOWLEDGEMENT - Long running screen/tmux on notebook1003 is CRITICAL: CRIT: Long running SCREEN process. (user: fsalutari PID: 15618, 2604253s 1728000s). daniel_zahn already emailed user https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[00:44:18] <wikibugs>	 10Operations, 10Traffic: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10Dzahn)
[01:15:10] <icinga-wm>	 PROBLEM - puppet last run on dbproxy1008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:46:54] <icinga-wm>	 RECOVERY - puppet last run on dbproxy1008 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[02:49:30] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1297 is CRITICAL: Host mw1297 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist
[03:45:11] <SMalyshev>	 !log repooled wdqs1003, it's good now
[03:45:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:10:10] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10CDanis) @Cwek Thank you very much for the detailed report!  I've rolled back the experimental change to our DNS...
[04:11:08] <icinga-wm>	 PROBLEM - puppet last run on aqs1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:37:34] <icinga-wm>	 RECOVERY - puppet last run on aqs1007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[05:01:48] <wikibugs>	 (03PS1) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819)
[05:03:13] <wikibugs>	 (03PS4) 10Santhosh: Remove ExternalGuidanceEnableContextDetection [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504261 (https://phabricator.wikimedia.org/T219819) (owner: 10KartikMistry)
[05:33:41] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046
[05:34:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046 (owner: 10Marostegui)
[05:35:49] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046 (owner: 10Marostegui)
[05:37:14] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2080 and db2083 (duration: 00m 54s)
[05:37:22] <marostegui>	 !log Upgrade db2080 and db2083
[05:38:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:38:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:40:52] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2080,db2083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506047
[05:42:34] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2080,db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506046 (owner: 10Marostegui)
[05:49:42] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) @Papaul can we upgrade firmware and BIOS on db2080?, I was bitten by this today.
[05:49:58] <wikibugs>	 (03Abandoned) 10Marostegui: Revert "db-codfw.php: Depool db2080,db2083" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506047 (owner: 10Marostegui)
[05:52:04] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048
[05:53:21] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048 (owner: 10Marostegui)
[05:54:19] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048 (owner: 10Marostegui)
[05:54:33] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Depool db2086, repool db2083 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506048 (owner: 10Marostegui)
[05:55:33] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2083 and depool db2086 (duration: 00m 52s)
[05:55:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:55:41] <marostegui>	 !log Upgrade db2086
[05:55:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:07:17] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050
[06:08:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050 (owner: 10Marostegui)
[06:09:21] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050 (owner: 10Marostegui)
[06:10:34] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2086, depool db2079 (duration: 00m 53s)
[06:10:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:10:43] <marostegui>	 !log Upgrade db2079
[06:10:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:41] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2086, depool db2079 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506050 (owner: 10Marostegui)
[06:15:54] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[06:18:07] <marostegui>	 !log Upgrade db2081
[06:18:08] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051
[06:18:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051 (owner: 10Marostegui)
[06:21:39] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051 (owner: 10Marostegui)
[06:22:50] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2079, depool db2082 (duration: 00m 55s)
[06:22:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:56] <marostegui>	 !log Upgrade db2082
[06:22:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:26:47] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2079, depool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506051 (owner: 10Marostegui)
[06:29:38] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[06:31:32] <wikibugs>	 10Operations, 10Traffic: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10Vgutierrez) p:05Triage→03Low that's expected, as @ema mentioned yesterday in -traffic: `<ema> so we've got cp4021 reimaged as Varnish/ATS and it seems to be looking kind-of OK <ema> it is howeve...
[06:33:25] <wikibugs>	 (03PS1) 10Marostegui: db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053
[06:34:16] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 6.982 second response time https://phabricator.wikimedia.org/T174916
[06:34:59] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053 (owner: 10Marostegui)
[06:35:59] <wikibugs>	 (03Merged) 10jenkins-bot: db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053 (owner: 10Marostegui)
[06:37:03] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2082 (duration: 00m 52s)
[06:37:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:43] <wikibugs>	 (03CR) 10jenkins-bot: db-codfw.php: Repool db2082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506053 (owner: 10Marostegui)
[06:38:16] <icinga-wm>	 PROBLEM - pdfrender on scb1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[06:38:46] <elukey>	 !log restart pdfrender on scb1003
[06:38:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:50] <icinga-wm>	 RECOVERY - Check systemd state on netmon2001 is OK: OK - running: The system is fully operational
[06:39:24] <icinga-wm>	 RECOVERY - pdfrender on scb1003 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916
[06:40:46] <elukey>	 gehel, onimisionipe - o/ I can see an alert for "ElasticSearch unassigned shard check - 9243
[06:41:07] <elukey>	 in icinga, not sure if super important or not (it is just a warning for the moment)
[06:41:20] <marostegui>	 !log Optimize tables on pc1010
[06:41:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:52:40] <gehel>	 elukey: not super urgent, I'll have a look
[06:52:54] <elukey>	 super thanks
[06:52:58] <gehel>	 elukey: thanks for the ping!
[06:53:13] * gehel is still on the way back from daycare
[06:54:18] <elukey>	 ah sorry!
[06:54:23] <elukey>	 :(
[06:54:28] <elukey>	 didn't check the calendar
[06:55:03] <wikibugs>	 (03PS1) 10Elukey: profile::superset: add libmariadb3 for buster [puppet] - 10https://gerrit.wikimedia.org/r/506054
[06:59:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, I upgraded an-tool1004 yesterday to the latest Buster, that's probably when it broke" [puppet] - 10https://gerrit.wikimedia.org/r/506054 (owner: 10Elukey)
[07:00:27] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] admins: add shell account for Willy Pao and add to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn)
[07:02:04] <wikibugs>	 (03CR) 10Muehlenhoff: "I'm not sure this is still valid when the ongoing work is completed to allow wikitech user registration to be opened up again. I'd recomme" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn)
[07:02:15] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::superset: add libmariadb3 for buster [puppet] - 10https://gerrit.wikimedia.org/r/506054 (owner: 10Elukey)
[07:04:40] <wikibugs>	 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) We need to keep an eye on https://github.com/apache/trafficserver/issues/5084
[07:11:25] <wikibugs>	 (03CR) 10Muehlenhoff: "IIRC we already tested more fine-grained ownerships/permissions for the various HDFS service keytabs that were deployed in the Kerberos pi" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey)
[07:13:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[07:18:58] <wikibugs>	 (03CR) 10Elukey: "> IIRC we already tested more fine-grained ownerships/permissions for" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey)
[07:20:08] <wikibugs>	 (03CR) 10Elukey: "More info https://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/SecureContainer.html" [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey)
[07:23:00] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 430.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[07:25:46] <icinga-wm>	 PROBLEM - Apache HTTP on mw1229 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[07:26:54] <icinga-wm>	 RECOVERY - Apache HTTP on mw1229 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[07:30:11] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "This is working fine in my tests/benchmarks, merging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[07:30:13] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: First version of the kask chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401)
[07:31:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] First version of the kask chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505263 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[07:33:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ah, indeed, we spoke about LinuxContainerExecutor before, that makes a lot of sense, then." [puppet] - 10https://gerrit.wikimedia.org/r/504839 (https://phabricator.wikimedia.org/T220971) (owner: 10Elukey)
[07:35:18] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Add a dedicated=kask label to kask nodes [puppet] - 10https://gerrit.wikimedia.org/r/505832 (https://phabricator.wikimedia.org/T220821)
[07:35:29] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add a dedicated=kask label to kask nodes [puppet] - 10https://gerrit.wikimedia.org/r/505832 (https://phabricator.wikimedia.org/T220821) (owner: 10Alexandros Kosiaris)
[07:47:20] <wikibugs>	 (03PS1) 10Muehlenhoff: grub: Remove fallback code for augeas < 1.2 [puppet] - 10https://gerrit.wikimedia.org/r/506061
[07:50:58] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1297 is OK: OK https://wikitech.wikimedia.org/wiki/Application_servers%23Apache_setup_checklist
[08:01:58] <wikibugs>	 (03PS2) 10Gehel: maps: align tilerator CPU usage across all nodes [puppet] - 10https://gerrit.wikimedia.org/r/505819
[08:02:43] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[08:02:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:47] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[08:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:31] <logmsgbot>	 !log gehel@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation
[08:04:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:35] <logmsgbot>	 !log gehel@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0)
[08:04:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:29] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] maps: align tilerator CPU usage across all nodes [puppet] - 10https://gerrit.wikimedia.org/r/505819 (owner: 10Gehel)
[08:05:59] <wikibugs>	 (03PS2) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670)
[08:09:53] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] codfw decom: halve non-object weights and 2/3rds object weights [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505888 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis)
[08:13:42] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 56.22 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[08:14:29] <wikibugs>	 (03CR) 10Mathew.onipe: maps: smooth the tilerator load by reducing cpu assigned to tilerator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel)
[08:16:55] <wikibugs>	 (03CR) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel)
[08:17:33] <godog>	 !log bounce prometheus on bast5001 after migration and backfill
[08:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:54] <wikibugs>	 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10MoritzMuehlenhoff) >>! In T220383#5133808, @Vgutierrez wrote: > We need to keep an eye on https://github.com/apache/trafficserver/issues/5084  Buster has OpenSSL 1.1.1b, so this affects ATS as shipped in Buster? Shou...
[08:27:56] <wikibugs>	 10Operations, 10Traffic: Evaluate ATS TLS stack - https://phabricator.wikimedia.org/T220383 (10Vgutierrez) so I guess it's affected but right now I'm working under the assumption that we will use stretch in the cp nodes, using our own ATS packaging. @ema can confirm that :)
[08:29:09] <godog>	 !log swift eqiad-prod: start decom for ms-be101[45] - T220590
[08:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:14] <stashbot>	 T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590
[08:29:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Support affinity in all charts [deployment-charts] - 10https://gerrit.wikimedia.org/r/505185 (owner: 10Alexandros Kosiaris)
[08:30:13] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Tested this locally, works fine, merging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/505185 (owner: 10Alexandros Kosiaris)
[08:35:25] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[08:39:32] <wikibugs>	 (03PS3) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670)
[08:40:48] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 03+1] maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel)
[08:40:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: "From reading the related task it looks like we're going to ship the logs as-is and then grok on the logstash side?" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[08:43:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, for rollout I think we should disable puppet fleetwide and reenable gradually because this change will mean a whole lot more kafka c" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[08:44:00] <wikibugs>	 (03PS7) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[08:44:19] <moritzm>	 !log rolling restart of Cassandra on restbase/codfw to pick up Java security update
[08:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:45] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402)
[08:44:47] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075
[08:44:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[08:46:22] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Gilles) itshappening
[08:47:11] <wikibugs>	 (03PS8) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[08:49:31] <wikibugs>	 (03PS9) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[08:51:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075 (owner: 10Alexandros Kosiaris)
[08:51:54] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075
[08:51:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Update repo index [deployment-charts] - 10https://gerrit.wikimedia.org/r/506075 (owner: 10Alexandros Kosiaris)
[08:51:58] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: mariadb::ferm: Switch ferm::rule => ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/505831
[08:52:00] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: kafka_cluster_name: ifguard ::labsproject lookup [puppet] - 10https://gerrit.wikimedia.org/r/506084
[08:52:09] <wikibugs>	 (03PS10) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[08:53:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[08:54:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC happy at https://puppet-compiler.wmflabs.org/compiler1002/15969/" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris)
[08:56:34] <wikibugs>	 (03PS11) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[08:57:07] <wikibugs>	 10Operations, 10Traffic: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10ema) Indeed our Varnish mailbox lag Icinga check only applies to Varnish backends, given that backends are those affected by T145661 and similar issues. During the Puppet refactoring splitting front...
[08:58:05] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10akosiaris) @Tarrow, @WMDE-leszek. I 've been working on the termbox helm chart and while the service seems to be up and running...
[08:58:33] <wikibugs>	 (03PS1) 10Ema: cache: move check_varnish_expiry_mailbox_lag to backend profile [puppet] - 10https://gerrit.wikimedia.org/r/506090 (https://phabricator.wikimedia.org/T145661)
[08:58:38] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/15975/" [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[09:01:19] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Fixes the" [puppet] - 10https://gerrit.wikimedia.org/r/506084 (owner: 10Alexandros Kosiaris)
[09:01:20] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: kafka_cluster_name: ifguard ::labsproject lookup [puppet] - 10https://gerrit.wikimedia.org/r/506084
[09:02:34] <wikibugs>	 (03PS12) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[09:05:52] <wikibugs>	 (03CR) 10Ema: [C: 03+1] "Nice! Two nits." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[09:11:53] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402)
[09:11:55] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: cxserver: Fix typo in GC metric name [deployment-charts] - 10https://gerrit.wikimedia.org/r/506098
[09:13:21] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cxserver: Fix typo in GC metric name [deployment-charts] - 10https://gerrit.wikimedia.org/r/506098
[09:13:23] <wikibugs>	 (03PS5) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402)
[09:13:25] <wikibugs>	 (03PS2) 10Ema: cache: move check_varnish_expiry_mailbox_lag to backend profile [puppet] - 10https://gerrit.wikimedia.org/r/506090 (https://phabricator.wikimedia.org/T145661)
[09:14:20] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] cxserver: Fix typo in GC metric name [deployment-charts] - 10https://gerrit.wikimedia.org/r/506098 (owner: 10Alexandros Kosiaris)
[09:15:23] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: move check_varnish_expiry_mailbox_lag to backend profile [puppet] - 10https://gerrit.wikimedia.org/r/506090 (https://phabricator.wikimedia.org/T145661) (owner: 10Ema)
[09:16:54] <wikibugs>	 (03PS13) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[09:17:08] <wikibugs>	 (03CR) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[09:17:39] <wikibugs>	 (03CR) 10Ema: [C: 03+1] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[09:17:49] <wikibugs>	 (03PS6) 10Alexandros Kosiaris: First draft of a wikibase-termbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402)
[09:17:51] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Publish the kask chart in the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/506104 (https://phabricator.wikimedia.org/T220401)
[09:18:17] <wikibugs>	 (03CR) 10Jcrespo: "I am ok with this, but this is not a noop (even if technically is), I would like to see this config change to single host first:" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris)
[09:18:36] <icinga-wm>	 RECOVERY - tools project instance distribution on cloudcontrol1003 is OK: OK: All critical instances are spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[09:18:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Publish the kask chart in the repo [deployment-charts] - 10https://gerrit.wikimedia.org/r/506104 (https://phabricator.wikimedia.org/T220401) (owner: 10Alexandros Kosiaris)
[09:22:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[09:22:41] <wikibugs>	 (03PS14) 10Vgutierrez: trafficserver: wrap TLS settings using a type alias [puppet] - 10https://gerrit.wikimedia.org/r/505780 (https://phabricator.wikimedia.org/T221594)
[09:23:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:24:04] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[09:24:04] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[09:24:21] <marostegui>	 woot?
[09:24:42] <elukey>	 the cr2-eqiad port down might be a transit
[09:25:02] <ema>	 I don't see any scheduled maintenance
[09:25:24] <marostegui>	 the 500s spike looks gone now
[09:25:38] <elukey>	 BGP Session Down: 91.198.174.249 (AS65003)
[09:25:56] <elukey>	 this is cr2-eqiad <-> cr2-esams
[09:26:47] <ema>	 upload does not seem affected
[09:27:40] <icinga-wm>	 RECOVERY - puppet last run on bast5001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[09:27:41] <elukey>	 it seems the Level3 link between eqiad and esams
[09:27:52] <elukey>	 https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png
[09:28:00] <elukey>	 we are probably going through knams now?
[09:29:00] <ema>	 akosiaris: ^ ?
[09:29:50] * akosiaris looking
[09:32:11] <wikibugs>	 (03PS4) 10Gehel: maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670)
[09:32:34] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[09:32:34] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[09:32:56] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] maps: smooth the tilerator load by reducing cpu assigned to tilerator [puppet] - 10https://gerrit.wikimedia.org/r/505823 (https://phabricator.wikimedia.org/T221670) (owner: 10Gehel)
[09:34:19] <akosiaris>	 we have some GRE tunnel there as well it seems
[09:34:26] <akosiaris>	 not sure what it is about
[09:34:28] <akosiaris>	 mark ^ ?
[09:35:16] <akosiaris>	 the overall device traffic hasn't fallen
[09:35:50] <icinga-wm>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 198.1 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[09:35:51] <elukey>	 akosiaris: is it the dotted line in https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png between eqiad and esams?
[09:35:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "LGTM, should we notify wikitech-l regarding this change?" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk)
[09:36:01] <elukey>	 (the GRE tunnel I mean)
[09:36:31] <akosiaris>	 elukey: I guess so?
[09:36:37] <akosiaris>	 I don't see traffic over it fwiw 
[09:36:54] <akosiaris>	 but graphs say nan
[09:36:58] <akosiaris>	 NaN that is
[09:37:14] <akosiaris>	 so... I am not sure it's actually 0 traffic, sounds more like something not being graphed?
[09:37:16] <elukey>	 is traffic going through knams now? (trying to understand)
[09:37:20] <akosiaris>	 me too
[09:37:45] <wikibugs>	 (03PS2) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819)
[09:39:50] <akosiaris>	 elukey: yes it's going through cr2-knams
[09:40:32] <elukey>	 yep I am seeing it via librenms, the spike in traffic is nice :D
[09:40:51] <elukey>	 should we open a task or something to level3?
[09:40:52] <akosiaris>	 if only we could change the timespan easily
[09:41:05] <akosiaris>	 it's killing me that I have to chase the spike in 6h graphs
[09:41:24] <elukey>	 otherwise where would be the joy??
[09:41:30] * elukey runs
[09:41:47] <akosiaris>	 lol
[09:42:31] <akosiaris>	 elukey: I guess we should
[09:43:43] <akosiaris>	 elukey: runbook is at https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down
[09:45:12] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:45:38] <elukey>	 yeah yeah we know :D
[09:45:45] <akosiaris>	 heh, icinga was slow on this one
[09:46:49] <wikibugs>	 (03PS17) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217)
[09:47:51] <akosiaris>	 let's open up a task for that
[09:48:07] <akosiaris>	 I see a scheduled maint on May 1st for that link
[09:48:25] <akosiaris>	 with reasoning cheduled maintenance to (Project modifier - troubleshoot and clear network alarms, clean fiber to clear network alarms) in order to prevent future service interruptions to customer services. 
[09:48:41] <akosiaris>	 so it maybe they just weren't fast enough 
[09:54:20] <wikibugs>	 (03PS18) 10Vgutierrez: trafficserver: Provide support for multiple ATS instances [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217)
[09:54:45] <wikibugs>	 (03CR) 10Gilles: [C: 03+2] Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles)
[09:54:47] <wikibugs>	 (03CR) 10Gilles: [V: 03+2 C: 03+2] Buster compatibility [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/505727 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles)
[09:56:06] <wikibugs>	 (03PS1) 10Gilles: Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/506115 (https://phabricator.wikimedia.org/T221562)
[09:56:19] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/15979/" [puppet] - 10https://gerrit.wikimedia.org/r/504601 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez)
[09:56:29] <wikibugs>	 (03CR) 10Gilles: [V: 03+2 C: 03+2] Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/506115 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles)
[09:56:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "I just realized this is for the extra_port only, so this can go anytime now." [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris)
[09:56:42] <elukey>	 akosiaris: if you want I can take care of the netops task while you contact their support
[09:56:50] <elukey>	 (creating etc.. as the runbook states)
[09:58:14] <moritzm>	 !log installing rsync security updates on jessie
[09:58:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:59:37] <wikibugs>	 (03PS1) 10Gilles: Upgrade to 2.5 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/506116 (https://phabricator.wikimedia.org/T221562)
[10:00:25] <wikibugs>	 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris)
[10:00:44] <wikibugs>	 10Operations, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout while loading Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358447430&oldid=358294930 - https://phabricator.wikimedia.org/T140879 (10abian)
[10:02:52] <wikibugs>	 10Operations, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout when loading diffs on Wikidata - https://phabricator.wikimedia.org/T140879 (10abian)
[10:03:55] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: cp4021 - UNKNOWN: cannot run varnishstat - https://phabricator.wikimedia.org/T221731 (10ema) 05Open→03Resolved
[10:04:47] <wikibugs>	 10Operations, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout when loading diffs on Wikidata - https://phabricator.wikimedia.org/T140879 (10abian)
[10:10:49] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki)
[10:12:19] <wikibugs>	 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) Received information for Level3  ` This is to confirm that ticket 16262986 has been created regarding your service.  Customer Name: Wikimedia Foundation Billing Account Number: 1-DCG6LL Customer...
[10:14:04] <wikibugs>	 10Operations, 10MediaWiki-History-and-Diffs, 10Wikidata, 10wikidata-tech-focus, 10Performance: Request timeout when loading diffs on Wikidata - https://phabricator.wikimedia.org/T140879 (10Epidosis)
[10:17:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "More like cleanup. ferm::rule isn't really meant to be used much. It's there for the cases where ferm::service just doesn't cut it." [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris)
[10:18:51] <wikibugs>	 (03PS1) 10Ema: prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967)
[10:19:31] <wikibugs>	 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) p:05Triage→03High
[10:19:42] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1024 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[10:21:35] <wikibugs>	 (03CR) 10Gilles: "I've built the package for Buster successfully using this on WMCS." [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/506116 (https://phabricator.wikimedia.org/T221562) (owner: 10Gilles)
[10:22:29] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10serviceops, and 5 others: Introduce kask session storage service to kubernetes - https://phabricator.wikimedia.org/T220401 (10akosiaris) @Eevans @Clarakosi chart has been merged and is published. The only thing missing before we can move on to...
[10:22:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris)
[10:22:56] <wikibugs>	 (03PS3) 10Alexandros Kosiaris: mariadb::ferm: Switch ferm::rule => ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/505831
[10:23:08] <jijiki>	 !log Restarting php-fpm on mw1238 for 505383 and T211488
[10:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:23:16] <stashbot>	 T211488: Audit and sync INI settings as needed between HHVM and PHP 7  - https://phabricator.wikimedia.org/T211488
[10:25:17] <wikibugs>	 (03PS4) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987)
[10:25:22] <wikibugs>	 (03PS3) 10Muehlenhoff: Support upgrades which introduce changes to binary package names (WIP) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176
[10:26:02] <wikibugs>	 (03CR) 10Jbond: "updated to incorporate changes from https://gerrit.wikimedia.org/r/c/operations/puppet/+/505817" [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:26:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:27:08] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) @CDanis I can't confirm it completely, but it seems the side effect may have formed.  I extracted some su...
[10:27:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:28:04] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:28:19] <wikibugs>	 (03CR) 10Muehlenhoff: kafka shipper: move kafka rsyslog shipping to base profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:28:53] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging]
[10:28:54] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed
[10:28:54] <logmsgbot>	 !log akosiaris@deploy1001 scap-helm cxserver finished
[10:28:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:10] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) @CDanis Thanks your help, but it seems the side effect may have formed.  I extracted some subdomains of w...
[10:30:32] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:30:36] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[10:31:02] <icinga-wm>	 PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:31:06] <wikibugs>	 (03PS5) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987)
[10:31:17] <wikibugs>	 (03CR) 10Jbond: kafka shipper: move kafka rsyslog shipping to base profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:31:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1001.eqiad.wmnet'] ` The...
[10:31:24] <icinga-wm>	 PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:31:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Ran on db1063 first, looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/505831 (owner: 10Alexandros Kosiaris)
[10:31:27] <wikibugs>	 (03PS3) 10Jbond: kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987)
[10:31:38] <icinga-wm>	 PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:31:42] <wikibugs>	 (03PS4) 10Jbond: ulogd logstash: Add rule to parse ulogd ouput to json [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987)
[10:32:00] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[10:32:12] <icinga-wm>	 PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:32:16] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[10:32:16] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:32:37] <elukey>	 seems a single spike
[10:32:40] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1024 is OK: OK - running: The system is fully operational
[10:33:10] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[10:33:14] <elukey>	 akosiaris: I think that the interface is flapping
[10:33:16] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:33:19] <elukey>	 https://librenms.wikimedia.org/eventlog
[10:33:26] <elukey>	 ifOperStatus: lowerLayerDown -> up
[10:33:27] <elukey>	 and then
[10:33:36] <wikibugs>	 (03Abandoned) 10Jbond: kafka: It was pointed out that kafak shipping may not work for all hosts [puppet] - 10https://gerrit.wikimedia.org/r/505817 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:33:38] <icinga-wm>	 RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:33:59] <elukey>	 ah no wait seems up now
[10:34:12] <icinga-wm>	 RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:34:24] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[10:34:29] <elukey>	 so the above mess was traffic shifted back to level3?
[10:34:48] <icinga-wm>	 RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:35:02] <elukey>	 (got confused with asw2-a-eqiad)
[10:35:45] <elukey>	 yep traffic back to the interface
[10:35:49] <elukey>	 gooood
[10:35:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn)
[10:36:31] <akosiaris>	 elukey: yup, looks like it
[10:36:36] <akosiaris>	 but it might indeed flap
[10:36:42] <akosiaris>	 so, let's keep an eye out for it
[10:36:44] <wikibugs>	 (03CR) 10Dr0ptp4kt: "Question / suggestion on test cases." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819) (owner: 10Santhosh)
[10:36:58] <akosiaris>	 that weird announcement makes me anxious
[10:37:11] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) @jijiki this is all good to go, successfully built on buster.thumbor.eqiad.wmflabs python-thumbor-community-core and thumbor need tiny pa...
[10:37:14] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[10:37:34] <elukey>	 akosiaris: where did the email from CenturyLink land? noc or something else?
[10:37:36] <akosiaris>	 elukey: lol
[10:37:38] <akosiaris>	 Field Operations dispatched and upon arrival to the site determined the fiber near the equipment had been burned. Field Operations are currently working to install a new fiber pair to restore services.
[10:37:41] <akosiaris>	 burned?
[10:37:43] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles)
[10:37:53] <elukey>	 ahahhaha
[10:37:55] <wikibugs>	 10Operations, 10Performance-Team, 10Thumbor, 10serviceops, 10Patch-For-Review: Build Thumbor packages for buster - https://phabricator.wikimedia.org/T221562 (10Gilles) a:05Gilles→03jijiki
[10:38:03] <akosiaris>	 elukey: ops-maintenance@
[10:38:13] <elukey>	 ahhh I think I am not on it
[10:38:22] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506061 (owner: 10Muehlenhoff)
[10:38:35] <jynus>	 elukey: you can access it through the google groups interface
[10:38:48] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[10:39:14] <icinga-wm>	 RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[10:40:04] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[10:40:23] <elukey>	 jynus: ack thanks!
[10:40:44] <elukey>	 (going afk)
[10:41:08] <wikibugs>	 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) New update says:  ` Field Operations dispatched and upon arrival to the site determined the fiber near the equipment had been burned. Field Operations are currently working to install a new fibe...
[10:41:26] <wikibugs>	 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris)
[10:41:33] <wikibugs>	 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) p:05High→03Low
[10:41:59] <wikibugs>	 (03CR) 10Jbond: "> LGTM, for rollout I think we should disable puppet fleetwide and reenable gradually because this change will mean a whole lot more kafka" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:46:20] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[10:51:09] <wikibugs>	 (03PS9) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[10:51:11] <wikibugs>	 (03PS7) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[10:51:13] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125
[10:51:28] <wikibugs>	 (03PS3) 10Santhosh: Redirect Google Translate any wiki source to mobile [puppet] - 10https://gerrit.wikimedia.org/r/506043 (https://phabricator.wikimedia.org/T219819)
[10:51:51] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The...
[10:52:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125 (owner: 10Jcrespo)
[10:53:28] <wikibugs>	 (03CR) 10Gilles: "@Ema any chance this could get looked at this quarter?" [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles)
[10:58:30] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: expose metrics in prometheus format for new docker-registry and create a grafana dashboard - https://phabricator.wikimedia.org/T221099 (10fsero) metrics has been exposed and there is a preliminar grafana dashboard https://grafana.wi...
[10:58:41] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: improve docker registry architecture - https://phabricator.wikimedia.org/T209271 (10fsero)
[10:58:48] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, and 2 others: expose metrics in prometheus format for new docker-registry and create a grafana dashboard - https://phabricator.wikimedia.org/T221099 (10fsero) 05Open→03Resolved
[10:58:51] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[11:00:04] <jouncebot>	 hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: How many deployers does it take to do European Mid-day SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:00:30] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: create and enable alerting on docker_registry_ha - https://phabricator.wikimedia.org/T221759 (10fsero)
[11:02:48] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] `  Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne...
[11:03:47] <Amir1>	 I'm stealing SWAT to do a small ores deployment
[11:03:56] <Amir1>	 akosiaris: FYI ^
[11:04:03] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:04:19] <akosiaris>	 ok
[11:07:12] <Amir1>	 Revision to revert just in case: 8f01d40bfac1c3026472efcaedb70a5df54fa0fb
[11:07:22] <logmsgbot>	 !log ladsgroup@deploy1001 Started deploy [ores/deploy@060fc37]: (no justification provided)
[11:07:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:05] <jijiki>	 Amir1: is your swat done?
[11:13:16] <jijiki>	 oh not yet, sorry 
[11:13:17] <Amir1>	 not yet
[11:14:01] <Amir1>	 the canary is healthy, let's roll 
[11:19:28] <wikibugs>	 (03PS1) 10Ladsgroup: nagios: Migrate ores checks from testwiki to fakewiki [puppet] - 10https://gerrit.wikimedia.org/r/506127 (https://phabricator.wikimedia.org/T219930)
[11:19:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The...
[11:20:04] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] `  Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne...
[11:21:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The...
[11:22:05] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] `  Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne...
[11:22:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov1002.eqiad.wmnet'] ` The...
[11:22:35] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['dbprov1002.eqiad.wmnet'] `  Of which those **FAILED**: ` ['dbprov1002.eqiad.wmne...
[11:23:40] <logmsgbot>	 !log ladsgroup@deploy1001 Finished deploy [ores/deploy@060fc37]: (no justification provided) (duration: 16m 18s)
[11:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:23:49] <Amir1>	 jijiki: ^ bam
[11:23:50] <Amir1>	 :D
[11:23:54] <jijiki>	 tx!
[11:24:55] <wikibugs>	 (03PS4) 10Muehlenhoff: Support upgrades which introduce changes to binary package names [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176
[11:25:12] <jijiki>	 !log Restarting php7.2-fpm on mw-canary for 505383 and T211488
[11:25:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:25:17] <stashbot>	 T211488: Audit and sync INI settings as needed between HHVM and PHP 7  - https://phabricator.wikimedia.org/T211488
[11:26:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1031 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[11:28:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] nagios: Migrate ores checks from testwiki to fakewiki [puppet] - 10https://gerrit.wikimedia.org/r/506127 (https://phabricator.wikimedia.org/T219930) (owner: 10Ladsgroup)
[11:30:08] <Amir1>	 akosiaris: Thanks!
[11:33:55] <jbond42>	 !log security update ghostscript on scb jessie servers
[11:33:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:35:45] <icinga-wm>	 PROBLEM - puppet last run on proton1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:36:01] <icinga-wm>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 101.2 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[11:41:22] <wikibugs>	 (03PS4) 10Jbond: puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409
[11:45:46] <gehel>	 !log restarting relforge for jvm ugprade
[11:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:50] <gehel>	 moritzm: ^
[11:46:06] <moritzm>	 ack
[11:46:32] <wikibugs>	 10Operations, 10Patch-For-Review: Replacement of network::constant's special_hosts - https://phabricator.wikimedia.org/T220894 (10jcrespo) I think it is better to hardcode the constants on `modules/profile/manifests/mariadb/ferm.pp` (for now, not as an ideal situation) than to go on a multi-file refactoring co...
[11:47:03] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational
[11:47:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] network::constants: Move mysql_root_clients from special_hosts to hieradata [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk)
[11:48:47] <wikibugs>	 (03CR) 10Jcrespo: [C: 04-1] "Why wasn't I added as reviewer? See phabricator comment." [puppet] - 10https://gerrit.wikimedia.org/r/505406 (owner: 10Alex Monk)
[11:49:35] <wikibugs>	 (03PS1) 10Mathew.onipe: maps: add pgpass file [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946)
[11:51:16] <wikibugs>	 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page  (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10jijiki) p:05Triage→03Unbreak!
[11:52:39] <wikibugs>	 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page  (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10jijiki)
[11:53:34] <wikibugs>	 (03PS1) 10Ladsgroup: icinga: change dashboard uid of ores to the new dashboard [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618)
[11:55:56] <wikibugs>	 (03CR) 10Mathew.onipe: "PCC output looks good: https://puppet-compiler.wmflabs.org/compiler1002/15980/" [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[11:56:53] <wikibugs>	 (03PS2) 10Gehel: maps: add pgpass file [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[11:56:57] <wikibugs>	 (03CR) 10Ladsgroup: "Is this correct?" [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618) (owner: 10Ladsgroup)
[11:58:09] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] maps: add pgpass file [puppet] - 10https://gerrit.wikimedia.org/r/506131 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[11:59:36] <jijiki>	 !log Restarting php7.2-fpm on mw12* for 505383 and T211488
[11:59:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:59:41] <stashbot>	 T211488: Audit and sync INI settings as needed between HHVM and PHP 7  - https://phabricator.wikimedia.org/T211488
[12:07:53] <icinga-wm>	 PROBLEM - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[12:09:17] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1031 is OK: OK - running: The system is fully operational
[12:09:58] <wikibugs>	 (03CR) 10Ladsgroup: [C: 04-1] [DNM] Rename JADE to Jade (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480284 (https://phabricator.wikimedia.org/T212182) (owner: 10Awight)
[12:12:01] <wikibugs>	 (03CR) 10Jbond: "LGTM, some minor comments" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff)
[12:14:24] <wikibugs>	 (03PS1) 10Urbanecm: Create new namespace "Edice" for cswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697)
[12:17:00] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: split profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051)
[12:17:14] <wikibugs>	 (03PS1) 10ArielGlenn: allow the display of only index or column differences for db table checker [software] - 10https://gerrit.wikimedia.org/r/506136
[12:17:16] <wikibugs>	 (03PS1) 10ArielGlenn: script to show section/dbhost info by asking mediawiki for it [software] - 10https://gerrit.wikimedia.org/r/506137
[12:18:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] allow the display of only index or column differences for db table checker [software] - 10https://gerrit.wikimedia.org/r/506136 (owner: 10ArielGlenn)
[12:18:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] script to show section/dbhost info by asking mediawiki for it [software] - 10https://gerrit.wikimedia.org/r/506137 (owner: 10ArielGlenn)
[12:20:45] <wikibugs>	 (03PS2) 10Ema: prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967)
[12:22:11] <wikibugs>	 (03CR) 10Ema: [C: 03+2] prometheus: use ATS profile instead of role in job definition [puppet] - 10https://gerrit.wikimedia.org/r/506122 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[12:22:34] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: clientpackages: split profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051)
[12:23:56] <moritzm>	 !log rolling restart of Cassandra on restbase/eqiad to pick up Java security update
[12:23:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:29] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[12:30:17] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: clientpackages: split profile for VM instances [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051)
[12:36:27] <icinga-wm>	 PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[12:36:46] <jijiki>	 !log restarting pdfrender on scb1004
[12:36:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:20] <wikibugs>	 (03PS1) 10Ema: Revert "cache: define ATS nodes in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/506141 (https://phabricator.wikimedia.org/T213263)
[12:38:53] <icinga-wm>	 RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916
[12:42:51] <ema>	 godog: FYI there's been a couple of alerts related to ms-be1013 (I see T220590, perhaps some decom steps missing)? 
[12:42:51] <stashbot>	 T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590
[12:44:04] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC for toolforge: https://puppet-compiler.wmflabs.org/compiler1001/15982/" [puppet] - 10https://gerrit.wikimedia.org/r/506135 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez)
[12:44:28] <wikibugs>	 (03CR) 10Muehlenhoff: Support upgrades which introduce changes to binary package names (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff)
[12:44:36] <wikibugs>	 (03PS5) 10Muehlenhoff: Support upgrades which introduce changes to binary package names [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176
[12:48:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational
[12:51:42] <wikibugs>	 (03PS2) 10Ema: Revert "cache: define ATS nodes in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/506141 (https://phabricator.wikimedia.org/T213263)
[12:52:36] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: clientpackages: vms: fix assert for realm [puppet] - 10https://gerrit.wikimedia.org/r/506146 (https://phabricator.wikimedia.org/T220051)
[12:52:45] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Revert "cache: define ATS nodes in hiera" [puppet] - 10https://gerrit.wikimedia.org/r/506141 (https://phabricator.wikimedia.org/T213263) (owner: 10Ema)
[12:53:44] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: clientpackages: vms: fix assert for realm [puppet] - 10https://gerrit.wikimedia.org/r/506146 (https://phabricator.wikimedia.org/T220051)
[12:54:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: clientpackages: vms: fix assert for realm [puppet] - 10https://gerrit.wikimedia.org/r/506146 (https://phabricator.wikimedia.org/T220051) (owner: 10Arturo Borrero Gonzalez)
[13:00:57] <icinga-wm>	 PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.48.142: Connection reset by peer
[13:01:21] <jijiki>	 !log Restarting php7.2-fpm on mw13* for 505383 and T211488
[13:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:01:27] <stashbot>	 T211488: Audit and sync INI settings as needed between HHVM and PHP 7  - https://phabricator.wikimedia.org/T211488
[13:15:46] <godog>	 ema: thanks I'll take a look!
[13:17:46] <wikibugs>	 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page  (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10jijiki)
[13:26:17] <icinga-wm>	 ACKNOWLEDGEMENT - EDAC syslog messages on thumbor1004 is CRITICAL: 4.001 ge 4 Effie Mouzeli https://phabricator.wikimedia.org/T215411 - The acknowledgement expires at: 2019-05-25 13:25:49. https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops
[13:30:51] <wikibugs>	 (03PS1) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759)
[13:34:07] <wikibugs>	 (03PS1) 10Matthias Mullie: Allow cross-site requests from Commons' mobile domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678)
[13:37:44] <marostegui>	 !log Poweroff db2080 for onsite maintenance - T216240
[13:38:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:24] <stashbot>	 T216240: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240
[13:38:48] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "We might have to rename the extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505816 (https://phabricator.wikimedia.org/T221651) (owner: 10Lucas Werkmeister (WMDE))
[13:38:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-2] "We might have to rename the extension" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505813 (https://phabricator.wikimedia.org/T221650) (owner: 10Lucas Werkmeister (WMDE))
[13:42:02] <wikibugs>	 (03PS1) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967)
[13:42:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Not a puppet/hiera expert so can't really say, LGTM to my untrained eye tho" [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond)
[13:42:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[13:48:32] <wikibugs>	 (03PS2) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967)
[13:50:31] <wikibugs>	 (03CR) 10Jforrester: "Wow." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678) (owner: 10Matthias Mullie)
[13:54:05] <wikibugs>	 (03CR) 10Alex Monk: "While here we should probably add anything else missing from the list" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678) (owner: 10Matthias Mullie)
[13:54:13] <wikibugs>	 (03CR) 10CDanis: [V: 03+2 C: 03+2] codfw decom: halve non-object weights and 2/3rds object weights [software/swift-ring] - 10https://gerrit.wikimedia.org/r/505888 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis)
[13:54:42] <wikibugs>	 (03CR) 10Fsero: "PCC is happy https://puppet-compiler.wmflabs.org/compiler1002/15986/registry1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[13:54:53] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[13:55:11] <wikibugs>	 (03PS2) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759)
[13:59:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] "That's indeed correct Amir, I'll merge" [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618) (owner: 10Ladsgroup)
[13:59:18] <wikibugs>	 (03PS2) 10Filippo Giunchedi: icinga: change dashboard uid of ores to the new dashboard [puppet] - 10https://gerrit.wikimedia.org/r/506132 (https://phabricator.wikimedia.org/T221618) (owner: 10Ladsgroup)
[13:59:51] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond)
[14:00:07] <wikibugs>	 (03PS3) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967)
[14:02:58] <wikibugs>	 (03PS3) 10Mathew.onipe: Add maps postgres init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946)
[14:03:22] <wikibugs>	 (03CR) 10Mathew.onipe: Add maps postgres init cookbook (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe)
[14:04:41] <wikibugs>	 10Operations, 10netops: Level3 esams <-> eqiad link outage - https://phabricator.wikimedia.org/T221758 (10akosiaris) 05Open→03Resolved a:03akosiaris CenturyLink sent a summary and a notification they 'll close the issues as resolved on their end.  ` Summary: On April 24, 2019 at 9:21 GMT, CenturyLink ide...
[14:05:03] <akosiaris>	 elukey: https://phabricator.wikimedia.org/T221758. Resolved.
[14:05:35] <wikibugs>	 (03PS5) 10Jbond: puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409
[14:05:53] <wikibugs>	 (03PS3) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759)
[14:06:19] <wikibugs>	 (03PS4) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967)
[14:07:47] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet-compiler: clean up hiera config [puppet] - 10https://gerrit.wikimedia.org/r/504409 (owner: 10Jbond)
[14:08:26] <wikibugs>	 (03PS4) 10Fsero: registryha: reenable alerts on registry[12]00[12] [puppet] - 10https://gerrit.wikimedia.org/r/506150 (https://phabricator.wikimedia.org/T221759)
[14:11:06] <elukey>	 akosiaris: thanks!
[14:17:10] <wikibugs>	 10Operations, 10User-fgiunchedi: Upgrade jessie hosts to rsyslog 8.1901.0-1 - https://phabricator.wikimedia.org/T219764 (10fgiunchedi)
[14:18:45] <wikibugs>	 (03CR) 10Ema: "pcc lgtm https://puppet-compiler.wmflabs.org/compiler1001/15994/" [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[14:19:01] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Provide support for incoming TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594)
[14:19:39] <wikibugs>	 (03PS1) 10Fsero: registryha: better description for health [puppet] - 10https://gerrit.wikimedia.org/r/506161 (https://phabricator.wikimedia.org/T221759)
[14:20:37] <wikibugs>	 (03PS2) 10Fsero: registryha: better description for health [puppet] - 10https://gerrit.wikimedia.org/r/506161 (https://phabricator.wikimedia.org/T221759)
[14:21:22] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha: better description for health [puppet] - 10https://gerrit.wikimedia.org/r/506161 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[14:22:12] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Provide support for incoming TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594)
[14:24:03] <wikibugs>	 (03CR) 10Vgutierrez: "PCC looks happy for existing cp-ats hosts, resulting almost in a NOOP: https://puppet-compiler.wmflabs.org/compiler1001/15996/" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[14:28:31] <godog>	 !log being rollout rsyslog 8.1901.0-1 to jessie hosts - T219764
[14:28:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:28:36] <stashbot>	 T219764: Upgrade jessie hosts to rsyslog 8.1901.0-1 - https://phabricator.wikimedia.org/T219764
[14:29:47] <wikibugs>	 (03CR) 10Ema: "A couple of comments." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[14:29:58] <vgutierrez>	 2 != 3
[14:29:59] <vgutierrez>	 error!
[14:30:12] <ema>	 :)
[14:30:43] <vgutierrez>	 thx for the review <3
[14:30:55] <wikibugs>	 (03PS1) 10Ottomata: Add eventgate-main chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346)
[14:33:37] <wikibugs>	 (03PS1) 10Jbond: Hiera backend: update the hiera configuration to remove the role backend [puppet] - 10https://gerrit.wikimedia.org/r/506167
[14:34:47] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[14:38:03] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1002.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Docker
[14:38:05] <wikibugs>	 (03CR) 1020after4: [C: 03+1] Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (owner: 10Aklapper)
[14:39:50] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2047 - https://phabricator.wikimedia.org/T221481 (10Papaul) a:05Papaul→03Marostegui complete
[14:40:03] <icinga-wm>	 ACKNOWLEDGEMENT - Docker registry HTTPS interface on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string schemaVersion not found on https://registry1002.eqiad.wmnet:443/v2/wikimedia-stretch/manifests/latest - 394 bytes in 0.069 second response time Fsero this is expected because swit replication is kind of slow and have not finished. https://wikitech.wikimedia.org/wiki/Docker
[14:40:07] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916
[14:41:09] <wikibugs>	 (03PS2) 1020after4: Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper)
[14:41:15] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time https://phabricator.wikimedia.org/T174916
[14:41:38] <jijiki>	 !log restart pdfrender on scb1002
[14:41:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:14] <wikibugs>	 (03PS3) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594)
[14:43:53] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.60: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[14:43:55] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2024 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.60: Connection reset by peer
[14:44:04] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594)
[14:44:22] <wikibugs>	 (03CR) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[14:44:24] <wikibugs>	 (03PS4) 10Herron: phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288)
[14:44:29] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10BBlack) @Cwek - Thanks for the reports!  Have you tried other Wikimedia projects (e.g. wikiversity, wikiquote,...
[14:45:01] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff)
[14:45:07] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2024 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift
[14:45:09] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2024 is OK: OK ferm input default policy is set
[14:45:39] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2037 - https://phabricator.wikimedia.org/T221512 (10Papaul) a:05Papaul→03Marostegui Complete
[14:45:57] <wikibugs>	 (03CR) 10Eric Gardner: [C: 03+1] Allow cross-site requests from Commons' mobile domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221678) (owner: 10Matthias Mullie)
[14:47:03] <wikibugs>	 (03CR) 10Herron: [C: 03+2] phabricator: remove rfc1918 ip4 addrs from SPF record [dns] - 10https://gerrit.wikimedia.org/r/504936 (https://phabricator.wikimedia.org/T221288) (owner: 10Herron)
[14:47:15] <wikibugs>	 (03CR) 10Fsero: "syntactically looks good, however, how this chart relates to the eventgate-analytics one? it seems there is some duplication between both." [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[14:47:55] <wikibugs>	 (03PS5) 10Ema: cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967)
[14:48:56] <wikibugs>	 (03CR) 10Ema: [C: 03+2] cache: unify cache nodes definition in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/506154 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[14:49:48] <icinga-wm>	 PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[14:50:02] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 (10Papaul) 05Open→03Resolved Complete
[14:50:21] <wikibugs>	 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestcontrol2003: rename to cloudcontrol2003-dev - https://phabricator.wikimedia.org/T220095 (10Papaul)
[14:51:06] <icinga-wm>	 PROBLEM - Docker registry health on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry1002.eqiad.wmnet:443/debug/health - 366 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Docker
[14:51:48] <wikibugs>	 (03CR) 10Ottomata: "Yeah there's a lot of duplication.  -analytics and -main are two separate deployments with different destination Kafka clusters, serving d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[14:54:48] <wikibugs>	 (03CR) 10Vgutierrez: "pcc is still happy after all the renaming: https://puppet-compiler.wmflabs.org/compiler1002/15998/" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez)
[14:55:24] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: furud - DISK CRITICAL - /mnt/hdfs is not accessible: Input/output error - https://phabricator.wikimedia.org/T221483 (10Ottomata)
[14:55:29] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Cluster: Remove Hadoop configs and unmount /mnt/hdfs from unused backup hosts (furud, +) - https://phabricator.wikimedia.org/T221629 (10Ottomata) 05Open→03Declined Oh, actually, /mnt/hdfs is not puppetized. It was leftover from when it was.  I just removed it from f...
[14:55:51] <icinga-wm>	 ACKNOWLEDGEMENT - Docker registry health on registry1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry1002.eqiad.wmnet:443/debug/health - 366 bytes in 0.010 second response time Fsero /debug/health listens on 5001 and not on 443.. tried to acknowlodge this before alerting but icinga UI is so fast Sigh https://wikitech.wikimedia.org/wiki/Docker
[14:58:32] <icinga-wm>	 PROBLEM - Docker registry health on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2001.codfw.wmnet:443/debug/health - 366 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Docker
[14:58:37] <wikibugs>	 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10Ottomata) a:05Ottomata→03None
[14:59:33] <wikibugs>	 (03PS2) 10Herron: lvs: switch kibana scheduler to source hash [puppet] - 10https://gerrit.wikimedia.org/r/504590 (https://phabricator.wikimedia.org/T221143)
[15:00:31] <herron>	 !log switching kibana lvs to source hash scheduler
[15:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:00:58] <wikibugs>	 (03CR) 10Herron: [C: 03+2] lvs: switch kibana scheduler to source hash [puppet] - 10https://gerrit.wikimedia.org/r/504590 (https://phabricator.wikimedia.org/T221143) (owner: 10Herron)
[15:01:14] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db2047 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2047&var-datasource=codfw+prometheus/ops
[15:01:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10decommission: Decommission parsercache hosts: pc1004.eqiad.wmnet pc1005.eqiad.wmnet pc1006.eqiad.wmnet - https://phabricator.wikimedia.org/T210969 (10Cmjohnson) 05Open→03Resolved All the disk were securely wiped and server reset to server defaults
[15:02:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10cloud-services-team (Kanban): decommission labvirt101[01].eqiad.wmnet (Dec 2018 lease return) - https://phabricator.wikimedia.org/T210735 (10Cmjohnson)
[15:02:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission, 10cloud-services-team (Kanban): decommission labvirt101[01].eqiad.wmnet (Dec 2018 lease return) - https://phabricator.wikimedia.org/T210735 (10Cmjohnson) 05Open→03Resolved xAll the disk were securely wiped and server reset to server defaults
[15:04:10] <icinga-wm>	 PROBLEM - Docker registry health on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2002.codfw.wmnet:443/debug/health - 366 bytes in 0.153 second response time https://wikitech.wikimedia.org/wiki/Docker
[15:09:38] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Refactor public-facing DYNA scheme for primary project hostnames in our DNS - https://phabricator.wikimedia.org/T208263 (10Cwek) @BBlack You can read this [[ https://zh.wikipedia.org/wiki/Help:%E5%A6%82%E4%BD%95%E8%AE%BF%E9%97%AE%E7%B...
[15:10:28] <icinga-wm>	 RECOVERY - puppet last run on ms-be2027 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[15:10:50] <wikibugs>	 (03PS1) 10Fsero: registryha: registry health check was querying wrong port [puppet] - 10https://gerrit.wikimedia.org/r/506170 (https://phabricator.wikimedia.org/T221759)
[15:11:26] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Support upgrades which introduce changes to binary package names [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/481176 (owner: 10Muehlenhoff)
[15:12:41] <icinga-wm>	 ACKNOWLEDGEMENT - Docker registry health on registry2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2001.codfw.wmnet:443/debug/health - 366 bytes in 0.151 second response time Fsero known problem, see registry1002 ack https://wikitech.wikimedia.org/wiki/Docker
[15:12:41] <icinga-wm>	 ACKNOWLEDGEMENT - Docker registry health on registry2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 404 Not Found - string {} not found on https://registry2002.codfw.wmnet:443/debug/health - 366 bytes in 0.154 second response time Fsero known problem, see registry1002 ack https://wikitech.wikimedia.org/wiki/Docker
[15:15:15] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha: registry health check was querying wrong port [puppet] - 10https://gerrit.wikimedia.org/r/506170 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[15:15:37] <wikibugs>	 10Operations, 10hardware-requests: Two test hosts for SREs - https://phabricator.wikimedia.org/T214024 (10Cmjohnson)
[15:15:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Support upgrades which introduce changes to binary package names (client side) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506171
[15:18:44] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Sporadic puppet failures - https://phabricator.wikimedia.org/T201247 (10herron)
[15:18:45] <wikibugs>	 (03PS10) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[15:18:48] <wikibugs>	 (03PS8) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[15:18:50] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10herron)
[15:18:51] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125
[15:19:13] <wikibugs>	 10Operations, 10Analytics-Kanban, 10EventBus, 10netops: Allow analytics VLAN to reach schema.svc.$site.wmnet - https://phabricator.wikimedia.org/T221690 (10ayounsi) a:03ayounsi
[15:20:59] <wikibugs>	 (03PS3) 10Ema: cache: distinguish between Varnish and ATS nodes [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967)
[15:24:59] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Papaul) Before:  BIOS Version   2.4.3 Firmware Version   2.40.40.40 IP Address(es)   10.193.1.75 iDRAC MAC Address   84:7B:EB:F6:99:B2 DNS Domain Name   Lifecyc...
[15:25:54] <wikibugs>	 (03PS1) 10Fsero: registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759)
[15:26:21] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[15:27:32] <wikibugs>	 (03CR) 10Niedzielski: [C: 03+1] Remove wikibase sameAs A/B test config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/505999 (https://phabricator.wikimedia.org/T209377) (owner: 10Nray)
[15:27:48] <wikibugs>	 (03PS2) 10Matthias Mullie: Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734)
[15:28:16] <wikibugs>	 (03PS2) 10Fsero: registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759)
[15:28:19] <wikibugs>	 (03CR) 10Ema: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/16000/" [puppet] - 10https://gerrit.wikimedia.org/r/505815 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema)
[15:29:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[15:30:39] <wikibugs>	 10Operations, 10Wikimedia-Logstash: logstash stuck on its persistent queue - https://phabricator.wikimedia.org/T212640 (10herron) 05Open→03Resolved a:03herron I think it's safe to resolve this now since we're on logstash 5.6.15, and have disabled the logstash persistent queue.
[15:31:18] <wikibugs>	 (03CR) 10WMDE-leszek: First draft of a wikibase-termbox chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/505788 (https://phabricator.wikimedia.org/T220402) (owner: 10Alexandros Kosiaris)
[15:32:03] <jijiki>	 !log Restarting php7.2-fpm on mw2* in codfw for 505383 and T211488
[15:32:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:32:08] <stashbot>	 T211488: Audit and sync INI settings as needed between HHVM and PHP 7  - https://phabricator.wikimedia.org/T211488
[15:33:02] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) Thanks @Papaul I am rebooting the server a few times to confirm it is indeed solved!
[15:33:41] <wikibugs>	 10Operations: maps hosts have bad permissions under /srv/deployment - https://phabricator.wikimedia.org/T220982 (10herron) Is there anything left to do before closing this?
[15:34:45] <wikibugs>	 (03PS1) 10Ema: debdeploy: make filter_services default to empty hash [puppet] - 10https://gerrit.wikimedia.org/r/506176
[15:35:16] <wikibugs>	 10Operations, 10ops-codfw, 10decommission, 10fundraising-tech-ops, 10Patch-For-Review: decom rigel.frack.codfw.wmnet - https://phabricator.wikimedia.org/T202535 (10Papaul) papaul@fasw-c-codfw> show interfaces descriptions | match "ge-[0-1]/0/14"    ge-0/0/14       down  down DISABLED ge-1/0/14       down...
[15:35:20] <wikibugs>	 (03PS3) 10Fsero: registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759)
[15:37:03] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha: fix: bad regexp pattern on check [puppet] - 10https://gerrit.wikimedia.org/r/506174 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[15:38:06] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125
[15:38:10] <wikibugs>	 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10crusnov) I forget where but in digging about this it seems that Puppet will return 503 if it is too busy, there are numerous reports of this (to be clear I don't know if it's puppet itself or an...
[15:40:37] <wikibugs>	 (03PS2) 10Ema: debdeploy: make filter_services default to empty hash [puppet] - 10https://gerrit.wikimedia.org/r/506176
[15:40:39] <wikibugs>	 (03PS1) 10Ema: cumin: add ATS production hosts to aliases [puppet] - 10https://gerrit.wikimedia.org/r/506177 (https://phabricator.wikimedia.org/T219967)
[15:40:51] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: Decommission db2033 - https://phabricator.wikimedia.org/T220070 (10Papaul)
[15:43:34] <wikibugs>	 10Operations, 10Puppet, 10Icinga, 10monitoring: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10ema)
[15:43:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Haven't seen the chart, I 'll have a look tomorrow." [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[15:43:46] <wikibugs>	 10Operations, 10Puppet, 10Icinga, 10monitoring: Puppet failing without Icinga alert in case of dependency cycle - https://phabricator.wikimedia.org/T221784 (10ema) p:05Triage→03Normal
[15:45:02] <wikibugs>	 10Operations, 10ops-ulsfo: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10RobH) p:05Triage→03Normal
[15:45:08] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2039 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.83: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[15:50:24] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo)
[15:51:14] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-ferm: Update comment to clarify intention of generic define [puppet] - 10https://gerrit.wikimedia.org/r/506125 (owner: 10Jcrespo)
[15:52:36] <herron>	 !log performing rolling restart of pybal on low-traffic eqiad/codfw lvs hosts
[15:52:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:22] <wikibugs>	 (03PS1) 10Fsero: registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759)
[15:55:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[15:57:11] <wikibugs>	 (03PS2) 10Fsero: registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759)
[15:57:35] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Kibana breaks during rolling upgrade - https://phabricator.wikimedia.org/T221143 (10herron) 05Open→03Resolved a:03herron The Kibana lvs has been updated to use the source hash scheduler
[15:58:50] <wikibugs>	 (03CR) 10Fsero: [C: 03+2] registryha: fix: bad regexp pattern on check (again) [puppet] - 10https://gerrit.wikimedia.org/r/506178 (https://phabricator.wikimedia.org/T221759) (owner: 10Fsero)
[15:59:49] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2037 - https://phabricator.wikimedia.org/T221512 (10Marostegui) 05Open→03Resolved Thanks! All good! ` root@db2037:~# hpssacli controller all show config  Smart Array P420i in Slot 0 (Embedded)    (sn: 0014380312088E0)      Port Name: 1I     Port Name:...
[16:00:04] <jouncebot>	 hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T1600). Please do the needful.
[16:00:04] <jouncebot>	 kostajh: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:27] <kostajh>	 i'm here
[16:02:36] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui)
[16:02:47] <wikibugs>	 (03PS1) 10Elukey: profile::analytics::database::meta: add properties to my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243)
[16:04:16] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): cumin: leaked aliases - https://phabricator.wikimedia.org/T221788 (10aborrero)
[16:09:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Remember you will need the table to have ROW_FORMAT=DYNAMIC as shown during the earlier discussion." [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey)
[16:09:48] <RoanKattouw>	 I'll do the SWAT, sorry for the delay
[16:10:08] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[16:10:33] <wikibugs>	 (03CR) 10Elukey: "> Remember you will need the table to have ROW_FORMAT=DYNAMIC as" [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243) (owner: 10Elukey)
[16:10:44] <icinga-wm>	 RECOVERY - Device not healthy -SMART- on db2037 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2037&var-datasource=codfw+prometheus/ops
[16:16:37] <icinga-wm>	 RECOVERY - Docker registry health on registry1002 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Docker
[16:17:07] <icinga-wm>	 RECOVERY - Docker registry health on registry2001 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker
[16:17:33] <wikibugs>	 (03PS11) 10Jcrespo: mariadb-snapshots: Setup full daily snapshots for all codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/500980 (https://phabricator.wikimedia.org/T206203)
[16:17:35] <wikibugs>	 (03PS9) 10Jcrespo: mariadb-snapshots: Stop replication during transfer [puppet] - 10https://gerrit.wikimedia.org/r/501546 (https://phabricator.wikimedia.org/T206203)
[16:17:37] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399)
[16:18:19] <wikibugs>	 (03CR) 10Jcrespo: "First version, needs more work on the previous steps still, and probably more improvements I may be missing now." [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo)
[16:19:41] <icinga-wm>	 PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2658 MB (5% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[16:19:46] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) ` papaul@asw-b-codfw> show interfaces ge-6/0/4 descriptions  Interface       Admin Link Description ge-6/0/4        up    up   db2020  papau...
[16:19:56] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Setup dbprov eqiad servers, remove dbstore1001 backups [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399)
[16:20:59] <wikibugs>	 (03CR) 10Jcrespo: "This will also need a better distribution later so we minimize simultaneous backups on the same server (target or source)." [puppet] - 10https://gerrit.wikimedia.org/r/506180 (https://phabricator.wikimedia.org/T219399) (owner: 10Jcrespo)
[16:23:43] <RoanKattouw>	 kostajh: Patch is live on mwdebug1002, please test there insofar possible
[16:23:53] <kostajh>	 RoanKattouw: checking
[16:30:18] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul)
[16:32:14] <kostajh>	 RoanKattouw: it looks OK, although there's another issue now (not caused by this patch)
[16:32:22] <RoanKattouw>	 OK syncing this for now hten
[16:33:52] <logmsgbot>	 !log catrope@deploy1001 Synchronized php-1.34.0-wmf.1/extensions/GrowthExperiments/: Fix exceptions in Homepage logging (duration: 00m 56s)
[16:33:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:36:11] <wikibugs>	 (03PS2) 10Elukey: profile::analytics::database::meta: add properties to my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/506179 (https://phabricator.wikimedia.org/T212243)
[16:37:38] <wikibugs>	 (03CR) 10Ottomata: "Hm, ok.  Was worried there would be potential for bugs in one causing errors in the other, but I'll give that a go.  I'd prefer DRY charts" [deployment-charts] - 10https://gerrit.wikimedia.org/r/506166 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata)
[16:40:07] <icinga-wm>	 RECOVERY - Docker registry health on registry2002 is OK: HTTP OK: HTTP/1.1 200 OK - 143 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Docker
[16:40:15] <icinga-wm>	 PROBLEM - HP RAID on ms-be2028 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.88: Connection reset by peer
[16:46:04] <wikibugs>	 (03PS1) 10Jbond: refactor: Refactor script and use the PyYAML [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/506188
[16:55:13] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[16:55:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "The phabricator task also mentions an alias for the talk namespace – I don’t see that being added in this change…?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm)
[16:56:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/506176 (owner: 10Ema)
[16:58:08] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10mobrovac) >>! In T220402#5133994, @akosiaris wrote: > @Tarrow, @WMDE-leszek. I 've been working on the termbox helm chart and w...
[16:58:26] <wikibugs>	 (03PS2) 10Dzahn: admins: add shell account for Willy Pao and add to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142)
[16:59:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] admins: add shell account for Willy Pao and add to datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/506003 (https://phabricator.wikimedia.org/T221142) (owner: 10Dzahn)
[17:02:01] <wikibugs>	 10Operations, 10ops-codfw: find horizontal PDUs in codfw - https://phabricator.wikimedia.org/T221153 (10Papaul) a:05Papaul→03RobH {F28756699}
[17:02:15] <wikibugs>	 (03CR) 10Jbond: "compiler suggests this is a noop (which is what i would expected)" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond)
[17:04:20] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn)
[17:04:28] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul)
[17:09:25] <wikibugs>	 (03PS2) 10Urbanecm: Create new namespace "Edice" for cswikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697)
[17:10:23] <wikibugs>	 (03CR) 10Urbanecm: "Thanks for catching that Lucas, fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm)
[17:12:40] <wikibugs>	 (03PS12) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422)
[17:13:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov)
[17:17:06] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "no problem :) it looks like I’ll be deploying this tomorrow btw, I have two other changes in SWAT" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506134 (https://phabricator.wikimedia.org/T221697) (owner: 10Urbanecm)
[17:22:43] <icinga-wm>	 RECOVERY - HP RAID on ms-be2028 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[17:23:46] <mutante>	 !log proton1001 - restarting proton service - low RAM caused facter/puppet fails  (https://tickets.puppetlabs.com/browse/PUP-8048) freed memory and fixed puppet run (cc: T219456 T214975)
[17:23:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:23:54] <stashbot>	 T219456: Add ram to Proton* - https://phabricator.wikimedia.org/T219456
[17:23:54] <stashbot>	 T214975: proton experienced a period of high CPU usage, busy queue, lockups - https://phabricator.wikimedia.org/T214975
[17:26:35] <icinga-wm>	 RECOVERY - puppet last run on proton1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[17:28:50] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on ms-be2026 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T219854
[17:29:09] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2026 is OK: OK - running: The system is fully operational
[17:32:00] <wikibugs>	 (03PS13) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422)
[17:33:53] <mutante>	 !log contint1001 - apt-get clean for 1% more disk space 
[17:33:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:33:58] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) Icinga alerting again:  contint1001 - Disk space CRITICAL  2019-04-24 17:29:39  0d 1h 10m 17s  3/3...
[17:35:03] <wikibugs>	 (03PS14) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422)
[17:35:05] <wikibugs>	 10Operations, 10Core Platform Team, 10Multi-Content-Revisions, 10Regression, 10Wikimedia-production-error: Unable to move page  (Special:MovePage&action=submit) - https://phabricator.wikimedia.org/T221763 (10mobrovac) p:05Unbreak!→03High
[17:35:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422) (owner: 10CRusnov)
[17:36:28] <wikibugs>	 (03PS15) 10CRusnov: coherence report: General improvements and rack checks [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/504367 (https://phabricator.wikimedia.org/T220422)
[17:37:33] <icinga-wm>	 PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2649 MB (5% inode=65%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[17:38:03] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:39:06] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) p:05Normal→03High
[17:39:23] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:39:27] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:39:31] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:39:31] <icinga-wm>	 PROBLEM - dhclient process on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:39:35] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:39:39] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:39:39] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:39:45] <icinga-wm>	 PROBLEM - puppet last run on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:39:47] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:39:53] <icinga-wm>	 PROBLEM - MD RAID on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:39:53] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:39:57] <icinga-wm>	 PROBLEM - Disk space on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[17:39:57] <icinga-wm>	 PROBLEM - DPKG on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:40:09] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:40:11] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:40:13] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:40:23] <icinga-wm>	 PROBLEM - configured eth on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer
[17:40:27] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:40:31] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:40:33] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:40:45] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2019 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.161: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[17:41:05] <mutante>	 probably nagios-nrpe-server crashed.. looking
[17:41:05] <icinga-wm>	 RECOVERY - MD RAID on ms-be2019 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[17:41:05] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2019 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift
[17:41:08] <mutante>	 eh, ok
[17:41:09] <icinga-wm>	 RECOVERY - Disk space on ms-be2019 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[17:41:09] <icinga-wm>	 RECOVERY - DPKG on ms-be2019 is OK: All packages OK
[17:41:21] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2019 is OK: OK: nf_conntrack is 7 % full
[17:41:23] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift
[17:41:25] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift
[17:41:31] <chaomodus>	 noisy thoug
[17:41:35] <icinga-wm>	 RECOVERY - configured eth on ms-be2019 is OK: OK - interfaces up
[17:41:39] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[17:41:41] <mutante>	 yea, noisy for being just a single server
[17:41:41] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift
[17:41:45] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift
[17:41:51] <mutante>	 did not restart anything
[17:41:53] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2019 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift
[17:41:57] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift
[17:41:57] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2019 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift
[17:41:59] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift
[17:41:59] <icinga-wm>	 RECOVERY - dhclient process on ms-be2019 is OK: PROCS OK: 0 processes with command name dhclient
[17:42:05] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2019 is OK: PROCS OK: 41 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift
[17:42:09] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2019 is OK: OK - load average: 27.53, 29.85, 28.51 https://wikitech.wikimedia.org/wiki/Swift
[17:42:09] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2019 is OK: OK ferm input default policy is set
[17:42:17] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2019 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift
[17:42:23] <mutante>	 that is how it looks when nagios-nrpe gets killed due to OOM
[17:42:35] <chaomodus>	 Yah i'm familiar :)
[17:42:41] <mutante>	 but this time i did nothing to fix it
[17:42:55] <cdanis>	 the swift backend machines in both codfw and eqiad are busy right now, lots of data moving around because of decomming some hosts in the cluster
[17:43:06] <mutante>	 ack, gotcha
[17:43:21] <cdanis>	 I am overdue for lunch but I'll spend some time looking if there's an easy way to rate-limit replication
[17:43:56] <godog>	 I have to run too, but +1 to what cdanis said
[17:44:54] <mutante>	 alright, no rush. enjoy lunch & dinner
[17:44:57] <icinga-wm>	 RECOVERY - puppet last run on ms-be2019 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[17:47:16] <wikibugs>	 10Operations, 10Data-Services, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): Decommission labsdb1006.eqiad.wmnet and labsdb1007.eqiad.wmnet - https://phabricator.wikimedia.org/T220144 (10Bstorm)
[17:47:41] <godog>	 thanks mutante !
[17:48:15] <godog>	 but yeah from past experiences the cluster can freak out a little right after a rebalancing has begun and the settles
[17:49:29] <godog>	 gotta go!
[17:50:18] <mutante>	 yep! laters
[17:51:55] <icinga-wm>	 RECOVERY - Disk space on contint1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[17:52:38] <mutante>	 !log contint1001 - for logfile in $(find /var/log/zuul/ ! -name "*.gz"); do gzip $logfile; done to get more disk space (T207707)
[17:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:44] <stashbot>	 T207707: contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707
[17:57:14] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10Dzahn) p:05High→03Normal gzipping all files in /var/log/zuul that were not already gzipped saved almost...
[18:03:00] <wikibugs>	 (03CR) 10CDanis: "FWIW I feel incompetent to review this change; well beyond my Puppet knowledge.  Sorry :)" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond)
[18:10:31] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:10:31] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:10:31] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:11:35] <icinga-wm>	 PROBLEM - DPKG on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:11:51] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:11:59] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:12:05] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:12:15] <icinga-wm>	 PROBLEM - Disk space on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[18:12:17] <icinga-wm>	 PROBLEM - MD RAID on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:12:21] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:12:23] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:12:37] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:12:37] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:12:47] <icinga-wm>	 PROBLEM - configured eth on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:12:53] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:12:55] <icinga-wm>	 PROBLEM - dhclient process on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:13:01] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:13:03] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:13:03] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:13:09] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:13:34] <mutante>	 yea.. now that's known
[18:13:43] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:13:43] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:14:21] <icinga-wm>	 PROBLEM - HP RAID on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:14:57] <icinga-wm>	 PROBLEM - MD RAID on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:14:59] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:15:29] <icinga-wm>	 PROBLEM - puppet last run on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer
[18:15:35] <icinga-wm>	 PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:15:39] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:15:47] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:15:47] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2031 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.13: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[18:16:19] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2031 is OK: OK ferm input default policy is set
[18:16:21] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2031 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift
[18:16:21] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2031 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift
[18:16:27] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2031 is OK: OK - load average: 35.97, 47.61, 45.85 https://wikitech.wikimedia.org/wiki/Swift
[18:16:27] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift
[18:16:39] <icinga-wm>	 RECOVERY - configured eth on ms-be2031 is OK: OK - interfaces up
[18:16:45] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift
[18:16:45] <icinga-wm>	 RECOVERY - dhclient process on ms-be2031 is OK: PROCS OK: 0 processes with command name dhclient
[18:16:47] <icinga-wm>	 RECOVERY - DPKG on ms-be2031 is OK: All packages OK
[18:16:51] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2031 is OK: OK: nf_conntrack is 3 % full
[18:16:51] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift
[18:16:51] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2031 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift
[18:16:59] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift
[18:16:59] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2031 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift
[18:16:59] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift
[18:16:59] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift
[18:17:09] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift
[18:17:15] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2031 is OK: OK - running: The system is fully operational
[18:17:25] <icinga-wm>	 RECOVERY - Disk space on ms-be2031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[18:17:26] <mutante>	 !log sudo icinga-downtime -h ms-be2031 -r swift-rebalancing -d 86400
[18:17:27] <icinga-wm>	 RECOVERY - MD RAID on ms-be2031 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[18:17:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:17:31] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2031 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[18:20:41] <icinga-wm>	 RECOVERY - puppet last run on ms-be2031 is OK: OK: Puppet is currently enabled, last run 28 minutes ago with 0 failures
[18:20:47] <icinga-wm>	 RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api
[18:23:33] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) Hi @wiki_willy   your shell account has been created. You should be able to ssh to the following hosts:  - bastion hosts, to jump to other hosts in the internal netw...
[18:25:00] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10wiki_willy) Thanks @Dzahn , much appreciated.   ~Willy
[18:25:55] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/505737 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[18:30:19] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2019 is OK: OK - running: The system is fully operational
[18:37:31] <wikibugs>	 (03PS17) 10CRusnov: Port MakeVM to a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963)
[18:40:24] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn)
[18:41:43] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests, 10Patch-For-Review: Willy Pao onboarding - https://phabricator.wikimedia.org/T221142 (10Dzahn) @wiki_willy You're welcome. I also just added you to root@ mail, mainly because then you receive noc@ mail which is an alias for it. Prepare for a _little_ more mail...
[18:43:09] <wikibugs>	 (03CR) 10Herron: "> > Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[18:44:19] <wikibugs>	 (03CR) 10Herron: [C: 03+1] kafka shipper: add ulogd to kafka forwarding rules [puppet] - 10https://gerrit.wikimedia.org/r/505750 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[18:45:05] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "👍" [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi)
[18:45:49] <mutante>	 !log mw1297 - scap pull
[18:45:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:46:49] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1297.eqiad.wmnet,cluster=api_appserver
[18:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:22] <mutante>	 !log pooled mw1297 as a new API server (T192457)
[18:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:28] <stashbot>	 T192457: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457
[18:47:49] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn)
[18:49:52] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn)
[18:49:56] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) 05Open→03Stalled Ticket is done besides one check box and that is T215332 unless a different server is used, making sure in T215332#5133171.
[18:50:43] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/505783 (https://phabricator.wikimedia.org/T220987) (owner: 10Jbond)
[18:51:00] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn)
[18:55:51] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): cumin: leaked aliases - https://phabricator.wikimedia.org/T221788 (10Dzahn) almost duplicate of T221125
[18:59:54] <Volker_E>	 hi, I've got the following notification and hope to find help here: Puppet is failing to run on the "wikimedia-ui.design.eqiad.wmflabs" instance in Wikimedia Cloud VPS.
[18:59:58] <Volker_E>	 :)
[19:01:31] <wikibugs>	 10Operations, 10ops-codfw: wtp2019 shows error messages in the racadm getsel's output - https://phabricator.wikimedia.org/T221572 (10Dzahn) History of this host:  * wtp2019 - hardware (RAM) check (T146113) * wtp2019 has faulty memory (T146009) * wtp2019 issues an uncorrectable memory error (T148710) * wtp2019....
[19:02:27] <herron>	 Volker_E: on the host there should be a /var/log/puppet.log with additional info as to why it failed to run
[19:08:29] <wikibugs>	 (03CR) 10Cwhite: "This change looks like a step in the right direction, but I don't see where $_role comes from.  Is it a datapoint added by the role() func" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond)
[19:09:40] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/505741 (https://phabricator.wikimedia.org/T220103) (owner: 10Filippo Giunchedi)
[19:11:33] <icinga-wm>	 PROBLEM - swift-account-server on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:12:11] <icinga-wm>	 PROBLEM - MD RAID on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer
[19:12:33] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:12:47] <icinga-wm>	 PROBLEM - Disk space on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[19:12:49] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:12:53] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:12:53] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:13:01] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:13:07] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:13:09] <icinga-wm>	 PROBLEM - configured eth on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer
[19:13:23] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer
[19:13:23] <icinga-wm>	 PROBLEM - DPKG on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer
[19:13:25] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.16.15: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[19:13:31] <icinga-wm>	 RECOVERY - MD RAID on ms-be2033 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[19:13:47] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[19:13:57] <icinga-wm>	 RECOVERY - Disk space on ms-be2033 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space
[19:14:01] <icinga-wm>	 RECOVERY - swift-container-replicator on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator https://wikitech.wikimedia.org/wiki/Swift
[19:14:05] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift
[19:14:05] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift
[19:14:05] <icinga-wm>	 RECOVERY - swift-account-server on ms-be2033 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server https://wikitech.wikimedia.org/wiki/Swift
[19:14:13] <icinga-wm>	 RECOVERY - swift-container-updater on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift
[19:14:19] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2033 is OK: OK - load average: 56.72, 56.58, 50.93 https://wikitech.wikimedia.org/wiki/Swift
[19:14:21] <icinga-wm>	 RECOVERY - configured eth on ms-be2033 is OK: OK - interfaces up
[19:14:35] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2033 is OK: OK: nf_conntrack is 3 % full
[19:14:35] <icinga-wm>	 RECOVERY - DPKG on ms-be2033 is OK: All packages OK
[19:14:37] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2033 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift
[19:20:06] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] Port MakeVM to a cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov)
[19:22:31] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Eurgh." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie)
[19:23:39] <wikibugs>	 (03CR) 10Reedy: Allow cross-site requests from mobile domains (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie)
[19:32:47] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul) ` papaul@asw-a-codfw# run show interfaces ge-6/0/17 descriptions  Interface       Admin Link Description ge-6/0/17       down  down DISABLED...
[19:34:44] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission db2014,db2020, db2021, db2022, db2024, db2031 - https://phabricator.wikimedia.org/T221424 (10Papaul)
[19:39:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk)
[19:40:37] <Krenair>	 mutante, shall I write a post?
[19:41:22] <mutante>	 Krenair: yes please :)
[19:42:44] <Krenair>	 mutante, do we have a particular date/time in mind?
[19:43:29] <mutante>	 uhm.. no.. should we merge before sending the post? 
[19:44:05] <wikibugs>	 (03PS1) 10ArielGlenn: allow section, list of dbs or list of wikis stand alone as arg [software] - 10https://gerrit.wikimedia.org/r/506225
[19:44:13] <mutante>	 Krenair: or we can put it on deployment calendar. what do you think?
[19:44:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] allow section, list of dbs or list of wikis stand alone as arg [software] - 10https://gerrit.wikimedia.org/r/506225 (owner: 10ArielGlenn)
[19:45:08] <mutante>	 i could add it to a puppet swat window.. just so that there is a scheduled time
[19:50:06] <wikibugs>	 (03CR) 10Brian Wolff: [C: 03+1] Allow cross-site requests from mobile domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506151 (https://phabricator.wikimedia.org/T221734) (owner: 10Matthias Mullie)
[19:55:20] <Krenair>	 mutante, not sure.
[19:57:29] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [restbase/deploy@8a6b6fc] (dev-cluster): Switch Parsoid stashing to simple key/value
[19:57:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:57:37] <Krenair>	 mutante, realistically this probably breaks old versions of IE
[19:57:46] <Krenair>	 and a bunch of unsupported other stuff
[20:00:04] <jouncebot>	 cscott, arlolra, subbu, bearND, and halfak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T2000).
[20:01:47] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@8a6b6fc] (dev-cluster): Switch Parsoid stashing to simple key/value (duration: 04m 18s)
[20:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:05:51] <Krenair>	 https://etherpad.wikimedia.org/p/g505410announce
[20:10:54] <paladox>	 thanks Krenair!
[20:18:17] <mutante>	 Krenair: thank you for the etherpad. i will look at getting it on the calendar 
[20:19:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[20:21:02] <logmsgbot>	 !log mobrovac@deploy1001 Started deploy [restbase/deploy@8a6b6fc]: Parsoid storage simplification step 1: switch Parsoid stashing to simple key/value - T215956
[20:21:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:21:08] <stashbot>	 T215956: Consider stashing data-parsoid for VE  - https://phabricator.wikimedia.org/T215956
[20:32:47] <wikibugs>	 (03PS1) 10Sbisson: Cleanup old EchoCrossWikiBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260)
[20:35:11] <wikibugs>	 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm)
[20:35:27] <wikibugs>	 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) a:05Bstorm→03None
[20:35:58] <mutante>	 Krenair: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1824248&oldid=1824244
[20:40:18] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Looks good." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506316 (https://phabricator.wikimedia.org/T221260) (owner: 10Sbisson)
[20:40:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "added to Deployment calendar in the Puppet SWAT section tomorrow: https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revisi" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk)
[20:41:22] <Krenair>	 thanks mutante 
[20:41:41] <logmsgbot>	 !log mobrovac@deploy1001 Finished deploy [restbase/deploy@8a6b6fc]: Parsoid storage simplification step 1: switch Parsoid stashing to simple key/value - T215956 (duration: 20m 39s)
[20:41:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:41:46] <stashbot>	 T215956: Consider stashing data-parsoid for VE  - https://phabricator.wikimedia.org/T215956
[20:44:32] <mutante>	 Krenair: feel free to send the mail to wikitech-l and i reply with the calendar link or just add it. still making sure there are no concerns in -releng
[20:44:53] <Krenair>	 am waiting for -releng too
[20:44:53] <mutante>	 but "likely shortly" is true
[20:44:56] <mutante>	 ok
[20:46:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1037 is OK: OK - running: The system is fully operational
[20:56:50] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527)
[21:07:11] <wikibugs>	 (03PS1) 10CDanis: swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321
[21:09:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:11:15] <wikibugs>	 (03PS1) 10CDanis: standard_packages: add iotop [puppet] - 10https://gerrit.wikimedia.org/r/506322
[21:14:43] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 25 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:19:21] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] standard_packages: add iotop [puppet] - 10https://gerrit.wikimedia.org/r/506322 (owner: 10CDanis)
[21:20:09] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqsin is CRITICAL: CRITICAL: host 103.102.166.129, interfaces up: 72, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:21:01] <icinga-wm>	 RECOVERY - HP RAID on ms-be2031 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[21:23:35] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 66 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:24:07] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqsin is OK: OK: host 103.102.166.129, interfaces up: 74, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:26:13] <mutante>	 XioNoX: cr1-eqsin - interface down. type: Peering:    Equinix Singapore (WIKIMEDIA-SG1-IX-00, MAC filter)  . i am not sure i can identify who from looking at https://netbox.wikimedia.org/circuits/providers/     ticket worthy?
[21:27:52] <mutante>	 well.. and it recovered 
[21:28:53] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 29 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:29:24] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527)
[21:31:50] <mutante>	 !log  icinga-downtime -h ms-be2038 -r swift-rebalancing -d 86400
[21:31:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:36] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/16003/" [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[21:46:45] <wikibugs>	 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Tarrow) We are indeed using service-runner; I don't think this provides /_info or /?spec though? Or are we missing something?
[21:49:03] <robh>	 XioNoX: mr1-ulsfo wil blip ]
[21:49:05] <robh>	 it just power cycled
[21:51:05] <icinga-wm>	 PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:05] <icinga-wm>	 PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:05] <icinga-wm>	 PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:05] <icinga-wm>	 PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:05] <icinga-wm>	 PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:05] <icinga-wm>	 PROBLEM - Host bast4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:09] <icinga-wm>	 PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:09] <icinga-wm>	 PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:09] <icinga-wm>	 PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:09] <icinga-wm>	 PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:51:35] <icinga-wm>	 PROBLEM - Host mr1-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[21:52:17] <icinga-wm>	 PROBLEM - Host ripe-atlas-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[21:52:25] <icinga-wm>	 PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:52:25] <icinga-wm>	 PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:52:30] <mutante>	 oh
[21:52:55] <icinga-wm>	 PROBLEM - Host dns4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:52:55] <icinga-wm>	 PROBLEM - Host dns4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:52:58] <chaomodus>	 oic
[21:53:08] <mutante>	 robh: sounds expected then.. pheew
[21:53:09] <icinga-wm>	 PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[21:53:33] <icinga-wm>	 PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:53:33] <icinga-wm>	 PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:53:33] <icinga-wm>	 PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:53:47] <icinga-wm>	 RECOVERY - Host mr1-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.83 ms
[21:53:49] <icinga-wm>	 RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 78.20 ms
[21:54:29] <icinga-wm>	 RECOVERY - Host cp4030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.21 ms
[21:55:37] <icinga-wm>	 RECOVERY - Host cp4027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.96 ms
[21:56:29] <icinga-wm>	 RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.01 ms
[21:56:29] <icinga-wm>	 RECOVERY - Host cp4024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.79 ms
[21:56:29] <icinga-wm>	 RECOVERY - Host cp4022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.83 ms
[21:56:29] <icinga-wm>	 RECOVERY - Host bast4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms
[21:56:29] <icinga-wm>	 RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.85 ms
[21:56:32] <Krenair>	 that was a bit more than just mr1?
[21:56:33] <icinga-wm>	 RECOVERY - Host cp4028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 79.10 ms
[21:56:33] <icinga-wm>	 RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 81.24 ms
[21:56:33] <icinga-wm>	 RECOVERY - Host cp4026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 80.89 ms
[21:56:47] <wikibugs>	 (03PS6) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531)
[21:56:53] <Krenair>	 oh right, because those hosts' mgmt interfaces would be linked via mr1
[21:57:29] <mutante>	 there are more details in the -dcops channel. they are working on power 
[21:57:36] <Krenair>	 ripe-atlas-ulsfo too?
[21:57:41] <icinga-wm>	 RECOVERY - Host ripe-atlas-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 74.29 ms
[21:57:46] <mutante>	 yes, all of them (the regular hosts) were mgmt 
[21:57:49] <icinga-wm>	 RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.87 ms
[21:58:19] <icinga-wm>	 RECOVERY - Host dns4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.86 ms
[21:58:19] <icinga-wm>	 RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.99 ms
[21:58:33] <icinga-wm>	 RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 73.05 ms
[21:58:57] <icinga-wm>	 RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.81 ms
[21:58:57] <icinga-wm>	 RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.08 ms
[21:58:57] <icinga-wm>	 RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 74.78 ms
[21:59:23] <robh>	 mgmt goes down when mr1 goes down
[21:59:26] <robh>	 but its ok and expected
[22:00:23] <Krenair>	 yeah I made that mental link shortly afterwards, what about ripe-atlas though?
[22:00:26] <wikibugs>	 (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond)
[22:01:39] <robh>	 ripe atlas is just a ping destination device
[22:01:49] <robh>	 and is single power supply so normalizing power on racks means it gets unplugged some
[22:01:55] <robh>	 but it comes back on its own with no interference
[22:02:17] <robh>	 basically today we're normalizing and labeling every single power input/cable in the racks into netbox
[22:02:23] <Krenair>	 ah
[22:02:30] <robh>	 so it involved moving power plugs around in the rack
[22:02:38] <robh>	 so server X uses port 2 on both A and B towers, etc...
[22:02:50] <robh>	 for 99% of the stuff, its dual power feeds so no one notices
[22:03:03] <robh>	 but mr1, atlas, and mgmt switches are single power supply fed, so they go down for this
[22:03:08] <Krenair>	 yeah
[22:04:26] <wikibugs>	 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH) https://docs.google.com/spreadsheets/d/13XPw-PyFqUUO5oqeljQvwpN9N7aWT5Yyii00GJrkeaE/edit?usp=sharing has all the power connections documented, I'll import that into netbox shortly
[22:07:57] <robh>	 sorry about the channel spam =]
[22:09:01] <wikibugs>	 (03PS3) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527)
[22:09:51] <Krenair>	 no worries I was just curious is all
[22:09:55] <wikibugs>	 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10jijiki) @Dzahn I need to to talk with our team before I green light this, also mentioned in T221132. Is it Possible to revisit this in a week from now? Thank you!
[22:10:03] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 41 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[22:10:21] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 55 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[22:11:24] <wikibugs>	 10Operations: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10faidon)
[22:12:33] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 49 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[22:13:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531) (owner: 10Dzahn)
[22:13:25] <wikibugs>	 (03PS7) 10Dzahn: varnish/trafficserver: add regex to cover www.wikiba.se as well [puppet] - 10https://gerrit.wikimedia.org/r/500715 (https://phabricator.wikimedia.org/T99531)
[22:13:54] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki)
[22:13:59] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar): Set `enable_dl` to 0 in php.ini - https://phabricator.wikimedia.org/T220681 (10jijiki) 05Open→03Resolved a:03jijiki @Joe @Krinkle, since we have pushed enable_dl => 0 to production, I am resolving this. Feel free to reop...
[22:14:30] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki)
[22:14:43] <wikibugs>	 10Operations, 10ops-ulsfo: Degraded RAID on cp4032 - https://phabricator.wikimedia.org/T219586 (10RobH) 05Open→03Invalid not sure why that check made this task, as the output in the task description shows a perfectly fine raid.  checked the system manually as well, no errors in systlog either
[22:15:21] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 17 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[22:15:39] <wikibugs>	 10Operations, 10ops-ulsfo: Degraded RAID on cp4032 - https://phabricator.wikimedia.org/T219586 (10Dzahn) It looks like the cause wasn't an actual RAID failure but a networking or DNS failure:   ` connect to address 10.128.0.132 port 5666: No route to host `
[22:18:29] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be2039 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.48.83: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:18:56] <mutante>	 !log icinga-downtime -h ms-be2039 -r swift-rebalancing -d 86400
[22:19:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:19:29] <mutante>	 !log deploying varnish/trafficserver change to cover www.wikiba.se (not prod yet)
[22:19:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:20:35] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed.
[22:21:49] <icinga-wm>	 PROBLEM - Docker registry HTTPS interface on darmstadtium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker
[22:22:29] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator https://wikitech.wikimedia.org/wiki/Swift
[22:23:01] <icinga-wm>	 RECOVERY - Docker registry HTTPS interface on darmstadtium is OK: HTTP OK: HTTP/1.1 200 OK - 2482 bytes in 0.656 second response time https://wikitech.wikimedia.org/wiki/Docker
[22:23:09] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 14 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[22:23:17] <wikibugs>	 10Operations, 10decommission, 10hardware-requests, 10Patch-For-Review: reimage WMF6937/mw1298 - https://phabricator.wikimedia.org/T215332 (10Dzahn) @jijiki Yes, of course it can wait, i just realized again the holiday situation.
[22:26:15] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 15 probes of 409 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[22:26:20] <wikibugs>	 10Operations, 10PHP 7.2 support, 10Patch-For-Review, 10Performance-Team (Radar), 10User-jijiki: Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jijiki) We have pushed https://gerrit.wikimedia.org/r/502986 and  (its update) https://gerrit.wikimed...
[22:31:03] <wikibugs>	 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10ayounsi) Thanks, sounds good! There is nothing special to do, make sure to reserve it in DNS, eg. 208.80.155.119/2620:0:861:...
[22:31:30] <wikibugs>	 10Operations, 10ops-ulsfo, 10Patch-For-Review: ulsfo: setup ulsfo PDUs - https://phabricator.wikimedia.org/T209101 (10RobH)
[22:36:56] <wikibugs>	 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) Great!  I'll sort that out, then.
[22:37:08] <wikibugs>	 10Operations, 10netops, 10cloud-services-team (Kanban): Allocate VIP for failover of the maps home and project mounts on cloudstore1008/9 - https://phabricator.wikimedia.org/T221806 (10Bstorm) a:03Bstorm
[22:44:45] <icinga-wm>	 PROBLEM - HP RAID on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer
[22:45:15] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:45:17] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:45:21] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:45:21] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:45:27] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer
[22:45:31] <icinga-wm>	 PROBLEM - configured eth on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer
[22:45:35] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer
[22:45:39] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:45:47] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:45:49] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:45:57] <icinga-wm>	 PROBLEM - swift-container-server on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:46:01] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be2034 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.32.164: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift
[22:46:11] <icinga-wm>	 PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.64.16.82: Connection reset by peer
[22:46:27] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10serviceops, and 4 others: [Task] move wikiba.se webhosting to wikimedia cluster - https://phabricator.wikimedia.org/T99531 (10Dzahn) >>! In T155359#5108338, @Dzahn wrote: > Next is deploying  https://gerrit.wikimedia.org/r/c/operations/puppet/+/500715 for T99531#50771...
[22:46:56] <mutante>	 !log icinga-downtime -h ms-be2034 -r swift-rebalancing -d 86400
[22:46:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:48:54] <wikibugs>	 10Operations, 10ops-ulsfo: ulsfo netbox updates - https://phabricator.wikimedia.org/T221785 (10RobH) updated all but the atlas, which has no serial connection to query for serial number
[22:49:41] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2034 is OK: OK - load average: 39.99, 47.31, 45.74 https://wikitech.wikimedia.org/wiki/Swift
[22:49:45] <icinga-wm>	 RECOVERY - swift-account-auditor on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift
[22:49:49] <icinga-wm>	 RECOVERY - swift-container-server on ms-be2034 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server https://wikitech.wikimedia.org/wiki/Swift
[22:49:53] <icinga-wm>	 RECOVERY - swift-container-auditor on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor https://wikitech.wikimedia.org/wiki/Swift
[22:50:25] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater https://wikitech.wikimedia.org/wiki/Swift
[22:50:25] <icinga-wm>	 RECOVERY - swift-account-reaper on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper https://wikitech.wikimedia.org/wiki/Swift
[22:50:29] <icinga-wm>	 RECOVERY - swift-account-replicator on ms-be2034 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator https://wikitech.wikimedia.org/wiki/Swift
[22:50:29] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2034 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor https://wikitech.wikimedia.org/wiki/Swift
[22:50:37] <icinga-wm>	 RECOVERY - Check size of conntrack table on ms-be2034 is OK: OK: nf_conntrack is 4 % full
[22:50:41] <icinga-wm>	 RECOVERY - configured eth on ms-be2034 is OK: OK - interfaces up
[22:50:45] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2034 is OK: OK ferm input default policy is set
[22:50:49] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2034 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server https://wikitech.wikimedia.org/wiki/Swift
[22:54:41] <wikibugs>	 (03PS1) 10Bstorm: cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806)
[22:55:01] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) (owner: 10Bstorm)
[22:55:59] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2013 is OK: OK - running: The system is fully operational
[22:57:41] <icinga-wm>	 RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4 - Controller: OK - Battery/Capacitor: OK
[23:00:04] <jouncebot>	 hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190424T2300).
[23:00:04] <jouncebot>	 SMalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[23:00:15] <SMalyshev>	 I'm here
[23:00:23] <MaxSem>	 Guess I can do it
[23:00:29] <SMalyshev>	 cool
[23:01:29] <SMalyshev>	 it's a cherry-pick for a maintenance script, so shouldn't make any trouble
[23:03:04] <MaxSem>	 Who will run it?
[23:03:11] <SMalyshev>	 I will
[23:04:00] <wikibugs>	 (03PS2) 10Bstorm: cloudstore: add floating IP for the maps homes/project NFS [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806)
[23:06:13] <wikibugs>	 (03CR) 10Bstorm: "Just checking my approach.  This is how we do it on tools NFS, but this is public." [dns] - 10https://gerrit.wikimedia.org/r/506327 (https://phabricator.wikimedia.org/T221806) (owner: 10Bstorm)
[23:08:04] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks good to me.  I hope Giuseppe will give more context around the prior attempts." [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond)
[23:09:15] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:09:19] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:09:21] <icinga-wm>	 PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:10:39] <wikibugs>	 10Operations, 10ops-ulsfo, 10netops: Interface errors on cr4-ulsfo:et-0/0/1 - https://phabricator.wikimedia.org/T205937 (10RobH) 05Open→03Resolved a:03RobH
[23:10:45] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[23:10:59] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[23:11:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:12:01] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[23:13:15] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:13:28] <wikibugs>	 (03PS4) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527)
[23:17:11] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:17:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:18:29] <icinga-wm>	 RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:18:31] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:18:36] <wikibugs>	 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) Hi @tramm Any update on this from your side?
[23:20:26] <MaxSem>	 @robh I see recoveries - are we good to deploy?
[23:23:41] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[23:23:45] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[23:23:57] <MaxSem>	 Anyone?
[23:24:51] <mutante>	 MaxSem: i dont know the details of the ongoing work but i saw the other channel and yes i see it all recovered on the graphs. so afaict, yes
[23:25:15] <MaxSem>	 Cool, thanks
[23:26:04] <wikibugs>	 (03PS1) 10Bstorm: toolforge: iotop was added to standard_packages, removing from exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/506329
[23:27:45] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[23:27:45] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[23:27:47] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge: iotop was added to standard_packages, removing from exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/506329 (owner: 10Bstorm)
[23:27:57] <wikibugs>	 (03CR) 10Dzahn: "caused a duplicate declaration on tool labs -> https://gerrit.wikimedia.org/r/c/operations/puppet/+/506329" [puppet] - 10https://gerrit.wikimedia.org/r/506322 (owner: 10CDanis)
[23:28:01] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[23:30:20] <MaxSem>	 23:27:19 sync-file failed: <CalledProcessError> Command 'find -O2 '/srv/mediawiki-staging/php-1.34.0-wmf.1/extensions/CirrusSearch' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P30 -exec php -l >/dev/null 2>&1' returned non-zero exit status 125
[23:32:25] <SMalyshev>	 hmm
[23:32:29] <SMalyshev>	 what's that
[23:33:01] <SMalyshev>	 merge conflict?
[23:34:51] <MaxSem>	 Don't see conflict tokens anywhere
[23:35:19] <SMalyshev>	 hmm
[23:35:40] <SMalyshev>	 MaxSem: does it say which file it has trouble with?
[23:35:57] <MaxSem>	 Nah, xargs seems to swallow that :O
[23:37:09] <SMalyshev>	 weird it's pretty simple patch only touching 2 php files
[23:37:58] <SMalyshev>	 I run it manually and got this: xargs: php: terminated by signal 6
[23:38:04] <SMalyshev>	 this is weird
[23:38:57] <SMalyshev>	 MaxSem: this seems to also happen on old code, pre-patch
[23:39:22] <SMalyshev>	 i.e. if I run it on mwmaint1002 now, it doesn't have the patch and exit code still 125
[23:39:59] <SMalyshev>	 in fact php-1.33.0-wmf.25 produces the same
[23:40:05] <SMalyshev>	 is it some kind of new check?
[23:40:37] <MaxSem>	 HHVM vs. PHP7 shenanigans?
[23:40:57] <MaxSem>	 find -O2 '/srv/mediawiki-staging/php-1.34.0-wmf.1/extensions/CirrusSearch' -not -type d -name '*.php' -not -name 'autoload_static.php'  -or -name '*.inc' | xargs -n1 -P30 -exec php7.2 -l | grep -v 'No syntax'
[23:40:57] <SMalyshev>	 no idea, but definitely not new for this patch...
[23:41:05] <MaxSem>	 ^ runs without problems
[23:41:40] <SMalyshev>	 yeah signal 6 is SIGABRT... I have no idea why hhvm is doing that
[23:41:49] <SMalyshev>	 let me try to find which file it is
[23:43:10] <MaxSem>	 For Satan's sake...
[23:43:27] <wikibugs>	 (03CR) 10Bstorm: "Ok, this passes the puppet compiler for various projects at this point.  The question I need to check on is whether a value for $mode or $" [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm)
[23:43:42] <wikibugs>	 (03PS5) 10Bstorm: cloudstore: refactor nfsclient role into profile [puppet] - 10https://gerrit.wikimedia.org/r/506319 (https://phabricator.wikimedia.org/T209527)
[23:43:47] <MaxSem>	 HHVM just sets exit code and doesn't output anything in case of syntax errors...
[23:44:15] <MaxSem>	 Because why TF be user friendly?
[23:44:53] <MaxSem>	 Or not?
[23:45:09] <SMalyshev>	  
[23:46:23] <SMalyshev>	 I wonder if this has something to do with -P
[23:46:56] <MaxSem>	 Nah, I forgot that everything that doesn't have <?php is a valid PHP file :P
[23:48:29] <SMalyshev>	 nope produces signal 6 even without parallel
[23:51:19] <robh>	 MaxSem: i didnt have anythign do do with those =]
[23:51:32] <MaxSem>	 Can you prove that?
[23:51:54] <robh>	 can you disprove? ;D
[23:53:48] <SMalyshev>	 well this is definitely weird... with xargs signal 6 is reproducible... but I can
[23:53:56] <SMalyshev>	 I can not cause it without xargs
[23:55:17] <SMalyshev>	 yeah when I run them without xargs every file passes with code 0
[23:55:57] <MaxSem>	 6877 Aborted                 (core dumped)
[23:56:19] <SMalyshev>	 oh cool what causes this?
[23:56:35] <MaxSem>	 That's a harder question XD
[23:56:57] <SMalyshev>	 well you should have core, right?
[23:57:02] <SMalyshev>	 so that might give some clues maybe
[23:57:54] <wikibugs>	 (03CR) 10CDanis: "Sorry for the breakage when I modified standard_packages!  I didn't even know this could happen :\" [puppet] - 10https://gerrit.wikimedia.org/r/506329 (owner: 10Bstorm)