[00:34:33] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) Chiming in here: the end date is August 31st - this can be confirmed by @BGerdemann
[02:09:22] <wikibugs>	 (03PS2) 10Dzahn: gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 (owner: 10QChris)
[02:10:00] <wikibugs>	 (03PS3) 10Dzahn: gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris)
[02:12:57] <wikibugs>	 (03CR) 10Dzahn: "is this heira key actually used anywhere yet? it does not seem like it in compiler / repo." [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris)
[02:14:27] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris)
[02:16:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "confirmed. per https://www.gerritcodereview.com/2.15.html#support-for-draft-changes-removed already" [puppet] - 10https://gerrit.wikimedia.org/r/606533 (owner: 10QChris)
[02:17:03] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 (owner: 10QChris)
[02:23:28] <wikibugs>	 (03PS1) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158)
[02:24:54] <wikibugs>	 (03PS1) 10Dzahn: gerrit (cloud): remove SQL database hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/606550
[02:26:03] <wikibugs>	 (03PS2) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158)
[03:18:54] <icinga-wm_>	 PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:19:40] <icinga-wm_>	 RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[04:35:55] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P11593 and previous config saved to /var/cache/conftool/dbconfig/20200619-043554-marostegui.json
[04:35:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:39:57] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P11594 and previous config saved to /var/cache/conftool/dbconfig/20200619-043956-marostegui.json
[04:40:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:43:59] <wikibugs>	 (03PS1) 10Marostegui: db2108: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606554 (https://phabricator.wikimedia.org/T250666)
[04:44:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2108 for reimage', diff saved to https://phabricator.wikimedia.org/P11595 and previous config saved to /var/cache/conftool/dbconfig/20200619-044440-marostegui.json
[04:44:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2108: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606554 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui)
[05:15:58] <icinga-wm_>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37
[05:20:15] <icinga-wm_>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[05:23:12] <logmsgbot>	 !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime
[05:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:25:46] <logmsgbot>	 !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[05:25:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:29:38] <wikibugs>	 (03CR) 10Muehlenhoff: "Two comments inline and +1 to what Antoine wrote about the before =>" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[05:34:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2108', diff saved to https://phabricator.wikimedia.org/P11596 and previous config saved to /var/cache/conftool/dbconfig/20200619-053402-marostegui.json
[05:34:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:53] <wikibugs>	 (03PS1) 10Marostegui: db2108: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606555
[05:36:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2108: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606555 (owner: 10Marostegui)
[05:36:57] <icinga-wm_>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:39:11] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Reimage db2075 and db2111 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606556 (https://phabricator.wikimedia.org/T250666)
[05:40:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2075 and db2111 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606556 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui)
[05:41:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/606437 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[05:41:19] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db2075 and db2111 for reimage', diff saved to https://phabricator.wikimedia.org/P11597 and previous config saved to /var/cache/conftool/dbconfig/20200619-054118-marostegui.json
[05:41:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:47:50] <wikibugs>	 (03CR) 10Muehlenhoff: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga)
[05:48:21] <icinga-wm_>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:54:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1112', diff saved to https://phabricator.wikimedia.org/P11598 and previous config saved to /var/cache/conftool/dbconfig/20200619-055430-marostegui.json
[05:55:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:00:14] <logmsgbot>	 !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime
[06:01:00] <wikibugs>	 (03CR) 10Marostegui: [C: 04-1] "I would prefer to send this alert to -operations like we do with the rest of them." [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat)
[06:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:13] <logmsgbot>	 !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime
[06:01:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:02:49] <logmsgbot>	 !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[06:02:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:05:26] <logmsgbot>	 !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[06:05:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:15:36] <wikibugs>	 (03PS1) 10Marostegui: db2075, db2111: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606557
[06:16:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2075, db2111: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606557 (owner: 10Marostegui)
[06:17:34] <wikibugs>	 (03PS3) 10Muehlenhoff: Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933)
[06:18:38] <wikibugs>	 (03PS1) 10Marostegui: db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606558 (https://phabricator.wikimedia.org/T254556)
[06:19:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2075 db2111', diff saved to https://phabricator.wikimedia.org/P11599 and previous config saved to /var/cache/conftool/dbconfig/20200619-061922-marostegui.json
[06:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:43] <marostegui>	 !log Stop mysql on db2132 to reimage m1 codfw master - T254556
[06:19:54] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606558 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui)
[06:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:43] <stashbot>	 T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556
[06:23:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff)
[06:23:43] <icinga-wm_>	 PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[06:23:51] <marostegui>	 ^ expected
[06:36:57] <logmsgbot>	 !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime
[06:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:36] <wikibugs>	 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984 (10eyazi) I am doing OTRS upgrades on a daily basis and would love to help you guys out with the upgrading process.  I don't need direct access to the interface or any data if you...
[06:39:26] <logmsgbot>	 !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[06:39:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:40:56] <wikibugs>	 (03PS1) 10Elukey: Revert "Set Bigtop for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/606605
[06:43:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Revert "Set Bigtop for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/606605 (owner: 10Elukey)
[06:45:33] <icinga-wm_>	 RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[06:47:47] <moritzm>	 !log force reinstall of memcached 1.6 deb packages to ensure that the override is used in addition to the unmodified systemd unit from the deb T233933
[06:47:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:54] <stashbot>	 T233933: Replicated ticket registry - https://phabricator.wikimedia.org/T233933
[06:51:04] <wikibugs>	 (03PS1) 10Marostegui: db2132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606634 (https://phabricator.wikimedia.org/T254556)
[06:51:36] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606634 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui)
[06:51:40] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[06:51:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:53:38] <icinga-wm_>	 PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100%
[06:53:46] <icinga-wm_>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:54:52] <icinga-wm_>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:55:19] <moritzm>	 kubetcd2006 is the ganeti reboot, this instance has plain disks
[06:57:10] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[06:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200619T0700)
[07:00:26] <icinga-wm_>	 RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms
[07:02:11] <moritzm>	 !log rebooting ganeti nodes in eqiad for kernel security updates
[07:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:35] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[07:04:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:09:32] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[07:10:14] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[07:10:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:10:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:12:42] <icinga-wm_>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:56] <icinga-wm_>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:15:09] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[07:15:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:48] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[07:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:19:18] <wikibugs>	 (03CR) 10Gilles: "I'd rather take care of metadata stripping/color profile substitution separately. ImageOptim doesn't care about color profiles and will st" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[07:22:45] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[07:22:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:23:01] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[07:23:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:59] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[07:29:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:33] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10ayounsi) Thanks. I did a few small changes, mostly removing bits of the default config. It's good to go now.
[07:39:15] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[07:39:16] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "We've sync'ed on IRC and it has been cleared up." [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat)
[07:39:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:22] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P11600 and previous config saved to /var/cache/conftool/dbconfig/20200619-074420-marostegui.json
[07:45:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:46:15] <wikibugs>	 (03PS5) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120)
[07:47:00] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[07:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:21] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[07:47:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:47:49] <wikibugs>	 (03CR) 10Kormat: mariadb: Add monitoring for lag spikes. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat)
[07:52:16] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[07:52:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:54:05] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[07:54:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:59:08] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1093', diff saved to https://phabricator.wikimedia.org/P11601 and previous config saved to /var/cache/conftool/dbconfig/20200619-075907-marostegui.json
[07:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:02] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:02:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:58] <wikibugs>	 10Operations, 10DBA: refactor mariadb puppet code to have single mapping of multiinstance section to port numbers - https://phabricator.wikimedia.org/T255849 (10Kormat)
[08:06:04] <wikibugs>	 10Operations, 10DBA: refactor mariadb puppet code to have single mapping of multiinstance section to port numbers - https://phabricator.wikimedia.org/T255849 (10Kormat) p:05Triage→03Medium
[08:06:29] <wikibugs>	 (03PS8) 10Kormat: [WIP] mariadb: Refactor puppetness. [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T255849)
[08:07:02] <wikibugs>	 (03Abandoned) 10Kormat: [WIP] mariadb: Refactor puppetness. [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T255849) (owner: 10Kormat)
[08:12:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat)
[08:12:58] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[08:13:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:15:40] <godog>	 !log roll-restart logstash elk5 for "JVM GC Old generation-s runs" alert
[08:15:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:55] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:01] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[08:21:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:51] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: bump pipeline workers [puppet] - 10https://gerrit.wikimedia.org/r/606647 (https://phabricator.wikimedia.org/T255243)
[08:26:58] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0)
[08:27:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:56] <icinga-wm_>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[08:30:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. Two nits inline, but feel free to ignore." (032 comments) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[08:34:27] <wikibugs>	 (03CR) 10JMeybohm: Initial commit of debian directory (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[08:35:26] <icinga-wm_>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-clu
[08:35:26] <icinga-wm_>	 &var-topic=All&var-consumer_group=All
[08:35:50] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "The fact the "average" reduction is reported as being "54%" worried me very much. This should not be possible, unless there is a major los" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[08:42:18] <icinga-wm_>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 23570 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[08:42:59] <godog>	 looking at the kafka lag alert
[08:45:00] <godog>	 !log roll restart elasticsearch_5@production-logstash-eqiad
[08:45:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:47] <volans>	 !log backup netbox and run one-time script to reserve first IPs on all infra prefixes on Netbox - T233183
[08:45:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:50] <stashbot>	 T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183
[08:46:59] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10hashar) Capture of a robot comment:  {F31871199 size=full}
[08:51:17] <wikibugs>	 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10hashar) There are a lot of details on the task to have SonarQube to report straight to Gerrit T217008 and an implementation at https://github.c...
[08:52:22] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 912 threshold =0.34 breach: status: red, number_of_in_flight_fetch: 0, number_of_data_nodes: 2, active_shards_percent_as_number: 21.85089974293059, number_of_nodes: 5, unassigned_shards: 904, cluster_name: production-logstash-eqiad, number_of_pending_tasks: 394, initializing_shards: 8, task_max_waiting_in
[08:52:22] <icinga-wm_>	 635, delayed_unassigned_shards: 390, active_shards: 255, timed_out: False, relocating_shards: 0, active_primary_shards: 197 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:53:18] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 848 threshold =0.34 breach: number_of_data_nodes: 3, initializing_shards: 12, task_max_waiting_in_queue_millis: 261031, number_of_pending_tasks: 602, timed_out: False, active_shards: 319, active_primary_shards: 239, unassigned_shards: 836, number_of_nodes: 6, status: red, active_shards_percent_as_number: 
[08:53:18] <icinga-wm_>	 , cluster_name: production-logstash-eqiad, relocating_shards: 0, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:53:26] <wikibugs>	 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) I've run the script in production, you can see the output of the script in P11603 and the results in Netbox in two ways: * looking for...
[08:53:32] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 814 threshold =0.34 breach: number_of_in_flight_fetch: 0, active_shards_percent_as_number: 30.248500428449017, timed_out: False, initializing_shards: 12, number_of_nodes: 6, number_of_data_nodes: 3, delayed_unassigned_shards: 0, number_of_pending_tasks: 602, status: red, task_max_waiting_in_queue_millis: 
[08:53:32] <icinga-wm_>	 rds: 353, cluster_name: production-logstash-eqiad, active_primary_shards: 266, unassigned_shards: 802, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:53:40] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 803 threshold =0.34 breach: delayed_unassigned_shards: 0, number_of_data_nodes: 3, active_shards_percent_as_number: 31.191088260497, timed_out: False, task_max_waiting_in_queue_millis: 284807, number_of_pending_tasks: 610, cluster_name: production-logstash-eqiad, active_shards: 364, initializing_shards: 1
[08:53:40] <icinga-wm_>	 locating_shards: 0, number_of_nodes: 6, active_primary_shards: 275, unassigned_shards: 791, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:53:56] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove comment about "Same regex as above in https_recv_redirect" [puppet] - 10https://gerrit.wikimedia.org/r/606457 (owner: 10Reedy)
[08:54:20] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 707 threshold =0.34 breach: relocating_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, active_shards: 460, active_shards_percent_as_number: 39.41730934018852, initializing_shards: 12, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 324217, number_of_nodes: 6, unassigned_shards: 
[08:54:20] <icinga-wm_>	 ding_tasks: 730, cluster_name: production-logstash-eqiad, status: red, number_of_data_nodes: 3, active_primary_shards: 353 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:54:44] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 663 threshold =0.34 breach: number_of_in_flight_fetch: 0, active_primary_shards: 389, relocating_shards: 0, status: red, cluster_name: production-logstash-eqiad, active_shards: 504, number_of_nodes: 6, task_max_waiting_in_queue_millis: 347750, active_shards_percent_as_number: 43.18766066838046, unassigned
[08:54:44] <icinga-wm_>	 yed_unassigned_shards: 0, number_of_data_nodes: 3, number_of_pending_tasks: 736, initializing_shards: 12, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:55:19] <godog>	 known ^ roll-restarted the cluster
[08:56:52] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, active_shards_percent_as_number: 69.32305055698372, active_primary_shards: 513, number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 473743, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, cluster_name: production-logstash-eqiad, number_of_nodes: 6, unassigne
[08:56:52] <icinga-wm_>	 ive_shards: 809, timed_out: False, status: yellow, number_of_pending_tasks: 900, initializing_shards: 11 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:57:08] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: active_primary_shards: 513, number_of_data_nodes: 3, active_shards: 842, number_of_nodes: 6, timed_out: False, unassigned_shards: 317, status: yellow, cluster_name: production-logstash-eqiad, active_shards_percent_as_number: 72.15081405312768, delayed_unassigned_shards: 0, number_of_pending_tasks: 
[08:57:08] <icinga-wm_>	 ing_in_queue_millis: 488666, initializing_shards: 8, number_of_in_flight_fetch: 3, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:57:16] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: active_primary_shards: 513, number_of_nodes: 6, active_shards_percent_as_number: 74.293059125964, number_of_data_nodes: 3, cluster_name: production-logstash-eqiad, active_shards: 867, relocating_shards: 0, number_of_pending_tasks: 906, status: yellow, unassigned_shards: 292, delayed_unassigned_shar
[08:57:16] <icinga-wm_>	 g_shards: 8, task_max_waiting_in_queue_millis: 497731, number_of_in_flight_fetch: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:57:46] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: unassigned_shards: 231, number_of_data_nodes: 3, active_primary_shards: 513, number_of_nodes: 6, active_shards_percent_as_number: 79.43444730077121, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, number_of_pending_tasks: 8, timed_out: False, task_max_waiting_in_queue_millis: 1134, clus
[08:57:46] <icinga-wm_>	 on-logstash-eqiad, active_shards: 927, relocating_shards: 0, initializing_shards: 9, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:58:02] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, relocating_shards: 0, cluster_name: production-logstash-eqiad, number_of_data_nodes: 3, delayed_unassigned_shards: 0, active_shards_percent_as_number: 80.80548414738647, unassigned_shards: 217, number_of_pending_tasks: 11, initializing_shards: 7, task_max_waiting_in_queue_millis: 65
[08:58:02] <icinga-wm_>	 se, active_shards: 943, number_of_in_flight_fetch: 0, active_primary_shards: 513, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:58:18] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: initializing_shards: 6, number_of_data_nodes: 3, number_of_nodes: 6, active_shards_percent_as_number: 83.54755784061697, unassigned_shards: 186, active_primary_shards: 513, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, status: yellow, delayed_unassigned_shards: 0, number_of_pen
[08:58:18] <icinga-wm_>	 ive_shards: 975, cluster_name: production-logstash-eqiad, timed_out: False, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[08:58:46] <volans>	 godog: do you know why we have icinga-wm_ instead of icinga-wm?
[09:00:26] <vgutierrez>	 the semi angry version of icinga-wm?
[09:01:37] <godog>	 volans: not ATM no
[09:01:48] <volans>	 vgutierrez: lol
[09:03:54] <icinga-wm_>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[09:06:41] <godog>	 ok I suspect the elastic 5 cluster isn't liking the additional shards change we did earlier in the week, I'm going to revert that before the weekend
[09:10:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "logstash: align number of shards with number of ES indexing hosts" [puppet] - 10https://gerrit.wikimedia.org/r/606651
[09:11:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "logstash: align number of shards with number of ES indexing hosts" [puppet] - 10https://gerrit.wikimedia.org/r/606651 (owner: 10Filippo Giunchedi)
[09:14:01] <apergos>	 !log rsync from dumpsdata1003 as root to labstore1007 of dumps output files to catch up, with --bwlimit=160000 up from 80000 
[09:14:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:34] <apergos>	 shouldn't impact the labstore server signficantly
[09:17:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "Looks great, thanks." [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm)
[09:20:26] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:20:34] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:21:04] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:21:14] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:21:19] <godog>	 mmhh *sigh* the master didn't like the index template update clearly
[09:21:34] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:22:00] <icinga-wm_>	 PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:22:06] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: task_max_waiting_in_queue_millis: 37155, number_of_data_nodes: 3, active_shards_percent_as_number: 90.31705227077977, number_of_pending_tasks: 98, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 21, active_primary_shards: 513, timed_out: False, unassigned_shards: 105, status: yellow, reloc
[09:22:06] <icinga-wm_>	 umber_of_nodes: 6, active_shards: 1054, initializing_shards: 8, cluster_name: production-logstash-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:22:14] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, number_of_nodes: 6, timed_out: False, unassigned_shards: 105, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 21, number_of_data_nodes: 3, number_of_pending_tasks: 98, cluster_name: production-logstash-eqiad, active_primary_shards: 513, task_max_waiting_in_queue_milli
[09:22:14] <icinga-wm_>	 hards: 1054, initializing_shards: 8, status: yellow, active_shards_percent_as_number: 90.31705227077977 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:22:48] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: active_shards_percent_as_number: 90.31705227077977, cluster_name: production-logstash-eqiad, number_of_data_nodes: 3, unassigned_shards: 105, timed_out: False, initializing_shards: 8, number_of_nodes: 6, relocating_shards: 0, number_of_pending_tasks: 99, task_max_waiting_in_queue_millis: 77848, act
[09:22:48] <icinga-wm_>	 : 513, number_of_in_flight_fetch: 33, active_shards: 1054, status: yellow, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:22:52] <godog>	 !log restart elasticsearch on logstash1010
[09:22:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:22:56] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: delayed_unassigned_shards: 0, unassigned_shards: 105, number_of_pending_tasks: 98, number_of_data_nodes: 3, cluster_name: production-logstash-eqiad, number_of_nodes: 6, relocating_shards: 0, number_of_in_flight_fetch: 33, active_primary_shards: 513, task_max_waiting_in_queue_millis: 85658, timed_ou
[09:22:56] <icinga-wm_>	 yellow, initializing_shards: 8, active_shards: 1054, active_shards_percent_as_number: 90.31705227077977 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:23:40] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, number_of_pending_tasks: 98, timed_out: False, delayed_unassigned_shards: 0, active_shards: 1055, task_max_waiting_in_queue_millis: 130715, relocating_shards: 0, initializing_shards: 7, cluster_name: production-logstash-eqiad, active_shards_percent_as_number: 90.
[09:23:40] <icinga-wm_>	 mber_of_in_flight_fetch: 29, number_of_data_nodes: 3, active_primary_shards: 513, unassigned_shards: 105 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:27:38] <icinga-wm_>	 PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:28:06] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:28:52] <jynus>	 1% of 5XX on gets at app layer
[09:29:02] <jynus>	 also latency increase
[09:30:30] <icinga-wm_>	 RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: timed_out: False, cluster_name: production-logstash-eqiad, status: yellow, number_of_pending_tasks: 4, active_primary_shards: 513, delayed_unassigned_shards: 0, number_of_data_nodes: 3, number_of_nodes: 6, relocating_shards: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2415, u
[09:30:30] <icinga-wm_>	 347, active_shards: 808, initializing_shards: 12, active_shards_percent_as_number: 69.23736075407027 https://wikitech.wikimedia.org/wiki/Search%23Administration
[09:31:40] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:34:18] <icinga-wm_>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-clu
[09:34:18] <icinga-wm_>	 &var-topic=All&var-consumer_group=All
[09:36:33] <volans>	 #page logstash elasticsearch 5 cluster in trouble
[09:40:38] <icinga-wm_>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[09:43:10] <icinga-wm_>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[09:50:52] <wikibugs>	 (03PS8) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409)
[09:51:12] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:53:00] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:58:42] <wikibugs>	 (03CR) 10Kormat: "> Patch Set 6:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat)
[10:01:34] <icinga-wm_>	 RECOVERY - Check systemd state on logstash1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:01:57] <volans>	 dcausse: by any chance are you around?
[10:02:06] <dcausse>	 volans: yes
[10:02:30] <volans>	 could you join #wikimedia-sre if not too much trouble?
[10:02:38] <dcausse>	 sure
[10:02:49] <volans>	 thx!
[10:05:51] <wikibugs>	 (03CR) 10Gilles: "The transparency information is not reduced to 1 bit. It becomes a palette PNG, but the palette colors are full ARGB values." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[10:08:11] <wikibugs>	 (03PS1) 10Vgutierrez: Release 8.0.8-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658
[10:14:34] <icinga-wm_>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:15:30] <wikibugs>	 (03PS1) 10Muehlenhoff: Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661
[10:16:45] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661 (owner: 10Muehlenhoff)
[10:18:08] <icinga-wm_>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:18:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[10:21:03] <wikibugs>	 (03Merged) 10jenkins-bot: Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm)
[10:21:24] <godog>	 !log start closing logstash indices for 2020.03 in elastic 5 eqiad
[10:21:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:48] <logmsgbot>	 !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11604 and previous config saved to /var/cache/conftool/dbconfig/20200619-102447-marostegui.json
[10:25:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:59] <wikibugs>	 (03CR) 10Gilles: "My previous commands were wrong in the sense that they manipulated the image first, but my point was correct. Here are the first 10 colors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[10:35:27] <moritzm>	 !log installing tomcat8 security updates
[10:38:19] <jayme>	 !log imported chartmuseum_0.12.0-1 to buster-wikimedia
[10:38:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:39:31] <jayme>	 moritzm: you might want to repeat yourself 
[10:45:01] <moritzm>	 !log installing tomcat8 security updates
[10:45:37] <moritzm>	 jayme: seems it's ignoring just me :-)
[10:45:54] <jayme>	 :-P
[10:46:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:57] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755)
[10:49:07] <godog>	 !log close april logstash indices on logstash 5 eqiad
[10:49:17] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Wait for the decided date and time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui)
[10:49:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:51:08] <wikibugs>	 (03CR) 10Kormat: [C: 04-1] db-eqiad.php: Depool cluster27 (es5) from writes. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui)
[10:51:54] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755)
[10:52:55] <wikibugs>	 (03PS1) 10Hashar: Fix doc generation for ParamikoExecution [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664
[10:53:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::icinga: add vhost for external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/606437 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[10:53:09] <wikibugs>	 (03CR) 10Jcrespo: "Do we need to make the cluster temporarily "static" so pt-heartbeat is not checked? This may need performance input, not sure if any of us" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui)
[10:53:24] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui)
[10:53:39] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "> Do we need to make the cluster temporarily "static" so pt-heartbeat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui)
[10:54:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] Fix doc generation for ParamikoExecution [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664 (owner: 10Hashar)
[10:55:24] <moritzm>	 !log installing mesa security updates
[10:55:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:22] <wikibugs>	 (03PS1) 10Jbond: icinga-extmon: add new cname for icinga external monitoring [dns] - 10https://gerrit.wikimedia.org/r/606668
[10:59:24] <wikibugs>	 (03PS1) 10Jbond: icingae::external_monitoring: hosts should be an array not string [puppet] - 10https://gerrit.wikimedia.org/r/606667
[10:59:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icingae::external_monitoring: hosts should be an array not string [puppet] - 10https://gerrit.wikimedia.org/r/606667 (owner: 10Jbond)
[11:00:54] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Aaron, Tim, we want to depool es5 from writes to do a master switchover, the last time we did it at https://gerrit.wikimedia.org/r/#/c/ope" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui)
[11:04:13] <icinga-wm_>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[11:06:33] <wikibugs>	 (03PS2) 10Jbond: admin: add Andrew Kuznetsov to ldap only [puppet] - 10https://gerrit.wikimedia.org/r/606147
[11:07:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add Andrew Kuznetsov to ldap only [puppet] - 10https://gerrit.wikimedia.org/r/606147 (owner: 10Jbond)
[11:10:29] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) 05Open→03Resolved a:03jbond @AndrewKuznetsov i have added you to the NDA group so you should be...
[11:11:29] <wikibugs>	 (03CR) 10Privacybatm: [C: 03+1] "Looks good! Thank you for this patch." [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664 (owner: 10Hashar)
[11:13:11] <icinga-wm_>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37
[11:13:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/606668 (owner: 10Jbond)
[11:14:39] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga-extmon: add new cname for icinga external monitoring [dns] - 10https://gerrit.wikimedia.org/r/606668 (owner: 10Jbond)
[11:15:33] <volans>	 jbond42: feel free to ping me when you need to change how exernal monitoring is calling icinga
[11:17:23] <jbond42>	 volans: i was just looking at that i see the script check icinga1001 and icinga2001 directly so wonder if its better to have a cname for icinga[12]001-extmon?
[11:18:14] <jbond42>	 please ignore the wikimedia-extmon alert
[11:19:38] <volans>	 jbond42: looking
[11:19:55] <volans>	 so we currnetly have 2 crontab lines
[11:19:55] <volans>	 */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org
[11:19:58] <volans>	 */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org
[11:20:11] <icinga-wm_>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 68.14 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37
[11:21:29] <volans>	 so we can use what'ever works for you, but we do check both the active and passive hosts
[11:21:39] <volans>	 what's not totally clear to me is why it paged... 
[11:21:47] <jbond42>	 oh i ran the script manuly
[11:21:47] <volans>	 did you try it manually?
[11:21:50] <volans>	 ah ok
[11:21:52] <volans>	 that explains it
[11:22:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] Fix doc generation for ParamikoExecution [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664 (owner: 10Hashar)
[11:22:43] <jbond42>	 ithink adding icinga[12]001-extmon.wikimedia.org makes the most sense.  otherwise we would need to update the check script to send a differen host header
[11:23:04] <volans>	 SGTM
[11:23:15] <jbond42>	 ack will update thanks
[11:24:24] <wikibugs>	 (03CR) 10Marostegui: "Do you have a task where I can read some background for this patch?" [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat)
[11:24:28] <volans>	 jbond42: thanks! please once we migrate update also https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring
[11:24:31] <volans>	 accordingly
[11:24:41] <jbond42>	 ack will do 
[11:24:47] <volans>	 being an unpuppetized host it's important to keep the doc up-to-date
[11:25:57] <jbond42>	 ack 
[11:26:17] <icinga-wm_>	 PROBLEM - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized
[11:26:30] <jbond42>	 ^^ me will ack
[11:27:54] <icinga-wm_>	 ACKNOWLEDGEMENT - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) John Bond Still configuring https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized
[11:29:19] <wikibugs>	 (03PS1) 10Jbond: icinga[12]001-extmon add extmon cnames for icinga[12]001 [dns] - 10https://gerrit.wikimedia.org/r/606672
[11:31:53] <icinga-wm_>	 RECOVERY - Check systemd state on an-tool1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:36:23] <wikibugs>	 (03PS2) 10Vgutierrez: Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658
[11:36:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 (owner: 10Vgutierrez)
[11:36:47] <vgutierrez>	 that was fast
[11:39:31] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Reimage db2116, db2119 and db2130 [puppet] - 10https://gerrit.wikimedia.org/r/606674 (https://phabricator.wikimedia.org/T250666)
[11:39:40] <wikibugs>	 (03PS1) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323)
[11:39:42] <marostegui>	 !log Reimage db2116 db2119 db2130
[11:39:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:11] <wikibugs>	 (03PS1) 10Elukey: cumin: add more aliases for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/606676
[11:41:44] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2116, db2119 and db2130 [puppet] - 10https://gerrit.wikimedia.org/r/606674 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui)
[11:41:46] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "one typo, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606676 (owner: 10Elukey)
[11:42:08] <elukey>	 ah!!! thanks!
[11:42:15] <volans>	 yw :)
[11:42:22] <wikibugs>	 (03CR) 10Elukey: cumin: add more aliases for Hadoop test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606676 (owner: 10Elukey)
[11:42:33] <wikibugs>	 (03PS2) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323)
[11:43:00] <volans>	 elukey: no need for follow up +1 from me
[11:43:18] <wikibugs>	 (03PS2) 10Elukey: cumin: add more aliases for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/606676
[11:45:52] <elukey>	 ack!
[11:45:59] <wikibugs>	 (03PS3) 10Vgutierrez: Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658
[11:46:29] <wikibugs>	 (03PS1) 10Jbond: profile/icinga/external_monitoring: add dummy secrets [labs/private] - 10https://gerrit.wikimedia.org/r/606677
[11:46:41] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cumin: add more aliases for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/606676 (owner: 10Elukey)
[11:46:54] <wikibugs>	 (03CR) 10Jbond: [V: 03+2 C: 03+2] profile/icinga/external_monitoring: add dummy secrets [labs/private] - 10https://gerrit.wikimedia.org/r/606677 (owner: 10Jbond)
[11:49:48] <wikibugs>	 (03CR) 10Ema: [C: 03+1] Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 (owner: 10Vgutierrez)
[11:51:28] <wikibugs>	 (03PS3) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323)
[11:55:21] <wikibugs>	 (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[11:58:13] <icinga-wm_>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:59:35] <wikibugs>	 (03PS1) 10Jbond: acme_chief: add extmon names to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323)
[11:59:59] <wikibugs>	 (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[12:00:52] <wikibugs>	 (03PS2) 10Jbond: acme_chief: add extmon names to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323)
[12:01:34] <wikibugs>	 (03PS3) 10Jbond: acme_chief: add extmon names to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323)
[12:02:37] <wikibugs>	 (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[12:03:16] <wikibugs>	 (03PS2) 10Kormat: mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604
[12:03:47] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] ""Extraordinary evidence". Oh dear. Can we please stick to a less aggressive language?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[12:03:59] <icinga-wm_>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:05:01] <logmsgbot>	 !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime
[12:05:04] <logmsgbot>	 !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime
[12:05:24] <wikibugs>	 (03PS3) 10Kormat: mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604
[12:05:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:06:55] <wikibugs>	 (03CR) 10Kormat: "> Do you have a task where I can read some background for this patch?" [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat)
[12:07:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:07:38] <logmsgbot>	 !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:07:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:02] <wikibugs>	 (03PS4) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323)
[12:08:07] <logmsgbot>	 !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime
[12:08:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:04] <logmsgbot>	 !log marostegui@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[12:10:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:50] <logmsgbot>	 !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:14:08] <godog>	 !log delete march indices from logstash 5 eqiad to free up space
[12:15:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/606672 (owner: 10Jbond)
[12:24:11] <volans>	 jbond42: FYI there is also the bit in /etc/check_icinga/config.yaml to define the domain of the icinga service
[12:24:50] <volans>	 and that's used to determine if the host is active or passive
[12:25:04] <volans>	 checking the CNAME
[12:26:06] <jbond42>	 volans: dose that mean we dont need th additional dns queries above?  i.e. dose it also use that value to set the host header?
[12:26:09] <wikibugs>	 (03PS6) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736)
[12:26:19] <jbond42>	 s/queries/cnames/
[12:26:46] <volans>	 headers={'Host': domain}
[12:27:18] <volans>	 see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/external-monitoring/+/master/icinga/check_icinga.py#564
[12:27:48] <volans>	 and slightly above to the detection of the active host
[12:28:00] <wikibugs>	 (03CR) 10Privacybatm: "I have resolved the comments except these three (WIP):" (035 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm)
[12:28:34] <jbond42>	 ack thanks ill cancle those other changes
[12:28:35] <volans>	 I think you can get away with just a single CNAME yes
[12:28:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661 (owner: 10Muehlenhoff)
[12:28:45] <wikibugs>	 (03PS2) 10Muehlenhoff: Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661
[12:29:17] <wikibugs>	 (03Abandoned) 10Jbond: icinga[12]001-extmon add extmon cnames for icinga[12]001 [dns] - 10https://gerrit.wikimedia.org/r/606672 (owner: 10Jbond)
[12:30:06] <wikibugs>	 (03Abandoned) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[12:30:15] <wikibugs>	 (03CR) 10Privacybatm: transferpy: Package transferpy (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm)
[12:31:04] <qchris>	 !log Disabling puppet on gerrit1002 (test instance) to do some more testing
[12:31:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:32:15] <wikibugs>	 (03PS1) 10Marostegui: db2116,db2119,db2130: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606684
[12:32:29] <wikibugs>	 (03PS4) 10Jbond: acme_chief: add extmon name to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323)
[12:32:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db2116,db2119,db2130: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606684 (owner: 10Marostegui)
[12:32:56] <wikibugs>	 (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[12:34:47] <wikibugs>	 (03PS1) 10Jbond: profile::icinga::external_monitoring:  correct icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606685
[12:36:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat)
[12:37:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::icinga::external_monitoring:  correct icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606685 (owner: 10Jbond)
[12:37:20] <wikibugs>	 (03CR) 10Kormat: [C: 03+2] mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat)
[12:37:50] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat)
[12:41:20] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 (owner: 10Vgutierrez)
[12:42:10] <wikibugs>	 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) For what is worth this is probably going to be scheduled for next quarter (so July-September).  @eyazi Thanks for you offer, it's much appreciated. Indeed we are con...
[12:45:56] <wikibugs>	 (03CR) 10Jcrespo: "answers" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm)
[12:47:43] <wikibugs>	 (03PS6) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120)
[12:48:58] <wikibugs>	 (03PS1) 10Muehlenhoff: Update various references/comments to jessie [puppet] - 10https://gerrit.wikimedia.org/r/606688
[12:49:17] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10User-jbond: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10Aklapper)
[12:49:21] <wikibugs>	 (03CR) 10Jbond: dnsdist: add parameter for web server configuration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[12:53:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[12:58:13] <wikibugs>	 (03CR) 10Gilles: "There's nothing aggressive with this language. It's very factual, you make bold claims in your -1, suggesting that previous reviewers miss" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[12:58:39] <wikibugs>	 10Operations, 10Traffic, 10User-Joe: etcd cluster has Raft Internal errors sporadically - https://phabricator.wikimedia.org/T147209 (10Aklapper)
[13:01:49] <moritzm>	 !log installing cups security updates (client side libs/tools)
[13:01:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:54] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] acme_chief: add extmon name to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond)
[13:13:52] <icinga-wm_>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:18:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "Sure. I can reproduce in fact." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[13:25:30] <wikibugs>	 (03PS1) 10Addshore: AdHocLogging for ReplicaMasterAwareRecordIdsAcquirer [extensions/Wikibase] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606692 (https://phabricator.wikimedia.org/T255855)
[13:30:57] <icinga-wm_>	 PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 211.9 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[13:32:32] <godog>	 that's a false positive I think ^ 
[13:37:45] <wikibugs>	 (03PS7) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736)
[14:06:14] <wikibugs>	 (03PS1) 10Majavah: betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673)
[14:07:42] <icinga-wm_>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:08:23] <wikibugs>	 (03CR) 10Privacybatm: "> Patch Set 7: Verified+2" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm)
[14:09:30] <icinga-wm_>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:10:49] <wikibugs>	 (03CR) 10Reedy: [C: 04-1] "Not that globalblocks seem to be used on beta anyway; the table on deploymentwiki is empty. So nothing to actually migrate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[14:10:58] <wikibugs>	 (03CR) 10Reedy: "Uh, accidental -1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[14:17:05] <wikibugs>	 (03PS1) 10Majavah: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673)
[14:17:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/606688 (owner: 10Muehlenhoff)
[14:17:46] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[14:18:20] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[14:18:40] <wikibugs>	 (03PS2) 10Majavah: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673)
[14:19:24] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[14:27:42] <icinga-wm_>	 RECOVERY - icinga-extmon.wikimedia.org requires authentication on icinga1001 is OK: passive https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized
[14:30:24] <wikibugs>	 (03CR) 10Jcrespo: "> Patch Set 7:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm)
[14:32:48] <icinga-wm_>	 PROBLEM - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized
[14:33:31] <icinga-wm_>	 ACKNOWLEDGEMENT - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) John Bond Investigating check https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized
[14:35:14] <wikibugs>	 (03PS1) 10Jbond: profile::icinga::external_monitoring: fix typo in check_command [puppet] - 10https://gerrit.wikimedia.org/r/606707
[14:36:16] <wikibugs>	 (03PS1) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409)
[14:38:22] <wikibugs>	 (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "I'm sorry, but I will not continue a conversation as hostile as this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles)
[14:44:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::icinga::external_monitoring: fix typo in check_command [puppet] - 10https://gerrit.wikimedia.org/r/606707 (owner: 10Jbond)
[14:51:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Yeah, Ganeti should be unblocked for this." [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[14:53:38] <icinga-wm_>	 RECOVERY - icinga-extmon.wikimedia.org requires authentication on icinga1001 is OK: HTTP OK: Status line output matched HTTP/1.1 403 - 437 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized
[14:56:35] <wikibugs>	 (03CR) 10Volans: "I don't dislike the approach, we could query other existing classes but feels weird to me (like the ferm one), so why not." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat)
[14:58:57] <wikibugs>	 (03PS1) 10Majavah: betacluster: Apply global abuse filters from metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673)
[14:59:11] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "DNM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[15:06:22] <wikibugs>	 (03CR) 10Gehel: [V: 04-1] "Looks good! A few questions inline. Some is just me not understanding, other are about some weirdness of our deployment process." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles)
[15:06:30] <wikibugs>	 (03PS1) 10Filippo Giunchedi: kibana: additional settings [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863)
[15:07:48] <wikibugs>	 (03CR) 10Gehel: [V: 04-1 C: 04-1] sdoc gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles)
[15:08:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: "I've verified that e.g. Kibana 5 doesn't barf on unknown settings" [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi)
[15:10:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Looks good to me except for the SPARQL URI (which I now see Gehel already commented on)." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles)
[15:12:19] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] sdoc gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles)
[15:13:04] <wikibugs>	 (03PS8) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736)
[15:13:31] <wikibugs>	 (03PS2) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409)
[15:14:49] <wikibugs>	 (03CR) 10Kormat: "> I'm even tempted to propose, in order to make the section profile a bit less dummy, to use it as the entry point for everything section " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat)
[15:15:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] kibana: additional settings [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi)
[15:22:20] <wikibugs>	 (03PS5) 10Dzahn: add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526)
[15:22:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] kibana: additional settings [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi)
[15:25:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn)
[15:28:25] <godog>	 !log roll-restart kibana to apply new settings
[15:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:40] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) added to DNS:  install3001.wikimedia.org has address 91.198.174.63 install3001.wikimedia.org has IPv6 address 2620:0:862:1:91:198:174:63  install40...
[15:29:44] <wikibugs>	 10Operations, 10Patch-For-Review: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn)
[15:31:42] <icinga-wm_>	 RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 99.77 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen
[15:32:12] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) Thanks will schedule downtime maintenance for the 25th at 9:30am CT to take down the old one and connect the new one.
[15:34:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): sdoc gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles)
[15:34:23] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10JMeybohm)
[15:36:00] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm)
[15:36:09] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn)
[15:37:38] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[15:37:49] <wikibugs>	 (03PS1) 10Ssingh: rearrange wikidough data [labs/private] - 10https://gerrit.wikimedia.org/r/606715
[15:38:00] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash2025.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:38:46] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash2025.codfw.wmnet, logstash2024.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[15:38:46] <godog>	 that's me ^ checking
[15:38:50] <mutante>	 !log dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 20 --network public ulsfo install4001.wikimedia.org (T254157) 
[15:38:59] <mutante>	 ack, thanks godog
[15:39:20] <icinga-wm_>	 PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash2025.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[15:39:51] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] "changes to only wikidough's dummy data" [labs/private] - 10https://gerrit.wikimedia.org/r/606715 (owner: 10Ssingh)
[15:42:00] <mutante>	 !heal stashbot
[15:42:08] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm)
[15:42:29] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm)
[15:42:31] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm)
[15:42:33] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm)
[15:42:37] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm)
[15:42:39] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm)
[15:42:42] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm)
[15:42:46] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm)
[15:42:52] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10JMeybohm)
[15:43:18] <wikibugs>	 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) >>! In T254157#6219046, @MoritzMuehlenhoff wrote: >  feel free to give install4001.wikimedia.org a shot.  Thanks!   Just did. First try i followed...
[15:43:25] <wikibugs>	 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul)
[15:43:30] <icinga-wm_>	 PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash2025.codfw.wmnet, logstash2024.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[15:46:34] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move cxserver  to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm)
[15:46:56] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[15:47:33] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move cxserver  to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) @Joe I think cxserver is missing the last two steps as well, correct?
[15:47:43] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[15:47:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:49:19] <wikibugs>	 (03PS1) 10Filippo Giunchedi: kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863)
[15:49:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi)
[15:49:48] <wikibugs>	 (03PS1) 10Dzahn: site/DHCP: add install4001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/606718 (https://phabricator.wikimedia.org/T254157)
[15:50:16] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move cxserver  to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) a:05JMeybohm→03None
[15:50:51] <wikibugs>	 (03PS2) 10Filippo Giunchedi: kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863)
[15:51:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] site/DHCP: add install4001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/606718 (https://phabricator.wikimedia.org/T254157) (owner: 10Dzahn)
[15:53:19] <wikibugs>	 (03PS5) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132)
[15:55:01] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi)
[15:55:35] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[15:56:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi)
[15:59:02] <wikibugs>	 (03PS6) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132)
[15:59:28] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:00:44] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:00:46] <icinga-wm_>	 RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[16:01:41] <wikibugs>	 (03PS1) 10Dzahn: DHCP: configure install2003 as next-server for install4001 [puppet] - 10https://gerrit.wikimedia.org/r/606720 (https://phabricator.wikimedia.org/T254157)
[16:01:54] <icinga-wm_>	 RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[16:03:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] DHCP: configure install2003 as next-server for install4001 [puppet] - 10https://gerrit.wikimedia.org/r/606720 (https://phabricator.wikimedia.org/T254157) (owner: 10Dzahn)
[16:07:18] <mutante>	 !log ganeti4003 - rebooting install4001 - trying to bootstrap OS install from install2003
[16:07:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:13:17] <wikibugs>	 (03CR) 10Volans: "LGTM, better to double check it with a puppet compiler for each profile when you get a chance." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat)
[16:13:38] <icinga-wm_>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:13:44] <wikibugs>	 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie...
[16:15:24] <icinga-wm_>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:16:50] <wikibugs>	 10Operations: Review lists of config/sysctl recommendations by "kernel self-protection project" - https://phabricator.wikimedia.org/T142984 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decre...
[16:17:48] <mutante>	 andre__: ^ mass editing tasks?:)
[16:19:22] <andre__>	 mutante: Yepp. Mass-unassigning people. Hence cannot do that silently.
[16:19:27] <wikibugs>	 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Aklapper) a:05herron→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-li...
[16:20:41] <wikibugs>	 10Operations: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066 (10Aklapper) a:05ArielGlenn→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly...
[16:21:57] <wikibugs>	 10Operations, 10Mail: Split MXes into inbound and outbound - https://phabricator.wikimedia.org/T175362 (10Aklapper) a:05herron→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly...
[16:22:25] <wikibugs>	 10Operations, 10netops, 10Sustainability (Incident Prevention): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10Aklapper) a:05ayounsi→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-l...
[16:23:59] <mutante>	 andre__: yep, it's just that wikibugs quits IRC 
[16:24:19] <mutante>	 because it triggers flooding
[16:24:29] * andre__ shroogs :)
[16:25:54] <wikibugs>	 10Operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a...
[16:25:56] <wikibugs>	 10Operations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a sl...
[16:26:11] * RhinosF1 didn't even know he was subscribed to half the tasks he's getting emails for
[16:26:13] <wikibugs>	 10Operations: Data retention: revise audit bash scripts - https://phabricator.wikimedia.org/T111021 (10Aklapper) a:05ArielGlenn→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly m...
[16:26:17] <wikibugs>	 10Operations: Retention auditing: clean up rules db contents and use - https://phabricator.wikimedia.org/T111020 (10Aklapper) a:05ArielGlenn→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get...
[16:28:04] <andre__>	 RhinosF1: Heh, sorry :P
[16:28:29] <RhinosF1>	 andre__: don't mind. :)
[16:28:50] * RhinosF1 gets that many emails anyway its barely a spike
[16:29:14] <andre__>	 https://www.mediawiki.org/wiki/Phabricator/Help/Managing_mail covers how to best ignore Phab email notifications ;)
[16:31:19] * mutante recommends turning off all mail from phabricator and instead have the notifications in the browser
[16:31:35] <RhinosF1>	 andre__: You just made me bump into an ios bug!
[16:31:38] <mutante>	 and then reading those to get updates (not ignoring them all, heh)
[16:32:02] <andre__>	 (For the records: Bulk Job Complete, server side at least.)
[16:33:42] <RhinosF1>	 Does anyone have an iphone running latest ios version (13.5.1) so I can test the bug andre__ made me bump into?
[16:34:08] <wikibugs>	 (03CR) 10Ssingh: "This is ready for review. I have used the "merge" strategy and also tried to address the other concerns raised. https://puppet-compiler.wm" [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:34:19] <andre__>	 RhinosF1: maybe not in #operations if it's not a bug on SRE level :)
[16:34:34] <RhinosF1>	 True :)
[16:34:43] * RhinosF1 wonders off to an apple channel
[16:35:08] <icinga-wm_>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:36:54] <icinga-wm_>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:38:30] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "right now puppet is broken and compiler shows it would be unbroken after this" [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:38:55] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh)
[16:48:21] <wikibugs>	 10Operations, 10Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941 (10Nuria) 05Open→03Resolved
[16:49:26] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: apply hostname labels to bast1002/WMF4749 - https://phabricator.wikimedia.org/T186625 (10Dzahn)
[16:56:50] <wikibugs>	 (03PS1) 10Ssingh: dnsdist: disable the console by default [puppet] - 10https://gerrit.wikimedia.org/r/606729
[16:59:36] <mutante>	 ChanServ shutting down .. ok, that's not happening every day
[17:00:54] <wikibugs>	 (03PS1) 10Dzahn: icinga: move ferm rules from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606730 (https://phabricator.wikimedia.org/T114209)
[17:01:21] <RhinosF1>	 mutante: maintenance, it was announced 2 hours ago
[17:01:21] <wikibugs>	 (03CR) 10Ssingh: [V: 03+2 C: 03+2] "pcc looks good as no change was expected: https://puppet-compiler.wmflabs.org/compiler1001/23346/" [puppet] - 10https://gerrit.wikimedia.org/r/606729 (owner: 10Ssingh)
[17:01:54] <mutante>	 RhinosF1: aha, thanks!
[17:02:01] <wikibugs>	 (03PS1) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673)
[17:02:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[17:03:36] * Majavah is confused why that jenkins job failed
[17:04:23] <wikibugs>	 (03PS1) 10Dzahn: codesearch: move ferm rules from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606735 (https://phabricator.wikimedia.org/T114209)
[17:04:58] <RhinosF1>	 Majavah: it's not order correctly
[17:05:07] <RhinosF1>	 Did you run buildDBLists?
[17:05:14] <mutante>	 Majavah: "closed-labs.dblist is not alphasorted"
[17:05:39] <Majavah>	 'closed-labs.dblist' only contains names in 'all.dblist'
[17:05:53] <mutante>	 eswiki
[17:05:53] <mutante>	 	
[17:05:53] <mutante>	 	
[17:05:54] <mutante>	 deploymentwiki
[17:05:58] <mutante>	 but d is before e
[17:06:03] <wikibugs>	 (03PS2) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673)
[17:06:35] <RhinosF1>	 Majavah: are you editing the dblist directly?
[17:06:35] <jynus>	 yeah, it is not a huge issue, but before the check was there it list were very kaotic
[17:06:48] <jynus>	 *the lists
[17:06:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[17:07:10] <jynus>	 *chaotic
[17:07:54] * RhinosF1 has an idea
[17:08:03] <wikibugs>	 (03PS1) 10Elukey: WIP - Add sre.hadoop.change-distro.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499)
[17:08:11] <Majavah>	 hmh found it
[17:08:57] <wikibugs>	 (03PS1) 10Dzahn: contint: move firewall rules for labs to profile [puppet] - 10https://gerrit.wikimedia.org/r/606737 (https://phabricator.wikimedia.org/T114209)
[17:08:59] <wikibugs>	 (03PS3) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673)
[17:09:55] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[17:12:32] <wikibugs>	 (03PS4) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673)
[17:14:04] <Majavah>	 finally it worked :D
[17:15:06] <wikibugs>	 10Operations, 10vm-requests: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) Creating the VM worked fine. Installing the OS on install4001 has not worked yet though.  DHCP was working right away, but serving the installer was not.  Then i changed...
[17:18:56] <RhinosF1>	 Majavah: :)
[17:20:26] <wikibugs>	 (03PS1) 10Dzahn: dumps: move ferm rules for xmldumps from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209)
[17:23:02] <wikibugs>	 10Operations, 10Wiki-Loves-Monuments, 10Wikimedia-Mailing-lists: Close wlm-us mailing list - https://phabricator.wikimedia.org/T159261 (10Dzahn)
[17:23:12] <wikibugs>	 10Operations, 10Wiki-Loves-Monuments: Close wlm-us mailing list - https://phabricator.wikimedia.org/T159261 (10Dzahn)
[17:26:07] <wikibugs>	 10Operations, 10audits-data-retention: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839 (10Dzahn)
[18:07:26] <wikibugs>	 (03PS1) 10Ottomata: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606749 (https://phabricator.wikimedia.org/T238230)
[18:08:23] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606749 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata)
[18:10:07] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt - T238230 (duration: 00m 59s)
[18:10:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:13] <stashbot>	 T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230
[18:16:29] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-JobQueue, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055 (10Krinkle) 05Open→03Declined We're no longer on HHVM. We also use Redis for fewer things now.  Dec...
[18:17:19] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM!  From metrics, it seems there is still some headspace under load that would be great to utilize." [puppet] - 10https://gerrit.wikimedia.org/r/606647 (https://phabricator.wikimedia.org/T255243) (owner: 10Filippo Giunchedi)
[18:48:38] <wikibugs>	 (03PS3) 10Krinkle: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[18:55:17] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "This has been cherry-picked on Beta Cluster's puppetmaster, and I ran puppet agent on mediawiki-07 there. There were no errors." [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[19:03:56] <wikibugs>	 (03CR) 10Reedy: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[19:07:08] <wikibugs>	 (03CR) 10Majavah: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[19:19:19] <wikibugs>	 (03PS1) 10Majavah: betacluster: Add explicit testwikidataclient-test overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555)
[19:37:02] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] "You can use '-' instead and then set 'default' for ones where it is not already set." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah)
[19:38:51] <wikibugs>	 (03PS1) 10Ottomata: [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609)
[19:39:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[19:40:22] <wikibugs>	 (03CR) 10Ottomata: "Petr and Timo, I'm not sure if this is the best way to do this.  It is very simple here, but I could also see adding some code to EventStr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[19:40:32] <wikibugs>	 (03PS2) 10Majavah: betacluster: Add explicit testwikidataclient-test overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555)
[19:42:35] <wikibugs>	 (03PS3) 10Majavah: betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555)
[19:42:42] <wikibugs>	 (03PS2) 10Ottomata: [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609)
[19:43:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[19:45:55] <RhinosF1>	 anyone got any idea/advice for https://phabricator.wikimedia.org/T255891? Or able to add the timings for wmf db queries if they're useful?
[19:46:45] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah)
[19:48:14] <icinga-wm_>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[19:48:15] <Reedy>	 RhinosF1: They're all 0.00 sec on a WMF replica
[19:49:04] <Reedy>	 But as the timings on miraheze is pretty small for just the sql queries, it's probably not missing sql indexes etc
[19:49:42] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) > We could treat a stream as an unending download, and encode the same information that is curr...
[19:50:44] <RhinosF1>	 Reedy: The queries are pretty low. The page is slow to load though so either I'm missing something in what the page does or it's doing something weird and wonderful.
[19:51:47] <wikibugs>	 (03PS2) 10ArielGlenn: restructure rsync of xml/sql dumps from primary source to other servers [puppet] - 10https://gerrit.wikimedia.org/r/605990 (https://phabricator.wikimedia.org/T254856)
[19:51:52] <Reedy>	 You'll really have to do some more indepth benchmarking to see where the slow parts are
[19:52:46] <Reedy>	 The unregistered user being over 2x slower on wmf is potentially you could get someone at WMF to investigate
[19:52:48] <RhinosF1>	 Reedy: If you have an idea on how to do that, drop a note on the task. I should have shell access soon to mw servers. If not, someone can run it in next few days.
[19:53:38] <Reedy>	 https://www.mediawiki.org/wiki/Manual:Profiling
[19:54:23] <Reedy>	 I wonder if it's more related to the cach(es|ing)
[19:54:55] * RhinosF1 not sure. I do config and basic maintenance.
[19:56:04] <Reedy>	 You can also do some debugging via browser tools
[19:56:10] <Reedy>	 See if that helps highlight the slow parts
[19:57:25] <RhinosF1>	 I can look in the next few days
[19:58:28] <Reedy>	 If you can find actual issues from the MW/CA code itself, you can probably tag it as a perf issue etc
[19:59:12] <RhinosF1>	 I can try and get data from profiling & browser tools
[19:59:31] <RhinosF1>	 But me understand php enough to know exactly where it is, is unlikely
[19:59:46] <Reedy>	 Like I say, there's a slight issue on WMF wikis too, but it's definitely nowhere near as pronounced
[20:00:36] <Reedy>	 But the fact there's such a big increase on the registered one would suggest it's more of an issue in MH config/setup etc
[20:00:46] <RhinosF1>	 That's probably because the wmf's servers are 100x better
[20:00:59] * RhinosF1 can't lie about our infra
[20:01:21] <Reedy>	 The mainpage different isn't much in absolute terms, but quite a bit in relative terms
[20:02:04] <RhinosF1>	 I know paladox said we are on hdds not ssds so that could slow some stuff down
[20:02:18] <Reedy>	 Well, your DB queries aren't excessively slowly
[20:02:50] <Reedy>	 *slower
[20:03:41] <RhinosF1>	 Not really
[20:03:53] <RhinosF1>	 They're nearly instant
[20:04:09] <Reedy>	 You need to work out where the problem actually is before really speculating. Sure you can hypothesising ;)
[20:04:21] <Reedy>	 hypothesise
[20:04:28] <Reedy>	 hypothesize
[20:04:30] <Reedy>	 whatever
[20:04:30] <RhinosF1>	 I didn't really know where to start
[20:04:47] <Reedy>	 As above, play with the profilers and see if you can narrow it down
[20:04:47] <RhinosF1>	 The word profiling is more of an idea than I had a few hours back
[20:04:53] <RhinosF1>	 Thanks
[20:05:04] <RhinosF1>	 paladox: ^ if you wanna start
[20:07:50] <icinga-wm_>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[20:40:03] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Nuria) Some thoughts: I agree with @ema and @BBlack that we cannot expect connections to live "forever" a...
[20:50:36] <paladox>	 Reedy profiled, found that it's due to https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/CentralAuthUser.php#L2362
[20:50:57] <paladox>	 https://phabricator.wikimedia.org/P11610
[20:56:48] <Reedy>	 Not quite
[20:56:54] <Reedy>	 It's due to all of the calls of localUserData ;)
[20:57:25] <Reedy>	 I think that explains the non existent users being slow... exceptions are slow
[20:57:52] <Reedy>	 But if it's reading from localuser...
[20:58:13] <Reedy>	 localUserData being slow is hardly a surprise
[20:58:56] <Reedy>	 As it's doing numerous queries against each database
[21:00:51] <Reedy>	 paladox: Could do with narrowing it down to which query/queries are slow inside it
[21:00:58] <paladox>	 ah
[21:01:06] <paladox>	 well i see this:
[21:01:07] <paladox>	  20.91% 902.220    697 - section.query-m: SELECT ipb_expiry,ipb_block_email,ipb_anon_only,ipb_create_account,ipb_enable_autoblock,ipb_allow_usertalk,comment_ipb_reason.comment_text AS `ipb_reason_text`,comment_ipb_reason.comment_data AS `ipb_reason_data`,comment_ipb_reason.comment_id AS `ipb_reas
[21:02:17] <Reedy>	 Yeah, there should probably be something similar for other queries
[21:02:26] <Reedy>	 I mean, that's not great, but it's not all the time
[21:02:45] <paladox>	 Reedy https://phabricator.wikimedia.org/T255891#6241515
[21:04:42] <Reedy>	 That's for an existing user, right?
[21:04:49] <Reedy>	 What about a nonexistent one?
[21:04:52] <paladox>	 yeh
[21:04:56] * paladox checks
[21:06:21] <paladox>	 Reedy https://phabricator.wikimedia.org/T255891#6241516
[21:08:43] <Reedy>	 The 3rd column is the count of times the function was called, IIRC?
[21:09:28] <icinga-wm_>	 PROBLEM - Stale file for node-exporter textfile in ulsfo on icinga1001 is CRITICAL: cluster=ganeti file=device_smart.prom instance=ganeti4002:9100 job=node site=ulsfo https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[21:09:37] <Reedy>	 Like, why is it doing 10K more queries for a nonexistent user?
[21:10:06] <paladox>	 yeh, i believe so.
[21:10:24] <Reedy>	 I'd almost say there's 2 seperate issues here
[21:11:34] <Reedy>	 https://www.mediawiki.org/wiki/Manual:How_to_debug#SQL_errors could be useful
[21:11:48] <Reedy>	 see what queries are actually being run
[21:12:31] <paladox>	 Where would that be logged?
[21:12:45] <paladox>	 oh
[21:12:53] <Reedy>	 Also, how many wikis? And are they all in CentralAuth?
[21:13:42] <icinga-wm_>	 PROBLEM - Stale file for node-exporter textfile in eqsin on icinga1001 is CRITICAL: cluster=ganeti file=device_smart.prom instance=ganeti5002:9100 job=node site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[21:14:05] <paladox>	 Reedy we have 4k wikis.
[21:14:22] <icinga-wm_>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[21:14:24] <paladox>	 and yes.
[21:14:47] <Reedy>	 So... This will obviously help make things slower
[21:14:53] <Reedy>	 Like, you've got over 4x more wikis than WMF
[21:15:36] <paladox>	 yeh
[21:16:00] <Reedy>	 I wonder if some of the assumptions about looking for unattached accounts we still do are really necessary these days
[21:18:42] <paladox>	 hmm, setting "'DBQuery'      => "$wmgLogDir/debuglogs/DBQuery.log"," and wgDebugDumpSql doesn't seem to log.
[21:19:14] <Reedy>	 touch the file first?
[21:20:28] <paladox>	 that didn't seem to fix it.
[21:25:52] <icinga-wm_>	 PROBLEM - Stale file for node-exporter textfile in esams on icinga1001 is CRITICAL: cluster=ganeti file=device_smart.prom instance=ganeti3003:9100 job=node site=esams https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile
[22:10:12] <icinga-wm_>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[22:59:54] <paladox>	 Reedy it seems to be https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/043514ea3b2c894d98ed84b7ef2c305f5833bb31/includes/CentralAuthUser.php#L2208 (for the non existing user)
[23:00:36] <Reedy>	 [22:15:59] <Reedy> I wonder if some of the assumptions about looking for unattached accounts we still do are really necessary these days
[23:01:43] <paladox>	 ah ok
[23:02:21] <RhinosF1>	 Can one of you sum this up on phab?
[23:02:51] <Reedy>	 I've got an idea that might help a bit though
[23:03:04] <RhinosF1>	 Go on
[23:03:51] <Reedy>	 I'm making a patch
[23:04:11] <RhinosF1>	 :)
[23:04:23] * RhinosF1 will likely look in the morning
[23:05:04] <Reedy>	 How are the DBs arranged? All one one set of server (primary/replicas)?
[23:05:09] <Reedy>	 or multiple clusters like WMF?
[23:05:35] <RhinosF1>	 paladox: you can explain better ^
[23:05:44] <RhinosF1>	 https://github.com/miraheze/mw-config/blob/master/Database.php
[23:06:22] <Reedy>	 yeah, only two hosts
[23:06:24] <Reedy>	 so seems likely
[23:06:48] <paladox>	 Reedy we use primary (though we have a replica but we don't read from it)
[23:06:55] <paladox>	 basically 2 primary db servers
[23:06:56] <Reedy>	 yeah, it was more seperate clusters
[23:07:01] <Reedy>	 https://gerrit.wikimedia.org/r/606764
[23:07:12] <Reedy>	 That should help... As it shouldn't need to do a new connection every time
[23:07:19] <Reedy>	 Should be able to reuse the connections a bit more
[23:07:56] <Reedy>	 Cause that's probably more of the time wasted than actually executing the queries
[23:08:16] <paladox>	 doesn't seem to help at least still slow
[23:23:33] <Reedy>	 Hmm
[23:23:33] <Reedy>	 >Never call this on handles acquired via getConnectionRef()
[23:25:34] <Reedy>	 paladox: What about if you replace getConnectionRef( with getConnection(
[23:25:53] <paladox>	 just tried that, doesn't seem to do it either :(
[23:34:49] <paladox>	 Reedy oh!
[23:34:57] <paladox>	 it's the inserting that appears slow
[23:35:13] <paladox>	 oh
[23:35:28] <paladox>	 nvm
[23:37:42] <paladox>	 Reedy https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/043514ea3b2c894d98ed84b7ef2c305f5833bb31/includes/CentralAuthUser.php#L2233 seems like a useless if statement
[23:37:47] <paladox>	 since it's already done above
[23:38:03] <Reedy>	 no
[23:38:28] <Reedy>	 there can be no rows, and $this->exists() is still true
[23:38:38] <paladox>	 oh
[23:53:55] <paladox>	 Reedy would this
[23:53:56] <paladox>	 $user = User::newFromName( $this->mName );
[23:53:57] <paladox>	 works?
[23:54:07] <paladox>	 at least using it was fast for me
[23:54:16] <Reedy>	 Work for what?
[23:54:18] <paladox>	 oh
[23:54:20] <paladox>	 nvm
[23:54:35] <paladox>	 Reedy well for importLocalNames
[23:54:47] <paladox>	 i just realised that it was using the same db
[23:54:52] <Reedy>	 Heh, yeah
[23:54:58] <Reedy>	 That's why it's quick, and cached :P