[00:34:33] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10AndrewKuznetsov) Chiming in here: the end date is August 31st - this can be confirmed by @BGerdemann [02:09:22] (03PS2) 10Dzahn: gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 (owner: 10QChris) [02:10:00] (03PS3) 10Dzahn: gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [02:12:57] (03CR) 10Dzahn: "is this heira key actually used anywhere yet? it does not seem like it in compiler / repo." [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [02:14:27] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) (owner: 10QChris) [02:16:25] (03CR) 10Dzahn: [C: 03+1] "confirmed. per https://www.gerritcodereview.com/2.15.html#support-for-draft-changes-removed already" [puppet] - 10https://gerrit.wikimedia.org/r/606533 (owner: 10QChris) [02:17:03] (03CR) 10Dzahn: [C: 03+1] gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 (owner: 10QChris) [02:23:28] (03PS1) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) [02:24:54] (03PS1) 10Dzahn: gerrit (cloud): remove SQL database hiera keys [puppet] - 10https://gerrit.wikimedia.org/r/606550 [02:26:03] (03PS2) 10Dzahn: gerrit: remove all database parameters / support [puppet] - 10https://gerrit.wikimedia.org/r/606549 (https://phabricator.wikimedia.org/T254158) [03:18:54] PROBLEM - SSH on cp5005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:19:40] RECOVERY - SSH on cp5005.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:35:55] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P11593 and previous config saved to /var/cache/conftool/dbconfig/20200619-043554-marostegui.json [04:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:39:57] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1098:3316', diff saved to https://phabricator.wikimedia.org/P11594 and previous config saved to /var/cache/conftool/dbconfig/20200619-043956-marostegui.json [04:40:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:43:59] (03PS1) 10Marostegui: db2108: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606554 (https://phabricator.wikimedia.org/T250666) [04:44:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2108 for reimage', diff saved to https://phabricator.wikimedia.org/P11595 and previous config saved to /var/cache/conftool/dbconfig/20200619-044440-marostegui.json [04:44:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:49] (03CR) 10Marostegui: [C: 03+2] db2108: Reimage to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606554 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [05:15:58] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is CRITICAL: 105.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [05:20:15] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [05:23:12] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [05:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:46] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:38] (03CR) 10Muehlenhoff: "Two comments inline and +1 to what Antoine wrote about the before =>" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606286 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [05:34:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2108', diff saved to https://phabricator.wikimedia.org/P11596 and previous config saved to /var/cache/conftool/dbconfig/20200619-053402-marostegui.json [05:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:53] (03PS1) 10Marostegui: db2108: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606555 [05:36:26] (03CR) 10Marostegui: [C: 03+2] db2108: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606555 (owner: 10Marostegui) [05:36:57] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:39:11] (03PS1) 10Marostegui: mariadb: Reimage db2075 and db2111 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606556 (https://phabricator.wikimedia.org/T250666) [05:40:05] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2075 and db2111 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/606556 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [05:41:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/606437 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [05:41:19] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db2075 and db2111 for reimage', diff saved to https://phabricator.wikimedia.org/P11597 and previous config saved to /var/cache/conftool/dbconfig/20200619-054118-marostegui.json [05:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:50] (03CR) 10Muehlenhoff: Add analytics-product system user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/595540 (https://phabricator.wikimedia.org/T230743) (owner: 10Bearloga) [05:48:21] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:54:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1112', diff saved to https://phabricator.wikimedia.org/P11598 and previous config saved to /var/cache/conftool/dbconfig/20200619-055430-marostegui.json [05:55:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:14] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [06:01:00] (03CR) 10Marostegui: [C: 04-1] "I would prefer to send this alert to -operations like we do with the rest of them." [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [06:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:13] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [06:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:49] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:02:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:26] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:36] (03PS1) 10Marostegui: db2075, db2111: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606557 [06:16:22] (03CR) 10Marostegui: [C: 03+2] db2075, db2111: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606557 (owner: 10Marostegui) [06:17:34] (03PS3) 10Muehlenhoff: Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) [06:18:38] (03PS1) 10Marostegui: db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606558 (https://phabricator.wikimedia.org/T254556) [06:19:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2075 db2111', diff saved to https://phabricator.wikimedia.org/P11599 and previous config saved to /var/cache/conftool/dbconfig/20200619-061922-marostegui.json [06:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:43] !log Stop mysql on db2132 to reimage m1 codfw master - T254556 [06:19:54] (03CR) 10Marostegui: [C: 03+2] db2132: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606558 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [06:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:43] T254556: Upgrade m1 to Buster and Mariadb 10.4 - https://phabricator.wikimedia.org/T254556 [06:23:00] (03CR) 10Muehlenhoff: [C: 03+2] Add CAP_DAC_OVERRIDE to the CapabilityBoundingSet [puppet] - 10https://gerrit.wikimedia.org/r/606433 (https://phabricator.wikimedia.org/T233933) (owner: 10Muehlenhoff) [06:23:43] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [06:23:51] ^ expected [06:36:57] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [06:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:36] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984 (10eyazi) I am doing OTRS upgrades on a daily basis and would love to help you guys out with the upgrading process. I don't need direct access to the interface or any data if you... [06:39:26] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:39:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:40:56] (03PS1) 10Elukey: Revert "Set Bigtop for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/606605 [06:43:57] (03CR) 10Elukey: [C: 03+2] Revert "Set Bigtop for Hadoop test" [puppet] - 10https://gerrit.wikimedia.org/r/606605 (owner: 10Elukey) [06:45:33] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [06:47:47] !log force reinstall of memcached 1.6 deb packages to ensure that the override is used in addition to the unmodified systemd unit from the deb T233933 [06:47:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:54] T233933: Replicated ticket registry - https://phabricator.wikimedia.org/T233933 [06:51:04] (03PS1) 10Marostegui: db2132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606634 (https://phabricator.wikimedia.org/T254556) [06:51:36] (03CR) 10Marostegui: [C: 03+2] db2132: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606634 (https://phabricator.wikimedia.org/T254556) (owner: 10Marostegui) [06:51:40] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [06:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:53:38] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [06:53:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:54:52] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:55:19] kubetcd2006 is the ganeti reboot, this instance has plain disks [06:57:10] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [06:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200619T0700) [07:00:26] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [07:02:11] !log rebooting ganeti nodes in eqiad for kernel security updates [07:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:35] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:10:14] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:42] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:56] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:15:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:48] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:18] (03CR) 10Gilles: "I'd rather take care of metadata stripping/color profile substitution separately. ImageOptim doesn't care about color profiles and will st" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [07:22:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:59] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:29:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:33] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10ayounsi) Thanks. I did a few small changes, mostly removing bits of the default config. It's good to go now. [07:39:15] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:39:16] (03CR) 10Marostegui: [C: 03+1] "We've sync'ed on IRC and it has been cleared up." [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [07:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:22] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1098:3316', diff saved to https://phabricator.wikimedia.org/P11600 and previous config saved to /var/cache/conftool/dbconfig/20200619-074420-marostegui.json [07:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:15] (03PS5) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) [07:47:00] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:21] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:47:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:49] (03CR) 10Kormat: mariadb: Add monitoring for lag spikes. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [07:52:16] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [07:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [07:54:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:59:08] !log marostegui@cumin2001 dbctl commit (dc=all): 'Depool db1093', diff saved to https://phabricator.wikimedia.org/P11601 and previous config saved to /var/cache/conftool/dbconfig/20200619-075907-marostegui.json [07:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:02] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:02:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:58] 10Operations, 10DBA: refactor mariadb puppet code to have single mapping of multiinstance section to port numbers - https://phabricator.wikimedia.org/T255849 (10Kormat) [08:06:04] 10Operations, 10DBA: refactor mariadb puppet code to have single mapping of multiinstance section to port numbers - https://phabricator.wikimedia.org/T255849 (10Kormat) p:05Triage→03Medium [08:06:29] (03PS8) 10Kormat: [WIP] mariadb: Refactor puppetness. [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T255849) [08:07:02] (03Abandoned) 10Kormat: [WIP] mariadb: Refactor puppetness. [puppet] - 10https://gerrit.wikimedia.org/r/605188 (https://phabricator.wikimedia.org/T255849) (owner: 10Kormat) [08:12:26] (03CR) 10Filippo Giunchedi: [C: 03+1] mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) (owner: 10Kormat) [08:12:58] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:40] !log roll-restart logstash elk5 for "JVM GC Old generation-s runs" alert [08:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:55] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:01] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [08:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:51] (03PS1) 10Filippo Giunchedi: logstash: bump pipeline workers [puppet] - 10https://gerrit.wikimedia.org/r/606647 (https://phabricator.wikimedia.org/T255243) [08:26:58] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [08:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:56] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:30:18] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. Two nits inline, but feel free to ignore." (032 comments) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:34:27] (03CR) 10JMeybohm: Initial commit of debian directory (031 comment) [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [08:35:26] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-clu [08:35:26] &var-topic=All&var-consumer_group=All [08:35:50] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "The fact the "average" reduction is reported as being "54%" worried me very much. This should not be possible, unless there is a major los" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [08:42:18] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3052 is OK: HTTP OK: HTTP/1.0 200 OK - 23570 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [08:42:59] looking at the kafka lag alert [08:45:00] !log roll restart elasticsearch_5@production-logstash-eqiad [08:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:47] !log backup netbox and run one-time script to reserve first IPs on all infra prefixes on Netbox - T233183 [08:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:50] T233183: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 [08:46:59] 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10hashar) Capture of a robot comment: {F31871199 size=full} [08:51:17] 10Operations, 10Continuous-Integration-Infrastructure: Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10hashar) There are a lot of details on the task to have SonarQube to report straight to Gerrit T217008 and an implementation at https://github.c... [08:52:22] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 912 threshold =0.34 breach: status: red, number_of_in_flight_fetch: 0, number_of_data_nodes: 2, active_shards_percent_as_number: 21.85089974293059, number_of_nodes: 5, unassigned_shards: 904, cluster_name: production-logstash-eqiad, number_of_pending_tasks: 394, initializing_shards: 8, task_max_waiting_in [08:52:22] 635, delayed_unassigned_shards: 390, active_shards: 255, timed_out: False, relocating_shards: 0, active_primary_shards: 197 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:53:18] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch inactive shards 848 threshold =0.34 breach: number_of_data_nodes: 3, initializing_shards: 12, task_max_waiting_in_queue_millis: 261031, number_of_pending_tasks: 602, timed_out: False, active_shards: 319, active_primary_shards: 239, unassigned_shards: 836, number_of_nodes: 6, status: red, active_shards_percent_as_number: [08:53:18] , cluster_name: production-logstash-eqiad, relocating_shards: 0, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:53:26] 10Operations, 10SRE-tools, 10Traffic, 10Goal, and 3 others: Automate generation of Management DNS records from Netbox - https://phabricator.wikimedia.org/T233183 (10Volans) I've run the script in production, you can see the output of the script in P11603 and the results in Netbox in two ways: * looking for... [08:53:32] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch inactive shards 814 threshold =0.34 breach: number_of_in_flight_fetch: 0, active_shards_percent_as_number: 30.248500428449017, timed_out: False, initializing_shards: 12, number_of_nodes: 6, number_of_data_nodes: 3, delayed_unassigned_shards: 0, number_of_pending_tasks: 602, status: red, task_max_waiting_in_queue_millis: [08:53:32] rds: 353, cluster_name: production-logstash-eqiad, active_primary_shards: 266, unassigned_shards: 802, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:53:40] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch inactive shards 803 threshold =0.34 breach: delayed_unassigned_shards: 0, number_of_data_nodes: 3, active_shards_percent_as_number: 31.191088260497, timed_out: False, task_max_waiting_in_queue_millis: 284807, number_of_pending_tasks: 610, cluster_name: production-logstash-eqiad, active_shards: 364, initializing_shards: 1 [08:53:40] locating_shards: 0, number_of_nodes: 6, active_primary_shards: 275, unassigned_shards: 791, number_of_in_flight_fetch: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:53:56] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Remove comment about "Same regex as above in https_recv_redirect" [puppet] - 10https://gerrit.wikimedia.org/r/606457 (owner: 10Reedy) [08:54:20] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 707 threshold =0.34 breach: relocating_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, active_shards: 460, active_shards_percent_as_number: 39.41730934018852, initializing_shards: 12, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 324217, number_of_nodes: 6, unassigned_shards: [08:54:20] ding_tasks: 730, cluster_name: production-logstash-eqiad, status: red, number_of_data_nodes: 3, active_primary_shards: 353 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:54:44] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 663 threshold =0.34 breach: number_of_in_flight_fetch: 0, active_primary_shards: 389, relocating_shards: 0, status: red, cluster_name: production-logstash-eqiad, active_shards: 504, number_of_nodes: 6, task_max_waiting_in_queue_millis: 347750, active_shards_percent_as_number: 43.18766066838046, unassigned [08:54:44] yed_unassigned_shards: 0, number_of_data_nodes: 3, number_of_pending_tasks: 736, initializing_shards: 12, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [08:55:19] known ^ roll-restarted the cluster [08:56:52] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, active_shards_percent_as_number: 69.32305055698372, active_primary_shards: 513, number_of_data_nodes: 3, task_max_waiting_in_queue_millis: 473743, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, cluster_name: production-logstash-eqiad, number_of_nodes: 6, unassigne [08:56:52] ive_shards: 809, timed_out: False, status: yellow, number_of_pending_tasks: 900, initializing_shards: 11 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:57:08] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: active_primary_shards: 513, number_of_data_nodes: 3, active_shards: 842, number_of_nodes: 6, timed_out: False, unassigned_shards: 317, status: yellow, cluster_name: production-logstash-eqiad, active_shards_percent_as_number: 72.15081405312768, delayed_unassigned_shards: 0, number_of_pending_tasks: [08:57:08] ing_in_queue_millis: 488666, initializing_shards: 8, number_of_in_flight_fetch: 3, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:57:16] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: active_primary_shards: 513, number_of_nodes: 6, active_shards_percent_as_number: 74.293059125964, number_of_data_nodes: 3, cluster_name: production-logstash-eqiad, active_shards: 867, relocating_shards: 0, number_of_pending_tasks: 906, status: yellow, unassigned_shards: 292, delayed_unassigned_shar [08:57:16] g_shards: 8, task_max_waiting_in_queue_millis: 497731, number_of_in_flight_fetch: 0, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [08:57:46] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: unassigned_shards: 231, number_of_data_nodes: 3, active_primary_shards: 513, number_of_nodes: 6, active_shards_percent_as_number: 79.43444730077121, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 0, number_of_pending_tasks: 8, timed_out: False, task_max_waiting_in_queue_millis: 1134, clus [08:57:46] on-logstash-eqiad, active_shards: 927, relocating_shards: 0, initializing_shards: 9, status: yellow https://wikitech.wikimedia.org/wiki/Search%23Administration [08:58:02] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, relocating_shards: 0, cluster_name: production-logstash-eqiad, number_of_data_nodes: 3, delayed_unassigned_shards: 0, active_shards_percent_as_number: 80.80548414738647, unassigned_shards: 217, number_of_pending_tasks: 11, initializing_shards: 7, task_max_waiting_in_queue_millis: 65 [08:58:02] se, active_shards: 943, number_of_in_flight_fetch: 0, active_primary_shards: 513, number_of_nodes: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:58:18] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: initializing_shards: 6, number_of_data_nodes: 3, number_of_nodes: 6, active_shards_percent_as_number: 83.54755784061697, unassigned_shards: 186, active_primary_shards: 513, task_max_waiting_in_queue_millis: 0, number_of_in_flight_fetch: 0, status: yellow, delayed_unassigned_shards: 0, number_of_pen [08:58:18] ive_shards: 975, cluster_name: production-logstash-eqiad, timed_out: False, relocating_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [08:58:46] godog: do you know why we have icinga-wm_ instead of icinga-wm? [09:00:26] the semi angry version of icinga-wm? [09:01:37] volans: not ATM no [09:01:48] vgutierrez: lol [09:03:54] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:06:41] ok I suspect the elastic 5 cluster isn't liking the additional shards change we did earlier in the week, I'm going to revert that before the weekend [09:10:03] (03PS1) 10Filippo Giunchedi: Revert "logstash: align number of shards with number of ES indexing hosts" [puppet] - 10https://gerrit.wikimedia.org/r/606651 [09:11:42] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "logstash: align number of shards with number of ES indexing hosts" [puppet] - 10https://gerrit.wikimedia.org/r/606651 (owner: 10Filippo Giunchedi) [09:14:01] !log rsync from dumpsdata1003 as root to labstore1007 of dumps output files to catch up, with --bwlimit=160000 up from 80000 [09:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:34] shouldn't impact the labstore server signficantly [09:17:56] (03CR) 10Jcrespo: [C: 03+2] "Looks great, thanks." [software/transferpy] - 10https://gerrit.wikimedia.org/r/602719 (https://phabricator.wikimedia.org/T253219) (owner: 10Privacybatm) [09:20:26] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:20:34] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:21:04] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1008 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:21:14] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1009 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:21:19] mmhh *sigh* the master didn't like the index template update clearly [09:21:34] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1010 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:22:00] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1007 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [09:22:06] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-logstash-eqiad: task_max_waiting_in_queue_millis: 37155, number_of_data_nodes: 3, active_shards_percent_as_number: 90.31705227077977, number_of_pending_tasks: 98, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 21, active_primary_shards: 513, timed_out: False, unassigned_shards: 105, status: yellow, reloc [09:22:06] umber_of_nodes: 6, active_shards: 1054, initializing_shards: 8, cluster_name: production-logstash-eqiad https://wikitech.wikimedia.org/wiki/Search%23Administration [09:22:14] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: relocating_shards: 0, number_of_nodes: 6, timed_out: False, unassigned_shards: 105, delayed_unassigned_shards: 0, number_of_in_flight_fetch: 21, number_of_data_nodes: 3, number_of_pending_tasks: 98, cluster_name: production-logstash-eqiad, active_primary_shards: 513, task_max_waiting_in_queue_milli [09:22:14] hards: 1054, initializing_shards: 8, status: yellow, active_shards_percent_as_number: 90.31705227077977 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:22:48] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1008 is OK: OK - elasticsearch status production-logstash-eqiad: active_shards_percent_as_number: 90.31705227077977, cluster_name: production-logstash-eqiad, number_of_data_nodes: 3, unassigned_shards: 105, timed_out: False, initializing_shards: 8, number_of_nodes: 6, relocating_shards: 0, number_of_pending_tasks: 99, task_max_waiting_in_queue_millis: 77848, act [09:22:48] : 513, number_of_in_flight_fetch: 33, active_shards: 1054, status: yellow, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:22:52] !log restart elasticsearch on logstash1010 [09:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:56] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1009 is OK: OK - elasticsearch status production-logstash-eqiad: delayed_unassigned_shards: 0, unassigned_shards: 105, number_of_pending_tasks: 98, number_of_data_nodes: 3, cluster_name: production-logstash-eqiad, number_of_nodes: 6, relocating_shards: 0, number_of_in_flight_fetch: 33, active_primary_shards: 513, task_max_waiting_in_queue_millis: 85658, timed_ou [09:22:56] yellow, initializing_shards: 8, active_shards: 1054, active_shards_percent_as_number: 90.31705227077977 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:23:40] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1007 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, number_of_pending_tasks: 98, timed_out: False, delayed_unassigned_shards: 0, active_shards: 1055, task_max_waiting_in_queue_millis: 130715, relocating_shards: 0, initializing_shards: 7, cluster_name: production-logstash-eqiad, active_shards_percent_as_number: 90. [09:23:40] mber_of_in_flight_fetch: 29, number_of_data_nodes: 3, active_primary_shards: 513, unassigned_shards: 105 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:27:38] PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:28:52] 1% of 5XX on gets at app layer [09:29:02] also latency increase [09:30:30] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1010 is OK: OK - elasticsearch status production-logstash-eqiad: timed_out: False, cluster_name: production-logstash-eqiad, status: yellow, number_of_pending_tasks: 4, active_primary_shards: 513, delayed_unassigned_shards: 0, number_of_data_nodes: 3, number_of_nodes: 6, relocating_shards: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 2415, u [09:30:30] 347, active_shards: 808, initializing_shards: 12, active_shards_percent_as_number: 69.23736075407027 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:31:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:34:18] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1001:9501 job=burrow partition={0,1,2,3,4,5} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-clu [09:34:18] &var-topic=All&var-consumer_group=All [09:36:33] #page logstash elasticsearch 5 cluster in trouble [09:40:38] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [09:43:10] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [09:50:52] (03PS8) 10Kormat: Add native mysql spicerack module. [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) [09:51:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:53:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:58:42] (03CR) 10Kormat: "> Patch Set 6:" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/603434 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [10:01:34] RECOVERY - Check systemd state on logstash1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:57] dcausse: by any chance are you around? [10:02:06] volans: yes [10:02:30] could you join #wikimedia-sre if not too much trouble? [10:02:38] sure [10:02:49] thx! [10:05:51] (03CR) 10Gilles: "The transparency information is not reduced to 1 bit. It becomes a palette PNG, but the palette colors are full ARGB values." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [10:08:11] (03PS1) 10Vgutierrez: Release 8.0.8-rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 [10:14:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:15:30] (03PS1) 10Muehlenhoff: Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661 [10:16:45] (03CR) 10Jbond: [C: 03+1] Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661 (owner: 10Muehlenhoff) [10:18:08] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:18:13] (03CR) 10JMeybohm: [C: 03+2] Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [10:21:03] (03Merged) 10jenkins-bot: Initial commit of debian directory [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/605940 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [10:21:24] !log start closing logstash indices for 2020.03 in elastic 5 eqiad [10:21:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:48] !log marostegui@cumin2001 dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11604 and previous config saved to /var/cache/conftool/dbconfig/20200619-102447-marostegui.json [10:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:59] (03CR) 10Gilles: "My previous commands were wrong in the sense that they manipulated the image first, but my point was correct. Here are the first 10 colors" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [10:35:27] !log installing tomcat8 security updates [10:38:19] !log imported chartmuseum_0.12.0-1 to buster-wikimedia [10:38:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:31] moritzm: you might want to repeat yourself [10:45:01] !log installing tomcat8 security updates [10:45:37] jayme: seems it's ignoring just me :-) [10:45:54] :-P [10:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:57] (03PS1) 10Marostegui: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) [10:49:07] !log close april logstash indices on logstash 5 eqiad [10:49:17] (03CR) 10Marostegui: [C: 04-2] "Wait for the decided date and time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [10:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:08] (03CR) 10Kormat: [C: 04-1] db-eqiad.php: Depool cluster27 (es5) from writes. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [10:51:54] (03PS2) 10Marostegui: db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) [10:52:55] (03PS1) 10Hashar: Fix doc generation for ParamikoExecution [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664 [10:53:06] (03CR) 10Jbond: [C: 03+2] profile::icinga: add vhost for external monitoring [puppet] - 10https://gerrit.wikimedia.org/r/606437 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [10:53:09] (03CR) 10Jcrespo: "Do we need to make the cluster temporarily "static" so pt-heartbeat is not checked? This may need performance input, not sure if any of us" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [10:53:24] (03CR) 10Kormat: [C: 03+1] db-eqiad.php: Depool cluster27 (es5) from writes. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [10:53:39] (03CR) 10Marostegui: [C: 04-2] "> Do we need to make the cluster temporarily "static" so pt-heartbeat" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [10:54:16] (03CR) 10Jcrespo: [C: 03+1] Fix doc generation for ParamikoExecution [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664 (owner: 10Hashar) [10:55:24] !log installing mesa security updates [10:55:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:22] (03PS1) 10Jbond: icinga-extmon: add new cname for icinga external monitoring [dns] - 10https://gerrit.wikimedia.org/r/606668 [10:59:24] (03PS1) 10Jbond: icingae::external_monitoring: hosts should be an array not string [puppet] - 10https://gerrit.wikimedia.org/r/606667 [10:59:59] (03CR) 10Jbond: [C: 03+2] icingae::external_monitoring: hosts should be an array not string [puppet] - 10https://gerrit.wikimedia.org/r/606667 (owner: 10Jbond) [11:00:54] (03CR) 10Marostegui: [C: 04-2] "Aaron, Tim, we want to depool es5 from writes to do a master switchover, the last time we did it at https://gerrit.wikimedia.org/r/#/c/ope" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606663 (https://phabricator.wikimedia.org/T255755) (owner: 10Marostegui) [11:04:13] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [11:06:33] (03PS2) 10Jbond: admin: add Andrew Kuznetsov to ldap only [puppet] - 10https://gerrit.wikimedia.org/r/606147 [11:07:55] (03CR) 10Jbond: [C: 03+2] admin: add Andrew Kuznetsov to ldap only [puppet] - 10https://gerrit.wikimedia.org/r/606147 (owner: 10Jbond) [11:10:29] 10Operations, 10SRE-Access-Requests: Requesting access researchers, statistics-privatedata-users, and analytics-privatedata-users, nda for AndrewKuznetsov - https://phabricator.wikimedia.org/T254939 (10jbond) 05Open→03Resolved a:03jbond @AndrewKuznetsov i have added you to the NDA group so you should be... [11:11:29] (03CR) 10Privacybatm: [C: 03+1] "Looks good! Thank you for this patch." [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664 (owner: 10Hashar) [11:13:11] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1012-production-logstash-eqiad on logstash1012 is OK: (C)100 gt (W)80 gt 74.24 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1012&panelId=37 [11:13:44] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/606668 (owner: 10Jbond) [11:14:39] (03CR) 10Jbond: [C: 03+2] icinga-extmon: add new cname for icinga external monitoring [dns] - 10https://gerrit.wikimedia.org/r/606668 (owner: 10Jbond) [11:15:33] jbond42: feel free to ping me when you need to change how exernal monitoring is calling icinga [11:17:23] volans: i was just looking at that i see the script check icinga1001 and icinga2001 directly so wonder if its better to have a cname for icinga[12]001-extmon? [11:18:14] please ignore the wikimedia-extmon alert [11:19:38] jbond42: looking [11:19:55] so we currnetly have 2 crontab lines [11:19:55] */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga2001.wikimedia.org [11:19:58] */2 * * * * /usr/bin/systemd-cat -t "check_icinga" /usr/local/bin/check_icinga icinga1001.wikimedia.org [11:20:11] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1010-production-logstash-eqiad on logstash1010 is OK: (C)100 gt (W)80 gt 68.14 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1010&panelId=37 [11:21:29] so we can use what'ever works for you, but we do check both the active and passive hosts [11:21:39] what's not totally clear to me is why it paged... [11:21:47] oh i ran the script manuly [11:21:47] did you try it manually? [11:21:50] ah ok [11:21:52] that explains it [11:22:22] (03CR) 10Jcrespo: [C: 03+2] Fix doc generation for ParamikoExecution [software/transferpy] - 10https://gerrit.wikimedia.org/r/606664 (owner: 10Hashar) [11:22:43] ithink adding icinga[12]001-extmon.wikimedia.org makes the most sense. otherwise we would need to update the check script to send a differen host header [11:23:04] SGTM [11:23:15] ack will update thanks [11:24:24] (03CR) 10Marostegui: "Do you have a task where I can read some background for this patch?" [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat) [11:24:28] jbond42: thanks! please once we migrate update also https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring [11:24:31] accordingly [11:24:41] ack will do [11:24:47] being an unpuppetized host it's important to keep the doc up-to-date [11:25:57] ack [11:26:17] PROBLEM - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [11:26:30] ^^ me will ack [11:27:54] ACKNOWLEDGEMENT - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) John Bond Still configuring https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [11:29:19] (03PS1) 10Jbond: icinga[12]001-extmon add extmon cnames for icinga[12]001 [dns] - 10https://gerrit.wikimedia.org/r/606672 [11:31:53] RECOVERY - Check systemd state on an-tool1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:36:23] (03PS2) 10Vgutierrez: Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 [11:36:37] (03CR) 10jerkins-bot: [V: 04-1] Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 (owner: 10Vgutierrez) [11:36:47] that was fast [11:39:31] (03PS1) 10Marostegui: mariadb: Reimage db2116, db2119 and db2130 [puppet] - 10https://gerrit.wikimedia.org/r/606674 (https://phabricator.wikimedia.org/T250666) [11:39:40] (03PS1) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) [11:39:42] !log Reimage db2116 db2119 db2130 [11:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:11] (03PS1) 10Elukey: cumin: add more aliases for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/606676 [11:41:44] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db2116, db2119 and db2130 [puppet] - 10https://gerrit.wikimedia.org/r/606674 (https://phabricator.wikimedia.org/T250666) (owner: 10Marostegui) [11:41:46] (03CR) 10Volans: [C: 04-1] "one typo, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606676 (owner: 10Elukey) [11:42:08] ah!!! thanks! [11:42:15] yw :) [11:42:22] (03CR) 10Elukey: cumin: add more aliases for Hadoop test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606676 (owner: 10Elukey) [11:42:33] (03PS2) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) [11:43:00] elukey: no need for follow up +1 from me [11:43:18] (03PS2) 10Elukey: cumin: add more aliases for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/606676 [11:45:52] ack! [11:45:59] (03PS3) 10Vgutierrez: Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 [11:46:29] (03PS1) 10Jbond: profile/icinga/external_monitoring: add dummy secrets [labs/private] - 10https://gerrit.wikimedia.org/r/606677 [11:46:41] (03CR) 10Elukey: [C: 03+2] cumin: add more aliases for Hadoop test [puppet] - 10https://gerrit.wikimedia.org/r/606676 (owner: 10Elukey) [11:46:54] (03CR) 10Jbond: [V: 03+2 C: 03+2] profile/icinga/external_monitoring: add dummy secrets [labs/private] - 10https://gerrit.wikimedia.org/r/606677 (owner: 10Jbond) [11:49:48] (03CR) 10Ema: [C: 03+1] Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 (owner: 10Vgutierrez) [11:51:28] (03PS3) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) [11:55:21] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [11:58:13] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:59:35] (03PS1) 10Jbond: acme_chief: add extmon names to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) [11:59:59] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [12:00:52] (03PS2) 10Jbond: acme_chief: add extmon names to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) [12:01:34] (03PS3) 10Jbond: acme_chief: add extmon names to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) [12:02:37] (03CR) 10Jbond: "Ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [12:03:16] (03PS2) 10Kormat: mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 [12:03:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] ""Extraordinary evidence". Oh dear. Can we please stick to a less aggressive language?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [12:03:59] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 570 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:05:01] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [12:05:04] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [12:05:24] (03PS3) 10Kormat: mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 [12:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:55] (03CR) 10Kormat: "> Do you have a task where I can read some background for this patch?" [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat) [12:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:07:38] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:02] (03PS4) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) [12:08:07] !log marostegui@cumin2001 START - Cookbook sre.hosts.downtime [12:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:04] !log marostegui@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [12:10:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:50] !log marostegui@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:08] !log delete march indices from logstash 5 eqiad to free up space [12:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:04] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/606672 (owner: 10Jbond) [12:24:11] jbond42: FYI there is also the bit in /etc/check_icinga/config.yaml to define the domain of the icinga service [12:24:50] and that's used to determine if the host is active or passive [12:25:04] checking the CNAME [12:26:06] volans: dose that mean we dont need th additional dns queries above? i.e. dose it also use that value to set the host header? [12:26:09] (03PS6) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [12:26:19] s/queries/cnames/ [12:26:46] headers={'Host': domain} [12:27:18] see https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/external-monitoring/+/master/icinga/check_icinga.py#564 [12:27:48] and slightly above to the detection of the active host [12:28:00] (03CR) 10Privacybatm: "I have resolved the comments except these three (WIP):" (035 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [12:28:34] ack thanks ill cancle those other changes [12:28:35] I think you can get away with just a single CNAME yes [12:28:40] (03CR) 10Muehlenhoff: [C: 03+2] Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661 (owner: 10Muehlenhoff) [12:28:45] (03PS2) 10Muehlenhoff: Test failover with new memcached session backend [dns] - 10https://gerrit.wikimedia.org/r/606661 [12:29:17] (03Abandoned) 10Jbond: icinga[12]001-extmon add extmon cnames for icinga[12]001 [dns] - 10https://gerrit.wikimedia.org/r/606672 (owner: 10Jbond) [12:30:06] (03Abandoned) 10Jbond: profile::icinga::external_monitoring: add icinga[12]001-extmon aliases [puppet] - 10https://gerrit.wikimedia.org/r/606675 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [12:30:15] (03CR) 10Privacybatm: transferpy: Package transferpy (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [12:31:04] !log Disabling puppet on gerrit1002 (test instance) to do some more testing [12:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:15] (03PS1) 10Marostegui: db2116,db2119,db2130: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606684 [12:32:29] (03PS4) 10Jbond: acme_chief: add extmon name to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) [12:32:46] (03CR) 10Marostegui: [C: 03+2] db2116,db2119,db2130: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/606684 (owner: 10Marostegui) [12:32:56] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [12:34:47] (03PS1) 10Jbond: profile::icinga::external_monitoring: correct icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606685 [12:36:32] (03CR) 10Marostegui: [C: 03+1] mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat) [12:37:07] (03CR) 10Jbond: [C: 03+2] profile::icinga::external_monitoring: correct icinga check [puppet] - 10https://gerrit.wikimedia.org/r/606685 (owner: 10Jbond) [12:37:20] (03CR) 10Kormat: [C: 03+2] mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat) [12:37:50] (03Merged) 10jenkins-bot: mariadb: Add mbstream to path [software] - 10https://gerrit.wikimedia.org/r/599604 (owner: 10Kormat) [12:41:20] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.8~rc0-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/606658 (owner: 10Vgutierrez) [12:42:10] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) For what is worth this is probably going to be scheduled for next quarter (so July-September). @eyazi Thanks for you offer, it's much appreciated. Indeed we are con... [12:45:56] (03CR) 10Jcrespo: "answers" (032 comments) [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [12:47:43] (03PS6) 10Kormat: mariadb: Add monitoring for lag spikes. [puppet] - 10https://gerrit.wikimedia.org/r/606441 (https://phabricator.wikimedia.org/T253120) [12:48:58] (03PS1) 10Muehlenhoff: Update various references/comments to jessie [puppet] - 10https://gerrit.wikimedia.org/r/606688 [12:49:17] 10Operations, 10observability, 10Patch-For-Review, 10User-jbond: Invalid apache configuration on profile::prometheus::ops hosts - https://phabricator.wikimedia.org/T255124 (10Aklapper) [12:49:21] (03CR) 10Jbond: dnsdist: add parameter for web server configuration (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [12:53:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [12:58:13] (03CR) 10Gilles: "There's nothing aggressive with this language. It's very factual, you make bold claims in your -1, suggesting that previous reviewers miss" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [12:58:39] 10Operations, 10Traffic, 10User-Joe: etcd cluster has Raft Internal errors sporadically - https://phabricator.wikimedia.org/T147209 (10Aklapper) [13:01:49] !log installing cups security updates (client side libs/tools) [13:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:54] (03CR) 10Jbond: [C: 03+2] acme_chief: add extmon name to SNI list of main icinga cert [puppet] - 10https://gerrit.wikimedia.org/r/606678 (https://phabricator.wikimedia.org/T239323) (owner: 10Jbond) [13:13:52] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:18:06] (03CR) 10Alexandros Kosiaris: "Sure. I can reproduce in fact." [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [13:25:30] (03PS1) 10Addshore: AdHocLogging for ReplicaMasterAwareRecordIdsAcquirer [extensions/Wikibase] (wmf/1.35.0-wmf.37) - 10https://gerrit.wikimedia.org/r/606692 (https://phabricator.wikimedia.org/T255855) [13:30:57] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 211.9 ge 210 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [13:32:32] that's a false positive I think ^ [13:37:45] (03PS7) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [14:06:14] (03PS1) 10Majavah: betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) [14:07:42] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:08:23] (03CR) 10Privacybatm: "> Patch Set 7: Verified+2" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [14:09:30] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:10:49] (03CR) 10Reedy: [C: 04-1] "Not that globalblocks seem to be used on beta anyway; the table on deploymentwiki is empty. So nothing to actually migrate" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [14:10:58] (03CR) 10Reedy: "Uh, accidental -1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [14:17:05] (03PS1) 10Majavah: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) [14:17:25] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/606688 (owner: 10Muehlenhoff) [14:17:46] (03CR) 10RhinosF1: [C: 03+1] Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [14:18:20] (03CR) 10RhinosF1: [C: 03+1] betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [14:18:40] (03PS2) 10Majavah: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) [14:19:24] (03CR) 10RhinosF1: [C: 03+1] Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [14:27:42] RECOVERY - icinga-extmon.wikimedia.org requires authentication on icinga1001 is OK: passive https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [14:30:24] (03CR) 10Jcrespo: "> Patch Set 7:" [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) (owner: 10Privacybatm) [14:32:48] PROBLEM - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [14:33:31] ACKNOWLEDGEMENT - icinga-extmon.wikimedia.org requires authentication on icinga1001 is CRITICAL: (null) John Bond Investigating check https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [14:35:14] (03PS1) 10Jbond: profile::icinga::external_monitoring: fix typo in check_command [puppet] - 10https://gerrit.wikimedia.org/r/606707 [14:36:16] (03PS1) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [14:38:22] (03CR) 10Thiemo Kreuz (WMDE): [C: 04-1] "I'm sorry, but I will not continue a conversation as hostile as this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/599284 (https://phabricator.wikimedia.org/T252108) (owner: 10Gilles) [14:44:49] (03CR) 10Jbond: [C: 03+2] profile::icinga::external_monitoring: fix typo in check_command [puppet] - 10https://gerrit.wikimedia.org/r/606707 (owner: 10Jbond) [14:51:13] (03CR) 10Muehlenhoff: [C: 03+1] "Yeah, Ganeti should be unblocked for this." [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [14:53:38] RECOVERY - icinga-extmon.wikimedia.org requires authentication on icinga1001 is OK: HTTP OK: Status line output matched HTTP/1.1 403 - 437 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Monitoring/https_unauthorized [14:56:35] (03CR) 10Volans: "I don't dislike the approach, we could query other existing classes but feels weird to me (like the ferm one), so why not." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [14:58:57] (03PS1) 10Majavah: betacluster: Apply global abuse filters from metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) [14:59:11] (03CR) 10Majavah: [C: 04-1] "DNM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [15:06:22] (03CR) 10Gehel: [V: 04-1] "Looks good! A few questions inline. Some is just me not understanding, other are about some weirdness of our deployment process." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [15:06:30] (03PS1) 10Filippo Giunchedi: kibana: additional settings [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863) [15:07:48] (03CR) 10Gehel: [V: 04-1 C: 04-1] sdoc gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [15:08:58] (03CR) 10Filippo Giunchedi: "I've verified that e.g. Kibana 5 doesn't barf on unknown settings" [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi) [15:10:25] (03CR) 10Lucas Werkmeister (WMDE): "Looks good to me except for the SPARQL URI (which I now see Gehel already commented on)." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [15:12:19] (03CR) 10Gehel: [C: 04-1] sdoc gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [15:13:04] (03PS8) 10Privacybatm: transferpy: Package transferpy [software/transferpy] - 10https://gerrit.wikimedia.org/r/602754 (https://phabricator.wikimedia.org/T253736) [15:13:31] (03PS2) 10Kormat: mariadb: Add 2 profiles to allow finer-grained cumin selection [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) [15:14:49] (03CR) 10Kormat: "> I'm even tempted to propose, in order to make the section profile a bit less dummy, to use it as the entry point for everything section " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [15:15:56] (03CR) 10Cwhite: [C: 03+1] kibana: additional settings [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi) [15:22:20] (03PS5) 10Dzahn: add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) [15:22:22] (03CR) 10Filippo Giunchedi: [C: 03+2] kibana: additional settings [puppet] - 10https://gerrit.wikimedia.org/r/606711 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi) [15:25:06] (03CR) 10Dzahn: [C: 03+2] add IPs for installservers in POPs [dns] - 10https://gerrit.wikimedia.org/r/599883 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [15:28:25] !log roll-restart kibana to apply new settings [15:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:40] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) added to DNS: install3001.wikimedia.org has address 91.198.174.63 install3001.wikimedia.org has IPv6 address 2620:0:862:1:91:198:174:63 install40... [15:29:44] 10Operations, 10Patch-For-Review: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 (10Dzahn) [15:31:42] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)210 ge (W)150 ge 99.77 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [15:32:12] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) Thanks will schedule downtime maintenance for the 25th at 9:30am CT to take down the old one and connect the new one. [15:34:04] (03CR) 10Lucas Werkmeister (WMDE): sdoc gui custom config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606297 (https://phabricator.wikimedia.org/T251514) (owner: 10Mstyles) [15:34:23] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move citoid to use TLS only - https://phabricator.wikimedia.org/T255868 (10JMeybohm) [15:36:00] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move zotero to use TLS only - https://phabricator.wikimedia.org/T255869 (10JMeybohm) [15:36:09] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) [15:37:38] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [15:37:49] (03PS1) 10Ssingh: rearrange wikidough data [labs/private] - 10https://gerrit.wikimedia.org/r/606715 [15:38:00] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash2025.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:38:46] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash2025.codfw.wmnet, logstash2024.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [15:38:46] that's me ^ checking [15:38:50] !log dzahn@cumin1001:~$ sudo cookbook sre.ganeti.makevm --vcpus 1 --memory 1 --disk 20 --network public ulsfo install4001.wikimedia.org (T254157) [15:38:59] ack, thanks godog [15:39:20] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-next_443: Servers logstash2025.codfw.wmnet, logstash2024.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:39:51] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "changes to only wikidough's dummy data" [labs/private] - 10https://gerrit.wikimedia.org/r/606715 (owner: 10Ssingh) [15:42:00] !heal stashbot [15:42:08] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics to use TLS only - https://phabricator.wikimedia.org/T255870 (10JMeybohm) [15:42:29] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-analytics-external to use TLS only - https://phabricator.wikimedia.org/T255871 (10JMeybohm) [15:42:31] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-logging-external to use TLS only - https://phabricator.wikimedia.org/T255872 (10JMeybohm) [15:42:33] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventgate-main to use TLS only - https://phabricator.wikimedia.org/T255873 (10JMeybohm) [15:42:37] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move eventstreams to use TLS only - https://phabricator.wikimedia.org/T255874 (10JMeybohm) [15:42:39] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) [15:42:42] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move mobileapps to use TLS only - https://phabricator.wikimedia.org/T255876 (10JMeybohm) [15:42:46] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move proton to use TLS only - https://phabricator.wikimedia.org/T255877 (10JMeybohm) [15:42:52] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move wikifeeds to use TLS only - https://phabricator.wikimedia.org/T255878 (10JMeybohm) [15:43:18] 10Operations, 10vm-requests, 10Patch-For-Review: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) >>! In T254157#6219046, @MoritzMuehlenhoff wrote: > feel free to give install4001.wikimedia.org a shot. Thanks! Just did. First try i followed... [15:43:25] 10Operations, 10ops-codfw, 10netops: codfw: rack/setup new srx300 (mr1) - https://phabricator.wikimedia.org/T255577 (10Papaul) [15:43:30] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([logstash2025.codfw.wmnet, logstash2024.codfw.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [15:46:34] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) [15:46:56] (03CR) 10Krinkle: [C: 03+1] betacluster: Apply Global Blocks at metawiki instead of deploymentwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606699 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [15:47:33] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) @Joe I think cxserver is missing the last two steps as well, correct? [15:47:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [15:47:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:19] (03PS1) 10Filippo Giunchedi: kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) [15:49:31] (03CR) 10jerkins-bot: [V: 04-1] kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi) [15:49:48] (03PS1) 10Dzahn: site/DHCP: add install4001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/606718 (https://phabricator.wikimedia.org/T254157) [15:50:16] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes: Move cxserver to use TLS only - https://phabricator.wikimedia.org/T255879 (10JMeybohm) a:05JMeybohm→03None [15:50:51] (03PS2) 10Filippo Giunchedi: kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) [15:51:21] (03CR) 10Dzahn: [C: 03+2] site/DHCP: add install4001.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/606718 (https://phabricator.wikimedia.org/T254157) (owner: 10Dzahn) [15:53:19] (03PS5) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) [15:55:01] (03CR) 10Cwhite: [C: 03+1] kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi) [15:55:35] (03CR) 10Krinkle: [C: 03+1] Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [15:56:34] (03CR) 10Filippo Giunchedi: [C: 03+2] kibana: disable metrics [puppet] - 10https://gerrit.wikimedia.org/r/606717 (https://phabricator.wikimedia.org/T255863) (owner: 10Filippo Giunchedi) [15:59:02] (03PS6) 10Ssingh: dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) [15:59:28] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:00:44] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:00:46] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:01:41] (03PS1) 10Dzahn: DHCP: configure install2003 as next-server for install4001 [puppet] - 10https://gerrit.wikimedia.org/r/606720 (https://phabricator.wikimedia.org/T254157) [16:01:54] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:03:02] (03CR) 10Dzahn: [C: 03+2] DHCP: configure install2003 as next-server for install4001 [puppet] - 10https://gerrit.wikimedia.org/r/606720 (https://phabricator.wikimedia.org/T254157) (owner: 10Dzahn) [16:07:18] !log ganeti4003 - rebooting install4001 - trying to bootstrap OS install from install2003 [16:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:17] (03CR) 10Volans: "LGTM, better to double check it with a puppet compiler for each profile when you get a chance." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606708 (https://phabricator.wikimedia.org/T255409) (owner: 10Kormat) [16:13:38] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:13:44] 10Operations, 10Documentation: Improve documentation for mirrors.wikimedia.org - https://phabricator.wikimedia.org/T179856 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie... [16:15:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:50] 10Operations: Review lists of config/sysctl recommendations by "kernel self-protection project" - https://phabricator.wikimedia.org/T142984 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decre... [16:17:48] andre__: ^ mass editing tasks?:) [16:19:22] mutante: Yepp. Mass-unassigning people. Hence cannot do that silently. [16:19:27] 10Operations, 10Mail, 10Wikimedia-Mailing-lists, 10Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529 (10Aklapper) a:05herron→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-li... [16:20:41] 10Operations: finish and automate data retention scripts - https://phabricator.wikimedia.org/T110066 (10Aklapper) a:05ArielGlenn→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly... [16:21:57] 10Operations, 10Mail: Split MXes into inbound and outbound - https://phabricator.wikimedia.org/T175362 (10Aklapper) a:05herron→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly... [16:22:25] 10Operations, 10netops, 10Sustainability (Incident Prevention): ospf link-protection - https://phabricator.wikimedia.org/T167306 (10Aklapper) a:05ayounsi→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-l... [16:23:59] andre__: yep, it's just that wikibugs quits IRC [16:24:19] because it triggers flooding [16:24:29] * andre__ shroogs :) [16:25:54] 10Operations: 2FA for SSH access to the production cluster - https://phabricator.wikimedia.org/T116750 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a... [16:25:56] 10Operations: Track amount of package updates on systems - https://phabricator.wikimedia.org/T116742 (10Aklapper) a:05MoritzMuehlenhoff→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a sl... [16:26:11] * RhinosF1 didn't even know he was subscribed to half the tasks he's getting emails for [16:26:13] 10Operations: Data retention: revise audit bash scripts - https://phabricator.wikimedia.org/T111021 (10Aklapper) a:05ArielGlenn→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get a slightly m... [16:26:17] 10Operations: Retention auditing: clean up rules db contents and use - https://phabricator.wikimedia.org/T111020 (10Aklapper) a:05ArielGlenn→03None This task has been assigned to the same task owner for more than two years. Resetting task assignee due to inactivity, to decrease task cookie-licking and to get... [16:28:04] RhinosF1: Heh, sorry :P [16:28:29] andre__: don't mind. :) [16:28:50] * RhinosF1 gets that many emails anyway its barely a spike [16:29:14] https://www.mediawiki.org/wiki/Phabricator/Help/Managing_mail covers how to best ignore Phab email notifications ;) [16:31:19] * mutante recommends turning off all mail from phabricator and instead have the notifications in the browser [16:31:35] andre__: You just made me bump into an ios bug! [16:31:38] and then reading those to get updates (not ignoring them all, heh) [16:32:02] (For the records: Bulk Job Complete, server side at least.) [16:33:42] Does anyone have an iphone running latest ios version (13.5.1) so I can test the bug andre__ made me bump into? [16:34:08] (03CR) 10Ssingh: "This is ready for review. I have used the "merge" strategy and also tried to address the other concerns raised. https://puppet-compiler.wm" [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:34:19] RhinosF1: maybe not in #operations if it's not a bug on SRE level :) [16:34:34] True :) [16:34:43] * RhinosF1 wonders off to an apple channel [16:35:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:36:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:38:30] (03CR) 10Dzahn: [C: 03+1] "right now puppet is broken and compiler shows it would be unbroken after this" [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:38:55] (03CR) 10Ssingh: [C: 03+2] dnsdist: add parameter for web server configuration [puppet] - 10https://gerrit.wikimedia.org/r/606459 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [16:48:21] 10Operations, 10Documentation: Improve SSH access information in onboarding documentation - https://phabricator.wikimedia.org/T160941 (10Nuria) 05Open→03Resolved [16:49:26] 10Operations, 10ops-eqiad, 10DC-Ops: apply hostname labels to bast1002/WMF4749 - https://phabricator.wikimedia.org/T186625 (10Dzahn) [16:56:50] (03PS1) 10Ssingh: dnsdist: disable the console by default [puppet] - 10https://gerrit.wikimedia.org/r/606729 [16:59:36] ChanServ shutting down .. ok, that's not happening every day [17:00:54] (03PS1) 10Dzahn: icinga: move ferm rules from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606730 (https://phabricator.wikimedia.org/T114209) [17:01:21] mutante: maintenance, it was announced 2 hours ago [17:01:21] (03CR) 10Ssingh: [V: 03+2 C: 03+2] "pcc looks good as no change was expected: https://puppet-compiler.wmflabs.org/compiler1001/23346/" [puppet] - 10https://gerrit.wikimedia.org/r/606729 (owner: 10Ssingh) [17:01:54] RhinosF1: aha, thanks! [17:02:01] (03PS1) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) [17:02:51] (03CR) 10jerkins-bot: [V: 04-1] betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [17:03:36] * Majavah is confused why that jenkins job failed [17:04:23] (03PS1) 10Dzahn: codesearch: move ferm rules from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606735 (https://phabricator.wikimedia.org/T114209) [17:04:58] Majavah: it's not order correctly [17:05:07] Did you run buildDBLists? [17:05:14] Majavah: "closed-labs.dblist is not alphasorted" [17:05:39] 'closed-labs.dblist' only contains names in 'all.dblist' [17:05:53] eswiki [17:05:53] [17:05:53] [17:05:54] deploymentwiki [17:05:58] but d is before e [17:06:03] (03PS2) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) [17:06:35] Majavah: are you editing the dblist directly? [17:06:35] yeah, it is not a huge issue, but before the check was there it list were very kaotic [17:06:48] *the lists [17:06:55] (03CR) 10jerkins-bot: [V: 04-1] betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [17:07:10] *chaotic [17:07:54] * RhinosF1 has an idea [17:08:03] (03PS1) 10Elukey: WIP - Add sre.hadoop.change-distro.py [cookbooks] - 10https://gerrit.wikimedia.org/r/606736 (https://phabricator.wikimedia.org/T244499) [17:08:11] hmh found it [17:08:57] (03PS1) 10Dzahn: contint: move firewall rules for labs to profile [puppet] - 10https://gerrit.wikimedia.org/r/606737 (https://phabricator.wikimedia.org/T114209) [17:08:59] (03PS3) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) [17:09:55] (03CR) 10jerkins-bot: [V: 04-1] betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [17:12:32] (03PS4) 10Majavah: betacluster: Add deploymentwiki to closed-labs.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606733 (https://phabricator.wikimedia.org/T198673) [17:14:04] finally it worked :D [17:15:06] 10Operations, 10vm-requests: esams,ulsfo,eqsin: one VM request each for install_servers - https://phabricator.wikimedia.org/T254157 (10Dzahn) Creating the VM worked fine. Installing the OS on install4001 has not worked yet though. DHCP was working right away, but serving the installer was not. Then i changed... [17:18:56] Majavah: :) [17:20:26] (03PS1) 10Dzahn: dumps: move ferm rules for xmldumps from module to profile [puppet] - 10https://gerrit.wikimedia.org/r/606739 (https://phabricator.wikimedia.org/T114209) [17:23:02] 10Operations, 10Wiki-Loves-Monuments, 10Wikimedia-Mailing-lists: Close wlm-us mailing list - https://phabricator.wikimedia.org/T159261 (10Dzahn) [17:23:12] 10Operations, 10Wiki-Loves-Monuments: Close wlm-us mailing list - https://phabricator.wikimedia.org/T159261 (10Dzahn) [17:26:07] 10Operations, 10audits-data-retention: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839 (10Dzahn) [18:07:26] (03PS1) 10Ottomata: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606749 (https://phabricator.wikimedia.org/T238230) [18:08:23] (03CR) 10Ottomata: [C: 03+2] Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606749 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [18:10:07] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bump eventlogging_Test schema version to 1.1.0 to pick up client_dt - T238230 (duration: 00m 59s) [18:10:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:13] T238230: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 [18:16:29] 10Operations, 10Beta-Cluster-Infrastructure, 10MediaWiki-JobQueue, 10Beta-Cluster-reproducible, 10Performance-Team (Radar): Stack overflow when Redis is down - https://phabricator.wikimedia.org/T185055 (10Krinkle) 05Open→03Declined We're no longer on HHVM. We also use Redis for fewer things now. Dec... [18:17:19] (03CR) 10Cwhite: [C: 03+1] "LGTM! From metrics, it seems there is still some headspace under load that would be great to utilize." [puppet] - 10https://gerrit.wikimedia.org/r/606647 (https://phabricator.wikimedia.org/T255243) (owner: 10Filippo Giunchedi) [18:48:38] (03PS3) 10Krinkle: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [18:55:17] (03CR) 10Krinkle: [C: 03+1] "This has been cherry-picked on Beta Cluster's puppetmaster, and I ran puppet agent on mediawiki-07 there. There were no errors." [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [19:03:56] (03CR) 10Reedy: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [19:07:08] (03CR) 10Majavah: Redirect beta.wmflabs.org to beta cluster metawiki instead of deploymentwiki (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606701 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [19:19:19] (03PS1) 10Majavah: betacluster: Add explicit testwikidataclient-test overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) [19:37:02] (03CR) 10Krinkle: [C: 04-1] "You can use '-' instead and then set 'default' for ones where it is not already set." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah) [19:38:51] (03PS1) 10Ottomata: [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) [19:39:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:40:22] (03CR) 10Ottomata: "Petr and Timo, I'm not sure if this is the best way to do this. It is very simple here, but I could also see adding some code to EventStr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:40:32] (03PS2) 10Majavah: betacluster: Add explicit testwikidataclient-test overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) [19:42:35] (03PS3) 10Majavah: betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) [19:42:42] (03PS2) 10Ottomata: [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) [19:43:15] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wgEventStreams - Allow for some default stream config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606758 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:45:55] anyone got any idea/advice for https://phabricator.wikimedia.org/T255891? Or able to add the timings for wmf db queries if they're useful? [19:46:45] (03CR) 10Krinkle: [C: 03+1] betacluster: Disallow wikidataclient-test leaking over [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606755 (https://phabricator.wikimedia.org/T250555) (owner: 10Majavah) [19:48:14] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:48:15] RhinosF1: They're all 0.00 sec on a WMF replica [19:49:04] But as the timings on miraheze is pretty small for just the sql queries, it's probably not missing sql indexes etc [19:49:42] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Ottomata) > We could treat a stream as an unending download, and encode the same information that is curr... [19:50:44] Reedy: The queries are pretty low. The page is slow to load though so either I'm missing something in what the page does or it's doing something weird and wonderful. [19:51:47] (03PS2) 10ArielGlenn: restructure rsync of xml/sql dumps from primary source to other servers [puppet] - 10https://gerrit.wikimedia.org/r/605990 (https://phabricator.wikimedia.org/T254856) [19:51:52] You'll really have to do some more indepth benchmarking to see where the slow parts are [19:52:46] The unregistered user being over 2x slower on wmf is potentially you could get someone at WMF to investigate [19:52:48] Reedy: If you have an idea on how to do that, drop a note on the task. I should have shell access soon to mw servers. If not, someone can run it in next few days. [19:53:38] https://www.mediawiki.org/wiki/Manual:Profiling [19:54:23] I wonder if it's more related to the cach(es|ing) [19:54:55] * RhinosF1 not sure. I do config and basic maintenance. [19:56:04] You can also do some debugging via browser tools [19:56:10] See if that helps highlight the slow parts [19:57:25] I can look in the next few days [19:58:28] If you can find actual issues from the MW/CA code itself, you can probably tag it as a perf issue etc [19:59:12] I can try and get data from profiling & browser tools [19:59:31] But me understand php enough to know exactly where it is, is unlikely [19:59:46] Like I say, there's a slight issue on WMF wikis too, but it's definitely nowhere near as pronounced [20:00:36] But the fact there's such a big increase on the registered one would suggest it's more of an issue in MH config/setup etc [20:00:46] That's probably because the wmf's servers are 100x better [20:00:59] * RhinosF1 can't lie about our infra [20:01:21] The mainpage different isn't much in absolute terms, but quite a bit in relative terms [20:02:04] I know paladox said we are on hdds not ssds so that could slow some stuff down [20:02:18] Well, your DB queries aren't excessively slowly [20:02:50] *slower [20:03:41] Not really [20:03:53] They're nearly instant [20:04:09] You need to work out where the problem actually is before really speculating. Sure you can hypothesising ;) [20:04:21] hypothesise [20:04:28] hypothesize [20:04:30] whatever [20:04:30] I didn't really know where to start [20:04:47] As above, play with the profilers and see if you can narrow it down [20:04:47] The word profiling is more of an idea than I had a few hours back [20:04:53] Thanks [20:05:04] paladox: ^ if you wanna start [20:07:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [20:40:03] 10Operations, 10Analytics, 10Analytics-Kanban, 10EventStreams, and 2 others: EventStreams drops the connection after 15 minutes, which makes it unreliable - https://phabricator.wikimedia.org/T242767 (10Nuria) Some thoughts: I agree with @ema and @BBlack that we cannot expect connections to live "forever" a... [20:50:36] Reedy profiled, found that it's due to https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/master/includes/CentralAuthUser.php#L2362 [20:50:57] https://phabricator.wikimedia.org/P11610 [20:56:48] Not quite [20:56:54] It's due to all of the calls of localUserData ;) [20:57:25] I think that explains the non existent users being slow... exceptions are slow [20:57:52] But if it's reading from localuser... [20:58:13] localUserData being slow is hardly a surprise [20:58:56] As it's doing numerous queries against each database [21:00:51] paladox: Could do with narrowing it down to which query/queries are slow inside it [21:00:58] ah [21:01:06] well i see this: [21:01:07] 20.91% 902.220 697 - section.query-m: SELECT ipb_expiry,ipb_block_email,ipb_anon_only,ipb_create_account,ipb_enable_autoblock,ipb_allow_usertalk,comment_ipb_reason.comment_text AS `ipb_reason_text`,comment_ipb_reason.comment_data AS `ipb_reason_data`,comment_ipb_reason.comment_id AS `ipb_reas [21:02:17] Yeah, there should probably be something similar for other queries [21:02:26] I mean, that's not great, but it's not all the time [21:02:45] Reedy https://phabricator.wikimedia.org/T255891#6241515 [21:04:42] That's for an existing user, right? [21:04:49] What about a nonexistent one? [21:04:52] yeh [21:04:56] * paladox checks [21:06:21] Reedy https://phabricator.wikimedia.org/T255891#6241516 [21:08:43] The 3rd column is the count of times the function was called, IIRC? [21:09:28] PROBLEM - Stale file for node-exporter textfile in ulsfo on icinga1001 is CRITICAL: cluster=ganeti file=device_smart.prom instance=ganeti4002:9100 job=node site=ulsfo https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [21:09:37] Like, why is it doing 10K more queries for a nonexistent user? [21:10:06] yeh, i believe so. [21:10:24] I'd almost say there's 2 seperate issues here [21:11:34] https://www.mediawiki.org/wiki/Manual:How_to_debug#SQL_errors could be useful [21:11:48] see what queries are actually being run [21:12:31] Where would that be logged? [21:12:45] oh [21:12:53] Also, how many wikis? And are they all in CentralAuth? [21:13:42] PROBLEM - Stale file for node-exporter textfile in eqsin on icinga1001 is CRITICAL: cluster=ganeti file=device_smart.prom instance=ganeti5002:9100 job=node site=eqsin https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [21:14:05] Reedy we have 4k wikis. [21:14:22] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:14:24] and yes. [21:14:47] So... This will obviously help make things slower [21:14:53] Like, you've got over 4x more wikis than WMF [21:15:36] yeh [21:16:00] I wonder if some of the assumptions about looking for unattached accounts we still do are really necessary these days [21:18:42] hmm, setting "'DBQuery' => "$wmgLogDir/debuglogs/DBQuery.log"," and wgDebugDumpSql doesn't seem to log. [21:19:14] touch the file first? [21:20:28] that didn't seem to fix it. [21:25:52] PROBLEM - Stale file for node-exporter textfile in esams on icinga1001 is CRITICAL: cluster=ganeti file=device_smart.prom instance=ganeti3003:9100 job=node site=esams https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [22:10:12] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [22:59:54] Reedy it seems to be https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/043514ea3b2c894d98ed84b7ef2c305f5833bb31/includes/CentralAuthUser.php#L2208 (for the non existing user) [23:00:36] [22:15:59] I wonder if some of the assumptions about looking for unattached accounts we still do are really necessary these days [23:01:43] ah ok [23:02:21] Can one of you sum this up on phab? [23:02:51] I've got an idea that might help a bit though [23:03:04] Go on [23:03:51] I'm making a patch [23:04:11] :) [23:04:23] * RhinosF1 will likely look in the morning [23:05:04] How are the DBs arranged? All one one set of server (primary/replicas)? [23:05:09] or multiple clusters like WMF? [23:05:35] paladox: you can explain better ^ [23:05:44] https://github.com/miraheze/mw-config/blob/master/Database.php [23:06:22] yeah, only two hosts [23:06:24] so seems likely [23:06:48] Reedy we use primary (though we have a replica but we don't read from it) [23:06:55] basically 2 primary db servers [23:06:56] yeah, it was more seperate clusters [23:07:01] https://gerrit.wikimedia.org/r/606764 [23:07:12] That should help... As it shouldn't need to do a new connection every time [23:07:19] Should be able to reuse the connections a bit more [23:07:56] Cause that's probably more of the time wasted than actually executing the queries [23:08:16] doesn't seem to help at least still slow [23:23:33] Hmm [23:23:33] >Never call this on handles acquired via getConnectionRef() [23:25:34] paladox: What about if you replace getConnectionRef( with getConnection( [23:25:53] just tried that, doesn't seem to do it either :( [23:34:49] Reedy oh! [23:34:57] it's the inserting that appears slow [23:35:13] oh [23:35:28] nvm [23:37:42] Reedy https://github.com/wikimedia/mediawiki-extensions-CentralAuth/blob/043514ea3b2c894d98ed84b7ef2c305f5833bb31/includes/CentralAuthUser.php#L2233 seems like a useless if statement [23:37:47] since it's already done above [23:38:03] no [23:38:28] there can be no rows, and $this->exists() is still true [23:38:38] oh [23:53:55] Reedy would this [23:53:56] $user = User::newFromName( $this->mName ); [23:53:57] works? [23:54:07] at least using it was fast for me [23:54:16] Work for what? [23:54:18] oh [23:54:20] nvm [23:54:35] Reedy well for importLocalNames [23:54:47] i just realised that it was using the same db [23:54:52] Heh, yeah [23:54:58] That's why it's quick, and cached :P