[00:04:58] (03PS1) 10Ssingh: cescout: fix typo in metadb-configure script [puppet] - 10https://gerrit.wikimedia.org/r/587892 [00:08:20] (03CR) 10Ssingh: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21830/cescout1001.eqiad.wmnet/ self-merging a trivial commit as there is no change in pup" [puppet] - 10https://gerrit.wikimedia.org/r/587892 (owner: 10Ssingh) [00:14:06] (03PS1) 10BryanDavis: toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376) [00:14:09] (03PS1) 10BryanDavis: toolforge: Remove Jessie config for grid engine + bastions [puppet] - 10https://gerrit.wikimedia.org/r/587895 [00:17:38] (03CR) 10jerkins-bot: [V: 04-1] toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376) (owner: 10BryanDavis) [00:21:15] (03PS2) 10BryanDavis: toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376) [00:27:47] (03PS1) 10Papaul: DNS: Remove mgmt DNS for heka [dns] - 10https://gerrit.wikimedia.org/r/587897 [00:28:39] (03PS3) 10BryanDavis: kubernetes: ingress: use HTTP 307 for canonical redirect [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez) [00:29:35] 10Operations, 10ops-codfw, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Papaul) [00:29:54] 10Operations, 10ops-codfw, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Papaul) 05Open→03Resolved Complete [00:30:10] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul) [00:31:00] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul) 05Open→03Resolved Complete [00:32:24] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for heka [dns] - 10https://gerrit.wikimedia.org/r/587897 (owner: 10Papaul) [00:32:46] (03CR) 10BryanDavis: "I verified that adding the `nginx.ingress.kubernetes.io/permanent-redirect-code: "307"` annotation to an ingress that also has a "nginx.in" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez) [00:34:26] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10Papaul) [00:34:41] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10Papaul) 05Open→03Resolved Complete [00:56:02] (03CR) 10BryanDavis: [V: 03+1 C: 03+1] "Verified generated ingress config via hot patching on toolsbeta-sgebastion-04" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez) [01:19:39] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul) [01:20:05] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul) [03:30:08] PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:02] RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:42] PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [04:52:59] 10Operations, 10Wikimedia-Mailing-lists: add oauth login to mailing lists - https://phabricator.wikimedia.org/T249678 (10Gryllida) Yes, lists.wikimedia.org, login is required to change preferences or view archives. [05:15:34] 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) Memory consumption has improved a lot after backporting two HTTP/2 fixes: * https://github.com/apache/trafficserver/pull/5697 (biggest improvement in text) * https://github.com... [05:15:54] 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) p:05High→03Medium [05:46:59] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: add oauth login to mailing lists - https://phabricator.wikimedia.org/T249678 (10Peachey88) It looks like upstream mailman doesn't support this, And I can't see any upstream tasks about this. We are also still using a old 2.X release, Once we upgrade to 3... [05:49:16] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Add oauth login to the mailman package for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Peachey88) [05:50:44] (03PS1) 10Marostegui: install_server: Reimage pc1008 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/587906 [05:53:02] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage pc1008 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/587906 (owner: 10Marostegui) [05:53:50] (03CR) 10Elukey: analytics::refinery::eventlogging-saltrotate: Bootstrap salts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587815 (owner: 10Mforns) [05:56:26] (03PS1) 10Marostegui: install_server: Allow reimage of pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/587907 [05:57:55] (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/587907 (owner: 10Marostegui) [06:00:07] !log Stop MySQL on pc1008 for upgrade [06:00:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:04:35] (03PS1) 10Marostegui: pc1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/587909 [06:06:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: retry on connect failures for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/587787 (https://phabricator.wikimedia.org/T249705) (owner: 10Giuseppe Lavagetto) [06:11:22] (03CR) 10Marostegui: [C: 03+2] pc1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/587909 (owner: 10Marostegui) [06:15:10] (03CR) 10Elukey: [C: 03+2] "Andrew: merging the change to start testing, if you don't like it I'll revert :)" [puppet] - 10https://gerrit.wikimedia.org/r/587813 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey) [06:15:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [06:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:39] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [06:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:47] (03PS1) 10Marostegui: Revert "install_server: Allow reimage of pc1008" [puppet] - 10https://gerrit.wikimedia.org/r/587913 [06:31:37] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage of pc1008" [puppet] - 10https://gerrit.wikimedia.org/r/587913 (owner: 10Marostegui) [06:33:18] (03CR) 10Elukey: "Should I open a task for SRE access request and wait an SRE meeting?" [puppet] - 10https://gerrit.wikimedia.org/r/587726 (owner: 10Elukey) [06:35:07] (03PS2) 10Marostegui: install_server: Allow reimage of labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/587628 (https://phabricator.wikimedia.org/T249188) [06:42:37] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:59] <_joe_> XioNoX: ^^ [06:44:07] (03CR) 10Dzahn: [C: 03+2] zuul: fix dependency on /etc/zuul and package if on buster [puppet] - 10https://gerrit.wikimedia.org/r/587782 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [06:44:11] checking [06:44:46] xe-3/2/3 up down Transport: cr2-codfw:xe-5/0/1 [06:45:15] Maintenance Window: 00:01 - 05:00 Central [06:45:22] central time, thanks Zayo [06:46:23] but yeah it's a planned zayo maintenance to last until 10am UTC [06:46:32] I'll ack it until then [06:46:44] or downtime actually [06:47:41] (03CR) 10Dzahn: "6377 Apr 10 06:45:54 contint2001 puppet-agent[13813]: (/Stage[main]/Profile::Zuul::Server/Git::Clone[integration/config]/F ile[/etc/zu" [puppet] - 10https://gerrit.wikimedia.org/r/587782 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [06:47:56] done [06:48:42] (03CR) 10Dzahn: "/etc/zuul exists now and content has been cloned:" [puppet] - 10https://gerrit.wikimedia.org/r/587782 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn) [06:49:05] (03PS8) 10Dzahn: phabricator: remove firewall holes for port 80 from caches [puppet] - 10https://gerrit.wikimedia.org/r/569100 [06:52:02] (03CR) 10Dzahn: [C: 03+2] phabricator: remove firewall holes for port 80 from caches [puppet] - 10https://gerrit.wikimedia.org/r/569100 (owner: 10Dzahn) [06:55:57] RECOVERY - Keyholder SSH agent on deploy1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder [06:58:11] !log armed keyholder on deploy1001 [06:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:35] !log sodium - sudo -u mirror ftpsync [07:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:43] RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 4 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [07:11:59] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:18:32] (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587915 (https://phabricator.wikimedia.org/T249593) [07:32:27] (03PS1) 10Ema: Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/587917 (https://phabricator.wikimedia.org/T249583) [07:34:27] (03CR) 10Vgutierrez: [C: 03+1] Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/587917 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:34:46] (03CR) 10Ema: [C: 03+2] Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/587917 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema) [07:37:21] !log cp3050: back to vhtcpd for the holidays T249583 [07:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:27] T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583 [07:38:12] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587915 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [07:40:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=purged site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:41:57] (03PS1) 10Elukey: role::analytics_cluster::launcher: add hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587919 (https://phabricator.wikimedia.org/T249593) [07:43:26] (03PS1) 10Dzahn: netbox: switch git::clone to ensure present, not latest [puppet] - 10https://gerrit.wikimedia.org/r/587920 [07:45:43] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587919 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [07:48:11] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=purged site=esams Ema purged testing stopped over Easter https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:51:34] (03CR) 10Dzahn: "sorry, i just realized i did the same before in" [puppet] - 10https://gerrit.wikimedia.org/r/587920 (owner: 10Dzahn) [07:51:41] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [07:52:58] !log closing port 80 on phab hosts for caching servers [07:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:33] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Add oauth login to the mailman package for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Aklapper) p:05Low→03Lowest Right, no mention of the term OAuth in their docs: https://mailman.readthedocs.io/en/latest/... [07:56:12] (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587961 (https://phabricator.wikimedia.org/T249593) [07:59:52] 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Add oauth login to the mailman package for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Dzahn) There are hundreds of lists and each one has different admins and different users and different settings who is allo... [08:02:21] (03CR) 10Dzahn: [C: 03+2] ci: remove blubber Debian package [puppet] - 10https://gerrit.wikimedia.org/r/587862 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [08:02:39] good morning ;) [08:03:03] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:03:39] !log hashar@deploy1001 Started deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided) [08:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:44] !log hashar@deploy1001 Finished deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided) (duration: 00m 05s) [08:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:32] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587961 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [08:07:49] hashar: good morning. integration::config has been cloned on contint2001 now [08:08:40] (03PS1) 10Elukey: role::analytics_cluster::launcher: add project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587964 (https://phabricator.wikimedia.org/T249593) [08:11:50] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [08:13:24] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587964 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [08:19:56] !log hashar@deploy1001 Started deploy [integration/zuul/deploy@6c3ddad]: (no justification provided) [08:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:07] !log hashar@deploy1001 Finished deploy [integration/zuul/deploy@6c3ddad]: (no justification provided) (duration: 00m 11s) [08:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:10] (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587966 (https://phabricator.wikimedia.org/T249593) [08:24:01] !log update puppet compiler facts [08:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:35] !log fix comment in deployment ssh key for zuul to include the path to the key on deploy1001 [08:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:37] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) >>! In T224591#6043580, @thcipriani wro... [08:37:32] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) [08:39:11] jouncebot: next [08:39:11] In 73 hour(s) and 50 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T1030) [08:39:53] !log deploy1001 - keyholder disarm, keyholder arm [08:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:43] (03PS10) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [08:40:58] (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [08:43:57] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587966 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [08:44:38] !log hashar@deploy1001 Started deploy [zuul/deploy@5a0a03a]: (no justification provided) [08:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:45] (03PS11) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [08:46:58] !log hashar@deploy1001 Finished deploy [zuul/deploy@5a0a03a]: (no justification provided) (duration: 02m 20s) [08:47:00] (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [08:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:13] (03PS1) 10Elukey: role::analytics_cluster::launcher: add sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587968 (https://phabricator.wikimedia.org/T249593) [08:51:52] !log hashar@deploy1001 Started deploy [zuul/deploy@4a69913]: (no justification provided) [08:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:05] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587968 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [08:52:09] !log hashar@deploy1001 Finished deploy [zuul/deploy@4a69913]: (no justification provided) (duration: 00m 16s) [08:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:18] yeah that is a bit spammy sorry about that [08:52:18] :( [08:52:26] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and WMF/NDA restricted tickets for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Aklapper) Hi @jmads, can you please file a separate request for "WMF/NDA restricted tickets"? See #wmf-nda-requests - thanks! [08:52:39] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Aklapper) [08:57:55] (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587972 (https://phabricator.wikimedia.org/T249593) [09:02:09] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587972 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [09:04:21] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Dzahn) [09:06:39] (03PS1) 10Elukey: role::analytics_cluster::launcher: add druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587973 (https://phabricator.wikimedia.org/T249593) [09:07:24] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Dzahn) Hi @KFrancis Jim is a contractor and I tried to check the box "User has a valid NDA on file with WMF legal. (This can be checked by Operations via the NDA... [09:08:27] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Dzahn) Adding @Nuria for analytics private data request. [09:10:07] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587973 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [09:22:03] (03CR) 10Ayounsi: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/577316 (https://phabricator.wikimedia.org/T246618) (owner: 10Ayounsi) [09:22:42] (03PS3) 10Elukey: analytics::refinery::eventlogging-saltrotate: Bootstrap salts [puppet] - 10https://gerrit.wikimedia.org/r/587815 (owner: 10Mforns) [09:24:45] (03PS1) 10Hashar: zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) [09:25:05] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:25:38] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21831/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/587815 (owner: 10Mforns) [09:27:44] (03CR) 10jerkins-bot: [V: 04-1] zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:30:15] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List projects with empty description [puppet] - 10https://gerrit.wikimedia.org/r/587725 (https://phabricator.wikimedia.org/T249805) (owner: 10Aklapper) [09:31:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1089', diff saved to https://phabricator.wikimedia.org/P10952 and previous config saved to /var/cache/conftool/dbconfig/20200410-093129-marostegui.json [09:31:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:37] 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) 05Open→03Stalled stalled by T247018 [09:32:40] 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn) [09:33:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Thanks Bryan for the testing!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez) [09:33:49] (03Merged) 10jenkins-bot: kubernetes: ingress: use HTTP 307 for canonical redirect [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez) [09:36:58] (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent data_purge [puppet] - 10https://gerrit.wikimedia.org/r/587975 (https://phabricator.wikimedia.org/T249593) [09:39:04] (03CR) 10Dzahn: "for some reason the GID is "invalid" per jenkins-bot vote but I don't see why. it does not seem to be duplicate." [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:41:22] (03CR) 10Dzahn: "The gerrit-deployers group from the referenced change does not exist anymore in here?" [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:42:16] (03PS2) 10Dzahn: zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:43:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376) (owner: 10BryanDavis) [09:43:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1089', diff saved to https://phabricator.wikimedia.org/P10953 and previous config saved to /var/cache/conftool/dbconfig/20200410-094359-marostegui.json [09:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:24] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Remove Jessie config for grid engine + bastions [puppet] - 10https://gerrit.wikimedia.org/r/587895 (owner: 10BryanDavis) [09:45:46] (03CR) 10jerkins-bot: [V: 04-1] zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:49:09] (03PS2) 10Arturo Borrero Gonzalez: toolforge: Remove Jessie config for grid engine + bastions [puppet] - 10https://gerrit.wikimedia.org/r/587895 (owner: 10BryanDavis) [09:53:37] (03PS3) 10Hashar: zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) [09:55:03] (03CR) 10Hashar: "The test fail because I had a gid 903 while the group is not a system user. The test ensure that system users are between 900 and 950." [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [09:55:33] PROBLEM - PHP opcache health on mw2317 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:56:51] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:57:25] RECOVERY - PHP opcache health on mw2317 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [09:58:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:04:30] <_joe_> we had a brief unavailability of eventgate-main i would say [10:07:09] <_joe_> corresponding to it, a spike in system cpu time in eventgate-main https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=67&fullscreen&orgId=1&refresh=1m [10:07:58] <_joe_> and a corresponding spike in memory usage [10:08:55] <_joe_> and, quite unsurprisingly, a high number of enqueued jobs in rdkafka [10:09:15] <_joe_> so maybe we need to raise memory limits? I'll take a look [10:10:12] <_joe_> I love these dashboards [10:13:40] (03CR) 10Dzahn: [C: 03+2] zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar) [10:17:48] (03PS1) 10Aklapper: phabricator weekly changes email: List open tasks with past due date [puppet] - 10https://gerrit.wikimedia.org/r/587976 (https://phabricator.wikimedia.org/T249807) [10:19:16] 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2020): CommRel support for FY2019-2020 Q4 DC switchover - https://phabricator.wikimedia.org/T244808 (10Elitre) @RLazarus assuming this is still pretty much in the air? [10:27:07] (03PS1) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) [10:40:08] (03PS2) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) [10:56:51] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [11:02:05] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) The pending hosts `root@cumin1001:/home/marostegui# for i in `mysql.py -hdb1115 -A zarcillo -e "select instance from masters where dc=... [11:03:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! awesome!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587868 (https://phabricator.wikimedia.org/T239459) (owner: 10Mholloway) [11:03:43] (03Merged) 10jenkins-bot: Release new charts for wikifeeds, mobileapps, chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/587868 (https://phabricator.wikimedia.org/T239459) (owner: 10Mholloway) [11:17:15] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:28:13] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 22373 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:37:26] (03PS1) 10Alexandros Kosiaris: mathoid: Enabled named loglevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/587980 (https://phabricator.wikimedia.org/T239459) [11:37:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: Enabled named loglevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/587980 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [11:38:06] (03Merged) 10jenkins-bot: mathoid: Enabled named loglevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/587980 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [11:38:43] (03PS1) 10Dr0ptp4kt: Make event.editattempthourly accessible in Superset [puppet] - 10https://gerrit.wikimedia.org/r/587981 [11:39:45] !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' . [11:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:26] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' . [11:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:27] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudnet2002-dev.codfw.wmnet, contint2001.wikimedia.org, cloudnet2003-dev.codfw.wmnet, idp-test2001.wikimedia.org, netbox1001.wikimedia.org https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:46:25] (03PS1) 10Dzahn: add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649) [11:46:28] (03Abandoned) 10Dr0ptp4kt: Make event.editattempthourly accessible in Superset [puppet] - 10https://gerrit.wikimedia.org/r/587981 (owner: 10Dr0ptp4kt) [11:46:50] (03CR) 10jerkins-bot: [V: 04-1] add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [11:47:23] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'production' . [11:47:24] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'canary' . [11:47:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:20] (03CR) 10Elukey: [C: 04-1] "This doesn't work, it was meant to be WIP, sorry gehel for the extra notification :)" [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [11:50:52] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 4 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) Mathoid has been deployed with this change and it... [11:53:59] PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [11:55:03] (03PS3) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) [11:58:17] (03PS4) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) [11:58:19] (03PS2) 10Dzahn: add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649) [12:00:32] (03PS5) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) [12:02:51] (03PS6) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) [12:04:18] (03CR) 10Dzahn: [C: 03+2] add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn) [12:07:58] (03PS7) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) [12:08:13] 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn) [12:08:16] (03CR) 10Elukey: "Ready for review now!" [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [12:08:20] 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn) [12:08:23] 10Operations, 10serviceops, 10Patch-For-Review: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) [12:09:22] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [12:09:23] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [12:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:00] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [12:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:23] !log Creating VM people1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=A vcpus=1 memory=2GB disk=80GB link=private. This may take a few minutes. [12:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:31] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [12:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:36] !log Creating VM people1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=A vcpus=1 memory=2GB disk=80GB link=private. (T249907) [12:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:41] T249907: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 [12:11:51] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [12:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:08] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [12:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:33] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:14:52] (03PS1) 10Dr0ptp4kt: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 [12:15:27] (03PS2) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 [12:25:17] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 22367 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:32:35] 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn) Creating the VM failed. spicerack logs don't tell me why. I guess we are out of resources on the Ganeti eqiad cluster. [12:41:23] (03PS1) 10Dzahn: merge microsites into webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650) [12:43:33] 10Operations, 10Mail, 10Epic: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Open→03Stalled [12:55:57] (03PS1) 10Hashar: admin: fix python deprecation in test suite [puppet] - 10https://gerrit.wikimedia.org/r/587988 [12:55:59] (03PS1) 10Hashar: admin: enhance test output for groups GID [puppet] - 10https://gerrit.wikimedia.org/r/587989 [12:56:01] (03PS1) 10Hashar: admin: show gid in gid test error [puppet] - 10https://gerrit.wikimedia.org/r/587990 [13:00:47] (03PS1) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [13:01:59] 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) p:05Medium→03High {F31748766} it looks like the leak is still there [13:04:23] (03CR) 10jerkins-bot: [V: 04-1] sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) (owner: 10Arturo Borrero Gonzalez) [13:09:16] (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List open tasks with past due date [puppet] - 10https://gerrit.wikimedia.org/r/587976 (https://phabricator.wikimedia.org/T249807) (owner: 10Aklapper) [13:12:32] !log restarted and re-armed keyholder on deploy1001 to pick up changes for zuul scap deploy [13:12:36] hashar: ^ done [13:12:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:56] magic :) [13:14:08] !log hashar@deploy1001 Started deploy [zuul/deploy@4a69913]: (no justification provided) [13:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:48] !log hashar@deploy1001 Finished deploy [zuul/deploy@4a69913]: (no justification provided) (duration: 00m 40s) [13:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:54] deploy-local failed: {} [13:14:55] hehe [13:16:31] ? invalid check to CheckInvalid [13:18:55] yeah gotta dig into that one now :/ [13:27:29] (03PS1) 10Mholloway: MachineVision label blacklist updates, 2020-04-09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587994 (https://phabricator.wikimedia.org/T249895) [13:33:17] PROBLEM - PHP opcache health on mw2320 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:33:50] if check_type not in _TYPES: [13:33:50] msg = "unknown check type '{}'".format(check_type) [13:33:51] raise CheckInvalid(msg) [13:55:15] RECOVERY - PHP opcache health on mw2320 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [13:57:55] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:21] RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon --syslog https://www.mediawiki.org/wiki/Continuous_integration/Zuul [14:11:57] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Mholloway) I'm trying to interpret what this task being closed as declined means. Does it mean that t... [14:25:01] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10WDoranWMF) hey @Mholloway, we are not porting Restbase to k8s so this becomes irrelevant. Restbase wil... [14:25:22] (03PS1) 10Andrew Bogott: Openstack rocky serverpackages: require python3-tooz [puppet] - 10https://gerrit.wikimedia.org/r/588003 [14:25:44] (03PS2) 10Andrew Bogott: Openstack rocky serverpackages: require python3-tooz [puppet] - 10https://gerrit.wikimedia.org/r/588003 [14:27:17] (03CR) 10Andrew Bogott: [C: 03+2] Openstack rocky serverpackages: require python3-tooz [puppet] - 10https://gerrit.wikimedia.org/r/588003 (owner: 10Andrew Bogott) [14:29:21] 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) [14:30:18] 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) 05Open→03Resolved [14:30:36] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Mholloway) Ah, I see. My interest is specifically in service-runner as my understanding is that it wi... [14:30:50] 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) cescout and its associated dependencies have been deployed to cescout1001. Marking this as resolved. [14:33:56] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice idea. I like it, but got a comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan) [14:36:03] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) `service-runner` itself is not going anywhere. DHT-based rate limiting however is likely to... [14:36:09] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [14:37:07] (03PS1) 10Andrew Bogott: Openstack python3-tooz hack: only install on hosts that really need it [puppet] - 10https://gerrit.wikimedia.org/r/588005 [14:38:23] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Eevans) >>! In T235437#6046252, @Mholloway wrote: > Ah, I see. My interest is specifically in service... [14:41:21] (03CR) 10Andrew Bogott: [C: 03+2] Openstack python3-tooz hack: only install on hosts that really need it [puppet] - 10https://gerrit.wikimedia.org/r/588005 (owner: 10Andrew Bogott) [14:41:23] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Joe) >>! In T235437#6046252, @Mholloway wrote: > Ah, I see. My interest is specifically in service-ru... [14:45:39] 10Operations, 10Services, 10service-runner, 10serviceops, and 3 others: Move service-runner legacy rate limiter into hyperswitch - https://phabricator.wikimedia.org/T249919 (10Pchelolo) [14:46:05] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) Ok, to avoid further confusion, I will do T249919 [14:48:26] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Mholloway) Thanks, everyone. The repo I'm working with is https://github.com/mdholloway/pushd, and ba... [14:50:09] (03PS2) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [14:51:38] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) [14:53:11] PROBLEM - PHP opcache health on mw2322 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [14:55:59] (03PS3) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [14:57:46] (03PS4) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [14:59:30] (03PS5) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [15:01:32] (03CR) 10CRusnov: [C: 03+1] "I am fine with this, as we all know when we merge changes. It is my fault the alert triggers, for what it's worth." [puppet] - 10https://gerrit.wikimedia.org/r/587920 (owner: 10Dzahn) [15:01:43] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Papaul) [15:01:56] (03PS6) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [15:03:00] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) 05Open→03Resolved We 've had no reoccurence with this in the last year, the approach has clearly worked. In the meantime we have understood what the the... [15:03:08] !log restart ats-tls on cp1083 and cp1085 - T249335 [15:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:14] T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 [15:03:29] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Papaul) @volans I realized that the sudo -i wmf-auto-reimage-host command is possible by just giving the user the right to run just that command on cumin and not any ot... [15:03:36] (03PS7) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [15:03:54] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924 (10akosiaris) 05Open→03Resolved a:03akosiaris This has happened for a long time now, resolving. [15:03:56] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462 (10akosiaris) [15:04:08] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462 (10akosiaris) [15:04:21] 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462 (10akosiaris) 05Open→03Resolved a:03akosiaris Has happened for a long time now, resolving [15:06:11] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) Can we use the existing dcops admin group and just adjust the sudo privs to include the wmf-auto-reimage commands? [15:06:43] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Volans) It's not up to me to decide. For context: with the current setup, being able to run the reimage script is equivalent of having global root. There have been some... [15:08:25] 10Operations, 10serviceops: dropped packets to echostore.svc.eqiad 8082/tcp - https://phabricator.wikimedia.org/T238789 (10akosiaris) [15:08:29] 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris) [15:08:39] 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) John is already in that group and the group has the following sudo privileges. So this is not about a new shell user, it's just about adding a command to the li... [15:08:44] 10Operations: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10akosiaris) [15:08:47] 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris) [15:09:40] (03PS8) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [15:11:38] 10Operations, 10Services, 10service-runner, 10serviceops, and 3 others: Move service-runner legacy rate limiter into hyperswitch - https://phabricator.wikimedia.org/T249919 (10Pchelolo) p:05Triage→03Medium [15:13:14] (03CR) 10Dzahn: [C: 03+2] netbox: switch git::clone to ensure present, not latest [puppet] - 10https://gerrit.wikimedia.org/r/587920 (owner: 10Dzahn) [15:13:17] RECOVERY - PHP opcache health on mw2322 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [15:20:31] (03PS9) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [15:24:08] 10Operations, 10serviceops, 10Kubernetes, 10User-fsero, 10User-jijiki: Support e - https://phabricator.wikimedia.org/T249927 (10akosiaris) [15:24:30] 10Operations, 10serviceops, 10Kubernetes, 10User-fsero, 10User-jijiki: Support kubernetes Egress networkpolicies in our helm charts - https://phabricator.wikimedia.org/T249927 (10akosiaris) p:05Triage→03Medium [15:29:12] (03PS10) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) [15:33:22] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/587975 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [15:33:31] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent data_purge [puppet] - 10https://gerrit.wikimedia.org/r/587975 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [15:38:31] (03PS1) 10Elukey: role::analytics_cluster::launcher: add data_purge [puppet] - 10https://gerrit.wikimedia.org/r/588010 (https://phabricator.wikimedia.org/T249593) [15:39:53] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/588010 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [15:40:04] (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add data_purge [puppet] - 10https://gerrit.wikimedia.org/r/588010 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey) [16:09:14] (03PS1) 10Elukey: kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012 [16:15:32] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) [16:17:46] (03PS2) 10Elukey: kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012 [16:20:32] (03PS1) 10Bstorm: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) [16:21:06] (03CR) 10jerkins-bot: [V: 04-1] d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm) [16:22:56] (03PS3) 10Elukey: kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012 [16:27:49] (03CR) 10Elukey: [C: 03+2] kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012 (owner: 10Elukey) [16:29:24] (03PS2) 10Bstorm: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) [16:29:57] (03CR) 10jerkins-bot: [V: 04-1] d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm) [16:33:44] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10KFrancis) @Dzahn It looks like Jim is currently being onboarded ( please see here: https://office.wikimedia.org/wiki/Office_IT_Weekly_Meeting_Notes-_April_9,_202... [16:35:18] (03PS1) 10Elukey: Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015 [16:38:30] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21843/" [puppet] - 10https://gerrit.wikimedia.org/r/588015 (owner: 10Elukey) [16:39:31] 10Operations, 10Services, 10service-runner, 10serviceops, and 3 others: Move service-runner legacy rate limiter into hyperswitch - https://phabricator.wikimedia.org/T249919 (10Pchelolo) 05Open→03Declined Actually, after reviewing the code once more, this doesn't seem to be feasible. rate limiter in ser... [16:39:38] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul) [16:40:46] PROBLEM - PHP opcache health on mw2319 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [16:45:59] 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul) [16:59:59] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10MNovotny_WMF) Is there anything I can do to help here? [17:00:16] RECOVERY - PHP opcache health on mw2319 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [17:02:29] (03PS1) 10Papaul: DNS Remove mgmt asset tag from cp20[16-20,22-26] [dns] - 10https://gerrit.wikimedia.org/r/588019 [17:05:04] (03PS2) 10Papaul: DNS Remove mgmt asset tag DNS for cp20[16-20,22-26] [dns] - 10https://gerrit.wikimedia.org/r/588019 [17:05:13] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) I have pruned all the containers from... [17:06:46] (03CR) 10Papaul: [C: 03+2] DNS Remove mgmt asset tag DNS for cp20[16-20,22-26] [dns] - 10https://gerrit.wikimedia.org/r/588019 (owner: 10Papaul) [17:09:23] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul) [17:09:35] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul) 05Open→03Resolved complete [17:09:49] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul) [17:09:59] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul) 05Open→03Resolved complete [17:10:18] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul) [17:10:28] 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul) 05Open→03Resolved complete [17:16:26] 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2020): CommRel support for FY2019-2020 Q4 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Yeah, as you can imagine with some folks working reduced hours, SRE is mostly focusing on critical work and we've pushed this off. I haven't aske... [17:37:07] (03CR) 10Krinkle: [C: 03+1] "Awesome :)" [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) (owner: 10DCausse) [17:52:44] (03PS3) 10Bstorm: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) [17:57:05] (03CR) 10Bstorm: "To explain what I did here: dh_python2 was actually failing on deps because the control file lacked the entry of ${python:Depends} (which " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm) [17:58:23] (03CR) 10Bstorm: d/changelog: prepare for 0.67 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm) [18:03:54] (03CR) 10Herron: Enable TLS encryption between kafkatee instances and Kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588015 (owner: 10Elukey) [18:09:37] (03PS1) 10BBlack: Offboarding shell access for anomie [puppet] - 10https://gerrit.wikimedia.org/r/588023 [18:10:30] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Nuria) @MNoorWMF We need the contract end date to provide access and your approval. Besides that the legal and SRE fellows need to assess NDA is been signed. [18:12:45] (03CR) 10BBlack: [C: 03+2] Offboarding shell access for anomie [puppet] - 10https://gerrit.wikimedia.org/r/588023 (owner: 10BBlack) [18:29:50] RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37 [18:32:00] PROBLEM - PHP opcache health on mw2318 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:36:27] (03PS3) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) [18:40:24] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt) [18:46:32] (03PS4) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) [18:48:07] (03CR) 10Bstorm: [C: 03+2] d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm) [18:49:22] (03Merged) 10jenkins-bot: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm) [18:50:41] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt) [18:53:48] RECOVERY - PHP opcache health on mw2318 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [18:56:12] (03CR) 10Gehel: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey) [19:09:14] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:09:14] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10MNoorWMF) I think I was tagged by mistake. Tagging @MNovotny_WMF in case it was missed :) [19:09:50] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:19:35] 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Aklapper) @ovasileva: Could you please check the last comment? (You were not CC'ed so you... [19:26:43] (03CR) 10Bstorm: [C: 03+1] sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) (owner: 10Arturo Borrero Gonzalez) [19:26:49] (03PS1) 10RobH: adding gpu skus [software] - 10https://gerrit.wikimedia.org/r/588031 [19:27:45] (03CR) 10RobH: [C: 03+2] adding gpu skus [software] - 10https://gerrit.wikimedia.org/r/588031 (owner: 10RobH) [19:37:50] uhhh interesting that BFD is down for that link just for ipv4, and not for ipv6, and OSPF is still working [19:37:57] !log cdanis@re0.cr1-codfw> clear bfd session address 208.80.153.220 [19:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:14] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:38:52] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:38:58] ... okay I guess [19:39:46] once and for all [19:46:26] state machine go brr [19:59:36] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:01:26] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:01:29] ^ brief latency spike that cleared, correlated with a big jump in s8 scrape time [20:03:31] (03PS5) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) [20:08:06] (03CR) 10jerkins-bot: [V: 04-1] WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt) [20:08:40] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:08:44] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:10:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:10:37] actually zooming out a bit further, it looks like that was a misread -- latency is persistently up 75 ms or so, and inconsistently spiking further [20:10:40] (03PS6) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) [20:12:02] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:12:18] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:18] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:13:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:14:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:15:02] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:17:01] rlazarus: looks like apiservers affected as well? [20:17:28] yep [20:17:44] rlazarus: https://grafana.wikimedia.org/d/ifM0GzjWk/cdanis-xxx-php-worker-threads?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-clusters=api_appserver&var-clusters=appserver [20:17:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:18:15] oh, nice find [20:18:46] doesn't tell you much aside from "something the appservers are doing is temporarily getting a lot more expensive" though [20:20:29] zooming out to a week it looks like the latency bump is more or less within normal variation, but the worker thread usage is definitely not [20:24:35] (03PS1) 10BryanDavis: Downgrade another Jessie package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588034 [20:24:36] (03PS1) 10BryanDavis: rebuild_all: build and push base of each series first [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588035 [20:24:56] parsercache hit rate is (if anything) slightly higher than usual [20:25:19] don't see any increases in network traffic on memcached hosts (if anything, slight decrease, which also tracks with something else getting more-expensive) [20:25:40] (03CR) 10BryanDavis: [C: 03+2] Introduce macros for installing composer and npm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578168 (owner: 10BryanDavis) [20:25:52] (03CR) 10BryanDavis: [C: 03+2] Downgrade another Jessie package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588034 (owner: 10BryanDavis) [20:26:04] (03Merged) 10jenkins-bot: Introduce macros for installing composer and npm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578168 (owner: 10BryanDavis) [20:26:13] (03CR) 10BryanDavis: [C: 03+2] rebuild_all: build and push base of each series first [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588035 (owner: 10BryanDavis) [20:26:15] (03Merged) 10jenkins-bot: Downgrade another Jessie package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588034 (owner: 10BryanDavis) [20:26:37] (03Merged) 10jenkins-bot: rebuild_all: build and push base of each series first [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588035 (owner: 10BryanDavis) [20:26:59] rlazarus: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-destination=termbox [20:27:17] wait but 1 rps? nevermind [20:27:59] https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-destination=echostore also interesting, less definitive, not much of an increse in absolute terms [20:28:42] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:31:10] logstash looks like elevated rate of "[{exception_id}] {exception_url} WMFTimeoutException from line 39 of /srv/mediawiki/wmf-config/set-time-limit.php: the execution time limit of 60 seconds was exceeded" [20:31:29] so check me, something is taking 60s to time out and consuming worker threads in the meantime? [20:31:58] well, consuming a singular worker thread, but at a high enough rate of occurrence that it's a problem [20:32:10] yeah, 60s is the standard timeout [20:34:28] (03PS1) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [20:35:42] (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [20:35:56] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:36:41] (03PS2) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) [20:43:18] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:43:33] rlazarus: I am yet to find a particularly slow appserver backend, and I am yet to find a new bot or something that is sending traffic that consistently times out (looking at https://logstash.wikimedia.org/goto/2b5ba93c1c496724141bb666a627a467 for that) [20:43:46] yeah, likewise on both counts [20:44:44] I wish I knew how to dig more into php-fpm state to find out where in the code it's spending its time [20:45:04] something something xhgui? [20:46:08] on mwdebug sure -- I want a little information about live traffic, but without having to slow it down with a full-on profiler [20:46:56] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:47:13] I keep meaning to try applying cool eBPF stuff to the tracepointing support in PHP [20:47:27] something like what you'd get from a JVM thread dump, just "here's a stack trace of where each worker is right now", see if anything jumps out [20:47:33] yeah [20:47:48] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:48:03] ^ that's gotta be unrelated, right [20:48:12] those ones flap all the time [20:48:15] ah okay [20:52:25] rlazarus: there's a logstash dashboard of appserver apache logs, right? [20:52:52] ah i think i found it [20:53:21] mediawiki-apache2 yeah [20:53:27] I looked earlier but nothing jumped out at me [20:53:45] n.b. it's only a few appservers though [20:54:42] <_joe_> it's s8 latency [20:54:50] <_joe_> https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=43&fullscreen&orgId=1 [20:55:46] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:56:00] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:56:11] it's like the 5rd time this week that the scrape time is this high [20:56:20] and 2nd or 3rd time there were so many open connections to s8 [20:57:11] <_joe_> I would say that explains the worker threads starvation, probably [20:57:18] <_joe_> the bad scraping time [20:57:30] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:57:35] ugh I was even looking at s8 latency earlier but I dismissed it [20:57:48] thanks _joe_ [20:58:30] <_joe_> so we're out of ideal conditions but still confortably up I'd say? [20:58:34] yeah [20:58:36] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is OK: HTTP OK: HTTP/1.0 200 OK - 22372 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [20:58:41] yeah, I was about to say it's not worth waking up DBAs over [20:58:45] anything to be done about it apart from that? [20:59:40] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:00:52] <_joe_> rlazarus: you can take a look at tendril if anythings stands out [21:02:25] looks like db1111 [21:03:20] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:04:00] <_joe_> cdanis: it's the recentchanges host I bet [21:04:22] <_joe_> oh no those are not rc queries [21:04:35] <_joe_> so I'd bet on some client making expensive requests [21:04:46] db1111 isn't in any special group [21:05:08] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:05:45] <_joe_> yes, db1111 has elevated threads usage [21:06:20] i'm inclined to give it less weight [21:06:28] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10BGerdemann) Jim's contract is for 116 hours until Dec 31, 2020, whichever comes first. [21:07:45] db1126 also? they're listed together on the slow queries list [21:07:54] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:08:53] rlazarus: https://grafana.wikimedia.org/d/000000273/mysql?panelId=9&fullscreen&orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1111&var-port=9104 [21:09:11] ah thanks [21:09:18] yeah agree that's just 1111 [21:09:19] db11276 doesn't look nearly as bad [21:09:30] db1126 [21:10:04] wow grafana REALLY doesn't like when I click the explore link on that graph [21:10:12] somewhat-confusing idiom there, if you aren't already familiar with it: 'port' is the port of the prometheus exporter, which is 9104 for the default mysql port (when you see nothing but a hostname on https://noc.wikimedia.org/db.php) [21:10:14] <_joe_> rlazarus: in tendril you can click on the single servers and you have a ton of metrics [21:10:19] and then 1xxxx when port :xxxx [21:10:25] ah cool [21:12:03] !log cdanis@cumin1001 dbctl commit (dc=all): 'db1111 seems overloaded', diff saved to https://phabricator.wikimedia.org/P10954 and previous config saved to /var/cache/conftool/dbconfig/20200410-211202-cdanis.json [21:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:14] I just had to refresh myself on the CLI syntax of my own tool 🙃 [21:13:21] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Nuria) @BGerdemann 116 hours seems a short period which hints that data permits will not be needed untll Dec 31st, how would we notified contract is no longer in... [21:17:13] cdanis: appserver latency is looking better [21:17:16] i think we were already starting to come out of the woods on appserver latency, but every since decreasing the weight we don't seem flattopped on 'threads running' on db1111 [21:18:19] appsrever threads were looking healthier since about 21:04, same for latency [21:18:40] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is OK: HTTP OK: HTTP/1.0 200 OK - 22371 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [21:20:14] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:22:04] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:22:27] rlazarus: https://w.wiki/MZ6 conjecture: we only ever allow 200 running threads in our mysql configuration [21:22:43] convincing [21:29:47] s8 scraping latency doesn't look like it improved, even though appserver latency is fine now? https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=43&fullscreen&orgId=1 [21:32:41] it's not atrocious? same plot: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=11&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-6h&to=now [21:33:26] yeah, but that last spike is characteristic of what we were seeing when it was bad, and that was after the config change [21:33:46] yes [21:33:49] that was db1126 [21:33:56] https://w.wiki/MZ7 [21:34:11] taking more of the load as a result of the config change [21:39:08] rlazarus: https://w.wiki/MZ8 [21:39:42] db1111 scrape time is back to normal; db1126 scrape time is a bit elevated but not bad (and went back down-ish) [21:40:04] and all of this matches up pretty well with worker thread saturation [21:40:54] okay yeah seems good [21:41:47] I'm wiped, going to call it a day there, thanks cdanis <3 [21:41:58] yeah, I'm quite hungry, also calling it a day [21:42:02] good weekend, all :) [21:54:29] 10Operations, 10Parsing-Team, 10Performance-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10aaron) >>! In T244058#6041722, @daniel wrote: > Another though from the TechCom meeting: we could ju... [22:05:12] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:08:03] (03PS1) 10CRusnov: icinga: Add git_untracked check [puppet] - 10https://gerrit.wikimedia.org/r/588049 [22:30:31] 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10MNovotny_WMF) @Nuria Ruiz we can make a point of notifying you as soon as the hours are used up. [22:40:26] (03PS1) 10Mholloway: MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) [22:41:30] (03CR) 10jerkins-bot: [V: 04-1] MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway) [22:42:50] (03PS2) 10Mholloway: MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) [23:02:13] (03PS1) 10Hashar: Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003) [23:03:34] (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003) (owner: 10Hashar) [23:04:58] (03PS2) 10Hashar: Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003) [23:05:26] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:06:15] (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003) (owner: 10Hashar) [23:06:47] (03PS2) 10CRusnov: icinga: Add git local changes check [puppet] - 10https://gerrit.wikimedia.org/r/588049 [23:08:50] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:09:09] 10Operations, 10Keyholder, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10hashar) Eventually today I went with the issue of having to restart Keyholder and reharm afte... [23:10:03] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [23:10:31] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/588049 (owner: 10CRusnov) [23:13:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:15:04] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:16:18] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 22389 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:19:42] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is OK: HTTP OK: HTTP/1.0 200 OK - 22372 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:35:10] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:59:15] 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) Just letting you know this issue has resumed as of about 4 or 5 hours ago, now requests are timing out every...