[00:04:58] <wikibugs>	 (03PS1) 10Ssingh: cescout: fix typo in metadb-configure script [puppet] - 10https://gerrit.wikimedia.org/r/587892
[00:08:20] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/21830/cescout1001.eqiad.wmnet/ self-merging a trivial commit as there is no change in pup" [puppet] - 10https://gerrit.wikimedia.org/r/587892 (owner: 10Ssingh)
[00:14:06] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376)
[00:14:09] <wikibugs>	 (03PS1) 10BryanDavis: toolforge: Remove Jessie config for grid engine + bastions [puppet] - 10https://gerrit.wikimedia.org/r/587895
[00:17:38] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376) (owner: 10BryanDavis)
[00:21:15] <wikibugs>	 (03PS2) 10BryanDavis: toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376)
[00:27:47] <wikibugs>	 (03PS1) 10Papaul: DNS: Remove mgmt DNS for heka [dns] - 10https://gerrit.wikimedia.org/r/587897
[00:28:39] <wikibugs>	 (03PS3) 10BryanDavis: kubernetes: ingress: use HTTP 307 for canonical redirect [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez)
[00:29:35] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Papaul)
[00:29:54] <wikibugs>	 10Operations, 10ops-codfw, 10decommission: decommission WMF6149 (old pay-lvs2002.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247572 (10Papaul) 05Open→03Resolved Complete
[00:30:10] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul)
[00:31:00] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: decommission WMF6144 (old pay-lvs2001.frack.codfw.wmnet) - https://phabricator.wikimedia.org/T247571 (10Papaul) 05Open→03Resolved Complete
[00:32:24] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for heka [dns] - 10https://gerrit.wikimedia.org/r/587897 (owner: 10Papaul)
[00:32:46] <wikibugs>	 (03CR) 10BryanDavis: "I verified that adding the `nginx.ingress.kubernetes.io/permanent-redirect-code: "307"` annotation to an ingress that also has a "nginx.in" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez)
[00:34:26] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10Papaul)
[00:34:41] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission, 10Patch-For-Review: decommission heka.frack.codfw.wmnet - https://phabricator.wikimedia.org/T248627 (10Papaul) 05Open→03Resolved Complete
[00:56:02] <wikibugs>	 (03CR) 10BryanDavis: [V: 03+1 C: 03+1] "Verified generated ingress config via hot patching on toolsbeta-sgebastion-04" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez)
[01:19:39] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul)
[01:20:05] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul)
[03:30:08] <icinga-wm>	 PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:54:02] <icinga-wm>	 RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:46:42] <icinga-wm>	 PROBLEM - Debian mirror in sync with upstream on sodium is CRITICAL: /srv/mirrors/debian is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[04:52:59] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: add oauth login to mailing lists - https://phabricator.wikimedia.org/T249678 (10Gryllida) Yes, lists.wikimedia.org, login is required to change preferences or view archives.
[05:15:34] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) Memory consumption has improved a lot after backporting two HTTP/2 fixes: * https://github.com/apache/trafficserver/pull/5697 (biggest improvement in text) * https://github.com...
[05:15:54] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) p:05High→03Medium
[05:46:59] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: add oauth login to mailing lists - https://phabricator.wikimedia.org/T249678 (10Peachey88) It looks like upstream mailman doesn't support this, And I can't see any upstream tasks about this.   We are also still using a old 2.X release, Once we upgrade to 3...
[05:49:16] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Add oauth login to the mailman package for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Peachey88)
[05:50:44] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage pc1008 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/587906
[05:53:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage pc1008 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/587906 (owner: 10Marostegui)
[05:53:50] <wikibugs>	 (03CR) 10Elukey: analytics::refinery::eventlogging-saltrotate: Bootstrap salts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587815 (owner: 10Mforns)
[05:56:26] <wikibugs>	 (03PS1) 10Marostegui: install_server: Allow reimage of pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/587907
[05:57:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Allow reimage of pc1008 [puppet] - 10https://gerrit.wikimedia.org/r/587907 (owner: 10Marostegui)
[06:00:07] <marostegui>	 !log Stop MySQL on pc1008 for upgrade
[06:00:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:04:35] <wikibugs>	 (03PS1) 10Marostegui: pc1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/587909
[06:06:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] services_proxy: retry on connect failures for parsoid [puppet] - 10https://gerrit.wikimedia.org/r/587787 (https://phabricator.wikimedia.org/T249705) (owner: 10Giuseppe Lavagetto)
[06:11:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1008: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/587909 (owner: 10Marostegui)
[06:15:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Andrew: merging the change to start testing, if you don't like it I'll revert :)" [puppet] - 10https://gerrit.wikimedia.org/r/587813 (https://phabricator.wikimedia.org/T240230) (owner: 10Elukey)
[06:15:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime
[06:15:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:19:39] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[06:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:47] <wikibugs>	 (03PS1) 10Marostegui: Revert "install_server: Allow reimage of pc1008" [puppet] - 10https://gerrit.wikimedia.org/r/587913
[06:31:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage of pc1008" [puppet] - 10https://gerrit.wikimedia.org/r/587913 (owner: 10Marostegui)
[06:33:18] <wikibugs>	 (03CR) 10Elukey: "Should I open a task for SRE access request and wait an SRE meeting?" [puppet] - 10https://gerrit.wikimedia.org/r/587726 (owner: 10Elukey)
[06:35:07] <wikibugs>	 (03PS2) 10Marostegui: install_server: Allow reimage of labsdb1011 [puppet] - 10https://gerrit.wikimedia.org/r/587628 (https://phabricator.wikimedia.org/T249188)
[06:42:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:43:59] <_joe_>	 XioNoX: ^^
[06:44:07] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: fix dependency on /etc/zuul and package if on buster [puppet] - 10https://gerrit.wikimedia.org/r/587782 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[06:44:11] <XioNoX>	 checking
[06:44:46] <XioNoX>	 xe-3/2/3        up    down Transport: cr2-codfw:xe-5/0/1
[06:45:15] <XioNoX>	 Maintenance Window: 00:01 - 05:00 Central 
[06:45:22] <XioNoX>	 central time, thanks Zayo
[06:46:23] <XioNoX>	 but yeah it's a planned zayo maintenance to last until 10am UTC
[06:46:32] <XioNoX>	 I'll ack it until then
[06:46:44] <XioNoX>	 or downtime actually
[06:47:41] <wikibugs>	 (03CR) 10Dzahn: "6377 Apr 10 06:45:54 contint2001 puppet-agent[13813]: (/Stage[main]/Profile::Zuul::Server/Git::Clone[integration/config]/F     ile[/etc/zu" [puppet] - 10https://gerrit.wikimedia.org/r/587782 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[06:47:56] <XioNoX>	 done
[06:48:42] <wikibugs>	 (03CR) 10Dzahn: "/etc/zuul exists now and content has been cloned:" [puppet] - 10https://gerrit.wikimedia.org/r/587782 (https://phabricator.wikimedia.org/T224591) (owner: 10Dzahn)
[06:49:05] <wikibugs>	 (03PS8) 10Dzahn: phabricator: remove firewall holes for port 80 from caches [puppet] - 10https://gerrit.wikimedia.org/r/569100
[06:52:02] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: remove firewall holes for port 80 from caches [puppet] - 10https://gerrit.wikimedia.org/r/569100 (owner: 10Dzahn)
[06:55:57] <icinga-wm>	 RECOVERY - Keyholder SSH agent on deploy1001 is OK: OK: Keyholder is armed with all configured keys. https://wikitech.wikimedia.org/wiki/Keyholder
[06:58:11] <mutante>	 !log armed keyholder on deploy1001
[06:58:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:35] <mutante>	 !log sodium - sudo -u mirror ftpsync
[07:00:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:43] <icinga-wm>	 RECOVERY - Debian mirror in sync with upstream on sodium is OK: /srv/mirrors/debian is over 4 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[07:11:59] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:18:32] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587915 (https://phabricator.wikimedia.org/T249593)
[07:32:27] <wikibugs>	 (03PS1) 10Ema: Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/587917 (https://phabricator.wikimedia.org/T249583)
[07:34:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/587917 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema)
[07:34:46] <wikibugs>	 (03CR) 10Ema: [C: 03+2] Revert "cache: test purged on cp3050" [puppet] - 10https://gerrit.wikimedia.org/r/587917 (https://phabricator.wikimedia.org/T249583) (owner: 10Ema)
[07:37:21] <ema>	 !log cp3050: back to vhtcpd for the holidays T249583
[07:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:37:27] <stashbot>	 T249583: Create vhtcpd replacement - https://phabricator.wikimedia.org/T249583
[07:38:12] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587915 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[07:40:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=purged site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:41:57] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::launcher: add hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587919 (https://phabricator.wikimedia.org/T249593)
[07:43:26] <wikibugs>	 (03PS1) 10Dzahn: netbox: switch git::clone to ensure present, not latest [puppet] - 10https://gerrit.wikimedia.org/r/587920
[07:45:43] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add hdfs_cleaner [puppet] - 10https://gerrit.wikimedia.org/r/587919 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[07:48:11] <icinga-wm>	 ACKNOWLEDGEMENT - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=purged site=esams Ema purged testing stopped over Easter https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[07:51:34] <wikibugs>	 (03CR) 10Dzahn: "sorry, i just realized i did the same before in" [puppet] - 10https://gerrit.wikimedia.org/r/587920 (owner: 10Dzahn)
[07:51:41] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[07:52:58] <mutante>	 !log closing port 80 on phab hosts for caching servers
[07:53:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:55:33] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Add oauth login to the mailman package for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Aklapper) p:05Low→03Lowest Right, no mention of the term OAuth in their docs: https://mailman.readthedocs.io/en/latest/...
[07:56:12] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587961 (https://phabricator.wikimedia.org/T249593)
[07:59:52] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists, 10Upstream: Add oauth login to the mailman package for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Dzahn) There are hundreds of lists and each one has different admins and different users and different settings who is allo...
[08:02:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ci: remove blubber Debian package [puppet] - 10https://gerrit.wikimedia.org/r/587862 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[08:02:39] <hashar>	 good morning ;)
[08:03:03] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:03:39] <logmsgbot>	 !log hashar@deploy1001 Started deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided)
[08:03:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:44] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [docker-pkg/deploy@9f2ba2c]: (no justification provided) (duration: 00m 05s)
[08:03:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:32] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587961 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[08:07:49] <mutante>	 hashar: good morning. integration::config has been cloned on contint2001 now
[08:08:40] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::launcher: add project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587964 (https://phabricator.wikimedia.org/T249593)
[08:11:50] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar)
[08:13:24] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add project_namespace_map [puppet] - 10https://gerrit.wikimedia.org/r/587964 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[08:19:56] <logmsgbot>	 !log hashar@deploy1001 Started deploy [integration/zuul/deploy@6c3ddad]: (no justification provided)
[08:20:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:07] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [integration/zuul/deploy@6c3ddad]: (no justification provided) (duration: 00m 11s)
[08:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:10] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587966 (https://phabricator.wikimedia.org/T249593)
[08:24:01] <vgutierrez>	 !log update puppet compiler facts
[08:24:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:35] <mutante>	 !log fix comment in deployment ssh key for zuul to include the path to the key on deploy1001
[08:32:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:37] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn) >>! In T224591#6043580, @thcipriani wro...
[08:37:32] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Dzahn)
[08:39:11] <mutante>	 jouncebot: next
[08:39:11] <jouncebot>	 In 73 hour(s) and 50 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200413T1030)
[08:39:53] <mutante>	 !log deploy1001 - keyholder disarm, keyholder arm
[08:39:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:43] <wikibugs>	 (03PS10) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587
[08:40:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi)
[08:43:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587966 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[08:44:38] <logmsgbot>	 !log hashar@deploy1001 Started deploy [zuul/deploy@5a0a03a]: (no justification provided)
[08:44:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:46:45] <wikibugs>	 (03PS11) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587
[08:46:58] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [zuul/deploy@5a0a03a]: (no justification provided) (duration: 02m 20s)
[08:47:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi)
[08:47:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:13] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::launcher: add sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587968 (https://phabricator.wikimedia.org/T249593)
[08:51:52] <logmsgbot>	 !log hashar@deploy1001 Started deploy [zuul/deploy@4a69913]: (no justification provided)
[08:51:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:05] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add sqoop_mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/587968 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[08:52:09] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [zuul/deploy@4a69913]: (no justification provided) (duration: 00m 16s)
[08:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:52:18] <hashar>	 yeah that is a bit spammy sorry about that
[08:52:18] <hashar>	 :(
[08:52:26] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and WMF/NDA restricted tickets for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Aklapper) Hi @jmads, can you please file a separate request for "WMF/NDA restricted tickets"? See #wmf-nda-requests - thanks!
[08:52:39] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Aklapper)
[08:57:55] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587972 (https://phabricator.wikimedia.org/T249593)
[09:02:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587972 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[09:04:21] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Dzahn)
[09:06:39] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::launcher: add druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587973 (https://phabricator.wikimedia.org/T249593)
[09:07:24] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Dzahn) Hi @KFrancis Jim is a contractor and I tried to check the box "User has a valid NDA on file with WMF legal. (This can be checked by Operations via the NDA...
[09:08:27] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Dzahn) Adding @Nuria for analytics private data request.
[09:10:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add druid_load [puppet] - 10https://gerrit.wikimedia.org/r/587973 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[09:22:03] <wikibugs>	 (03CR) 10Ayounsi: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/577316 (https://phabricator.wikimedia.org/T246618) (owner: 10Ayounsi)
[09:22:42] <wikibugs>	 (03PS3) 10Elukey: analytics::refinery::eventlogging-saltrotate: Bootstrap salts [puppet] - 10https://gerrit.wikimedia.org/r/587815 (owner: 10Mforns)
[09:24:45] <wikibugs>	 (03PS1) 10Hashar: zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591)
[09:25:05] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[09:25:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21831/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/587815 (owner: 10Mforns)
[09:27:44] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[09:30:15] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List projects with empty description [puppet] - 10https://gerrit.wikimedia.org/r/587725 (https://phabricator.wikimedia.org/T249805) (owner: 10Aklapper)
[09:31:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1089', diff saved to https://phabricator.wikimedia.org/P10952 and previous config saved to /var/cache/conftool/dbconfig/20200410-093129-marostegui.json
[09:31:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:37] <wikibugs>	 10Operations, 10serviceops: move all 86 new codfw appservers into production (mw2[291-2377].codfw.wmnet) - https://phabricator.wikimedia.org/T247021 (10Dzahn) 05Open→03Stalled stalled by T247018
[09:32:40] <wikibugs>	 10Operations, 10ops-codfw, 10serviceops: (Need by: TBD) rack/setup/install 86 new codfw mw systems - https://phabricator.wikimedia.org/T241852 (10Dzahn)
[09:33:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "Thanks Bryan for the testing!" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez)
[09:33:49] <wikibugs>	 (03Merged) 10jenkins-bot: kubernetes: ingress: use HTTP 307 for canonical redirect [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/587807 (https://phabricator.wikimedia.org/T249843) (owner: 10Arturo Borrero Gonzalez)
[09:36:58] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::coordinator: absent data_purge [puppet] - 10https://gerrit.wikimedia.org/r/587975 (https://phabricator.wikimedia.org/T249593)
[09:39:04] <wikibugs>	 (03CR) 10Dzahn: "for some reason the GID is "invalid" per jenkins-bot vote but I don't see why. it does not seem to be duplicate." [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[09:41:22] <wikibugs>	 (03CR) 10Dzahn: "The gerrit-deployers group from the referenced change does not exist anymore in here?" [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[09:42:16] <wikibugs>	 (03PS2) 10Dzahn: zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[09:43:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Update python dependencies for pywikibot [puppet] - 10https://gerrit.wikimedia.org/r/587894 (https://phabricator.wikimedia.org/T248376) (owner: 10BryanDavis)
[09:43:59] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Give more weight to db1089', diff saved to https://phabricator.wikimedia.org/P10953 and previous config saved to /var/cache/conftool/dbconfig/20200410-094359-marostegui.json
[09:44:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:44:24] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: Remove Jessie config for grid engine + bastions [puppet] - 10https://gerrit.wikimedia.org/r/587895 (owner: 10BryanDavis)
[09:45:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[09:49:09] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: Remove Jessie config for grid engine + bastions [puppet] - 10https://gerrit.wikimedia.org/r/587895 (owner: 10BryanDavis)
[09:53:37] <wikibugs>	 (03PS3) 10Hashar: zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591)
[09:55:03] <wikibugs>	 (03CR) 10Hashar: "The test fail because I had a gid 903 while the group is not a system user. The test ensure that system users are between 900 and 950." [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[09:55:33] <icinga-wm>	 PROBLEM - PHP opcache health on mw2317 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[09:56:51] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:57:25] <icinga-wm>	 RECOVERY - PHP opcache health on mw2317 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[09:58:43] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:04:30] <_joe_>	 we had a brief unavailability of eventgate-main i would say
[10:07:09] <_joe_>	 corresponding to it, a spike in system cpu time in eventgate-main https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?panelId=67&fullscreen&orgId=1&refresh=1m
[10:07:58] <_joe_>	 and a corresponding spike in memory usage
[10:08:55] <_joe_>	 and, quite unsurprisingly, a high number of enqueued jobs in rdkafka
[10:09:15] <_joe_>	 so maybe we need to raise memory limits? I'll take a look
[10:10:12] <_joe_>	 I love these dashboards
[10:13:40] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: use zuul-deployers not contint-root [puppet] - 10https://gerrit.wikimedia.org/r/587974 (https://phabricator.wikimedia.org/T224591) (owner: 10Hashar)
[10:17:48] <wikibugs>	 (03PS1) 10Aklapper: phabricator weekly changes email: List open tasks with past due date [puppet] - 10https://gerrit.wikimedia.org/r/587976 (https://phabricator.wikimedia.org/T249807)
[10:19:16] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2020): CommRel support for FY2019-2020 Q4 DC switchover - https://phabricator.wikimedia.org/T244808 (10Elitre) @RLazarus assuming this is still pretty much in the air?
[10:27:07] <wikibugs>	 (03PS1) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517)
[10:40:08] <wikibugs>	 (03PS2) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517)
[10:56:51] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[11:02:05] <wikibugs>	 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) The pending hosts `root@cumin1001:/home/marostegui# for i in `mysql.py -hdb1115 -A zarcillo -e "select instance from masters where dc=...
[11:03:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! awesome!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/587868 (https://phabricator.wikimedia.org/T239459) (owner: 10Mholloway)
[11:03:43] <wikibugs>	 (03Merged) 10jenkins-bot: Release new charts for wikifeeds, mobileapps, chromium-render [deployment-charts] - 10https://gerrit.wikimedia.org/r/587868 (https://phabricator.wikimedia.org/T239459) (owner: 10Mholloway)
[11:17:15] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:28:13] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1083 is OK: HTTP OK: HTTP/1.0 200 OK - 22373 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[11:37:26] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: mathoid: Enabled named loglevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/587980 (https://phabricator.wikimedia.org/T239459)
[11:37:49] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: Enabled named loglevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/587980 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris)
[11:38:06] <wikibugs>	 (03Merged) 10jenkins-bot: mathoid: Enabled named loglevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/587980 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris)
[11:38:43] <wikibugs>	 (03PS1) 10Dr0ptp4kt: Make event.editattempthourly accessible in Superset [puppet] - 10https://gerrit.wikimedia.org/r/587981
[11:39:45] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'mathoid' for release 'staging' .
[11:39:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:26] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'mathoid' for release 'production' .
[11:44:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:27] <icinga-wm>	 PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: cloudnet2002-dev.codfw.wmnet, contint2001.wikimedia.org, cloudnet2003-dev.codfw.wmnet, idp-test2001.wikimedia.org, netbox1001.wikimedia.org https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes
[11:46:25] <wikibugs>	 (03PS1) 10Dzahn: add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649)
[11:46:28] <wikibugs>	 (03Abandoned) 10Dr0ptp4kt: Make event.editattempthourly accessible in Superset [puppet] - 10https://gerrit.wikimedia.org/r/587981 (owner: 10Dr0ptp4kt)
[11:46:50] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn)
[11:47:23] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'production' .
[11:47:24] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'mathoid' for release 'canary' .
[11:47:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:48:20] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "This doesn't work, it was meant to be WIP, sorry gehel for the extra notification :)" [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey)
[11:50:52] <wikibugs>	 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, and 4 others: service-runner apps (wikifeeds/cxserver at the least) running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10akosiaris) Mathoid has been deployed with this change and it...
[11:53:59] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is CRITICAL: 104.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[11:55:03] <wikibugs>	 (03PS3) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517)
[11:58:17] <wikibugs>	 (03PS4) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517)
[11:58:19] <wikibugs>	 (03PS2) 10Dzahn: add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649)
[12:00:32] <wikibugs>	 (03PS5) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517)
[12:02:51] <wikibugs>	 (03PS6) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517)
[12:04:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] add people1002.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/587982 (https://phabricator.wikimedia.org/T247649) (owner: 10Dzahn)
[12:07:58] <wikibugs>	 (03PS7) 10Elukey: Set Java CMS GC for cloudelastic chi cluster [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517)
[12:08:13] <wikibugs>	 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn)
[12:08:16] <wikibugs>	 (03CR) 10Elukey: "Ready for review now!" [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey)
[12:08:20] <wikibugs>	 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn)
[12:08:23] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn)
[12:09:22] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[12:09:23] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[12:09:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:00] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[12:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:23] <mutante>	 !log Creating VM people1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=A vcpus=1 memory=2GB disk=80GB link=private. This may take a few minutes.
[12:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:31] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[12:10:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:36] <mutante>	 !log Creating VM people1002.eqiad.wmnet in cluster ganeti01.svc.eqiad.wmnet with row=A vcpus=1 memory=2GB disk=80GB link=private. (T249907)
[12:10:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:10:41] <stashbot>	 T249907: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907
[12:11:51] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm
[12:11:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:08] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99)
[12:12:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:33] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:14:52] <wikibugs>	 (03PS1) 10Dr0ptp4kt: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984
[12:15:27] <wikibugs>	 (03PS2) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984
[12:25:17] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 22367 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[12:32:35] <wikibugs>	 10Operations, 10vm-requests: eqiad: 1 VM request for people.wikimedia.org - https://phabricator.wikimedia.org/T249907 (10Dzahn) Creating the VM failed. spicerack logs don't tell me why.  I guess we are out of resources on the Ganeti eqiad cluster.
[12:41:23] <wikibugs>	 (03PS1) 10Dzahn: merge microsites into webserver_misc_apps [puppet] - 10https://gerrit.wikimedia.org/r/587985 (https://phabricator.wikimedia.org/T247650)
[12:43:33] <wikibugs>	 10Operations, 10Mail, 10Epic: Move most (all?) exim personal aliases to OIT - https://phabricator.wikimedia.org/T122144 (10Dzahn) 05Open→03Stalled
[12:55:57] <wikibugs>	 (03PS1) 10Hashar: admin: fix python deprecation in test suite [puppet] - 10https://gerrit.wikimedia.org/r/587988
[12:55:59] <wikibugs>	 (03PS1) 10Hashar: admin: enhance test output for groups GID [puppet] - 10https://gerrit.wikimedia.org/r/587989
[12:56:01] <wikibugs>	 (03PS1) 10Hashar: admin: show gid in gid test error [puppet] - 10https://gerrit.wikimedia.org/r/587990
[13:00:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[13:01:59] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335 (10Vgutierrez) p:05Medium→03High {F31748766} it looks like the leak is still there
[13:04:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) (owner: 10Arturo Borrero Gonzalez)
[13:09:16] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator weekly changes email: List open tasks with past due date [puppet] - 10https://gerrit.wikimedia.org/r/587976 (https://phabricator.wikimedia.org/T249807) (owner: 10Aklapper)
[13:12:32] <mutante>	 !log restarted and re-armed keyholder on deploy1001 to pick up changes for zuul scap deploy
[13:12:36] <mutante>	 hashar: ^ done
[13:12:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:56] <hashar>	 magic :)
[13:14:08] <logmsgbot>	 !log hashar@deploy1001 Started deploy [zuul/deploy@4a69913]: (no justification provided)
[13:14:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:48] <logmsgbot>	 !log hashar@deploy1001 Finished deploy [zuul/deploy@4a69913]: (no justification provided) (duration: 00m 40s)
[13:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:54] <hashar>	 deploy-local failed: <CheckInvalid> {}
[13:14:55] <hashar>	 hehe
[13:16:31] <mutante>	 ? invalid check to CheckInvalid
[13:18:55] <hashar>	 yeah gotta dig into that one now :/
[13:27:29] <wikibugs>	 (03PS1) 10Mholloway: MachineVision label blacklist updates, 2020-04-09 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/587994 (https://phabricator.wikimedia.org/T249895)
[13:33:17] <icinga-wm>	 PROBLEM - PHP opcache health on mw2320 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[13:33:50] <mutante>	             if check_type not in _TYPES:
[13:33:50] <mutante>	                 msg = "unknown check type '{}'".format(check_type)
[13:33:51] <mutante>	                 raise CheckInvalid(msg)
[13:55:15] <icinga-wm>	 RECOVERY - PHP opcache health on mw2320 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[13:57:55] <icinga-wm>	 RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:58:21] <icinga-wm>	 RECOVERY - git_daemon_running on contint2001 is OK: PROCS OK: 1 process with regex args ^/usr/lib/git-core/git-daemon --syslog https://www.mediawiki.org/wiki/Continuous_integration/Zuul
[14:11:57] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Mholloway) I'm trying to interpret what this task being closed as declined means.  Does it mean that t...
[14:25:01] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10WDoranWMF) hey @Mholloway, we are not porting Restbase to k8s so this becomes irrelevant. Restbase wil...
[14:25:22] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack rocky serverpackages: require python3-tooz [puppet] - 10https://gerrit.wikimedia.org/r/588003
[14:25:44] <wikibugs>	 (03PS2) 10Andrew Bogott: Openstack rocky serverpackages: require python3-tooz [puppet] - 10https://gerrit.wikimedia.org/r/588003
[14:27:17] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack rocky serverpackages: require python3-tooz [puppet] - 10https://gerrit.wikimedia.org/r/588003 (owner: 10Andrew Bogott)
[14:29:21] <wikibugs>	 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh)
[14:30:18] <wikibugs>	 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) 05Open→03Resolved
[14:30:36] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Mholloway) Ah, I see.  My interest is specifically in service-runner as my understanding is that it wi...
[14:30:50] <wikibugs>	 10Operations: Deploy the cescout package (censorship monitoring) - https://phabricator.wikimedia.org/T247273 (10ssingh) cescout and its associated dependencies have been deployed to cescout1001. Marking this as resolved.
[14:33:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Nice idea. I like it, but got a comment inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/587799 (https://phabricator.wikimedia.org/T249633) (owner: 10Hnowlan)
[14:36:03] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) `service-runner` itself is not going anywhere. DHT-based rate limiting however is likely to...
[14:36:09] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar)
[14:37:07] <wikibugs>	 (03PS1) 10Andrew Bogott: Openstack python3-tooz hack: only install on hosts that really need it [puppet] - 10https://gerrit.wikimedia.org/r/588005
[14:38:23] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Eevans) >>! In T235437#6046252, @Mholloway wrote: > Ah, I see.  My interest is specifically in service...
[14:41:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Openstack python3-tooz hack: only install on hosts that really need it [puppet] - 10https://gerrit.wikimedia.org/r/588005 (owner: 10Andrew Bogott)
[14:41:23] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Joe) >>! In T235437#6046252, @Mholloway wrote: > Ah, I see.  My interest is specifically in service-ru...
[14:45:39] <wikibugs>	 10Operations, 10Services, 10service-runner, 10serviceops, and 3 others: Move service-runner legacy rate limiter into hyperswitch - https://phabricator.wikimedia.org/T249919 (10Pchelolo)
[14:46:05] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Pchelolo) Ok, to avoid further confusion, I will do T249919
[14:48:26] <wikibugs>	 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10Mholloway) Thanks, everyone.  The repo I'm working with is https://github.com/mdholloway/pushd, and ba...
[14:50:09] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[14:51:38] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn)
[14:53:11] <icinga-wm>	 PROBLEM - PHP opcache health on mw2322 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[14:55:59] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[14:57:46] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[14:59:30] <wikibugs>	 (03PS5) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[15:01:32] <wikibugs>	 (03CR) 10CRusnov: [C: 03+1] "I am fine with this, as we all know when we merge changes. It is my fault the alert triggers, for what it's worth." [puppet] - 10https://gerrit.wikimedia.org/r/587920 (owner: 10Dzahn)
[15:01:43] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Papaul)
[15:01:56] <wikibugs>	 (03PS6) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[15:03:00] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10Kubernetes: Re-evaluate kubelet operation latencies alerts - https://phabricator.wikimedia.org/T220808 (10akosiaris) 05Open→03Resolved We 've had no reoccurence with this in the last year, the approach has clearly worked. In the meantime we have understood what the the...
[15:03:08] <vgutierrez>	 !log restart ats-tls on cp1083 and cp1085 - T249335
[15:03:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:14] <stashbot>	 T249335: Memory leak on ats-tls 8.0.6 - https://phabricator.wikimedia.org/T249335
[15:03:29] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Papaul) @volans I realized that the sudo -i wmf-auto-reimage-host command is possible by just giving the user the right to run just that command on cumin and not any ot...
[15:03:36] <wikibugs>	 (03PS7) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[15:03:54] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10Kubernetes: Utilize the deployment pipeline (stretch) - https://phabricator.wikimedia.org/T184924 (10akosiaris) 05Open→03Resolved a:03akosiaris This has happened for a long time now, resolving.
[15:03:56] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462 (10akosiaris)
[15:04:08] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462 (10akosiaris)
[15:04:21] <wikibugs>	 10Operations, 10Prod-Kubernetes, 10Kubernetes: Serve one production service via Kubernetes - https://phabricator.wikimedia.org/T184462 (10akosiaris) 05Open→03Resolved a:03akosiaris Has happened for a long time now, resolving
[15:06:11] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) Can we use the existing dcops admin group and just adjust the sudo privs to include the wmf-auto-reimage commands?
[15:06:43] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Volans) It's not up to me to decide. For context: with the current setup, being able to run the reimage script is equivalent of having global root. There have been some...
[15:08:25] <wikibugs>	 10Operations, 10serviceops: dropped packets to echostore.svc.eqiad 8082/tcp - https://phabricator.wikimedia.org/T238789 (10akosiaris)
[15:08:29] <wikibugs>	 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris)
[15:08:39] <wikibugs>	 10Operations, 10DC-Ops, 10SRE-Access-Requests: access request on cumin[1-2]001 for John Clark - https://phabricator.wikimedia.org/T249916 (10Dzahn) John is already in that group and the group has the following sudo privileges.  So this is not about a new shell user, it's just about adding a command to the li...
[15:08:44] <wikibugs>	 10Operations: k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700 (10akosiaris)
[15:08:47] <wikibugs>	 10Operations, 10netops, 10User-jbond: Sporatic RST drops in the ulogd logs - https://phabricator.wikimedia.org/T238823 (10akosiaris)
[15:09:40] <wikibugs>	 (03PS8) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[15:11:38] <wikibugs>	 10Operations, 10Services, 10service-runner, 10serviceops, and 3 others: Move service-runner legacy rate limiter into hyperswitch - https://phabricator.wikimedia.org/T249919 (10Pchelolo) p:05Triage→03Medium
[15:13:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] netbox: switch git::clone to ensure present, not latest [puppet] - 10https://gerrit.wikimedia.org/r/587920 (owner: 10Dzahn)
[15:13:17] <icinga-wm>	 RECOVERY - PHP opcache health on mw2322 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[15:20:31] <wikibugs>	 (03PS9) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[15:24:08] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes, 10User-fsero, 10User-jijiki: Support e - https://phabricator.wikimedia.org/T249927 (10akosiaris)
[15:24:30] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes, 10User-fsero, 10User-jijiki: Support kubernetes Egress networkpolicies in our helm charts - https://phabricator.wikimedia.org/T249927 (10akosiaris) p:05Triage→03Medium
[15:29:12] <wikibugs>	 (03PS10) 10Arturo Borrero Gonzalez: sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837)
[15:33:22] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/587975 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[15:33:31] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::coordinator: absent data_purge [puppet] - 10https://gerrit.wikimedia.org/r/587975 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[15:38:31] <wikibugs>	 (03PS1) 10Elukey: role::analytics_cluster::launcher: add data_purge [puppet] - 10https://gerrit.wikimedia.org/r/588010 (https://phabricator.wikimedia.org/T249593)
[15:39:53] <wikibugs>	 (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/588010 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[15:40:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] role::analytics_cluster::launcher: add data_purge [puppet] - 10https://gerrit.wikimedia.org/r/588010 (https://phabricator.wikimedia.org/T249593) (owner: 10Elukey)
[16:09:14] <wikibugs>	 (03PS1) 10Elukey: kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012
[16:15:32] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar)
[16:17:46] <wikibugs>	 (03PS2) 10Elukey: kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012
[16:20:32] <wikibugs>	 (03PS1) 10Bstorm: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843)
[16:21:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm)
[16:22:56] <wikibugs>	 (03PS3) 10Elukey: kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012
[16:27:49] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] kafkatee: add support for TLS to Kafka and enable it for a test instance [puppet] - 10https://gerrit.wikimedia.org/r/588012 (owner: 10Elukey)
[16:29:24] <wikibugs>	 (03PS2) 10Bstorm: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843)
[16:29:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm)
[16:33:44] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10KFrancis) @Dzahn It looks like Jim is currently being onboarded ( please see here:  https://office.wikimedia.org/wiki/Office_IT_Weekly_Meeting_Notes-_April_9,_202...
[16:35:18] <wikibugs>	 (03PS1) 10Elukey: Enable TLS encryption between kafkatee instances and Kafka [puppet] - 10https://gerrit.wikimedia.org/r/588015
[16:38:30] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/21843/" [puppet] - 10https://gerrit.wikimedia.org/r/588015 (owner: 10Elukey)
[16:39:31] <wikibugs>	 10Operations, 10Services, 10service-runner, 10serviceops, and 3 others: Move service-runner legacy rate limiter into hyperswitch - https://phabricator.wikimedia.org/T249919 (10Pchelolo) 05Open→03Declined Actually, after reviewing the code once more, this doesn't seem to be feasible. rate limiter in ser...
[16:39:38] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul)
[16:40:46] <icinga-wm>	 PROBLEM - PHP opcache health on mw2319 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[16:45:59] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul)
[16:59:59] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10MNovotny_WMF) Is there anything I can do to help here?
[17:00:16] <icinga-wm>	 RECOVERY - PHP opcache health on mw2319 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[17:02:29] <wikibugs>	 (03PS1) 10Papaul: DNS Remove mgmt asset tag from cp20[16-20,22-26] [dns] - 10https://gerrit.wikimedia.org/r/588019
[17:05:04] <wikibugs>	 (03PS2) 10Papaul: DNS Remove mgmt asset tag DNS for cp20[16-20,22-26] [dns] - 10https://gerrit.wikimedia.org/r/588019
[17:05:13] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Patch-For-Review, 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10hashar) I have pruned all the containers from...
[17:06:46] <wikibugs>	 (03CR) 10Papaul: [C: 03+2] DNS Remove mgmt asset tag DNS for cp20[16-20,22-26] [dns] - 10https://gerrit.wikimedia.org/r/588019 (owner: 10Papaul)
[17:09:23] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul)
[17:09:35] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp2017.codfw.wmnet - https://phabricator.wikimedia.org/T249084 (10Papaul) 05Open→03Resolved complete
[17:09:49] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul)
[17:09:59] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[16,19,23].codfw.wmnet - https://phabricator.wikimedia.org/T249125 (10Papaul) 05Open→03Resolved complete
[17:10:18] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul)
[17:10:28] <wikibugs>	 10Operations, 10ops-codfw, 10Traffic, 10decommission, 10Patch-For-Review: decommission cp20[18,20,22,24-26].codfw.wmnet - https://phabricator.wikimedia.org/T249115 (10Papaul) 05Open→03Resolved complete
[17:16:26] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Apr-Jun-2020): CommRel support for FY2019-2020 Q4 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Yeah, as you can imagine with some folks working reduced hours, SRE is mostly focusing on critical work and we've pushed this off. I haven't aske...
[17:37:07] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] "Awesome :)" [puppet] - 10https://gerrit.wikimedia.org/r/587586 (https://phabricator.wikimedia.org/T249435) (owner: 10DCausse)
[17:52:44] <wikibugs>	 (03PS3) 10Bstorm: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843)
[17:57:05] <wikibugs>	 (03CR) 10Bstorm: "To explain what I did here: dh_python2 was actually failing on deps because the control file lacked the entry of ${python:Depends} (which " [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm)
[17:58:23] <wikibugs>	 (03CR) 10Bstorm: d/changelog: prepare for 0.67 release (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm)
[18:03:54] <wikibugs>	 (03CR) 10Herron: Enable TLS encryption between kafkatee instances and Kafka (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/588015 (owner: 10Elukey)
[18:09:37] <wikibugs>	 (03PS1) 10BBlack: Offboarding shell access for anomie [puppet] - 10https://gerrit.wikimedia.org/r/588023
[18:10:30] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Nuria) @MNoorWMF We need the contract end date to provide access and your approval. Besides that the legal and SRE fellows need to assess NDA is been signed.
[18:12:45] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Offboarding shell access for anomie [puppet] - 10https://gerrit.wikimedia.org/r/588023 (owner: 10BBlack)
[18:29:50] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - logstash1011-production-logstash-eqiad on logstash1011 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-logstash-eqiad&var-instance=logstash1011&panelId=37
[18:32:00] <icinga-wm>	 PROBLEM - PHP opcache health on mw2318 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:36:27] <wikibugs>	 (03PS3) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945)
[18:40:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt)
[18:46:32] <wikibugs>	 (03PS4) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945)
[18:48:07] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm)
[18:49:22] <wikibugs>	 (03Merged) 10jenkins-bot: d/changelog: prepare for 0.67 release [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/588013 (https://phabricator.wikimedia.org/T249843) (owner: 10Bstorm)
[18:50:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt)
[18:53:48] <icinga-wm>	 RECOVERY - PHP opcache health on mw2318 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health
[18:56:12] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/587978 (https://phabricator.wikimedia.org/T231517) (owner: 10Elukey)
[19:09:14] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:09:14] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10MNoorWMF) I think I was tagged by mistake. Tagging @MNovotny_WMF in case it was missed :)
[19:09:50] <icinga-wm>	 PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:19:35] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown, 10Readers-Web-Backlog (Needs Product Owner Decisions), 10SEO: Yoruba Language Wikipedia not being indexed by search engines - https://phabricator.wikimedia.org/T236241 (10Aklapper) @ovasileva: Could you please check the last comment? (You were not CC'ed so you...
[19:26:43] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] sbuild: introduce module and use it in toolforge package builder [puppet] - 10https://gerrit.wikimedia.org/r/587991 (https://phabricator.wikimedia.org/T249837) (owner: 10Arturo Borrero Gonzalez)
[19:26:49] <wikibugs>	 (03PS1) 10RobH: adding gpu skus [software] - 10https://gerrit.wikimedia.org/r/588031
[19:27:45] <wikibugs>	 (03CR) 10RobH: [C: 03+2] adding gpu skus [software] - 10https://gerrit.wikimedia.org/r/588031 (owner: 10RobH)
[19:37:50] <cdanis>	 uhhh interesting that BFD is down for that link just for ipv4, and not for ipv6, and OSPF is still working
[19:37:57] <cdanis>	 !log cdanis@re0.cr1-codfw> clear bfd session address 208.80.153.220
[19:38:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:38:14] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:38:52] <icinga-wm>	 RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 13 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:38:58] <cdanis>	 ... okay I guess
[19:39:46] <rlazarus>	 once and for all
[19:46:26] <cdanis>	 state machine go brr
[19:59:36] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:01:26] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:01:29] <rlazarus>	 ^ brief latency spike that cleared, correlated with a big jump in s8 scrape time
[20:03:31] <wikibugs>	 (03PS5) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945)
[20:08:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945) (owner: 10Dr0ptp4kt)
[20:08:40] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:08:44] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:10:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:10:37] <rlazarus>	 actually zooming out a bit further, it looks like that was a misread -- latency is persistently up 75 ms or so, and inconsistently spiking further
[20:10:40] <wikibugs>	 (03PS6) 10Dr0ptp4kt: WIP: Add Druid support for event.editattemptstep [puppet] - 10https://gerrit.wikimedia.org/r/587984 (https://phabricator.wikimedia.org/T249945)
[20:12:02] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:12:18] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:13:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:13:48] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:14:14] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:15:02] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:17:01] <cdanis>	 rlazarus: looks like apiservers affected as well?
[20:17:28] <rlazarus>	 yep
[20:17:44] <cdanis>	 rlazarus: https://grafana.wikimedia.org/d/ifM0GzjWk/cdanis-xxx-php-worker-threads?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-clusters=api_appserver&var-clusters=appserver
[20:17:46] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:18:15] <rlazarus>	 oh, nice find
[20:18:46] <cdanis>	 doesn't tell you much aside from "something the appservers are doing is temporarily getting a lot more expensive" though
[20:20:29] <rlazarus>	 zooming out to a week it looks like the latency bump is more or less within normal variation, but the worker thread usage is definitely not
[20:24:35] <wikibugs>	 (03PS1) 10BryanDavis: Downgrade another Jessie package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588034
[20:24:36] <wikibugs>	 (03PS1) 10BryanDavis: rebuild_all: build and push base of each series first [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588035
[20:24:56] <cdanis>	 parsercache hit rate is (if anything) slightly higher than usual
[20:25:19] <cdanis>	 don't see any increases in network traffic on memcached hosts (if anything, slight decrease, which also tracks with something else getting more-expensive)
[20:25:40] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Introduce macros for installing composer and npm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578168 (owner: 10BryanDavis)
[20:25:52] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] Downgrade another Jessie package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588034 (owner: 10BryanDavis)
[20:26:04] <wikibugs>	 (03Merged) 10jenkins-bot: Introduce macros for installing composer and npm [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/578168 (owner: 10BryanDavis)
[20:26:13] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] rebuild_all: build and push base of each series first [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588035 (owner: 10BryanDavis)
[20:26:15] <wikibugs>	 (03Merged) 10jenkins-bot: Downgrade another Jessie package [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588034 (owner: 10BryanDavis)
[20:26:37] <wikibugs>	 (03Merged) 10jenkins-bot: rebuild_all: build and push base of each series first [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/588035 (owner: 10BryanDavis)
[20:26:59] <cdanis>	 rlazarus: https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-destination=termbox
[20:27:17] <cdanis>	 wait but 1 rps? nevermind
[20:27:59] <cdanis>	 https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-destination=echostore also interesting, less definitive, not much of an increse in absolute terms
[20:28:42] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=
[20:31:10] <rlazarus>	 logstash looks like elevated rate of "[{exception_id}] {exception_url} WMFTimeoutException from line 39 of /srv/mediawiki/wmf-config/set-time-limit.php: the execution time limit of 60 seconds was exceeded"
[20:31:29] <rlazarus>	 so check me, something is taking 60s to time out and consuming worker threads in the meantime?
[20:31:58] <rlazarus>	 well, consuming a singular worker thread, but at a high enough rate of occurrence that it's a problem
[20:32:10] <cdanis>	 yeah, 60s is the standard timeout
[20:34:28] <wikibugs>	 (03PS1) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153)
[20:35:42] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov)
[20:35:56] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:36:41] <wikibugs>	 (03PS2) 10CRusnov: customscripts/interface_automation.py: Add PuppetDB Importer [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153)
[20:43:18] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:43:33] <cdanis>	 rlazarus: I am yet to find a particularly slow appserver backend, and I am yet to find a new bot or something that is sending traffic that consistently times out (looking at https://logstash.wikimedia.org/goto/2b5ba93c1c496724141bb666a627a467 for that)
[20:43:46] <rlazarus>	 yeah, likewise on both counts
[20:44:44] <rlazarus>	 I wish I knew how to dig more into php-fpm state to find out where in the code it's spending its time
[20:45:04] <cdanis>	 something something xhgui?
[20:46:08] <rlazarus>	 on mwdebug sure -- I want a little information about live traffic, but without having to slow it down with a full-on profiler
[20:46:56] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:47:13] <cdanis>	 I keep meaning to try applying cool eBPF stuff to the tracepointing support in PHP
[20:47:27] <rlazarus>	 something like what you'd get from a JVM thread dump, just "here's a stack trace of where each worker is right now", see if anything jumps out
[20:47:33] <cdanis>	 yeah
[20:47:48] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:48:03] <rlazarus>	 ^ that's gotta be unrelated, right
[20:48:12] <cdanis>	 those ones flap all the time
[20:48:15] <rlazarus>	 ah okay
[20:52:25] <cdanis>	 rlazarus: there's a logstash dashboard of appserver apache logs, right?
[20:52:52] <cdanis>	 ah i think i found it
[20:53:21] <rlazarus>	 mediawiki-apache2 yeah
[20:53:27] <rlazarus>	 I looked earlier but nothing jumped out at me
[20:53:45] <rlazarus>	 n.b. it's only a few appservers though
[20:54:42] <_joe_>	 it's s8 latency
[20:54:50] <_joe_>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=43&fullscreen&orgId=1
[20:55:46] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:56:00] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:56:11] <cdanis>	 it's like the 5rd time this week that the scrape time is this high
[20:56:20] <cdanis>	 and 2nd or 3rd time there were so many open connections to s8
[20:57:11] <_joe_>	 I would say that explains the worker threads starvation, probably
[20:57:18] <_joe_>	 the bad scraping time
[20:57:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[20:57:35] <rlazarus>	 ugh I was even looking at s8 latency earlier but I dismissed it
[20:57:48] <rlazarus>	 thanks _joe_ 
[20:58:30] <_joe_>	 so we're out of ideal conditions but still confortably up I'd say?
[20:58:34] <cdanis>	 yeah
[20:58:36] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is OK: HTTP OK: HTTP/1.0 200 OK - 22372 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[20:58:41] <rlazarus>	 yeah, I was about to say it's not worth waking up DBAs over
[20:58:45] <rlazarus>	 anything to be done about it apart from that?
[20:59:40] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[21:00:52] <_joe_>	 rlazarus: you can take a look at tendril if anythings stands out
[21:02:25] <cdanis>	 looks like db1111
[21:03:20] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[21:04:00] <_joe_>	 cdanis: it's the recentchanges host I bet
[21:04:22] <_joe_>	 oh no those are not rc queries
[21:04:35] <_joe_>	 so I'd bet on some client making expensive requests
[21:04:46] <cdanis>	 db1111 isn't in any special group
[21:05:08] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[21:05:45] <_joe_>	 yes, db1111 has elevated threads usage
[21:06:20] <cdanis>	 i'm inclined to give it less weight
[21:06:28] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10BGerdemann) Jim's contract is for 116 hours until Dec 31, 2020, whichever comes first.
[21:07:45] <rlazarus>	 db1126 also? they're listed together on the slow queries list
[21:07:54] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[21:08:53] <cdanis>	 rlazarus: https://grafana.wikimedia.org/d/000000273/mysql?panelId=9&fullscreen&orgId=1&from=now-6h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1111&var-port=9104
[21:09:11] <rlazarus>	 ah thanks
[21:09:18] <rlazarus>	 yeah agree that's just 1111
[21:09:19] <cdanis>	 db11276 doesn't look nearly as bad
[21:09:30] <cdanis>	 db1126 
[21:10:04] <rlazarus>	 wow grafana REALLY doesn't like when I click the explore link on that graph
[21:10:12] <cdanis>	 somewhat-confusing idiom there, if you aren't already familiar with it: 'port' is the port of the prometheus exporter, which is 9104 for the default mysql port (when you see nothing but a hostname on https://noc.wikimedia.org/db.php)
[21:10:14] <_joe_>	 rlazarus: in tendril you can click on the single servers and you have a ton of metrics
[21:10:19] <cdanis>	 and then 1xxxx when port :xxxx
[21:10:25] <rlazarus>	 ah cool
[21:12:03] <logmsgbot>	 !log cdanis@cumin1001 dbctl commit (dc=all): 'db1111 seems overloaded', diff saved to https://phabricator.wikimedia.org/P10954 and previous config saved to /var/cache/conftool/dbconfig/20200410-211202-cdanis.json
[21:12:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:12:14] <cdanis>	 I just had to refresh myself on the CLI syntax of my own tool 🙃
[21:13:21] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Nuria) @BGerdemann 116 hours seems a short period which hints that data permits will not be needed untll Dec 31st, how would we notified contract is no longer in...
[21:17:13] <rlazarus>	 cdanis: appserver latency is looking better
[21:17:16] <cdanis>	 i think we were already starting to come out of the woods on appserver latency, but every since decreasing the weight we don't seem flattopped on 'threads running' on db1111
[21:18:19] <cdanis>	 appsrever threads were looking healthier since about 21:04, same for latency
[21:18:40] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is OK: HTTP OK: HTTP/1.0 200 OK - 22371 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[21:20:14] <icinga-wm>	 PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) is CRITICAL: Test Print the Foo page from en.wp.org in letter format returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[21:22:04] <icinga-wm>	 RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[21:22:27] <cdanis>	 rlazarus: https://w.wiki/MZ6 conjecture: we only ever allow 200 running threads in our mysql configuration
[21:22:43] <rlazarus>	 convincing
[21:29:47] <rlazarus>	 s8 scraping latency doesn't look like it improved, even though appserver latency is fine now? https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=43&fullscreen&orgId=1
[21:32:41] <cdanis>	 it's not atrocious? same plot: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?panelId=11&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-group=core&var-shard=s8&var-role=All&from=now-6h&to=now
[21:33:26] <rlazarus>	 yeah, but that last spike is characteristic of what we were seeing when it was bad, and that was after the config change
[21:33:46] <cdanis>	 yes
[21:33:49] <cdanis>	 that was db1126
[21:33:56] <cdanis>	 https://w.wiki/MZ7
[21:34:11] <cdanis>	 taking more of the load as a result of the config change
[21:39:08] <cdanis>	 rlazarus: https://w.wiki/MZ8
[21:39:42] <cdanis>	 db1111 scrape time is back to normal; db1126 scrape time is a bit elevated but not bad (and went back down-ish)
[21:40:04] <cdanis>	 and all of this matches up pretty well with worker thread saturation
[21:40:54] <rlazarus>	 okay yeah seems good
[21:41:47] <rlazarus>	 I'm wiped, going to call it a day there, thanks cdanis <3
[21:41:58] <cdanis>	 yeah, I'm quite hungry, also calling it a day
[21:42:02] <cdanis>	 good weekend, all :)
[21:54:29] <wikibugs>	 10Operations, 10Parsing-Team, 10Performance-Team, 10TechCom, and 4 others: Strategy for storing parser output for "old revision" (Popular diffs and permalinks) - https://phabricator.wikimedia.org/T244058 (10aaron) >>! In T244058#6041722, @daniel wrote: > Another though from the TechCom meeting: we could ju...
[22:05:12] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:08:03] <wikibugs>	 (03PS1) 10CRusnov: icinga: Add git_untracked check [puppet] - 10https://gerrit.wikimedia.org/r/588049
[22:30:31] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10MNovotny_WMF) @Nuria Ruiz <nruiz@wikimedia.org> we can make a point of notifying you as soon as the hours are used up.
[22:40:26] <wikibugs>	 (03PS1) 10Mholloway: MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939)
[22:41:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939) (owner: 10Mholloway)
[22:42:50] <wikibugs>	 (03PS2) 10Mholloway: MachineVision: Add MachineVisionWithholdImageList config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/588053 (https://phabricator.wikimedia.org/T249939)
[23:02:13] <wikibugs>	 (03PS1) 10Hashar: Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003)
[23:03:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003) (owner: 10Hashar)
[23:04:58] <wikibugs>	 (03PS2) 10Hashar: Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003)
[23:05:26] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:06:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Initial debianization [software/keyholder] (debian) - 10https://gerrit.wikimedia.org/r/588055 (https://phabricator.wikimedia.org/T203003) (owner: 10Hashar)
[23:06:47] <wikibugs>	 (03PS2) 10CRusnov: icinga: Add git local changes check [puppet] - 10https://gerrit.wikimedia.org/r/588049
[23:08:50] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:09:09] <wikibugs>	 10Operations, 10Keyholder, 10Release-Engineering-Team-TODO, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services): Keyholder phab repo duplicate work - https://phabricator.wikimedia.org/T203003 (10hashar) Eventually today I went with the issue of having to restart Keyholder and reharm afte...
[23:10:03] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/588036 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov)
[23:10:31] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/588049 (owner: 10CRusnov)
[23:13:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_restbase_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:15:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:16:18] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1085 is OK: HTTP OK: HTTP/1.0 200 OK - 22389 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:19:42] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1079 is OK: HTTP OK: HTTP/1.0 200 OK - 22372 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:35:10] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[23:59:15] <wikibugs>	 10Operations, 10Cloud-Services, 10Traffic, 10Wikimedia-Incident: Requests to production are sometimes timing out or giving empty response - https://phabricator.wikimedia.org/T249035 (10MusikAnimal) Just letting you know this issue has resumed as of about 4 or 5 hours ago, now requests are timing out every...