[00:00:34] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn)
[00:01:35] <wikibugs>	 (03CR) 10Ryan Kemper: "`puppet-merge` is done" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron)
[00:01:43] <wikibugs>	 (03CR) 10Ryan Kemper: "Sorry for letting this go stale - I'll work on getting this out tomorrow (the 26th)" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup)
[00:16:06] <wikibugs>	 (03CR) 10Herron: "thanks for merging!  I'll do rolling bounces of the logging ES7 instances tomorrow to pick up the new config" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron)
[00:28:01] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,3} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&va
[00:28:01] <icinga-wm>	 -eqiad&var-topic=All&var-consumer_group=All
[00:29:47] <wikibugs>	 (03PS7) 10Dave Pifke: arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[00:29:49] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[00:30:11] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke)
[00:31:26] <wikibugs>	 (03PS8) 10Dave Pifke: arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[00:33:17] <wikibugs>	 (03PS9) 10Dave Pifke: arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109)
[00:35:57] <wikibugs>	 (03CR) 10Krinkle: [C: 03+1] arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke)
[00:36:32] <logmsgbot>	 !log tstarling@deploy1001 Synchronized w/T256395-cookie-test.php: (no justification provided) (duration: 00m 58s)
[00:36:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:37:59] <wikibugs>	 (03CR) 10Dave Pifke: arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke)
[00:38:42] <logmsgbot>	 !log tstarling@deploy1001 Synchronized w/T256395-cookie-test.php: (no justification provided) (duration: 00m 56s)
[00:38:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:05:54] <wikibugs>	 (03PS1) 10CDanis: varnish: Set-Cookie responses are uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/607910 (https://phabricator.wikimedia.org/T256395)
[01:10:51] <wikibugs>	 (03PS1) 10Andrew Bogott: designate codfw1dev: remove need for 'legacy' dns sink domain [puppet] - 10https://gerrit.wikimedia.org/r/607911
[01:12:46] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+1] varnish: Set-Cookie responses are uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/607910 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[01:12:55] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] "This matches ATS behavior, so seems safe: https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/files/trafficserve" [puppet] - 10https://gerrit.wikimedia.org/r/607910 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[01:13:26] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕘🍺 sudo cumin A:cp 'disable-puppet "cdanis deploying I6cc5f3e6 T256395"'      
[01:13:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:13:46] <wikibugs>	 (03PS2) 10Andrew Bogott: designate codfw1dev: remove need for 'legacy' dns sink domain [puppet] - 10https://gerrit.wikimedia.org/r/607911
[01:21:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate codfw1dev: remove need for 'legacy' dns sink domain [puppet] - 10https://gerrit.wikimedia.org/r/607911 (owner: 10Andrew Bogott)
[01:41:26] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕘🍺 sudo cumin A:cp 'enable-puppet "cdanis deploying I6cc5f3e6 T256395"'
[01:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:46:02] <wikibugs>	 (03PS4) 10Bmansurov: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230)
[01:47:17] <wikibugs>	 (03CR) 10Bmansurov: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[01:53:09] <cdanis>	 !log I6cc5f3e6 has been deployed to all cp text nodes T256395
[01:53:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:55:41] <JohanJ>	 Hrm, asked in -traffic, but maybe this channel makes more sense:
[01:55:56] <JohanJ>	 https://twitter.com/john_overholt/status/1276276247602044933 describes a situation where one version of the article is visible logged in and another one logged out for a couple of persons. Neither a hard refresh nor purging the page seems to help. Is this a normal issue? Something we need to do something about?
[01:56:40] <brion>	 JohanJ: I'm hearing similar reports
[01:56:48] <brion>	 i wondered if it might be replication lag related
[01:56:52] <brion>	 however i saw no alerts in log
[01:57:31] <brion>	 on closer look they are related reports ;)
[01:58:03] <brion>	 so i'm not sure how widespread it is but a couple people in those threads were seeing it
[02:17:11] <cdanis>	 !log depooling cp1087 which has not been processing purges for 11.415 days
[02:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:17:21] <cdanis>	 thanks for passing on the report JohanJ 
[02:19:09] <cdanis>	 !log three more hosts not processing purges for multiple days ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕥🍺 sudo cumin 'cp2033*,cp2037*,cp2039*' 'depool'
[02:19:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:20:14] <cdanis>	 https://w.wiki/VDp
[02:20:22] <cdanis>	 oh, that only works if you can log into grafana, nevermind
[02:21:02] <cdanis>	 I'm taking a stretch and getting some water, then opening a task and doing a bit more digging.  But for now I believe no one should be served badly-stale pages.
[02:23:27] <wikibugs>	 (03PS1) 10CDanis: varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917
[02:24:14] <wikibugs>	 (03PS2) 10CDanis: varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395)
[02:31:20] <cdanis>	 John confirmed on Twitter that he's no longer seeing problems :)
[02:36:51] <wikibugs>	 10Operations, 10Traffic: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10CDanis)
[02:39:16] <wikibugs>	 (03PS1) 10Jeena Huneidi: Add Cassandra image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281)
[02:39:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:41:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:41:39] <wikibugs>	 (03PS1) 10Legoktm: libraryupgrader: Add systemd units [puppet] - 10https://gerrit.wikimedia.org/r/607919 (https://phabricator.wikimedia.org/T173478)
[02:42:15] <wikibugs>	 (03PS2) 10Jeena Huneidi: Add Cassandra image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281)
[02:49:11] <wikibugs>	 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10Johan)
[02:52:55] <logmsgbot>	 !log tstarling@deploy1001 Synchronized private/PrivateSettings.php: updating wgAuthenticationTokenVersion per my wikitech-l post (duration: 00m 57s)
[02:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:01:46] <wikibugs>	 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10CDanis)
[03:07:36] <wikibugs>	 (03CR) 10Tim Starling: "As I said on IRC, I would prefer to preserve the cache object for analysis, rather than forcibly expiring them (if that's what this does)." [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[03:29:09] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup)
[03:30:48] <wikibugs>	 10Operations, 10Traffic, 10User-notice: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10CDanis)
[03:51:39] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[03:51:49] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[03:54:35] <cdanis>	 !log https://gerrit.wikimedia.org/r/c/operations/puppet/+/607917
[03:54:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:54:43] <cdanis>	 aafgakjfg clipboard vs selection
[03:54:51] <cdanis>	 !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕛🍺 sudo cumin A:cp 'disable-puppet "I39e1c68a is broken"'                                                                                                     
[03:54:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:59:14] <wikibugs>	 (03PS1) 10CDanis: varnish: Set-Cookie responses are uncacheable: take 3 [puppet] - 10https://gerrit.wikimedia.org/r/607922 (https://phabricator.wikimedia.org/T256395)
[03:59:45] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie responses are uncacheable: take 3 [puppet] - 10https://gerrit.wikimedia.org/r/607922 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[04:01:47] <cdanis>	 !log re-enable puppet on cps
[04:01:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:18:12] <wikibugs>	 (03PS1) 10CDanis: varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395)
[04:20:33] <wikibugs>	 (03PS2) 10CDanis: varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395)
[04:21:13] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[04:21:23] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis)
[04:25:18] <wikibugs>	 (03PS1) 10CDanis: varnish: Set-Cookie beresps: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/607924
[04:26:59] <wikibugs>	 (03PS2) 10CDanis: varnish: Set-Cookie beresps: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/607924
[04:27:56] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie beresps: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/607924 (owner: 10CDanis)
[04:37:24] <wikibugs>	 (03PS1) 10CDanis: varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925
[04:38:44] <wikibugs>	 (03PS2) 10CDanis: varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925
[04:42:05] <wikibugs>	 (03PS3) 10CDanis: varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925
[04:43:37] <wikibugs>	 (03CR) 10BBlack: [C: 03+1] varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925 (owner: 10CDanis)
[04:43:50] <wikibugs>	 (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925 (owner: 10CDanis)
[04:45:35] <wikibugs>	 (03PS1) 10Marostegui: install_server: Reimage dbproxy1014 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607927 (https://phabricator.wikimedia.org/T255408)
[04:46:28] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1014 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607927 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui)
[05:03:25] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[05:03:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime
[05:03:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:05:02] <wikibugs>	 (03PS2) 10BBlack: Add Elastic IP to wikimedia_nets for PAPI [puppet] - 10https://gerrit.wikimedia.org/r/607313 (https://phabricator.wikimedia.org/T255524)
[05:05:05] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[05:05:38] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Add Elastic IP to wikimedia_nets for PAPI [puppet] - 10https://gerrit.wikimedia.org/r/607313 (https://phabricator.wikimedia.org/T255524) (owner: 10BBlack)
[05:06:29] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[05:06:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:12:42] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607928 (https://phabricator.wikimedia.org/T255408)
[05:13:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2088:3312, db2104', diff saved to https://phabricator.wikimedia.org/P11672 and previous config saved to /var/cache/conftool/dbconfig/20200626-051328-marostegui.json
[05:13:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:33] <wikibugs>	 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10CDanis)
[05:13:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607928 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui)
[05:31:09] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:38:55] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:50:38] <wikibugs>	 (03PS1) 10Marostegui: production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795)
[06:15:39] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:17:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:24:33] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:25:03] <wikibugs>	 (03PS1) 10Tim Starling: Add the cache-cookies log to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607931 (https://phabricator.wikimedia.org/T256395)
[06:29:55] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:37:57] <wikibugs>	 (03CR) 10Tim Starling: [C: 03+2] Add the cache-cookies log to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607931 (https://phabricator.wikimedia.org/T256395) (owner: 10Tim Starling)
[06:38:46] <wikibugs>	 (03Merged) 10jenkins-bot: Add the cache-cookies log to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607931 (https://phabricator.wikimedia.org/T256395) (owner: 10Tim Starling)
[06:40:34] <logmsgbot>	 !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: add cache-cookies log channel (duration: 00m 59s)
[06:40:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:46:13] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:57:03] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Renamed notebook1003 to an-launcher1002 - https://phabricator.wikimedia.org/T256397 (10elukey) Updated wikitech in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging with the procedure that I followed (mentioning also to...
[06:57:11] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:58:44] <wikibugs>	 (03PS7) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200626T0700)
[07:01:32] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Special:HideBanners is not really cacheable - https://phabricator.wikimedia.org/T256447 (10tstarling)
[07:08:07] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:17:07] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:22:31] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:26:09] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:31:39] <icinga-wm>	 PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:32:36] <volans>	 vgutierrez, XioNoX: anything ongoing in eqsin? cp5006 above and the librenms report in Netbox (not sure if related to each other)
[07:33:25] <jayme>	 volans: I was already trying to connect via mgmt console...no response for login
[07:33:27] <XioNoX>	 volans: it's possible that cr1-eqsin crapped itself
[07:33:35] <XioNoX>	 volans: can you prepare a depool?
[07:33:41] <volans>	 sure
[07:34:26] <XioNoX>	 so far it looks all normal
[07:34:28] <XioNoX>	 still looking
[07:34:53] <volans>	 the report on netbox is saying missing items in librenms
[07:35:18] <wikibugs>	 (03PS1) 10Volans: depool eqsin, network issues [dns] - 10https://gerrit.wikimedia.org/r/607976
[07:35:28] <volans>	 patch ready if needed ^^^ abandon it if not needed
[07:35:50] <XioNoX>	 volans: which patch? :)
[07:36:00] <volans>	 lol
[07:36:13] <XioNoX>	 volans: https://librenms.wikimedia.org/device/device=159/ yeah librenms failed to pull the router, but everything looks back to normal now
[07:36:55] <XioNoX>	 smokeping is happy too https://smokeping.wikimedia.org/?target=eqsin.Core.cr1-eqsin
[07:37:17] <volans>	 ok
[07:37:25] <volans>	 I can ssh into the mgmt of cm5006
[07:37:32] <XioNoX>	 cr1-eqsin logs are fine too
[07:37:46] <volans>	 *cp5006, and as jayme said nothing in console
[07:38:08] <volans>	 ema: around by any chance?
[07:38:22] <volans>	 I'd go for a depool + force reboot if seems reasonable
[07:38:51] <icinga-wm>	 RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:38:54] <jayme>	 I was about to open a phab task...is that how "we do it"? :)
[07:39:32] <XioNoX>	 jayme: yeah
[07:39:52] <XioNoX>	 even if the actions above solve the issue it's good to have a tracking task for the future and offline people
[07:39:56] <jayme>	 ack
[07:40:05] <volans>	 jayme: yes to both :)
[07:41:22] <volans>	 jayme: another useful resource in this cases is https://wikitech.wikimedia.org/wiki/Service_restarts
[07:42:14] <vgutierrez>	 sounds good yeah
[07:42:34] <volans>	 ack, proceeding
[07:42:43] <logmsgbot>	 !log volans@cumin1001 conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet
[07:42:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:34] <volans>	 jayme: are you opening a task?
[07:43:43] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 (10JMeybohm)
[07:43:45] <volans>	 (to avoid duplicates)
[07:43:51] <jayme>	 yes :D
[07:44:12] <volans>	 !log force rebooted cp5006 that is unresponsive (after having depooled it) - T256449
[07:44:15] <volans>	 thanks!
[07:44:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:16] <stashbot>	 T256449: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449
[07:44:27] <jayme>	 can't go back in time in icinga prior to 04:00 UTC, though
[07:44:33] <icinga-wm>	 PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100%
[07:44:46] <volans>	 icinga logs are in /srv/
[07:44:48] <volans>	 wow
[07:44:57] <volans>	 console so far is all ��fx怘�怘�xx��x�x
[07:45:03] <volans>	 doesn't look good
[07:45:29] * volans uploading matrix-translator plugin to his brain
[07:45:48] <volans>	 vgutierrez: be prepared to say goodbye to cp5006
[07:45:55] <icinga-wm>	 RECOVERY - Host kubestagetcd1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.32 ms
[07:46:08] <volans>	 mmmh
[07:46:11] <wikibugs>	 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) >>! In T244530#6258712, @Dzahn wrote: > kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga.   No, mistake on my side. Thanks! I 've just started it up.
[07:46:17] <volans>	 now I got kernel boot logs normally
[07:46:25] <icinga-wm>	 RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 231.35 ms
[07:46:27] <volans>	 so seems that the bios redirection is borked, maybe misconfigured?
[07:46:51] <volans>	 host is up
[07:47:58] <volans>	 nothingn in syslog since Jun 26 04:40:01
[07:48:54] <jayme>	 first nrpe socket timeout messages started to pop up at 2020-06-26 04:41:21
[07:50:41] <ema>	 volans: hey
[07:50:42] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 (10Volans) Host is back up, the console output during boot was all borked `��fx怘�怘�xx��x�x....` but the kernel boot logs were normally readable. Maybe there is some misconfiguration in...
[07:50:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool: Add new kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/607530 (owner: 10Alexandros Kosiaris)
[07:50:48] <wikibugs>	 (03PS1) 10Ayounsi: Add mgmt for cloudsw [dns] - 10https://gerrit.wikimedia.org/r/607978 (https://phabricator.wikimedia.org/T251632)
[07:51:18] <volans>	 I've updated the task, if it's ok for you I'd leave it to you now, it's depooled in conftool
[07:52:55] <volans>	 ema: ^^^
[07:53:19] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/607978 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi)
[07:53:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Add mgmt for cloudsw [dns] - 10https://gerrit.wikimedia.org/r/607978 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi)
[07:54:20] <ema>	 volans: sure, ty
[07:54:34] <wikibugs>	 (03PS2) 10Hashar: releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn)
[07:55:08] <wikibugs>	 (03Abandoned) 10Hashar: releases::mediawiki:: support buster / PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn)
[07:56:48] <volans>	 thanks!
[07:57:17] <wikibugs>	 (03PS3) 10Hashar: releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn)
[07:57:47] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[07:57:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:18] <wikibugs>	 (03Abandoned) 10Volans: depool eqsin, network issues [dns] - 10https://gerrit.wikimedia.org/r/607976 (owner: 10Volans)
[08:04:19] <wikibugs>	 10Operations, 10vm-requests: Site: 2 VM request for kubernetes sessionstore dedicated nodes - https://phabricator.wikimedia.org/T256254 (10akosiaris) 05Open→03Resolved a:03akosiaris VMs created, installed and ready. Resolving
[08:04:21] <wikibugs>	 10Operations, 10serviceops, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris)
[08:04:28] <akosiaris>	 !log pool all new kubernetes nodes in LVS T252185 T256236
[08:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:04:33] <stashbot>	 T256236: Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236
[08:04:33] <stashbot>	 T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185
[08:04:53] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes.*.wmnet
[08:04:53] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795) (owner: 10Marostegui)
[08:04:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:04] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes.*.wmnet
[08:05:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:06:29] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:08:17] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[08:15:33] <wikibugs>	 10Operations, 10serviceops, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris) 05Open→03Resolved Both paths outlined in the description have been followed. We now have 4 different sessionstore...
[08:19:01] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Partially revert sessionstore emergency fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980
[08:20:16] <logmsgbot>	 !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[08:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:20:21] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
[08:20:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P11673 and previous config saved to /var/cache/conftool/dbconfig/20200626-082242-marostegui.json
[08:22:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:02] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:25:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:32:20] <wikibugs>	 (03PS5) 10Jcrespo: mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871)
[08:33:19] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1088', diff saved to https://phabricator.wikimedia.org/P11674 and previous config saved to /var/cache/conftool/dbconfig/20200626-083319-marostegui.json
[08:33:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:37:57] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo)
[08:46:32] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] cumin: backup all of /srv where a lot of deployment state may live [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[08:47:24] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn)
[08:50:45] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) >>! In T255243#6258852, @Dzahn wrote: > There are 4 new ganeti VMs now, 2 in eqiad and 2 in codfw, in row D e...
[08:54:42] <wikibugs>	 10Operations, 10Puppet: Missing dependency on bacula-fd setup - https://phabricator.wikimedia.org/T256454 (10jcrespo)
[08:55:12] <wikibugs>	 10Operations, 10Puppet: Missing dependency on bacula-fd setup - https://phabricator.wikimedia.org/T256454 (10jcrespo)
[08:56:20] <wikibugs>	 10Operations, 10Puppet: Missing dependency on bacula-fd setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) Added @hashar @akosiaris  @Volans FYI, feel free to unsubscribe as I will be the one debugging this.
[08:56:25] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime
[08:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:56:40] <wikibugs>	 10Operations, 10Puppet: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10jcrespo)
[08:57:56] <wikibugs>	 (03CR) 10Liuxinyu970226: [C: 03+1] Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825)
[08:58:58] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[08:59:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P11675 and previous config saved to /var/cache/conftool/dbconfig/20200626-090813-marostegui.json
[09:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:42] <wikibugs>	 10Operations: Puppet CI should fail over CRLF line endings (sometimes) - https://phabricator.wikimedia.org/T182641 (10hashar) That can be done directly inside `operations/puppet.git`. Ci just invokes `rake test`.
[09:18:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980 (owner: 10Alexandros Kosiaris)
[09:29:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980 (owner: 10Alexandros Kosiaris)
[09:30:04] <wikibugs>	 (03Merged) 10jenkins-bot: Partially revert sessionstore emergency fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980 (owner: 10Alexandros Kosiaris)
[09:35:20] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[09:35:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:57] <akosiaris>	 !log move the sessionstore codfw pods back to the dedicated sessionstore nodes
[09:36:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:32] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[09:37:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:26] <wikibugs>	 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops: Increased "Allowed memory size exhausted" exceptions from MediaWiki since 2020-06-25 ~16:00 - https://phabricator.wikimedia.org/T256459 (10JMeybohm)
[09:38:43] <akosiaris>	 !log move the sessionstore eqiad pods back to the dedicated sessionstore nodes
[09:38:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:13] <ema>	 !log cp2033: restart purged T256444 
[09:46:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:16] <stashbot>	 T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444
[09:47:13] <wikibugs>	 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) p:05Triage→03High
[09:47:42] <wikibugs>	 10Operations, 10Traffic, 10User-notice: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10ema) p:05Triage→03Medium
[09:48:08] <wikibugs>	 (03CR) 10Jcrespo: "Backup took 28 seconds:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo)
[09:51:16] <wikibugs>	 (03PS1) 10Elukey: Move archiva.wikimedia.org from archiva1001 to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/607989 (https://phabricator.wikimedia.org/T252767)
[09:55:46] <ema>	 !log cp1087: restart purged T256444
[09:55:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:51] <stashbot>	 T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444
[09:57:57] <ema>	 !log cp2037: restart purged T256444
[09:58:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:58:06] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backups: Remove x1 from db1095 and enable db1102 notif. [puppet] - 10https://gerrit.wikimedia.org/r/607515 (https://phabricator.wikimedia.org/T254871)
[09:58:37] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad on icinga1001 is CRITICAL: 0.2697 lt 0.3 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[09:59:16] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Remove x1 from db1095 and enable db1102 notif. [puppet] - 10https://gerrit.wikimedia.org/r/607515 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo)
[10:00:28] <wikibugs>	 (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23489/" [puppet] - 10https://gerrit.wikimedia.org/r/607989 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey)
[10:03:01] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:03:24] <akosiaris>	 we have a flurry of api reqests https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=17&fullscreen&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200
[10:03:47] <akosiaris>	 it's already dying down, but this is weird
[10:03:50] <elukey>	 yeah 
[10:03:53] <ema>	 !log cp2039: restart purged T256444
[10:03:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:57] <stashbot>	 T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444
[10:04:01] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad on icinga1001 is OK: (C)0.3 lt (W)0.5 lt 0.8048 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[10:04:41] <elukey>	 akosiaris: no sign of distress on the mcrouter/memcached front
[10:06:41] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:06:54] <akosiaris>	 elukey: yes. All signals deteriorated for a while
[10:07:01] <akosiaris>	 but we are ok now?
[10:07:22] <akosiaris>	 or so it seems to me
[10:07:29] <elukey>	 I think so, didn't check the exceptions yet
[10:08:46] <akosiaris>	 well, per icinga those are elevated right now
[10:09:04] <akosiaris>	 the linked grafana dashboard concurs
[10:09:31] <akosiaris>	 but it also looks like it's getting better. Maybe an aftermath?
[10:10:17] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:10:28] <elukey>	 from logstash I see a spike in errors only from 9:55~10:00 UTC, matching the rise in traffic
[10:10:34] <elukey>	 mostly wikidatawiki, with cirrussearch-too-busy-error
[10:10:42] <elukey>	 CC: dcausse: --^
[10:11:10] <elukey>	 but it is probably a side effect of the t raffic
[10:11:55] <akosiaris>	 yeah, agreed
[10:15:40] <wikibugs>	 (03PS1) 10Kormat: mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935)
[10:18:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat)
[10:19:01] <wikibugs>	 (03PS2) 10Kormat: mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935)
[10:21:25] <wikibugs>	 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) I have identified the misbehaving purged instances with `rate(purged_events_received_total{cluster="cache_text", topic="eqiad.resource-purge"}[5m]) == 0` and restarted them...
[10:22:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11676 and previous config saved to /var/cache/conftool/dbconfig/20200626-102201-marostegui.json
[10:22:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1085', diff saved to https://phabricator.wikimedia.org/P11677 and previous config saved to /var/cache/conftool/dbconfig/20200626-102248-marostegui.json
[10:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:00] <ema>	 !log pool 5006 T256449
[10:24:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:24:12] <stashbot>	 T256449: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449
[10:25:23] <wikibugs>	 10Operations, 10Traffic: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10Johan)
[10:25:26] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Adithyak1997)
[10:28:24] <wikibugs>	 10Operations, 10DBA, 10SRE-tools, 10User-Kormat: Add native mysql module to spicerack - https://phabricator.wikimedia.org/T255409 (10Kormat)
[10:28:53] <wikibugs>	 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review, 10User-Kormat: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Kormat)
[10:29:35] <wikibugs>	 10Operations, 10Traffic: Make atsmtail-backend.service depend on fifo-log-demux - https://phabricator.wikimedia.org/T256467 (10ema)
[10:29:55] <wikibugs>	 10Operations, 10Traffic: Make atsmtail-backend.service depend on fifo-log-demux - https://phabricator.wikimedia.org/T256467 (10ema) p:05Triage→03Low
[10:30:12] <wikibugs>	 10Operations, 10ops-eqsin, 10Traffic: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 (10ema) 05Open→03Resolved a:03ema The host looks fine, closing for now.
[10:31:52] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 11.64 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[10:32:48] <wikibugs>	 (03PS10) 10ArielGlenn: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[10:34:17] <ema>	 what now
[10:34:42] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[10:35:51] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] dumps: fix shellcheck issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond)
[10:39:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210)
[10:40:11] <wikibugs>	 (03CR) 10Daimona Eaytoy: "> All except wmgUseGlobalAbuseFilters this seems fine and mostly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607605 (owner: 10Reedy)
[10:40:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210) (owner: 10Arturo Borrero Gonzalez)
[10:41:05] <wikibugs>	 (03PS1) 10ArielGlenn: add options to make testing dumps rsync easier [puppet] - 10https://gerrit.wikimedia.org/r/608006
[10:47:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[10:54:47] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui)
[10:58:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:03:13] <wikibugs>	 (03PS1) 10Kormat: install_server: Reuse partitions for dbprov* hosts [puppet] - 10https://gerrit.wikimedia.org/r/608012 (https://phabricator.wikimedia.org/T255768)
[11:04:17] <wikibugs>	 (03PS2) 10Kormat: install_server: Reuse partitions for dbprov* hosts [puppet] - 10https://gerrit.wikimedia.org/r/608012 (https://phabricator.wikimedia.org/T255768)
[11:12:56] <wikibugs>	 (03PS1) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395)
[11:16:26] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:21:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[11:35:37] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've left another round of comments and have hopefully answered your question @bmansurov. I think that we should be able to deploy after " (0323 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[11:37:33] <wikibugs>	 (03PS1) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446)
[11:45:32] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] ATS: override Cache-Control for Set-Cookie responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[11:52:06] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210)
[11:55:00] <wikibugs>	 (03PS2) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395)
[11:59:59] <wikibugs>	 (03PS2) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446)
[12:00:08] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[12:07:23] <wikibugs>	 (03PS3) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446)
[12:08:29] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[12:11:09] <wikibugs>	 (03PS4) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446)
[12:15:16] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210)
[12:18:07] <dcausse>	 elukey: just saw your ping, looking some elastic graphs we rejected a bunch of queries between 9:50 and 10:10, fulltext qps rised from 450 to 2500 (5x more than what we usually hande) 
[12:21:24] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[12:21:32] <wikibugs>	 (03CR) 10Filippo Giunchedi: purged: alert in case of high event lag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[12:23:14] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210)
[12:30:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:33:14] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:34:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:37:06] <elukey>	 dcausse: yes there was a big surge in traffic, just wanted to let you know
[12:37:17] <dcausse>	 sure, thanks!
[12:39:02] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:40:13] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: all: Don't declare data in Secret if not specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332
[12:40:15] <wikibugs>	 (03PS9) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280
[12:41:48] <wikibugs>	 (03CR) 10CDanis: [C: 04-2] "This is the right fix, but at the wrong time -- let's not deploy this until we get a handle on the app-layer causes.  AFAICT this will bre" [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema)
[12:44:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] all: Don't declare data in Secret if not specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 (owner: 10Alexandros Kosiaris)
[12:45:01] <wikibugs>	 (03Merged) 10jenkins-bot: all: Don't declare data in Secret if not specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 (owner: 10Alexandros Kosiaris)
[12:45:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris)
[12:46:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:48:43] <wikibugs>	 (03Merged) 10jenkins-bot: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris)
[12:52:20] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:55:56] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[12:56:02] <elukey>	 is anybody checking --^ ?
[13:00:02] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:03:19] <wikibugs>	 (03PS5) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446)
[13:04:03] <wikibugs>	 (03CR) 10Ema: purged: alert in case of high event lag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[13:04:57] <wikibugs>	 (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[13:05:50] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:07:27] <wikibugs>	 (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1003/454/" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[13:08:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:12:13] <elukey>	 so logstash mediawiki-errors doesn't show much
[13:13:16] <elukey>	 the one in logstash-next shows some pressure on parsoid
[13:13:29] <wikibugs>	 (03PS6) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446)
[13:15:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:19:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:19:55] <wikibugs>	 10Operations, 10Traffic: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 (10ema)
[13:23:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:24:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:27:27] <jayme>	 elukey: I've created https://phabricator.wikimedia.org/T256459 for this. I think this *might* be related to the "Parsoid-PHP is now considered part of MediaWiki." part from the logstash improvement mail from Krinkle
[13:28:18] <wikibugs>	 (03PS1) 10Ema: Serialize access to KafkaReader.maxts [software/purged] - 10https://gerrit.wikimedia.org/r/608045 (https://phabricator.wikimedia.org/T256479)
[13:29:32] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 (10ema) p:05Triage→03Medium
[13:31:59] <elukey>	 jayme: thanks!
[13:33:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema)
[13:43:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:52:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:02:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:05:35] <Krinkle>	 jayme: elukey Yeah OOMs are usually not worthy of a task unless it's order or magnitudes higher than normal
[14:06:06] <Krinkle>	 Too many little holes that are not plugged and can be triggered by wikitekst
[14:06:31] <elukey>	 Krinkle: sure but then we need to tune the alarms, see mw exceptions (that is the only thing related that I found, I may have missed something)
[14:06:32] <jayme>	 Krinkle: it's easily x10 for some cases, though
[14:06:33] <Krinkle>	 Unfortunately OOM is therefore partially used as proxy for rejecting too complex pages
[14:06:45] <Krinkle>	 Yeah
[14:07:38] <Krinkle>	 jayme: what do you mean by "for some cases"?
[14:07:41] <Krinkle>	 https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops
[14:08:47] <Krinkle>	 elukey: are the alerts fixed or dynamic? What's the baseline or relative comparison date?
[14:10:27] <elukey>	 Krinkle: see https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops
[14:10:40] <elukey>	 I think we alarm on this metric
[14:11:15] <elukey>	 you can find the definition in profile::mediawiki::alerts
[14:11:34] <elukey>	 sum(irate(logstash_mediawiki_events_total{channel=~"(fatal|exception)",level="ERROR"}[10m])) without (channel, instance) * 60
[14:11:48] <elukey>	 if > 25 warning, > 50 error
[14:12:31] <Krinkle>	 Hmm I see. Maybe 50/100 makes sense there. We have lots of other triggers already
[14:12:41] <Krinkle>	 This is more a wide net catch all
[14:12:45] <jayme>	 Krinkle: Depending on which timeframes you compare. We hat like a top of ~120 per hour from 06-24 to 06-25 and a top of 1.6k from 06-25 to 06-26
[14:12:57] <Krinkle>	 Doesn't need to be as sensitive
[14:13:05] <jayme>	 (looking at kibana)
[14:13:21] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Move transferpy deployment to debian package [puppet] - 10https://gerrit.wikimedia.org/r/608053 (https://phabricator.wikimedia.org/T138562)
[14:13:42] <Krinkle>	 elukey: average per x minutes? Or last minute from whenever it is polled?
[14:14:20] <Krinkle>	 Based on my irate understanding that means current minute
[14:14:44] <Krinkle>	 Last two days points right?
[14:15:06] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Move transferpy deployment to debian package [puppet] - 10https://gerrit.wikimedia.org/r/608053 (https://phabricator.wikimedia.org/T138562)
[14:15:39] <elukey>	 https://prometheus.io/docs/prometheus/latest/querying/functions/#irate
[14:15:57] <elukey>	 lemme check because I am not 100% sure on a friday afternoon :D
[14:16:34] <elukey>	 so up to 10m ago, for the last two datapoints 
[14:16:38] <Krinkle>	 I don't think irate makes sense here then
[14:17:12] <Krinkle>	 Yeah for irate the [m] part doesn't do much unless there are big gaps in Prometheus collection points
[14:17:51] <Krinkle>	 Especially since we have subminute reporting intervals in prod
[14:19:25] <Krinkle>	 rate()5m would be more useful gives us an average rate over the full duration
[14:19:45] <Krinkle>	 Perhaps even keep the thresholds as is
[14:20:35] * Krinkle adds todo to add threshold visually to the grafana dash
[14:27:59] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] "should be merged right before doing a data-reload with the --skolemize on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse)
[14:39:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:04:10] <wikibugs>	 (03PS1) 10BBlack: Set-Cookie: Send server header to syslog as well [puppet] - 10https://gerrit.wikimedia.org/r/608060 (https://phabricator.wikimedia.org/T256395)
[15:08:42] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: add tests for filter-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/608061 (https://phabricator.wikimedia.org/T251869)
[15:08:44] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Set-Cookie: Send server header to syslog as well [puppet] - 10https://gerrit.wikimedia.org/r/608060 (https://phabricator.wikimedia.org/T256395) (owner: 10BBlack)
[15:14:06] <wikibugs>	 (03PS1) 10Cicalese: Add HTTP proxy to MediaModeration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943)
[15:14:19] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] Set-Cookie: Send server header to syslog as well [puppet] - 10https://gerrit.wikimedia.org/r/608060 (https://phabricator.wikimedia.org/T256395) (owner: 10BBlack)
[15:18:32] <wikibugs>	 10Operations: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis) For a number of years now, work has been proceeding in order to bring to perfection the crudely-conceived idea of a machine that would not only supply the easy re-routing of traffic for lo...
[15:23:16] <wikibugs>	 (03CR) 10Ppchelko: Add HTTP proxy to MediaModeration. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese)
[15:25:20] <wikibugs>	 10Operations, 10SRE-swift-storage, 10serviceops: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This is done and the account is working, thanks @fgiunchedi !
[15:38:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:44:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:49:22] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:55:28] <wikibugs>	 (03CR) 10Dave Pifke: [C: 03+1] production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795) (owner: 10Marostegui)
[16:03:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795) (owner: 10Marostegui)
[16:19:49] <wikibugs>	 (03PS1) 10Andrew Bogott: cloud-vps: puppetize /etc/ldap.conf on sssd clients [puppet] - 10https://gerrit.wikimedia.org/r/608068
[16:26:39] <wikibugs>	 10Operations, 10ops-ulsfo: update rack location of decom wmf5801 - https://phabricator.wikimedia.org/T249287 (10RobH) 05Open→03Resolved fixed
[16:26:41] <wikibugs>	 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Netbox report coherence_rack Icinga alert - https://phabricator.wikimedia.org/T250054 (10RobH)
[16:26:43] <wikibugs>	 10Operations, 10netbox: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10RobH)
[16:31:09] <wikibugs>	 10Operations, 10ops-ulsfo: remove cr4-ulsfo:xe-0/1/1 - https://phabricator.wikimedia.org/T254206 (10RobH) 05Open→03Resolved updated gsheet and removed the fiber/cable from netbox
[16:35:37] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH)
[16:45:34] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) https://netbox.wikimedia.org/dcim/cables/1632/ was also missing a cable id (zayo link) and added it to both gsheet xconnect tracking sheet and to netbox info
[16:47:19] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) 05Stalled→03Resolved
[16:47:21] <wikibugs>	 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10RobH)
[16:47:31] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) all cable errors in netbox for ulsfo are fixed
[17:05:46] <wikibugs>	 (03CR) 10EBernhardson: "two things related to recent changes I made to this code" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse)
[17:07:09] <wikibugs>	 10Operations, 10Android-app-Bugs, 10Parsoid, 10Traffic, and 6 others: Right-to-Left directionality problem with refs - https://phabricator.wikimedia.org/T251983 (10bearND) 05Open→03Resolved a:03bearND Ok, looks like enough time has passed for the old, cached version to be evited and the newly deploye...
[17:10:58] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH)
[17:11:01] <robh>	 !log msw work in ulsfo via T256300
[17:11:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:11:06] <stashbot>	 T256300: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300
[17:17:02] <icinga-wm>	 PROBLEM - Host dns4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:17:02] <icinga-wm>	 PROBLEM - Host bast4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:17:06] <icinga-wm>	 PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:17:20] <icinga-wm>	 PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:17:20] <icinga-wm>	 PROBLEM - Host re0.cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[17:17:30] <icinga-wm>	 PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:17:46] <icinga-wm>	 PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[17:18:14] <icinga-wm>	 PROBLEM - Host ganeti4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:19:04] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:19:06] <icinga-wm>	 PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:19:06] <icinga-wm>	 PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:19:08] <icinga-wm>	 PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:19:15] <robh>	 thats me!
[17:19:16] <icinga-wm>	 PROBLEM - Host lvs4006.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:19:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:19:21] <robh>	 and they'll start to recover shortly
[17:19:26] <robh>	 (all ulsfo mgmt is me and expected)
[17:20:04] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:21:54] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:38:13] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10Nemo_bis)
[17:38:26] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Nemo_bis)
[17:46:30] <wikibugs>	 (03PS1) 10Bstorm: Revert "unattendedupgrades: allow configurable kernel cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/608085
[17:46:30] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:47:33] <icinga-wm>	 RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:47:34] <icinga-wm>	 RECOVERY - Host bast4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 82.60 ms
[17:47:34] <icinga-wm>	 RECOVERY - Host cp4024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.36 ms
[17:47:34] <icinga-wm>	 RECOVERY - Host cp4030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.55 ms
[17:47:34] <icinga-wm>	 RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.40 ms
[17:47:35] <icinga-wm>	 RECOVERY - Host ganeti4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms
[17:47:37] <icinga-wm>	 RECOVERY - Host dns4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.82 ms
[17:47:51] <icinga-wm>	 RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.61 ms
[17:47:53] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:48:21] <icinga-wm>	 RECOVERY - Host re0.cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 90.33 ms
[17:48:21] <icinga-wm>	 RECOVERY - Host cp4022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 91.73 ms
[17:48:21] <icinga-wm>	 RECOVERY - Host cp4026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 97.99 ms
[17:48:21] <icinga-wm>	 RECOVERY - Host cp4028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 95.87 ms
[17:48:23] <icinga-wm>	 RECOVERY - Host lvs4006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms
[17:49:35] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH)
[17:49:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:52:46] <robh>	 !log msw2-ulsfo work done, all mgmt items confirmed back online and icinga alerts cleared, moving onto msw1-ulsfo (rack 22) and will lose all mgmt in that rack for next 10-20 minutes
[17:52:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:52:56] <robh>	 !log msw2-ulsfo work done, all mgmt items confirmed back online and icinga alerts cleared, moving onto msw1-ulsfo (rack 22) and will lose all mgmt in that rack for next 10-20 minutes T256300
[17:52:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:53:00] <stashbot>	 T256300: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300
[17:58:25] <icinga-wm>	 PROBLEM - Host lvs4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[17:58:43] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:58:55] <icinga-wm>	 PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[17:59:23] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:59:35] <icinga-wm>	 PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:59:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:59:56] <wikibugs>	 (03CR) 10Thcipriani: "Couple of nits inline." (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi)
[18:00:55] <icinga-wm>	 PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:01:13] <icinga-wm>	 PROBLEM - Host dns4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:01:17] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:01:31] <icinga-wm>	 PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:01:31] <icinga-wm>	 PROBLEM - Host re0.cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100%
[18:02:23] <icinga-wm>	 PROBLEM - Host ganeti4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:02:23] <icinga-wm>	 PROBLEM - Host ganeti4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:03:21] <icinga-wm>	 PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:03:21] <icinga-wm>	 PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:03:24] <icinga-wm>	 PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:03:24] <icinga-wm>	 PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:03:33] <icinga-wm>	 PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:03:57] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:04:05] <icinga-wm>	 RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.13 ms
[18:04:57] <icinga-wm>	 RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:05:59] <icinga-wm>	 RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.37 ms
[18:06:03] <robh>	 ok, they should all be coming back now
[18:06:09] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:06:21] <icinga-wm>	 RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.75 ms
[18:06:41] <icinga-wm>	 RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.77 ms
[18:06:59] <icinga-wm>	 RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.87 ms
[18:06:59] <icinga-wm>	 RECOVERY - Host re0.cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.54 ms
[18:07:54] <icinga-wm>	 RECOVERY - Host ganeti4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 76.33 ms
[18:07:54] <icinga-wm>	 RECOVERY - Host ganeti4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 76.13 ms
[18:07:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:07:57] <wikibugs>	 10Operations, 10ops-codfw: Return asw-c8-codfw to spares - https://phabricator.wikimedia.org/T256498 (10faidon) p:05Triage→03Low
[18:08:49] <icinga-wm>	 RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.10 ms
[18:08:51] <icinga-wm>	 RECOVERY - Host cp4027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.18 ms
[18:08:51] <icinga-wm>	 RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms
[18:09:03] <icinga-wm>	 RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms
[18:09:31] <icinga-wm>	 RECOVERY - Host lvs4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.19 ms
[18:12:58] <wikibugs>	 (03CR) 10Cicalese: Add HTTP proxy to MediaModeration. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese)
[18:13:12] <wikibugs>	 (03PS1) 10JMeybohm: Add patches for swift auth and bind interface [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/608088 (https://phabricator.wikimedia.org/T253843)
[18:21:59] <icinga-wm>	 PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:22:07] <cdanis>	 here, looking
[18:22:19] <rzl>	 👋
[18:22:20] <herron>	 hey
[18:22:30] * apergos peeks in
[18:22:51] <icinga-wm>	 RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 23788 bytes in 0.366 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:23:44] <cdanis>	 uhm
[18:24:52] <shdubsh>	 o/
[18:25:26] <godog>	 I'm here too if needed
[18:25:42] <cdanis>	 there's no sign of anything wrong
[18:25:44] <cdanis>	 strange
[18:25:49] <apergos>	 world's fastest recovery
[18:25:57] <jynus>	 yeah, metrics don't seem to show any blip
[18:26:05] <jynus>	 neither app level not traffic
[18:26:06] <rzl>	 stupid question: I'm having trouble even reading the alert, what does it mean when it says eqiad and codfw in those two fields?
[18:26:15] <mutante>	 yea, is it eqiad or codfw or both
[18:26:35] <rzl>	 is that cross-dc traffic, and if so which direction?
[18:27:06] * volans was heading out, do you need me too?
[18:27:07] <cdanis>	 I believe it is the LVSen in codfw, for api_appserver which it knows the realservers are in eqiad
[18:27:14] <volans>	 I can stay if needed
[18:27:27] <cdanis>	 volans: no, seems like nothing
[18:27:29] <rzl>	 cdanis: ack, thanks
[18:28:31] <volans>	 ok thx
[18:29:03] <apergos>	 not seein nothin
[18:29:06] <apergos>	 meh
[18:29:30] <apergos>	 looked at: red, app servers, api servers, lvses (because hey you never know)...
[18:29:35] <rzl>	 ==
[18:30:50] <apergos>	 msw1-ulsfo     any chance the work on that caused a momentary blip?
[18:30:54] <cdanis>	 in codfw? naw
[18:31:00] <cdanis>	 there's nothing obvious in syslog or pybal.log on lvs2009
[18:31:08] <apergos>	 seems unlikely but 
[18:31:17] <godog>	 can't find anything obviously wrong either so far
[18:31:35] <apergos>	 welp if nothin in the logs, I'm out of ideas 
[18:33:57] <apergos>	 if no one objects I'm going to check back out (if someone has an idea of further things to look at, I will happily stick around. or at least not too grugingly :-P)
[18:33:59] <mutante>	 on the Icinga server side the logs have some SOFT alerts about saturated CPU cores on LVS
[18:34:13] <mutante>	 lvs1013;At least one CPU core of an LVS is saturated- packet drops are likely;WARNING;SOFT;
[18:34:30] <mutante>	 but they stayed SOFT, so no alerts
[18:35:06] <cdanis>	 mutante: I need to adjust the thresholds a bit, those are common to go into WARNING on a puppet run 😂
[18:35:32] <mutante>	 cdanis: ok, ACK. it was lvs1013 and lvs2010, nothing special then, ok
[18:35:47] <cdanis>	 2010 is the backup in codfw anyway, not the primary
[18:37:28] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH)
[18:37:39] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10wiki_willy) a:05wiki_willy→03Jclark-ctr
[18:38:02] <wikibugs>	 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH) 05Open→03Resolved Ok the old switches are unracked and in the bottom of rack 23.  all cables in netbox added and all duplicate cable id conflicts resolved/fixed.
[18:38:11] <wikibugs>	 10Operations, 10ops-eqiad: Decommisson and store old row D network gear. - https://phabricator.wikimedia.org/T170474 (10wiki_willy) a:03Cmjohnson
[18:38:44] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Renamed notebook1003 to an-launcher1002 - https://phabricator.wikimedia.org/T256397 (10wiki_willy) a:03Jclark-ctr
[18:39:59] <godog>	 I'm logging off as well, available if needed
[18:42:18] <robh>	 !log all ulsfo onsite work completed as of 30 minutes ago
[18:42:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:47:27] <wikibugs>	 10Operations, 10observability, 10User-MoritzMuehlenhoff: Switch ELK7 to use the distro Java - https://phabricator.wikimedia.org/T252913 (10herron) 05Open→03Resolved a:03herron logging ES7 instances are now using the system openjdk-11
[18:48:21] <wikibugs>	 (03PS2) 10Bstorm: Revert "unattendedupgrades: allow configurable kernel cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/608085
[18:58:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:59:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10wiki_willy)
[19:00:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-17) rack/setup/install <hadoop testing nodes> - https://phabricator.wikimedia.org/T255520 (10wiki_willy)
[19:00:45] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10wiki_willy)
[19:01:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10wiki_willy)
[19:02:12] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:03:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1003.eqiad.wmnet ` The log...
[19:04:27] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1004.eqiad.wmnet ` The log...
[19:11:41] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] Revert "unattendedupgrades: allow configurable kernel cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/608085 (owner: 10Bstorm)
[19:38:34] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:45:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:02:12] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:07:09] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1003.eqiad.wmnet'] `  Of which those **FAILED**: ` ['relforge1003.eqiad.wmnet'...
[20:07:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:08:27] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1004.eqiad.wmnet'] `  Of which those **FAILED**: ` ['relforge1004.eqiad.wmnet'...
[20:34:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:44:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:51:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:58:54] <wikibugs>	 (03CR) 10Hashar: "As I get it, the idea is to spawn a Cassandra container and get some other container to use it as a backend to run tests." (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi)
[21:01:03] <wikibugs>	 (03CR) 10Hashar: "Bah I thought this change was for integration/config but it is targeting operations/docker-images/production-images :]  So a couple of my " (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi)
[21:15:25] <wikibugs>	 10Puppet, 10Analytics, 10Analytics-Kanban, 10Cloud-VPS: Puppet failing on wikistats.analytics.eqiad.wmflabs: /usr/local/sbin/x509-bundle error - https://phabricator.wikimedia.org/T255464 (10bd808) Still busted: `lines=10,lang=shell-session root@wikistats:~# puppet agent -tv Info: Using configured environme...
[21:16:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:24:12] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition=1 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging
[21:24:12] <icinga-wm>	 All&var-consumer_group=All
[21:36:38] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:36:52] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[21:51:41] <wikibugs>	 (03PS2) 10BryanDavis: Pywikibot container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787)
[21:51:43] <wikibugs>	 (03PS1) 10BryanDavis: webservice-python-bootstrap: install wheel [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608093
[22:04:26] <wikibugs>	 (03CR) 10Legoktm: [C: 03+1] "I thought this would have already been installed, but apparently[1] venv doesn't install wheel, while virtualenv does." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608093 (owner: 10BryanDavis)
[22:12:24] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1088.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[22:17:01] <wikibugs>	 (03PS1) 10QChris: gerrit: As log4j.xml is a static file, treat it as static file [puppet] - 10https://gerrit.wikimedia.org/r/608097
[22:17:03] <wikibugs>	 (03PS1) 10QChris: gerrit: Adapt log4j config to catch gc_log messages for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/608098
[22:28:25] <wikibugs>	 (03CR) 10QChris: "This change is not a noop on production, as at least file" [puppet] - 10https://gerrit.wikimedia.org/r/608097 (owner: 10QChris)
[22:28:51] <wikibugs>	 (03CR) 10QChris: "This change can well wait until the deployment takes place :-)" [puppet] - 10https://gerrit.wikimedia.org/r/608097 (owner: 10QChris)
[22:29:06] <wikibugs>	 (03CR) 10QChris: "This change can wait until the deployment takes place :-)" [puppet] - 10https://gerrit.wikimedia.org/r/608098 (owner: 10QChris)
[22:42:15] <Krinkle>	 ottomata: btw, looking into your EL/EventGate issue now.
[22:42:18] <Krinkle>	 https://phabricator.wikimedia.org/T249261
[23:01:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:03:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:09:17] <wikibugs>	 (03PS1) 10Accraze: hiera: add missing metrics for ores statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/608102 (https://phabricator.wikimedia.org/T233448)
[23:12:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:14:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:23:37] <wikibugs>	 (03PS2) 10Accraze: hiera: add missing metrics for ores statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/608102 (https://phabricator.wikimedia.org/T233448)
[23:43:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:52:10] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:54:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops