[00:00:34] 10Operations, 10Wikimedia-Logstash, 10observability, 10User-fgiunchedi: move 4 new logstash VMs into production - https://phabricator.wikimedia.org/T256443 (10Dzahn) [00:01:35] (03CR) 10Ryan Kemper: "`puppet-merge` is done" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [00:01:43] (03CR) 10Ryan Kemper: "Sorry for letting this go stale - I'll work on getting this out tomorrow (the 26th)" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [00:16:06] (03CR) 10Herron: "thanks for merging! I'll do rolling bounces of the logging ES7 instances tomorrow to pick up the new config" [puppet] - 10https://gerrit.wikimedia.org/r/607542 (https://phabricator.wikimedia.org/T252913) (owner: 10Herron) [00:28:01] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw} instance=kafkamon1001 job=burrow partition={0,1,3} site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&va [00:28:01] -eqiad&var-topic=All&var-consumer_group=All [00:29:47] (03PS7) 10Dave Pifke: arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [00:29:49] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [00:30:11] (03CR) 10jerkins-bot: [V: 04-1] arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [00:31:26] (03PS8) 10Dave Pifke: arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [00:33:17] (03PS9) 10Dave Pifke: arclamp: Deploy from scap [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) [00:35:57] (03CR) 10Krinkle: [C: 03+1] arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [00:36:32] !log tstarling@deploy1001 Synchronized w/T256395-cookie-test.php: (no justification provided) (duration: 00m 58s) [00:36:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:59] (03CR) 10Dave Pifke: arclamp: Deploy from scap (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/607370 (https://phabricator.wikimedia.org/T200109) (owner: 10Dave Pifke) [00:38:42] !log tstarling@deploy1001 Synchronized w/T256395-cookie-test.php: (no justification provided) (duration: 00m 56s) [00:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:54] (03PS1) 10CDanis: varnish: Set-Cookie responses are uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/607910 (https://phabricator.wikimedia.org/T256395) [01:10:51] (03PS1) 10Andrew Bogott: designate codfw1dev: remove need for 'legacy' dns sink domain [puppet] - 10https://gerrit.wikimedia.org/r/607911 [01:12:46] (03CR) 10Tim Starling: [C: 03+1] varnish: Set-Cookie responses are uncacheable [puppet] - 10https://gerrit.wikimedia.org/r/607910 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [01:12:55] (03CR) 10CDanis: [C: 03+2] "This matches ATS behavior, so seems safe: https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/profile/files/trafficserve" [puppet] - 10https://gerrit.wikimedia.org/r/607910 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [01:13:26] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕘🍺 sudo cumin A:cp 'disable-puppet "cdanis deploying I6cc5f3e6 T256395"' [01:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:46] (03PS2) 10Andrew Bogott: designate codfw1dev: remove need for 'legacy' dns sink domain [puppet] - 10https://gerrit.wikimedia.org/r/607911 [01:21:10] (03CR) 10Andrew Bogott: [C: 03+2] designate codfw1dev: remove need for 'legacy' dns sink domain [puppet] - 10https://gerrit.wikimedia.org/r/607911 (owner: 10Andrew Bogott) [01:41:26] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕘🍺 sudo cumin A:cp 'enable-puppet "cdanis deploying I6cc5f3e6 T256395"' [01:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:46:02] (03PS4) 10Bmansurov: Add recommendation-api helmfile stanzas [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) [01:47:17] (03CR) 10Bmansurov: Add recommendation-api helmfile stanzas (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [01:53:09] !log I6cc5f3e6 has been deployed to all cp text nodes T256395 [01:53:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:55:41] Hrm, asked in -traffic, but maybe this channel makes more sense: [01:55:56] https://twitter.com/john_overholt/status/1276276247602044933 describes a situation where one version of the article is visible logged in and another one logged out for a couple of persons. Neither a hard refresh nor purging the page seems to help. Is this a normal issue? Something we need to do something about? [01:56:40] JohanJ: I'm hearing similar reports [01:56:48] i wondered if it might be replication lag related [01:56:52] however i saw no alerts in log [01:57:31] on closer look they are related reports ;) [01:58:03] so i'm not sure how widespread it is but a couple people in those threads were seeing it [02:17:11] !log depooling cp1087 which has not been processing purges for 11.415 days [02:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:17:21] thanks for passing on the report JohanJ [02:19:09] !log three more hosts not processing purges for multiple days ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕥🍺 sudo cumin 'cp2033*,cp2037*,cp2039*' 'depool' [02:19:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:14] https://w.wiki/VDp [02:20:22] oh, that only works if you can log into grafana, nevermind [02:21:02] I'm taking a stretch and getting some water, then opening a task and doing a bit more digging. But for now I believe no one should be served badly-stale pages. [02:23:27] (03PS1) 10CDanis: varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917 [02:24:14] (03PS2) 10CDanis: varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395) [02:31:20] John confirmed on Twitter that he's no longer seeing problems :) [02:36:51] 10Operations, 10Traffic: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10CDanis) [02:39:16] (03PS1) 10Jeena Huneidi: Add Cassandra image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) [02:39:43] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:41:33] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:41:39] (03PS1) 10Legoktm: libraryupgrader: Add systemd units [puppet] - 10https://gerrit.wikimedia.org/r/607919 (https://phabricator.wikimedia.org/T173478) [02:42:15] (03PS2) 10Jeena Huneidi: Add Cassandra image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) [02:49:11] 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10Johan) [02:52:55] !log tstarling@deploy1001 Synchronized private/PrivateSettings.php: updating wgAuthenticationTokenVersion per my wikitech-l post (duration: 00m 57s) [02:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:46] 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10CDanis) [03:07:36] (03CR) 10Tim Starling: "As I said on IRC, I would prefer to preserve the cache object for analysis, rather than forcibly expiring them (if that's what this does)." [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [03:29:09] (03CR) 10Ladsgroup: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/603688 (https://phabricator.wikimedia.org/T254646) (owner: 10Ladsgroup) [03:30:48] 10Operations, 10Traffic, 10User-notice: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10CDanis) [03:51:39] (03CR) 10BBlack: [C: 03+1] varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [03:51:49] (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie beresps are uncacheable, take 2: log them too [puppet] - 10https://gerrit.wikimedia.org/r/607917 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [03:54:35] !log https://gerrit.wikimedia.org/r/c/operations/puppet/+/607917 [03:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:54:43] aafgakjfg clipboard vs selection [03:54:51] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕛🍺 sudo cumin A:cp 'disable-puppet "I39e1c68a is broken"' [03:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:59:14] (03PS1) 10CDanis: varnish: Set-Cookie responses are uncacheable: take 3 [puppet] - 10https://gerrit.wikimedia.org/r/607922 (https://phabricator.wikimedia.org/T256395) [03:59:45] (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie responses are uncacheable: take 3 [puppet] - 10https://gerrit.wikimedia.org/r/607922 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [04:01:47] !log re-enable puppet on cps [04:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:12] (03PS1) 10CDanis: varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395) [04:20:33] (03PS2) 10CDanis: varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395) [04:21:13] (03CR) 10BBlack: [C: 03+1] varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [04:21:23] (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie: log to syslog for non-performance.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/607923 (https://phabricator.wikimedia.org/T256395) (owner: 10CDanis) [04:25:18] (03PS1) 10CDanis: varnish: Set-Cookie beresps: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/607924 [04:26:59] (03PS2) 10CDanis: varnish: Set-Cookie beresps: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/607924 [04:27:56] (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie beresps: fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/607924 (owner: 10CDanis) [04:37:24] (03PS1) 10CDanis: varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925 [04:38:44] (03PS2) 10CDanis: varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925 [04:42:05] (03PS3) 10CDanis: varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925 [04:43:37] (03CR) 10BBlack: [C: 03+1] varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925 (owner: 10CDanis) [04:43:50] (03CR) 10CDanis: [C: 03+2] varnish: Set-Cookie beresps: also log bereq Host header [puppet] - 10https://gerrit.wikimedia.org/r/607925 (owner: 10CDanis) [04:45:35] (03PS1) 10Marostegui: install_server: Reimage dbproxy1014 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607927 (https://phabricator.wikimedia.org/T255408) [04:46:28] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage dbproxy1014 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/607927 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:03:25] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/source/{title}/{to} (Suggest a source title to use for translation) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [05:03:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:02] (03PS2) 10BBlack: Add Elastic IP to wikimedia_nets for PAPI [puppet] - 10https://gerrit.wikimedia.org/r/607313 (https://phabricator.wikimedia.org/T255524) [05:05:05] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [05:05:38] (03CR) 10BBlack: [C: 03+2] Add Elastic IP to wikimedia_nets for PAPI [puppet] - 10https://gerrit.wikimedia.org/r/607313 (https://phabricator.wikimedia.org/T255524) (owner: 10BBlack) [05:06:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [05:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:42] (03PS1) 10Marostegui: dbproxy1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607928 (https://phabricator.wikimedia.org/T255408) [05:13:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db2088:3312, db2104', diff saved to https://phabricator.wikimedia.org/P11672 and previous config saved to /var/cache/conftool/dbconfig/20200626-051328-marostegui.json [05:13:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:33] 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10CDanis) [05:13:53] (03CR) 10Marostegui: [C: 03+2] dbproxy1014: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/607928 (https://phabricator.wikimedia.org/T255408) (owner: 10Marostegui) [05:31:09] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:38:55] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:50:38] (03PS1) 10Marostegui: production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795) [06:15:39] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:17:21] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:24:33] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:25:03] (03PS1) 10Tim Starling: Add the cache-cookies log to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607931 (https://phabricator.wikimedia.org/T256395) [06:29:55] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:37:57] (03CR) 10Tim Starling: [C: 03+2] Add the cache-cookies log to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607931 (https://phabricator.wikimedia.org/T256395) (owner: 10Tim Starling) [06:38:46] (03Merged) 10jenkins-bot: Add the cache-cookies log to wmgMonologChannels [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607931 (https://phabricator.wikimedia.org/T256395) (owner: 10Tim Starling) [06:40:34] !log tstarling@deploy1001 Synchronized wmf-config/InitialiseSettings.php: add cache-cookies log channel (duration: 00m 59s) [06:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:13] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:57:03] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Renamed notebook1003 to an-launcher1002 - https://phabricator.wikimedia.org/T256397 (10elukey) Updated wikitech in https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Rename_while_reimaging with the procedure that I followed (mentioning also to... [06:57:11] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:58:44] (03PS7) 10Privacybatm: [WIP] transferpy: Generate checksum parallel to the data transfer [software/transferpy] - 10https://gerrit.wikimedia.org/r/605851 (https://phabricator.wikimedia.org/T254979) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200626T0700) [07:01:32] 10Operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10Traffic, and 2 others: Special:HideBanners is not really cacheable - https://phabricator.wikimedia.org/T256447 (10tstarling) [07:08:07] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:17:07] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:22:31] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:26:09] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:31:39] PROBLEM - SSH on cp5006 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:32:36] vgutierrez, XioNoX: anything ongoing in eqsin? cp5006 above and the librenms report in Netbox (not sure if related to each other) [07:33:25] volans: I was already trying to connect via mgmt console...no response for login [07:33:27] volans: it's possible that cr1-eqsin crapped itself [07:33:35] volans: can you prepare a depool? [07:33:41] sure [07:34:26] so far it looks all normal [07:34:28] still looking [07:34:53] the report on netbox is saying missing items in librenms [07:35:18] (03PS1) 10Volans: depool eqsin, network issues [dns] - 10https://gerrit.wikimedia.org/r/607976 [07:35:28] patch ready if needed ^^^ abandon it if not needed [07:35:50] volans: which patch? :) [07:36:00] lol [07:36:13] volans: https://librenms.wikimedia.org/device/device=159/ yeah librenms failed to pull the router, but everything looks back to normal now [07:36:55] smokeping is happy too https://smokeping.wikimedia.org/?target=eqsin.Core.cr1-eqsin [07:37:17] ok [07:37:25] I can ssh into the mgmt of cm5006 [07:37:32] cr1-eqsin logs are fine too [07:37:46] *cp5006, and as jayme said nothing in console [07:38:08] ema: around by any chance? [07:38:22] I'd go for a depool + force reboot if seems reasonable [07:38:51] RECOVERY - SSH on cp5006 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:38:54] I was about to open a phab task...is that how "we do it"? :) [07:39:32] jayme: yeah [07:39:52] even if the actions above solve the issue it's good to have a tracking task for the future and offline people [07:39:56] ack [07:40:05] jayme: yes to both :) [07:41:22] jayme: another useful resource in this cases is https://wikitech.wikimedia.org/wiki/Service_restarts [07:42:14] sounds good yeah [07:42:34] ack, proceeding [07:42:43] !log volans@cumin1001 conftool action : set/pooled=no; selector: name=cp5006.eqsin.wmnet [07:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:34] jayme: are you opening a task? [07:43:43] 10Operations, 10ops-eqsin, 10Traffic: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 (10JMeybohm) [07:43:45] (to avoid duplicates) [07:43:51] yes :D [07:44:12] !log force rebooted cp5006 that is unresponsive (after having depooled it) - T256449 [07:44:15] thanks! [07:44:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:16] T256449: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 [07:44:27] can't go back in time in icinga prior to 04:00 UTC, though [07:44:33] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [07:44:46] icinga logs are in /srv/ [07:44:48] wow [07:44:57] console so far is all ��fx怘�怘�xx��x�x [07:45:03] doesn't look good [07:45:29] * volans uploading matrix-translator plugin to his brain [07:45:48] vgutierrez: be prepared to say goodbye to cp5006 [07:45:55] RECOVERY - Host kubestagetcd1004 is UP: PING WARNING - Packet loss = 33%, RTA = 0.32 ms [07:46:08] mmmh [07:46:11] 10Operations, 10ops-eqiad: upgrade memory in ganeti100[5-8].eqiad.wmnet - https://phabricator.wikimedia.org/T244530 (10akosiaris) >>! In T244530#6258712, @Dzahn wrote: > kubestagetcd1004 is still down. not sure if desired. ACKed in Icinga. No, mistake on my side. Thanks! I 've just started it up. [07:46:17] now I got kernel boot logs normally [07:46:25] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 231.35 ms [07:46:27] so seems that the bios redirection is borked, maybe misconfigured? [07:46:51] host is up [07:47:58] nothingn in syslog since Jun 26 04:40:01 [07:48:54] first nrpe socket timeout messages started to pop up at 2020-06-26 04:41:21 [07:50:41] volans: hey [07:50:42] 10Operations, 10ops-eqsin, 10Traffic: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 (10Volans) Host is back up, the console output during boot was all borked `��fx怘�怘�xx��x�x....` but the kernel boot logs were normally readable. Maybe there is some misconfiguration in... [07:50:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] conftool: Add new kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/607530 (owner: 10Alexandros Kosiaris) [07:50:48] (03PS1) 10Ayounsi: Add mgmt for cloudsw [dns] - 10https://gerrit.wikimedia.org/r/607978 (https://phabricator.wikimedia.org/T251632) [07:51:18] I've updated the task, if it's ok for you I'd leave it to you now, it's depooled in conftool [07:52:55] ema: ^^^ [07:53:19] (03CR) 10Volans: [C: 03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/607978 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [07:53:39] (03CR) 10Ayounsi: [C: 03+2] Add mgmt for cloudsw [dns] - 10https://gerrit.wikimedia.org/r/607978 (https://phabricator.wikimedia.org/T251632) (owner: 10Ayounsi) [07:54:20] volans: sure, ty [07:54:34] (03PS2) 10Hashar: releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn) [07:55:08] (03Abandoned) 10Hashar: releases::mediawiki:: support buster / PHP 7.3 [puppet] - 10https://gerrit.wikimedia.org/r/607641 (https://phabricator.wikimedia.org/T247652) (owner: 10Dzahn) [07:56:48] thanks! [07:57:17] (03PS3) 10Hashar: releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn) [07:57:47] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [07:57:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:18] (03Abandoned) 10Volans: depool eqsin, network issues [dns] - 10https://gerrit.wikimedia.org/r/607976 (owner: 10Volans) [08:04:19] 10Operations, 10vm-requests: Site: 2 VM request for kubernetes sessionstore dedicated nodes - https://phabricator.wikimedia.org/T256254 (10akosiaris) 05Open→03Resolved a:03akosiaris VMs created, installed and ready. Resolving [08:04:21] 10Operations, 10serviceops, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris) [08:04:28] !log pool all new kubernetes nodes in LVS T252185 T256236 [08:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:33] T256236: Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 [08:04:33] T252185: (Need by: TBD) rack/setup/install kubernetes20[07-14].codfw.wmnet and kubestage200[1-2].codfw.wmnet. - https://phabricator.wikimedia.org/T252185 [08:04:53] !log akosiaris@cumin1001 conftool action : set/weight=10; selector: name=kubernetes.*.wmnet [08:04:53] (03CR) 10Kormat: [C: 03+1] production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795) (owner: 10Marostegui) [08:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:04] !log akosiaris@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes.*.wmnet [08:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:29] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:08:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:15:33] 10Operations, 10serviceops, 10Sustainability (Incident Prevention): Increase capacity of the sessionstore dedicated kubernetes nodes - https://phabricator.wikimedia.org/T256236 (10akosiaris) 05Open→03Resolved Both paths outlined in the description have been followed. We now have 4 different sessionstore... [08:19:01] (03PS1) 10Alexandros Kosiaris: Partially revert sessionstore emergency fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980 [08:20:16] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [08:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:21] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [08:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1088 for schema change', diff saved to https://phabricator.wikimedia.org/P11673 and previous config saved to /var/cache/conftool/dbconfig/20200626-082242-marostegui.json [08:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:02] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:20] (03PS5) 10Jcrespo: mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) [08:33:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1088', diff saved to https://phabricator.wikimedia.org/P11674 and previous config saved to /var/cache/conftool/dbconfig/20200626-083319-marostegui.json [08:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:57] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move x1 backup source from db1095 to db1102 [puppet] - 10https://gerrit.wikimedia.org/r/607510 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [08:46:32] (03CR) 10Jcrespo: [C: 03+2] cumin: backup all of /srv where a lot of deployment state may live [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [08:47:24] (03CR) 10Hashar: [C: 03+1] releases::mediawiki: remove PHP packages [puppet] - 10https://gerrit.wikimedia.org/r/607858 (https://phabricator.wikimedia.org/T249949) (owner: 10Dzahn) [08:50:45] 10Operations, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-fgiunchedi: Increase logging pipeline ingestion capacity - https://phabricator.wikimedia.org/T255243 (10fgiunchedi) >>! In T255243#6258852, @Dzahn wrote: > There are 4 new ganeti VMs now, 2 in eqiad and 2 in codfw, in row D e... [08:54:42] 10Operations, 10Puppet: Missing dependency on bacula-fd setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) [08:55:12] 10Operations, 10Puppet: Missing dependency on bacula-fd setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) [08:56:20] 10Operations, 10Puppet: Missing dependency on bacula-fd setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) Added @hashar @akosiaris @Volans FYI, feel free to unsubscribe as I will be the one debugging this. [08:56:25] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime [08:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:40] 10Operations, 10Puppet: Missing dependency on bacula-fd Puppet setup - https://phabricator.wikimedia.org/T256454 (10jcrespo) [08:57:56] (03CR) 10Liuxinyu970226: [C: 03+1] Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) (owner: 10VulpesVulpes825) [08:58:58] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [08:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1093 for schema change', diff saved to https://phabricator.wikimedia.org/P11675 and previous config saved to /var/cache/conftool/dbconfig/20200626-090813-marostegui.json [09:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:42] 10Operations: Puppet CI should fail over CRLF line endings (sometimes) - https://phabricator.wikimedia.org/T182641 (10hashar) That can be done directly inside `operations/puppet.git`. Ci just invokes `rake test`. [09:18:13] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980 (owner: 10Alexandros Kosiaris) [09:29:36] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980 (owner: 10Alexandros Kosiaris) [09:30:04] (03Merged) 10jenkins-bot: Partially revert sessionstore emergency fixes [deployment-charts] - 10https://gerrit.wikimedia.org/r/607980 (owner: 10Alexandros Kosiaris) [09:35:20] !log akosiaris@deploy1001 helmfile [CODFW] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [09:35:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:57] !log move the sessionstore codfw pods back to the dedicated sessionstore nodes [09:36:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:32] !log akosiaris@deploy1001 helmfile [EQIAD] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [09:37:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:26] 10Operations, 10Core Platform Team, 10Performance-Team, 10serviceops: Increased "Allowed memory size exhausted" exceptions from MediaWiki since 2020-06-25 ~16:00 - https://phabricator.wikimedia.org/T256459 (10JMeybohm) [09:38:43] !log move the sessionstore eqiad pods back to the dedicated sessionstore nodes [09:38:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:13] !log cp2033: restart purged T256444 [09:46:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:16] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [09:47:13] 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) p:05Triage→03High [09:47:42] 10Operations, 10Traffic, 10User-notice: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10ema) p:05Triage→03Medium [09:48:08] (03CR) 10Jcrespo: "Backup took 28 seconds:" [puppet] - 10https://gerrit.wikimedia.org/r/607258 (owner: 10Jcrespo) [09:51:16] (03PS1) 10Elukey: Move archiva.wikimedia.org from archiva1001 to archiva1002 [puppet] - 10https://gerrit.wikimedia.org/r/607989 (https://phabricator.wikimedia.org/T252767) [09:55:46] !log cp1087: restart purged T256444 [09:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [09:57:57] !log cp2037: restart purged T256444 [09:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:06] (03PS4) 10Jcrespo: mariadb-backups: Remove x1 from db1095 and enable db1102 notif. [puppet] - 10https://gerrit.wikimedia.org/r/607515 (https://phabricator.wikimedia.org/T254871) [09:58:37] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad on icinga1001 is CRITICAL: 0.2697 lt 0.3 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [09:59:16] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Remove x1 from db1095 and enable db1102 notif. [puppet] - 10https://gerrit.wikimedia.org/r/607515 (https://phabricator.wikimedia.org/T254871) (owner: 10Jcrespo) [10:00:28] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/23489/" [puppet] - 10https://gerrit.wikimedia.org/r/607989 (https://phabricator.wikimedia.org/T252767) (owner: 10Elukey) [10:03:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:03:24] we have a flurry of api reqests https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=17&fullscreen&orgId=1&from=now-1h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200 [10:03:47] it's already dying down, but this is weird [10:03:50] yeah [10:03:53] !log cp2039: restart purged T256444 [10:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:57] T256444: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 [10:04:01] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad on icinga1001 is OK: (C)0.3 lt (W)0.5 lt 0.8048 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [10:04:41] akosiaris: no sign of distress on the mcrouter/memcached front [10:06:41] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:06:54] elukey: yes. All signals deteriorated for a while [10:07:01] but we are ok now? [10:07:22] or so it seems to me [10:07:29] I think so, didn't check the exceptions yet [10:08:46] well, per icinga those are elevated right now [10:09:04] the linked grafana dashboard concurs [10:09:31] but it also looks like it's getting better. Maybe an aftermath? [10:10:17] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:10:28] from logstash I see a spike in errors only from 9:55~10:00 UTC, matching the rise in traffic [10:10:34] mostly wikidatawiki, with cirrussearch-too-busy-error [10:10:42] CC: dcausse: --^ [10:11:10] but it is probably a side effect of the t raffic [10:11:55] yeah, agreed [10:15:40] (03PS1) 10Kormat: mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) [10:18:07] (03CR) 10jerkins-bot: [V: 04-1] mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) (owner: 10Kormat) [10:19:01] (03PS2) 10Kormat: mysql_legacy: update Cumin queries for DB selection [software/spicerack] - 10https://gerrit.wikimedia.org/r/607996 (https://phabricator.wikimedia.org/T243935) [10:21:25] 10Operations, 10Traffic, 10User-notice: several purgeds badly backlogged (> 10 days) - https://phabricator.wikimedia.org/T256444 (10ema) I have identified the misbehaving purged instances with `rate(purged_events_received_total{cluster="cache_text", topic="eqiad.resource-purge"}[5m]) == 0` and restarted them... [10:22:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1093', diff saved to https://phabricator.wikimedia.org/P11676 and previous config saved to /var/cache/conftool/dbconfig/20200626-102201-marostegui.json [10:22:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1085', diff saved to https://phabricator.wikimedia.org/P11677 and previous config saved to /var/cache/conftool/dbconfig/20200626-102248-marostegui.json [10:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:00] !log pool 5006 T256449 [10:24:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:12] T256449: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 [10:25:23] 10Operations, 10Traffic: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10Johan) [10:25:26] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Adithyak1997) [10:28:24] 10Operations, 10DBA, 10SRE-tools, 10User-Kormat: Add native mysql module to spicerack - https://phabricator.wikimedia.org/T255409 (10Kormat) [10:28:53] 10Operations, 10DBA, 10SRE-tools, 10Patch-For-Review, 10User-Kormat: Audit all cumin queries in switchdc scripts - https://phabricator.wikimedia.org/T243935 (10Kormat) [10:29:35] 10Operations, 10Traffic: Make atsmtail-backend.service depend on fifo-log-demux - https://phabricator.wikimedia.org/T256467 (10ema) [10:29:55] 10Operations, 10Traffic: Make atsmtail-backend.service depend on fifo-log-demux - https://phabricator.wikimedia.org/T256467 (10ema) p:05Triage→03Low [10:30:12] 10Operations, 10ops-eqsin, 10Traffic: cp5006 multiple alerts (and SSH flapping) - https://phabricator.wikimedia.org/T256449 (10ema) 05Open→03Resolved a:03ema The host looks fine, closing for now. [10:31:52] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 11.64 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:32:48] (03PS10) 10ArielGlenn: dumps: fix shellcheck issues [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:34:17] what now [10:34:42] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:35:51] (03CR) 10ArielGlenn: [C: 03+2] dumps: fix shellcheck issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/602645 (https://phabricator.wikimedia.org/T254480) (owner: 10Jbond) [10:39:08] (03PS1) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210) [10:40:11] (03CR) 10Daimona Eaytoy: "> All except wmgUseGlobalAbuseFilters this seems fine and mostly" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607605 (owner: 10Reedy) [10:40:15] (03CR) 10jerkins-bot: [V: 04-1] toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210) (owner: 10Arturo Borrero Gonzalez) [10:41:05] (03PS1) 10ArielGlenn: add options to make testing dumps rsync easier [puppet] - 10https://gerrit.wikimedia.org/r/608006 [10:47:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:54:47] 10Operations, 10ops-eqiad, 10DBA, 10decommission-hardware: decommission dbproxy1003.eqiad.wmnet - https://phabricator.wikimedia.org/T256216 (10Marostegui) [10:58:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:03:13] (03PS1) 10Kormat: install_server: Reuse partitions for dbprov* hosts [puppet] - 10https://gerrit.wikimedia.org/r/608012 (https://phabricator.wikimedia.org/T255768) [11:04:17] (03PS2) 10Kormat: install_server: Reuse partitions for dbprov* hosts [puppet] - 10https://gerrit.wikimedia.org/r/608012 (https://phabricator.wikimedia.org/T255768) [11:12:56] (03PS1) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) [11:16:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:21:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:35:37] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I 've left another round of comments and have hopefully answered your question @bmansurov. I think that we should be able to deploy after " (0323 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [11:37:33] (03PS1) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) [11:45:32] (03CR) 10CDanis: [C: 03+1] ATS: override Cache-Control for Set-Cookie responses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [11:52:06] (03PS2) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210) [11:55:00] (03PS2) 10Ema: ATS: override Cache-Control for Set-Cookie responses [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) [11:59:59] (03PS2) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) [12:00:08] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [12:07:23] (03PS3) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) [12:08:29] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [12:11:09] (03PS4) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) [12:15:16] (03PS3) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210) [12:18:07] elukey: just saw your ping, looking some elastic graphs we rejected a bunch of queries between 9:50 and 10:10, fulltext qps rised from 450 to 2500 (5x more than what we usually hande) [12:21:24] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [12:21:32] (03CR) 10Filippo Giunchedi: purged: alert in case of high event lag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [12:23:14] (03PS4) 10Arturo Borrero Gonzalez: toolforge: mailrelay: introduce spam filtering with spamassassin [puppet] - 10https://gerrit.wikimedia.org/r/608005 (https://phabricator.wikimedia.org/T120210) [12:30:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:33:14] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:34:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:37:06] dcausse: yes there was a big surge in traffic, just wanted to let you know [12:37:17] sure, thanks! [12:39:02] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:40:13] (03PS2) 10Alexandros Kosiaris: all: Don't declare data in Secret if not specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 [12:40:15] (03PS9) 10Alexandros Kosiaris: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 [12:41:48] (03CR) 10CDanis: [C: 04-2] "This is the right fix, but at the wrong time -- let's not deploy this until we get a handle on the app-layer causes. AFAICT this will bre" [puppet] - 10https://gerrit.wikimedia.org/r/608017 (https://phabricator.wikimedia.org/T256395) (owner: 10Ema) [12:44:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] all: Don't declare data in Secret if not specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 (owner: 10Alexandros Kosiaris) [12:45:01] (03Merged) 10jenkins-bot: all: Don't declare data in Secret if not specified [deployment-charts] - 10https://gerrit.wikimedia.org/r/599332 (owner: 10Alexandros Kosiaris) [12:45:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [12:46:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:48:43] (03Merged) 10jenkins-bot: rake: Add kubeyaml validation [deployment-charts] - 10https://gerrit.wikimedia.org/r/598280 (owner: 10Alexandros Kosiaris) [12:52:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:55:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:56:02] is anybody checking --^ ? [13:00:02] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:03:19] (03PS5) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) [13:04:03] (03CR) 10Ema: purged: alert in case of high event lag (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [13:04:57] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [13:05:50] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 50 probes of 568 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:07:27] (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1003/454/" [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [13:08:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:12:13] so logstash mediawiki-errors doesn't show much [13:13:16] the one in logstash-next shows some pressure on parsoid [13:13:29] (03PS6) 10Ema: purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) [13:15:52] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:19:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:19:55] 10Operations, 10Traffic: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 (10ema) [13:23:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:24:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:27:27] elukey: I've created https://phabricator.wikimedia.org/T256459 for this. I think this *might* be related to the "Parsoid-PHP is now considered part of MediaWiki." part from the logstash improvement mail from Krinkle [13:28:18] (03PS1) 10Ema: Serialize access to KafkaReader.maxts [software/purged] - 10https://gerrit.wikimedia.org/r/608045 (https://phabricator.wikimedia.org/T256479) [13:29:32] 10Operations, 10Traffic, 10Patch-For-Review: purged crashes with "fatal error: concurrent map read and map write" - https://phabricator.wikimedia.org/T256479 (10ema) p:05Triage→03Medium [13:31:59] jayme: thanks! [13:33:34] (03CR) 10Filippo Giunchedi: [C: 03+1] purged: alert in case of high event lag [puppet] - 10https://gerrit.wikimedia.org/r/608019 (https://phabricator.wikimedia.org/T256446) (owner: 10Ema) [13:43:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:52:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:02:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:05:35] jayme: elukey Yeah OOMs are usually not worthy of a task unless it's order or magnitudes higher than normal [14:06:06] Too many little holes that are not plugged and can be triggered by wikitekst [14:06:31] Krinkle: sure but then we need to tune the alarms, see mw exceptions (that is the only thing related that I found, I may have missed something) [14:06:32] Krinkle: it's easily x10 for some cases, though [14:06:33] Unfortunately OOM is therefore partially used as proxy for rejecting too complex pages [14:06:45] Yeah [14:07:38] jayme: what do you mean by "for some cases"? [14:07:41] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops [14:08:47] elukey: are the alerts fixed or dynamic? What's the baseline or relative comparison date? [14:10:27] Krinkle: see https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops [14:10:40] I think we alarm on this metric [14:11:15] you can find the definition in profile::mediawiki::alerts [14:11:34] sum(irate(logstash_mediawiki_events_total{channel=~"(fatal|exception)",level="ERROR"}[10m])) without (channel, instance) * 60 [14:11:48] if > 25 warning, > 50 error [14:12:31] Hmm I see. Maybe 50/100 makes sense there. We have lots of other triggers already [14:12:41] This is more a wide net catch all [14:12:45] Krinkle: Depending on which timeframes you compare. We hat like a top of ~120 per hour from 06-24 to 06-25 and a top of 1.6k from 06-25 to 06-26 [14:12:57] Doesn't need to be as sensitive [14:13:05] (looking at kibana) [14:13:21] (03PS1) 10Jcrespo: mariadb-backups: Move transferpy deployment to debian package [puppet] - 10https://gerrit.wikimedia.org/r/608053 (https://phabricator.wikimedia.org/T138562) [14:13:42] elukey: average per x minutes? Or last minute from whenever it is polled? [14:14:20] Based on my irate understanding that means current minute [14:14:44] Last two days points right? [14:15:06] (03PS2) 10Jcrespo: mariadb-backups: Move transferpy deployment to debian package [puppet] - 10https://gerrit.wikimedia.org/r/608053 (https://phabricator.wikimedia.org/T138562) [14:15:39] https://prometheus.io/docs/prometheus/latest/querying/functions/#irate [14:15:57] lemme check because I am not 100% sure on a friday afternoon :D [14:16:34] so up to 10m ago, for the last two datapoints [14:16:38] I don't think irate makes sense here then [14:17:12] Yeah for irate the [m] part doesn't do much unless there are big gaps in Prometheus collection points [14:17:51] Especially since we have subminute reporting intervals in prod [14:19:25] rate()5m would be more useful gives us an average rate over the full duration [14:19:45] Perhaps even keep the thresholds as is [14:20:35] * Krinkle adds todo to add threshold visually to the grafana dash [14:27:59] (03CR) 10DCausse: [C: 04-1] "should be merged right before doing a data-reload with the --skolemize on wdqs1009" [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [14:39:02] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:04:10] (03PS1) 10BBlack: Set-Cookie: Send server header to syslog as well [puppet] - 10https://gerrit.wikimedia.org/r/608060 (https://phabricator.wikimedia.org/T256395) [15:08:42] (03PS1) 10Filippo Giunchedi: logstash: add tests for filter-mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/608061 (https://phabricator.wikimedia.org/T251869) [15:08:44] (03CR) 10CDanis: [C: 03+1] Set-Cookie: Send server header to syslog as well [puppet] - 10https://gerrit.wikimedia.org/r/608060 (https://phabricator.wikimedia.org/T256395) (owner: 10BBlack) [15:14:06] (03PS1) 10Cicalese: Add HTTP proxy to MediaModeration. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) [15:14:19] (03CR) 10BBlack: [C: 03+2] Set-Cookie: Send server header to syslog as well [puppet] - 10https://gerrit.wikimedia.org/r/608060 (https://phabricator.wikimedia.org/T256395) (owner: 10BBlack) [15:18:32] 10Operations: Script to point SRE local machine traffic to another LB - https://phabricator.wikimedia.org/T244761 (10CDanis) For a number of years now, work has been proceeding in order to bring to perfection the crudely-conceived idea of a machine that would not only supply the easy re-routing of traffic for lo... [15:23:16] (03CR) 10Ppchelko: Add HTTP proxy to MediaModeration. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [15:25:20] 10Operations, 10SRE-swift-storage, 10serviceops: Access to the thanos-swift cluster for ChartMuseum - https://phabricator.wikimedia.org/T256020 (10JMeybohm) 05Open→03Resolved a:03JMeybohm This is done and the account is working, thanks @fgiunchedi ! [15:38:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:44:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:49:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:55:28] (03CR) 10Dave Pifke: [C: 03+1] production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795) (owner: 10Marostegui) [16:03:08] (03CR) 10Marostegui: [C: 03+2] production-m2.sql.erb: Add grants for xhgui [puppet] - 10https://gerrit.wikimedia.org/r/607930 (https://phabricator.wikimedia.org/T254795) (owner: 10Marostegui) [16:19:49] (03PS1) 10Andrew Bogott: cloud-vps: puppetize /etc/ldap.conf on sssd clients [puppet] - 10https://gerrit.wikimedia.org/r/608068 [16:26:39] 10Operations, 10ops-ulsfo: update rack location of decom wmf5801 - https://phabricator.wikimedia.org/T249287 (10RobH) 05Open→03Resolved fixed [16:26:41] 10Operations, 10ops-eqiad, 10ops-ulsfo, 10DC-Ops: Netbox report coherence_rack Icinga alert - https://phabricator.wikimedia.org/T250054 (10RobH) [16:26:43] 10Operations, 10netbox: Netbox report check for no position set in rack - https://phabricator.wikimedia.org/T239244 (10RobH) [16:31:09] 10Operations, 10ops-ulsfo: remove cr4-ulsfo:xe-0/1/1 - https://phabricator.wikimedia.org/T254206 (10RobH) 05Open→03Resolved updated gsheet and removed the fiber/cable from netbox [16:35:37] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) [16:45:34] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) https://netbox.wikimedia.org/dcim/cables/1632/ was also missing a cable id (zayo link) and added it to both gsheet xconnect tracking sheet and to netbox info [16:47:19] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) 05Stalled→03Resolved [16:47:21] 10Operations, 10netbox: Netbox: fill network topology - https://phabricator.wikimedia.org/T205897 (10RobH) [16:47:31] 10Operations, 10ops-ulsfo, 10DC-Ops: fix newly imported cable data in ulsfo - https://phabricator.wikimedia.org/T250408 (10RobH) all cable errors in netbox for ulsfo are fixed [17:05:46] (03CR) 10EBernhardson: "two things related to recent changes I made to this code" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/597790 (owner: 10DCausse) [17:07:09] 10Operations, 10Android-app-Bugs, 10Parsoid, 10Traffic, and 6 others: Right-to-Left directionality problem with refs - https://phabricator.wikimedia.org/T251983 (10bearND) 05Open→03Resolved a:03bearND Ok, looks like enough time has passed for the old, cached version to be evited and the newly deploye... [17:10:58] 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH) [17:11:01] !log msw work in ulsfo via T256300 [17:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:06] T256300: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 [17:17:02] PROBLEM - Host dns4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:17:02] PROBLEM - Host bast4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:17:06] PROBLEM - Host cp4024.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:17:20] PROBLEM - Host cp4030.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:17:20] PROBLEM - Host re0.cr4-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:17:30] PROBLEM - Host cp4032.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:17:46] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:18:14] PROBLEM - Host ganeti4002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:19:04] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:19:06] PROBLEM - Host cp4022.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:19:06] PROBLEM - Host cp4026.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:19:08] PROBLEM - Host cp4028.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:19:15] thats me! [17:19:16] PROBLEM - Host lvs4006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:19:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:19:21] and they'll start to recover shortly [17:19:26] (all ulsfo mgmt is me and expected) [17:20:04] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:21:54] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:38:13] 10Operations, 10Traffic, 10Patch-For-Review: monitoring & alerting for purged - https://phabricator.wikimedia.org/T256446 (10Nemo_bis) [17:38:26] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review, and 2 others: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10Nemo_bis) [17:46:30] (03PS1) 10Bstorm: Revert "unattendedupgrades: allow configurable kernel cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/608085 [17:46:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:47:33] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:47:34] RECOVERY - Host bast4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 82.60 ms [17:47:34] RECOVERY - Host cp4024.mgmt is UP: PING OK - Packet loss = 0%, RTA = 85.36 ms [17:47:34] RECOVERY - Host cp4030.mgmt is UP: PING OK - Packet loss = 0%, RTA = 87.55 ms [17:47:34] RECOVERY - Host cp4032.mgmt is UP: PING OK - Packet loss = 0%, RTA = 86.40 ms [17:47:35] RECOVERY - Host ganeti4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.06 ms [17:47:37] RECOVERY - Host dns4002.mgmt is UP: PING OK - Packet loss = 0%, RTA = 84.82 ms [17:47:51] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.61 ms [17:47:53] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:48:21] RECOVERY - Host re0.cr4-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 90.33 ms [17:48:21] RECOVERY - Host cp4022.mgmt is UP: PING OK - Packet loss = 0%, RTA = 91.73 ms [17:48:21] RECOVERY - Host cp4026.mgmt is UP: PING OK - Packet loss = 0%, RTA = 97.99 ms [17:48:21] RECOVERY - Host cp4028.mgmt is UP: PING OK - Packet loss = 0%, RTA = 95.87 ms [17:48:23] RECOVERY - Host lvs4006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [17:49:35] 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH) [17:49:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:52:46] !log msw2-ulsfo work done, all mgmt items confirmed back online and icinga alerts cleared, moving onto msw1-ulsfo (rack 22) and will lose all mgmt in that rack for next 10-20 minutes [17:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:56] !log msw2-ulsfo work done, all mgmt items confirmed back online and icinga alerts cleared, moving onto msw1-ulsfo (rack 22) and will lose all mgmt in that rack for next 10-20 minutes T256300 [17:52:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:00] T256300: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 [17:58:25] PROBLEM - Host lvs4007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:58:43] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:58:55] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [17:59:23] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:59:35] PROBLEM - Router interfaces on mr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.194, interfaces up: 38, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:59:41] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:56] (03CR) 10Thcipriani: "Couple of nits inline." (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi) [18:00:55] PROBLEM - Host cp4029.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:13] PROBLEM - Host dns4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:01:31] PROBLEM - Host cp4031.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:01:31] PROBLEM - Host re0.cr3-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [18:02:23] PROBLEM - Host ganeti4001.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:02:23] PROBLEM - Host ganeti4003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:21] PROBLEM - Host cp4021.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:21] PROBLEM - Host cp4023.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:24] PROBLEM - Host cp4027.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:24] PROBLEM - Host cp4025.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:33] PROBLEM - Host lvs4005.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:03:57] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:04:05] RECOVERY - Host cp4021.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.13 ms [18:04:57] RECOVERY - Router interfaces on mr1-ulsfo is OK: OK: host 198.35.26.194, interfaces up: 40, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:05:59] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.37 ms [18:06:03] ok, they should all be coming back now [18:06:09] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:06:21] RECOVERY - Host cp4029.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.75 ms [18:06:41] RECOVERY - Host dns4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.77 ms [18:06:59] RECOVERY - Host cp4031.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.87 ms [18:06:59] RECOVERY - Host re0.cr3-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 76.54 ms [18:07:54] RECOVERY - Host ganeti4001.mgmt is UP: PING OK - Packet loss = 0%, RTA = 76.33 ms [18:07:54] RECOVERY - Host ganeti4003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 76.13 ms [18:07:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:07:57] 10Operations, 10ops-codfw: Return asw-c8-codfw to spares - https://phabricator.wikimedia.org/T256498 (10faidon) p:05Triage→03Low [18:08:49] RECOVERY - Host cp4023.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.10 ms [18:08:51] RECOVERY - Host cp4027.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.18 ms [18:08:51] RECOVERY - Host cp4025.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [18:09:03] RECOVERY - Host lvs4005.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.12 ms [18:09:31] RECOVERY - Host lvs4007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 75.19 ms [18:12:58] (03CR) 10Cicalese: Add HTTP proxy to MediaModeration. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/608062 (https://phabricator.wikimedia.org/T247943) (owner: 10Cicalese) [18:13:12] (03PS1) 10JMeybohm: Add patches for swift auth and bind interface [debs/chartmuseum] - 10https://gerrit.wikimedia.org/r/608088 (https://phabricator.wikimedia.org/T253843) [18:21:59] PROBLEM - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:22:07] here, looking [18:22:19] 👋 [18:22:20] hey [18:22:30] * apergos peeks in [18:22:51] RECOVERY - LVS api codfw port 80/tcp - MediaWiki API cluster- api.svc.eqiad.wmnet IPv4 #page on api.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 23788 bytes in 0.366 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:23:44] uhm [18:24:52] o/ [18:25:26] I'm here too if needed [18:25:42] there's no sign of anything wrong [18:25:44] strange [18:25:49] world's fastest recovery [18:25:57] yeah, metrics don't seem to show any blip [18:26:05] neither app level not traffic [18:26:06] stupid question: I'm having trouble even reading the alert, what does it mean when it says eqiad and codfw in those two fields? [18:26:15] yea, is it eqiad or codfw or both [18:26:35] is that cross-dc traffic, and if so which direction? [18:27:06] * volans was heading out, do you need me too? [18:27:07] I believe it is the LVSen in codfw, for api_appserver which it knows the realservers are in eqiad [18:27:14] I can stay if needed [18:27:27] volans: no, seems like nothing [18:27:29] cdanis: ack, thanks [18:28:31] ok thx [18:29:03] not seein nothin [18:29:06] meh [18:29:30] looked at: red, app servers, api servers, lvses (because hey you never know)... [18:29:35] == [18:30:50] msw1-ulsfo any chance the work on that caused a momentary blip? [18:30:54] in codfw? naw [18:31:00] there's nothing obvious in syslog or pybal.log on lvs2009 [18:31:08] seems unlikely but [18:31:17] can't find anything obviously wrong either so far [18:31:35] welp if nothin in the logs, I'm out of ideas [18:33:57] if no one objects I'm going to check back out (if someone has an idea of further things to look at, I will happily stick around. or at least not too grugingly :-P) [18:33:59] on the Icinga server side the logs have some SOFT alerts about saturated CPU cores on LVS [18:34:13] lvs1013;At least one CPU core of an LVS is saturated- packet drops are likely;WARNING;SOFT; [18:34:30] but they stayed SOFT, so no alerts [18:35:06] mutante: I need to adjust the thresholds a bit, those are common to go into WARNING on a puppet run 😂 [18:35:32] cdanis: ok, ACK. it was lvs1013 and lvs2010, nothing special then, ok [18:35:47] 2010 is the backup in codfw anyway, not the primary [18:37:28] 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH) [18:37:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission dbproxy1008.eqiad.wmnet - https://phabricator.wikimedia.org/T255406 (10wiki_willy) a:05wiki_willy→03Jclark-ctr [18:38:02] 10Operations, 10ops-ulsfo, 10DC-Ops: replace msw[12]-ulsfo with new switches - https://phabricator.wikimedia.org/T256300 (10RobH) 05Open→03Resolved Ok the old switches are unracked and in the bottom of rack 23. all cables in netbox added and all duplicate cable id conflicts resolved/fixed. [18:38:11] 10Operations, 10ops-eqiad: Decommisson and store old row D network gear. - https://phabricator.wikimedia.org/T170474 (10wiki_willy) a:03Cmjohnson [18:38:44] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: Renamed notebook1003 to an-launcher1002 - https://phabricator.wikimedia.org/T256397 (10wiki_willy) a:03Jclark-ctr [18:39:59] I'm logging off as well, available if needed [18:42:18] !log all ulsfo onsite work completed as of 30 minutes ago [18:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:27] 10Operations, 10observability, 10User-MoritzMuehlenhoff: Switch ELK7 to use the distro Java - https://phabricator.wikimedia.org/T252913 (10herron) 05Open→03Resolved a:03herron logging ES7 instances are now using the system openjdk-11 [18:48:21] (03PS2) 10Bstorm: Revert "unattendedupgrades: allow configurable kernel cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/608085 [18:58:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:59:40] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-25) rack/setup/install alert1001 - https://phabricator.wikimedia.org/T255072 (10wiki_willy) [19:00:15] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-17) rack/setup/install - https://phabricator.wikimedia.org/T255520 (10wiki_willy) [19:00:45] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10wiki_willy) [19:01:33] 10Operations, 10ops-eqiad, 10DC-Ops: (Due By: 2020-07-02) rack/setup/install 3 lightweight hadoop nodes - https://phabricator.wikimedia.org/T255518 (10wiki_willy) [19:02:12] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:03:09] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1003.eqiad.wmnet ` The log... [19:04:27] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` relforge1004.eqiad.wmnet ` The log... [19:11:41] (03CR) 10Bstorm: [C: 03+2] Revert "unattendedupgrades: allow configurable kernel cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/608085 (owner: 10Bstorm) [19:38:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:45:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:02:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:07:09] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1003.eqiad.wmnet'] ` Of which those **FAILED**: ` ['relforge1003.eqiad.wmnet'... [20:07:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:08:27] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work): (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['relforge1004.eqiad.wmnet'] ` Of which those **FAILED**: ` ['relforge1004.eqiad.wmnet'... [20:34:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:44:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:51:16] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:58:54] (03CR) 10Hashar: "As I get it, the idea is to spawn a Cassandra container and get some other container to use it as a backend to run tests." (034 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi) [21:01:03] (03CR) 10Hashar: "Bah I thought this change was for integration/config but it is targeting operations/docker-images/production-images :] So a couple of my " (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/607918 (https://phabricator.wikimedia.org/T256281) (owner: 10Jeena Huneidi) [21:15:25] 10Puppet, 10Analytics, 10Analytics-Kanban, 10Cloud-VPS: Puppet failing on wikistats.analytics.eqiad.wmflabs: /usr/local/sbin/x509-bundle error - https://phabricator.wikimedia.org/T255464 (10bd808) Still busted: `lines=10,lang=shell-session root@wikistats:~# puppet agent -tv Info: Using configured environme... [21:16:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:24:12] PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition=1 site=eqiad topic=udp_localhost-info https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging [21:24:12] All&var-consumer_group=All [21:36:38] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:36:52] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [21:51:41] (03PS2) 10BryanDavis: Pywikibot container [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/603652 (https://phabricator.wikimedia.org/T249787) [21:51:43] (03PS1) 10BryanDavis: webservice-python-bootstrap: install wheel [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608093 [22:04:26] (03CR) 10Legoktm: [C: 03+1] "I thought this would have already been installed, but apparently[1] venv doesn't install wheel, while virtualenv does." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/608093 (owner: 10BryanDavis) [22:12:24] PROBLEM - MariaDB Replica Lag: s4 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1088.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [22:17:01] (03PS1) 10QChris: gerrit: As log4j.xml is a static file, treat it as static file [puppet] - 10https://gerrit.wikimedia.org/r/608097 [22:17:03] (03PS1) 10QChris: gerrit: Adapt log4j config to catch gc_log messages for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/608098 [22:28:25] (03CR) 10QChris: "This change is not a noop on production, as at least file" [puppet] - 10https://gerrit.wikimedia.org/r/608097 (owner: 10QChris) [22:28:51] (03CR) 10QChris: "This change can well wait until the deployment takes place :-)" [puppet] - 10https://gerrit.wikimedia.org/r/608097 (owner: 10QChris) [22:29:06] (03CR) 10QChris: "This change can wait until the deployment takes place :-)" [puppet] - 10https://gerrit.wikimedia.org/r/608098 (owner: 10QChris) [22:42:15] ottomata: btw, looking into your EL/EventGate issue now. [22:42:18] https://phabricator.wikimedia.org/T249261 [23:01:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:03:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:09:17] (03PS1) 10Accraze: hiera: add missing metrics for ores statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/608102 (https://phabricator.wikimedia.org/T233448) [23:12:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:14:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:23:37] (03PS2) 10Accraze: hiera: add missing metrics for ores statsd exporter [puppet] - 10https://gerrit.wikimedia.org/r/608102 (https://phabricator.wikimedia.org/T233448) [23:43:32] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:52:10] RECOVERY - MariaDB Replica Lag: s4 on db1145 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:54:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops