[00:24:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:27:04] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:06:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Jclark-ctr)
[01:07:03] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Jclark-ctr) @RobH  if you can clarify racking instructions.  I have reviewed racking instructions and what these are replacing  mw[1261-1290,1293-1306 Are only in r...
[01:19:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[01:24:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:16:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:19:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:25:40] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:27:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:33:02] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 188097232 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[04:35:26] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 576280 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[05:27:00] <icinga-wm>	 PROBLEM - MegaRAID on analytics1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:27:11] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on analytics1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T276696 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:27:15] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10ops-monitoring-bot)
[05:35:09] <wikibugs>	 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10Peachey88)
[05:47:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:50:01] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:51:58] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Reuqesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) > One complicating factor is that the old mailman doesn't follow the `.eqiad.wmnet` naming convention and it is `lists1001.wikimedia.org`. So I assume the test VM should be...
[06:04:35] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[06:25:08] <AmandaNP>	 I'm trying to assist someone in connection issues with Wikipedia, but I can't find the article about what they should do if they can't connect. Can someone link me?
[06:27:47] <legoktm>	 AmandaNP: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue
[06:28:08] <legoktm>	 you should probably use https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue
[06:30:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:32:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[06:32:36] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Legoktm) @ladsgroup are you planning to work on this or would you like me to?
[06:32:49] <wikibugs>	 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Legoktm)
[06:33:09] <wikibugs>	 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Legoktm)
[06:35:42] <AmandaNP>	 thanks legoktm 
[06:42:59] <legoktm>	 np
[06:44:35] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10Legoktm) Coming back to this, I think we should: * Get the Internet Archive to scrape all the current pipermail archives/views as a fail-safe (this is currently prevented with https://l...
[06:48:04] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Reuqesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) >>! In T276686#6889832, @Legoktm wrote: > What about the database? I assume we need to ask #DBA for space somewhere.  This is {T256538}.
[07:02:14] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) @Ladsgroup are these just testing databases that will be deleted at some point or are these testing databases that will become production? :-) The concerns raised on this task ear...
[07:13:01] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) >>! In T256538#6889858, @Marostegui wrote: > @Ladsgroup are these just testing databases that will be deleted at some point or are these testing databases that will become production...
[07:13:47] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Legoktm)
[07:18:40] <wikibugs>	 (03CR) 10Legoktm: "Just a quick skim, I really don't know exim config." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup)
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210307T0800)
[08:01:14] <elukey>	 !log "megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll" on analytics1066 - BBU looks fine, but the raid controller was using WriteThrough
[08:01:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:30] <elukey>	 seems not working, will open a task tomorrow for the BBU probably :(
[08:47:09] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:49:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:55:57] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:58:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:51:05] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:55:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:32:31] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 102995720 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[12:34:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 738376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[13:48:11] <wikibugs>	 (03PS1) 10ArielGlenn: wikibase entity dumps: add comma at end of intermediate files [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643)
[13:49:58] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn)
[13:50:50] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Peachey88)
[13:50:54] <wikibugs>	 (03PS2) 10ArielGlenn: wikibase entity dumps: add comma at end of intermediate files [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643)
[13:59:29] <wikibugs>	 (03CR) 10Lucas Werkmeister: wikibase entity dumps: add comma at end of intermediate files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn)
[14:05:36] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1658 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[14:06:17] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8889 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:07:48] <godog>	 got the page, checking
[14:07:50] * volans|away on the move, anyone around?
[14:08:05] <godog>	 volans|away: yeah I'm taking a look
[14:08:07] <sobanski>	 Here if I can be of any help
[14:08:54] <godog>	 looks like it is true, not sure about the cause yet
[14:09:04] <godog>	 but recovering
[14:09:13] <volans|away>	 thanks godog, I can get to my laptop soon~ish if it's something bigger
[14:09:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me
[14:09:51] <godog>	 volans|away: ok! no idea yet
[14:10:11] <rzl>	 2m away
[14:10:18] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6498 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[14:10:38] <_joe_>	 hey sorry I couldn't get onto IRC for some reason
[14:10:49] <_joe_>	 so, this was caused by a surge of bad requests
[14:10:59] <_joe_>	 see https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=22&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-30m&to=now
[14:11:07] <volans|away>	 ack, ping me on VO if I'm needed, thx
[14:11:08] <_joe_>	 a surdden surge of 4xx 
[14:11:20] <_joe_>	 it's already over though
[14:11:43] <godog>	 indeed, looks like it is over already
[14:11:53] <rzl>	 okay, here
[14:12:03] <_joe_>	 oh yes, and a spike in latency on s4
[14:12:07] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:12:17] <_joe_>	 which is prolly what caused the issue, actually
[14:13:25] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4127 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:13:51] <_joe_>	 uhm
[14:14:06] <_joe_>	 happening again it seems
[14:14:25] <_joe_>	 started to have some troubles around 13:30
[14:14:34] <rzl>	 yeah, the traffic looks like it's been up and down for a few hours, that last spike was the biggest yet but not sure I believe it's over
[14:14:42] <cdanis>	 last time this was an aggressive scraper 
[14:14:46] <_joe_>	 traffic?
[14:15:10] <icinga-wm>	 PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2463 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[14:15:11] <_joe_>	 and yes, I see a specific IP
[14:15:14] <rzl>	 _joe_ the bad requests from your graph, I mean
[14:15:49] <_joe_>	 cdanis: again, same thing I think
[14:16:00] <cdanis>	 let’s block that IP
[14:16:08] <_joe_>	 yes
[14:19:58] <icinga-wm>	 RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.553 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver
[14:22:59] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03175 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[14:25:05] <wikibugs>	 (03CR) 10ArielGlenn: wikibase entity dumps: add comma at end of intermediate files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn)
[14:26:08] <wikibugs>	 (03CR) 10Lucas Werkmeister: [C: 03+1] wikibase entity dumps: add comma at end of intermediate files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn)
[14:30:16] <wikibugs>	 (03CR) 10Zabe: [C: 03+1] wikibase entity dumps: add comma at end of intermediate files [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn)
[14:42:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 111397384 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:45:21] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 418024 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[14:54:24] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] "I've not got an easy way to check that the json syntax of this is correct, but the script runs and produces ot obviously broken output, so" [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn)
[14:56:35] <wikibugs>	 (03Abandoned) 10ArielGlenn: allow linkrecommendation service access to m2-master [deployment-charts] - 10https://gerrit.wikimedia.org/r/667934 (https://phabricator.wikimedia.org/T276268) (owner: 10ArielGlenn)
[14:59:09] <wikibugs>	 (03PS10) 10ArielGlenn: make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871
[15:00:42] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 (owner: 10ArielGlenn)
[15:11:24] <wikibugs>	 (03Abandoned) 10ArielGlenn: check-raid syntax fixes, check all raids on system [puppet] - 10https://gerrit.wikimedia.org/r/145018 (owner: 10ArielGlenn)
[15:11:51] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:20:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10Ladsgroup)
[15:22:06] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Peachey88)
[15:22:59] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6889860, @Legoktm wrote: >>>! In T256538#6889858, @Marostegui wrote: >> @Ladsgroup are these just testing databases that will be deleted at some point or are these t...
[15:25:53] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:28:30] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) Name of the db and the mysql user is hard-coded (we can hiera-fiy it though. Not too complicated)
[15:31:26] <wikibugs>	 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Ladsgroup) >>! In T276687#6889842, @Legoktm wrote: > @ladsgroup are you planning to work on this or would you like me to?  I want to work on it mostly to lear...
[15:38:15] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Ladsgroup)
[15:44:25] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Ladsgroup) >>! In T276686#6889832, @Legoktm wrote: >> One complicating factor is that the old mailman doesn't follow the `.eqiad.wmnet` naming convention and it is `...
[15:52:23] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:01:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:03:39] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10Ladsgroup)
[16:06:21] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:07:33] <wikibugs>	 (03PS1) 10Majavah: wancache: change deployment-prep to new Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/669436 (https://phabricator.wikimedia.org/T276707)
[16:14:41] <wikibugs>	 (03PS1) 10Majavah: betacluster: Use new memc* hosts for redis_lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669439 (https://phabricator.wikimedia.org/T276707)
[16:31:53] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[16:36:29] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] betacluster: Use new memc* hosts for redis_lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669439 (https://phabricator.wikimedia.org/T276707) (owner: 10Majavah)
[16:37:12] <wikibugs>	 (03Merged) 10jenkins-bot: betacluster: Use new memc* hosts for redis_lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669439 (https://phabricator.wikimedia.org/T276707) (owner: 10Majavah)
[16:48:34] <wikibugs>	 (03PS1) 10Majavah: redis::multidc: Make discovery optional [puppet] - 10https://gerrit.wikimedia.org/r/669447
[16:54:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:57:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:18:43] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[17:19:17] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.15 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash
[17:23:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:23:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:23:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:24:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:26:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:26:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:26:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:26:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:26:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:29:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:31:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:31:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:32:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:32:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:32:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:32:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:34:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:35:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:35:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:36:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:38:43] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:40:05] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase
[17:40:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:40:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:40:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:41:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:41:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:41:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:41:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:41:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:25] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[17:42:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:00:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:01:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:01:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:01:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:01:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:03:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:03:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:04:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:04:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:05:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:05:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:05:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:06:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:06:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:06:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:08:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:08:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:08:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:09:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:09:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:10:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:10:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:11:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:12:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:16:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:17:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:17:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:18:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:18:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:19:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:23:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:24:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:24:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:24:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:25:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:25:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:25:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:27:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:28:11] <RhinosF1>	 Is it only me that phab search is slow for?
[18:28:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:28:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:28:42] <RhinosF1>	 Or phab generally tbh
[18:28:50] <RhinosF1>	 Search just makes it obvious
[18:51:00] <wikibugs>	 (03PS1) 10Majavah: Add deployment-restbase03 to beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/669477 (https://phabricator.wikimedia.org/T250574)
[18:53:27] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[18:57:45] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:00:45] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:07:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:17:31] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:18:01] <wikibugs>	 (03CR) 10Majavah: "cherry picked on betacluster" [puppet] - 10https://gerrit.wikimedia.org/r/668995 (https://phabricator.wikimedia.org/T276654) (owner: 10Majavah)
[19:18:07] <wikibugs>	 (03CR) 10Majavah: "cherry picked on betacluster" [puppet] - 10https://gerrit.wikimedia.org/r/669436 (https://phabricator.wikimedia.org/T276707) (owner: 10Majavah)
[19:18:14] <wikibugs>	 (03CR) 10Majavah: "cherry picked on betacluster" [puppet] - 10https://gerrit.wikimedia.org/r/669447 (owner: 10Majavah)
[19:21:37] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[19:24:41] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:57:57] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 319449000 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:00:25] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 18344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[20:05:13] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:49:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:51:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:17:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:20:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:54:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:56:45] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets