[00:24:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:27:04] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:06:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Jclark-ctr) [01:07:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-radar: (Need By: TBD) rack/setup/install mw14[14-56] - https://phabricator.wikimedia.org/T273915 (10Jclark-ctr) @RobH if you can clarify racking instructions. I have reviewed racking instructions and what these are replacing mw[1261-1290,1293-1306 Are only in r... [01:19:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:24:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:16:50] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:19:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:25:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:27:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:33:02] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 188097232 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:35:26] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 576280 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:27:00] PROBLEM - MegaRAID on analytics1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:27:11] ACKNOWLEDGEMENT - MegaRAID on analytics1059 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T276696 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:27:15] 10SRE, 10ops-eqiad: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10ops-monitoring-bot) [05:35:09] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on analytics1059 - https://phabricator.wikimedia.org/T276696 (10Peachey88) [05:47:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:50:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:51:58] 10SRE, 10Wikimedia-Mailing-lists: Reuqesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) > One complicating factor is that the old mailman doesn't follow the `.eqiad.wmnet` naming convention and it is `lists1001.wikimedia.org`. So I assume the test VM should be... [06:04:35] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 98 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:25:08] I'm trying to assist someone in connection issues with Wikipedia, but I can't find the article about what they should do if they can't connect. Can someone link me? [06:27:47] AmandaNP: https://wikitech.wikimedia.org/wiki/Reporting_a_connectivity_issue [06:28:08] you should probably use https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [06:30:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:32:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:32:36] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Legoktm) @ladsgroup are you planning to work on this or would you like me to? [06:32:49] 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Legoktm) [06:33:09] 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Legoktm) [06:35:42] thanks legoktm [06:42:59] np [06:44:35] 10SRE, 10Wikimedia-Mailing-lists: Figure out a way to sync old and new mailman - https://phabricator.wikimedia.org/T256539 (10Legoktm) Coming back to this, I think we should: * Get the Internet Archive to scrape all the current pipermail archives/views as a fail-safe (this is currently prevented with https://l... [06:48:04] 10SRE, 10Wikimedia-Mailing-lists: Reuqesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Legoktm) >>! In T276686#6889832, @Legoktm wrote: > What about the database? I assume we need to ask #DBA for space somewhere. This is {T256538}. [07:02:14] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Marostegui) @Ladsgroup are these just testing databases that will be deleted at some point or are these testing databases that will become production? :-) The concerns raised on this task ear... [07:13:01] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Legoktm) >>! In T256538#6889858, @Marostegui wrote: > @Ladsgroup are these just testing databases that will be deleted at some point or are these testing databases that will become production... [07:13:47] 10SRE, 10Wikimedia-Mailing-lists: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Legoktm) [07:18:40] (03CR) 10Legoktm: "Just a quick skim, I really don't know exim config." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/669182 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210307T0800) [08:01:14] !log "megacli -LDSetProp -ForcedWB -Immediate -Lall -aAll" on analytics1066 - BBU looks fine, but the raid controller was using WriteThrough [08:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:30] seems not working, will open a task tomorrow for the BBU probably :( [08:47:09] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:49:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:55:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:58:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:51:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:55:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:32:31] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 102995720 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:47] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 738376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:48:11] (03PS1) 10ArielGlenn: wikibase entity dumps: add comma at end of intermediate files [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) [13:49:58] (03CR) 10Zabe: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn) [13:50:50] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Implement central logging for mailman3 - https://phabricator.wikimedia.org/T276697 (10Peachey88) [13:50:54] (03PS2) 10ArielGlenn: wikibase entity dumps: add comma at end of intermediate files [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) [13:59:29] (03CR) 10Lucas Werkmeister: wikibase entity dumps: add comma at end of intermediate files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn) [14:05:36] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.1658 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:06:17] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.8889 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:07:48] got the page, checking [14:07:50] * volans|away on the move, anyone around? [14:08:05] volans|away: yeah I'm taking a look [14:08:07] Here if I can be of any help [14:08:54] looks like it is true, not sure about the cause yet [14:09:04] but recovering [14:09:13] thanks godog, I can get to my laptop soon~ish if it's something bigger [14:09:43] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-me [14:09:51] volans|away: ok! no idea yet [14:10:11] 2m away [14:10:18] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.6498 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:10:38] <_joe_> hey sorry I couldn't get onto IRC for some reason [14:10:49] <_joe_> so, this was caused by a surge of bad requests [14:10:59] <_joe_> see https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?viewPanel=22&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-method=GET&var-code=200&from=now-30m&to=now [14:11:07] ack, ping me on VO if I'm needed, thx [14:11:08] <_joe_> a surdden surge of 4xx [14:11:20] <_joe_> it's already over though [14:11:43] indeed, looks like it is over already [14:11:53] okay, here [14:12:03] <_joe_> oh yes, and a spike in latency on s4 [14:12:07] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:12:17] <_joe_> which is prolly what caused the issue, actually [14:13:25] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4127 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:13:51] <_joe_> uhm [14:14:06] <_joe_> happening again it seems [14:14:25] <_joe_> started to have some troubles around 13:30 [14:14:34] yeah, the traffic looks like it's been up and down for a few hours, that last spike was the biggest yet but not sure I believe it's over [14:14:42] last time this was an aggressive scraper [14:14:46] <_joe_> traffic? [14:15:10] PROBLEM - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is CRITICAL: 0.2463 lt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:15:11] <_joe_> and yes, I see a specific IP [14:15:14] _joe_ the bad requests from your graph, I mean [14:15:49] <_joe_> cdanis: again, same thing I think [14:16:00] let’s block that IP [14:16:08] <_joe_> yes [14:19:58] RECOVERY - Not enough idle PHP-FPM workers for Mediawiki api_appserver at eqiad #page on alert1001 is OK: (C)0.3 lt (W)0.5 lt 0.553 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver [14:22:59] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03175 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [14:25:05] (03CR) 10ArielGlenn: wikibase entity dumps: add comma at end of intermediate files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn) [14:26:08] (03CR) 10Lucas Werkmeister: [C: 03+1] wikibase entity dumps: add comma at end of intermediate files (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn) [14:30:16] (03CR) 10Zabe: [C: 03+1] wikibase entity dumps: add comma at end of intermediate files [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn) [14:42:57] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 111397384 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:45:21] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 418024 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:54:24] (03CR) 10ArielGlenn: [C: 03+2] "I've not got an easy way to check that the json syntax of this is correct, but the script runs and produces ot obviously broken output, so" [puppet] - 10https://gerrit.wikimedia.org/r/669404 (https://phabricator.wikimedia.org/T276643) (owner: 10ArielGlenn) [14:56:35] (03Abandoned) 10ArielGlenn: allow linkrecommendation service access to m2-master [deployment-charts] - 10https://gerrit.wikimedia.org/r/667934 (https://phabricator.wikimedia.org/T276268) (owner: 10ArielGlenn) [14:59:09] (03PS10) 10ArielGlenn: make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 [15:00:42] (03CR) 10ArielGlenn: [C: 03+2] make wikidata and commons entity dumps easier to test in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/660871 (owner: 10ArielGlenn) [15:11:24] (03Abandoned) 10ArielGlenn: check-raid syntax fixes, check all raids on system [puppet] - 10https://gerrit.wikimedia.org/r/145018 (owner: 10ArielGlenn) [15:11:51] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:20:27] 10SRE, 10Wikimedia-Mailing-lists: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10Ladsgroup) [15:22:06] 10SRE, 10Wikimedia-Mailing-lists: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Peachey88) [15:22:59] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) >>! In T256538#6889860, @Legoktm wrote: >>>! In T256538#6889858, @Marostegui wrote: >> @Ladsgroup are these just testing databases that will be deleted at some point or are these t... [15:25:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:28:30] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Create databases for mailman3 - https://phabricator.wikimedia.org/T256538 (10Ladsgroup) Name of the db and the mysql user is hard-coded (we can hiera-fiy it though. Not too complicated) [15:31:26] 10SRE, 10Packaging, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Backport hyperkitty 1.3.4 for buster - https://phabricator.wikimedia.org/T276687 (10Ladsgroup) >>! In T276687#6889842, @Legoktm wrote: > @ladsgroup are you planning to work on this or would you like me to? I want to work on it mostly to lear... [15:38:15] 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Ladsgroup) [15:44:25] 10SRE, 10Wikimedia-Mailing-lists, 10vm-requests: Requesting a test VM in production for mailman3 - https://phabricator.wikimedia.org/T276686 (10Ladsgroup) >>! In T276686#6889832, @Legoktm wrote: >> One complicating factor is that the old mailman doesn't follow the `.eqiad.wmnet` naming convention and it is `... [15:52:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:01:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:03:39] 10SRE, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Request for creation of mailman3-roots group - https://phabricator.wikimedia.org/T276712 (10Ladsgroup) [16:06:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:07:33] (03PS1) 10Majavah: wancache: change deployment-prep to new Buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/669436 (https://phabricator.wikimedia.org/T276707) [16:14:41] (03PS1) 10Majavah: betacluster: Use new memc* hosts for redis_lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669439 (https://phabricator.wikimedia.org/T276707) [16:31:53] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:36:29] (03CR) 10Krinkle: [C: 03+2] betacluster: Use new memc* hosts for redis_lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669439 (https://phabricator.wikimedia.org/T276707) (owner: 10Majavah) [16:37:12] (03Merged) 10jenkins-bot: betacluster: Use new memc* hosts for redis_lock [mediawiki-config] - 10https://gerrit.wikimedia.org/r/669439 (https://phabricator.wikimedia.org/T276707) (owner: 10Majavah) [16:48:34] (03PS1) 10Majavah: redis::multidc: Make discovery optional [puppet] - 10https://gerrit.wikimedia.org/r/669447 [16:54:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:57:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:18:43] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 99 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [17:19:17] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 11.15 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:23:43] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:23:53] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:23:57] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:24:03] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:26:01] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:26:11] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:26:15] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:26:21] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:26:57] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:29:51] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:31:25] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:31:55] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:32:03] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:32:05] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:32:45] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:32:47] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:34:25] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:35:19] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:35:25] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:36:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:38:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:40:05] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/RESTBase [17:40:11] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:40:15] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:40:17] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:41:25] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:41:33] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:41:33] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:41:35] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:41:37] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:17] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:17] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:25] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [17:42:27] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:33] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:33] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:37] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:39] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:00:09] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:01:47] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:01:47] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:01:51] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:01:53] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:03:13] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:03:15] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:04:13] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:04:15] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:05:23] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:05:33] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:05:35] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:06:17] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:06:17] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:06:27] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:08:49] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:08:55] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:08:55] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:09:37] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:09:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:10:17] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:10:21] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:11:01] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:12:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:16:11] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:17:29] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:17:35] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:18:15] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:18:27] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:19:55] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:23:21] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:24:33] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:24:43] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:24:47] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/transform/wikitext/to/html/{title} (Transform wikitext to html) is CRITICAL: Test Transform wikitext to html returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:25:27] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:25:27] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:25:37] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:27:13] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:28:11] Is it only me that phab search is slow for? [18:28:13] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:28:15] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:28:42] Or phab generally tbh [18:28:50] Search just makes it obvious [18:51:00] (03PS1) 10Majavah: Add deployment-restbase03 to beta cluster [puppet] - 10https://gerrit.wikimedia.org/r/669477 (https://phabricator.wikimedia.org/T250574) [18:53:27] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:57:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:00:45] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:07:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:17:31] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 79, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:18:01] (03CR) 10Majavah: "cherry picked on betacluster" [puppet] - 10https://gerrit.wikimedia.org/r/668995 (https://phabricator.wikimedia.org/T276654) (owner: 10Majavah) [19:18:07] (03CR) 10Majavah: "cherry picked on betacluster" [puppet] - 10https://gerrit.wikimedia.org/r/669436 (https://phabricator.wikimedia.org/T276707) (owner: 10Majavah) [19:18:14] (03CR) 10Majavah: "cherry picked on betacluster" [puppet] - 10https://gerrit.wikimedia.org/r/669447 (owner: 10Majavah) [19:21:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:24:41] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 81, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:57:57] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 319449000 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:00:25] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 18344 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:05:13] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:49:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:51:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:17:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:20:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:54:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:56:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets