[00:29:49] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Aklapper)
[02:02:37] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:03:54] <Krenair>	 hieradata/eqiad/profile/openstack/eqiad1/nova.yaml:#   cloudvirt1023: depooled, emergency spare
[02:03:56] <icinga-wm>	 ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott Ill investigate https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:06:15] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1023 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[02:38:20] <wikibugs>	 (03Abandoned) 10Andrew Bogott: nova-api: inject default user_data script for new VMs [puppet] - 10https://gerrit.wikimedia.org/r/556311 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott)
[02:38:21] <wikibugs>	 (03PS6) 10Andrew Bogott: Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421
[02:39:56] <wikibugs>	 10Operations, 10Traffic, 10Performance Issue, 10Performance-Team (Radar), 10User-notice: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10MBH) 05Open→03Resolved Looks like fixed m...
[02:40:21] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 (owner: 10Andrew Bogott)
[02:44:11] <wikibugs>	 (03PS1) 10Andrew Bogott: labs_bootstrapbz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379
[02:46:51] <wikibugs>	 (03PS2) 10Andrew Bogott: labs_bootstrapbz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379
[02:48:36] <wikibugs>	 (03PS3) 10Andrew Bogott: labs_bootstrapvz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379
[02:49:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379 (owner: 10Andrew Bogott)
[04:36:53] <icinga-wm>	 PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteThrough, currently using: WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[04:47:41] <icinga-wm>	 RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteThrough policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[05:47:35] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[06:11:31] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.175 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[06:14:47] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[06:15:05] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.08333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[06:32:48] <wikibugs>	 (03PS1) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728)
[06:34:45] <wikibugs>	 (03PS2) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728)
[07:18:23] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5792 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[07:25:37] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[08:36:25] <icinga-wm>	 PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 4.178e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1
[09:15:27] <wikibugs>	 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm) p:05High→03Triage I believe this is connected with T240518. Resetting priority and tagging with #operations. Not merging as dupe, given I'm unsure.
[09:22:29] <wikibugs>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Global renames are really slow - https://phabricator.wikimedia.org/T240518 (10Urbanecm) I just spent a few hours going through Grafana and found https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&fullscreen&panelId=15&from=now...
[09:23:18] <wikibugs>	 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm)
[09:23:31] <wikibugs>	 10Operations: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm)
[10:25:33] <wikibugs>	 (03Abandoned) 10Urbanecm: Prepare initial configuration for initiativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm)
[10:45:16] <wikibugs>	 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Peachey88)
[10:48:37] <wikibugs>	 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Peachey88)
[11:14:10] <wikibugs>	 (03PS7) 10TechneSiyam: Added bnwikibooks,bnwikisource,ukwikivoyage under wiki hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053
[11:14:59] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Added bnwikibooks,bnwikisource,ukwikivoyage under wiki hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 (owner: 10TechneSiyam)
[11:33:46] <wikibugs>	 10Operations, 10netops: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10ayounsi) This issue is fixed in Fastnetmon 1.1.4: https://github.com/pavel-odintsov/fastnetmon/releases/tag/v1.1.4 > Suppressed excessive logging about missing IPFIX o...
[11:36:27] <icinga-wm>	 PROBLEM - MegaRAID on db1130 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:47:15] <icinga-wm>	 RECOVERY - MegaRAID on db1130 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[11:58:45] <wikibugs>	 10Operations, 10cloud-services-team, 10netops: Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10ayounsi) p:05Triage→03Low
[12:14:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:16:41] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:19:37] <icinga-wm>	 PROBLEM - MegaRAID on db1130 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[12:27:29] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.55 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[12:31:05] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.08333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[12:41:13] <icinga-wm>	 RECOVERY - MegaRAID on db1130 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:24:21] <icinga-wm>	 PROBLEM - MegaRAID on db1130 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[13:34:23] <wikibugs>	 10Operations, 10netops: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10ayounsi) One of the 2 office routers had a hardware failure. I have an OIT ticket open. No ETA on resolution yet.
[13:40:51] <wikibugs>	 10Operations, 10netops: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 (10ayounsi) Yep, good for me! Thanks!
[13:55:00] <wikibugs>	 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10ayounsi) Thanks Chris for looking into it, and Brion for providing the outputs. All those mtr look fine though (the destination doesn't have any packet loss). Is the issue wi...
[14:04:37] <icinga-wm>	 PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[14:39:25] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:40:49] <wikibugs>	 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10DavidBrooks) 05Open→03Invalid Closed as suggested. Sorry about the delay; I'll open another ticket although it's rapidly...
[14:41:11] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:44:17] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:49:45] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[14:55:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:57:19] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:58:24] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "Some IPs are still present on the routers:" [dns] - 10https://gerrit.wikimedia.org/r/556995 (https://phabricator.wikimedia.org/T240670) (owner: 10Arturo Borrero Gonzalez)
[15:06:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] network: data: cleanup unused WMCS ranges [puppet] - 10https://gerrit.wikimedia.org/r/556994 (https://phabricator.wikimedia.org/T240670) (owner: 10Arturo Borrero Gonzalez)
[15:11:25] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:16:25] <wikibugs>	 (03PS1) 10Ayounsi: Cloud ACLs: rename labmon1002 to cloudmetrics1002 [homer/public] - 10https://gerrit.wikimedia.org/r/557565 (https://phabricator.wikimedia.org/T240456)
[15:19:18] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10ayounsi) 10.64.4.15 is already present in the router ACLs. I opened the CR bellow to rename its description. https://gerrit.wikimedia.org/r/c/operations/homer/public/+/557565
[15:36:43] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[15:38:31] <icinga-wm>	 RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[16:22:29] <wikibugs>	 (03PS1) 10Ammarpad: Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104)
[16:25:58] <wikibugs>	 (03PS2) 10Ammarpad: Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104)
[16:35:13] <wikibugs>	 10Operations, 10Traffic: /sec-warning page: please add a helpful XML comment explaining why it's being delivered. - https://phabricator.wikimedia.org/T240794 (10Aklapper)
[17:23:52] <wikibugs>	 (03CR) 10Masumrezarock100: [C: 03+1] Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104) (owner: 10Ammarpad)
[17:40:45] <wikibugs>	 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Masumrezarock100) Heh. I believe I received that message at my meta talk page.
[18:28:31] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6042 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:32:09] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[18:53:23] <icinga-wm>	 PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[18:58:27] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] keys.txt: Only include Tim's current key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557158 (owner: 10Legoktm)
[18:58:37] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] keys.html: Include Tim's new key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557159 (owner: 10Legoktm)
[19:20:51] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:23:59] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Jdforrester-WMF) >>! In T224591#5739522, @hashar wrote: > Indeed tha...
[19:24:29] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:29:31] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[19:36:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[19:54:41] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:03:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:10:57] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:18:11] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:19:30] <wikibugs>	 10Operations, 10cloud-services-team, 10netops: Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10Krinkle)
[20:19:51] <icinga-wm>	 RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[20:20:45] <wikibugs>	 (03PS3) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728)
[20:26:19] <wikibugs>	 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10MarcoAurelio) @aaron @Pchelolo Could you please take a look at this one? Thanks.
[20:27:13] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[20:27:15] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CRITICAL - load average: 184.14, 112.62, 53.06 https://wikitech.wikimedia.org/wiki/Swift
[20:28:41] <wikibugs>	 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Krinkle) I've come around to agreeing with @Tgr. Gravatar seems like something we could support in good conscience...
[20:29:03] <icinga-wm>	 PROBLEM - MD RAID on ms-be2016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:29:04] <icinga-wm>	 ACKNOWLEDGEMENT - MD RAID on ms-be2016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T240798 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[20:29:08] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 (10ops-monitoring-bot)
[20:32:41] <icinga-wm>	 PROBLEM - Disk space on ms-be2016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2016&var-datasource=codfw+prometheus/ops
[20:32:43] <icinga-wm>	 RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 33.12, 72.87, 55.22 https://wikitech.wikimedia.org/wiki/Swift
[20:33:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:34:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:36:39] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:44:03] <icinga-wm>	 PROBLEM - Varnish HTCP daemon on cp1075 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish
[20:57:29] <wikibugs>	 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Daimona) T240518 ?
[21:00:25] <wikibugs>	 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm) p:05Triage→03Unbreak! Boldly triaging as UBN, this seems to affect the whole queue thing (T240800 was just created, I'm having troubles uploading a webm file [uploader...
[21:00:58] <wikibugs>	 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm) >>! In T240800#5742793, @Daimona wrote: > T240518 ?  My first guess.
[21:02:05] <icinga-wm>	 PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:03:43] <icinga-wm>	 RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:07:53] <wikibugs>	 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm) Yup, it just takes hours to deliver.
[21:08:13] <wikibugs>	 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm)
[21:08:15] <wikibugs>	 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm)
[21:08:17] <wikibugs>	 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm)
[21:08:28] <wikibugs>	 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm)
[21:08:30] <wikibugs>	 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm)
[21:08:32] <wikibugs>	 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm)
[21:08:45] <wikibugs>	 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm)
[21:08:47] <wikibugs>	 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm)
[21:08:49] <wikibugs>	 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm)
[21:45:57] <wikibugs>	 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Masumrezarock100) >>! In T240518#5742251, @Urbanecm wrote: > I just spent some time going through Grafana and found https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId...
[22:28:55] <wikibugs>	 (03PS1) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438)
[22:31:42] <wikibugs>	 (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[22:32:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[22:33:58] <wikibugs>	 (03CR) 10DannyS712: [C: 04-1] "Thanks for contributing @RetroCraft. Jenkins found some issues though" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[22:36:04] <wikibugs>	 (03PS2) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438)
[22:37:25] <wikibugs>	 (03CR) 10RetroCraft: "> Patch Set 1: Code-Review-1" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[23:07:44] <wikibugs>	 (03CR) 10DannyS712: Create Test Custodians group at Beta Wikiversity (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[23:10:53] <wikibugs>	 10Operations, 10Core Platform Team, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Krinkle)
[23:36:21] <wikibugs>	 (03PS3) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438)
[23:36:48] <wikibugs>	 (03PS4) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438)
[23:39:37] <wikibugs>	 (03CR) 10RetroCraft: "Makes sense, fixed." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[23:49:22] <wikibugs>	 (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[23:50:32] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft)
[23:56:13] <icinga-wm>	 PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/