[00:29:49] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Aklapper) [02:02:37] PROBLEM - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:03:54] hieradata/eqiad/profile/openstack/eqiad1/nova.yaml:# cloudvirt1023: depooled, emergency spare [02:03:56] ACKNOWLEDGEMENT - ensure kvm processes are running on cloudvirt1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 andrew bogott Ill investigate https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:06:15] RECOVERY - ensure kvm processes are running on cloudvirt1023 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [02:38:20] (03Abandoned) 10Andrew Bogott: nova-api: inject default user_data script for new VMs [puppet] - 10https://gerrit.wikimedia.org/r/556311 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [02:38:21] (03PS6) 10Andrew Bogott: Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 [02:39:56] 10Operations, 10Traffic, 10Performance Issue, 10Performance-Team (Radar), 10User-notice: Sometimes pages load slowly for users routed to the Amsterdam data center (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10MBH) 05Open→03Resolved Looks like fixed m... [02:40:21] (03CR) 10Andrew Bogott: [C: 03+2] Bootstrapvz: remove firstboot script, enable cloud-init [puppet] - 10https://gerrit.wikimedia.org/r/425421 (owner: 10Andrew Bogott) [02:44:11] (03PS1) 10Andrew Bogott: labs_bootstrapbz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379 [02:46:51] (03PS2) 10Andrew Bogott: labs_bootstrapbz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379 [02:48:36] (03PS3) 10Andrew Bogott: labs_bootstrapvz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379 [02:49:33] (03CR) 10Andrew Bogott: [C: 03+2] labs_bootstrapvz: remove remaining references to firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/557379 (owner: 10Andrew Bogott) [04:36:53] PROBLEM - MegaRAID on analytics1057 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteThrough, currently using: WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack, WriteBack https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [04:47:41] RECOVERY - MegaRAID on analytics1057 is OK: OK: optimal, 13 logical, 14 physical, WriteThrough policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:47:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:11:31] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.175 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [06:14:47] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:15:05] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.08333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [06:32:48] (03PS1) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [06:34:45] (03PS2) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [07:18:23] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.5792 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [07:25:37] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.09583 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [08:36:25] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 4.178e+06 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1 [09:15:27] 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm) p:05High→03Triage I believe this is connected with T240518. Resetting priority and tagging with #operations. Not merging as dupe, given I'm unsure. [09:22:29] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Global renames are really slow - https://phabricator.wikimedia.org/T240518 (10Urbanecm) I just spent a few hours going through Grafana and found https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&fullscreen&panelId=15&from=now... [09:23:18] 10Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm) [09:23:31] 10Operations: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm) [10:25:33] (03Abandoned) 10Urbanecm: Prepare initial configuration for initiativeswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/504502 (https://phabricator.wikimedia.org/T167375) (owner: 10Urbanecm) [10:45:16] 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Peachey88) [10:48:37] 10Operations, 10ChangeProp, 10Release Pipeline, 10Release-Engineering-Team-TODO, and 5 others: Migrate cpjobqueue to kubernetes - https://phabricator.wikimedia.org/T220399 (10Peachey88) [11:14:10] (03PS7) 10TechneSiyam: Added bnwikibooks,bnwikisource,ukwikivoyage under wiki hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 [11:14:59] (03CR) 10jerkins-bot: [V: 04-1] Added bnwikibooks,bnwikisource,ukwikivoyage under wiki hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 (owner: 10TechneSiyam) [11:33:46] 10Operations, 10netops: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10ayounsi) This issue is fixed in Fastnetmon 1.1.4: https://github.com/pavel-odintsov/fastnetmon/releases/tag/v1.1.4 > Suppressed excessive logging about missing IPFIX o... [11:36:27] PROBLEM - MegaRAID on db1130 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:47:15] RECOVERY - MegaRAID on db1130 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:58:45] 10Operations, 10cloud-services-team, 10netops: Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10ayounsi) p:05Triage→03Low [12:14:53] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:41] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:19:37] PROBLEM - MegaRAID on db1130 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [12:27:29] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.55 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [12:31:05] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.08333 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [12:41:13] RECOVERY - MegaRAID on db1130 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:24:21] PROBLEM - MegaRAID on db1130 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:34:23] 10Operations, 10netops: BGP peering sessions with corp partially down in ulsfo - https://phabricator.wikimedia.org/T239893 (10ayounsi) One of the 2 office routers had a hardware failure. I have an OIT ticket open. No ETA on resolution yet. [13:40:51] 10Operations, 10netops: Facebook BGP peering links down in ulsfo - https://phabricator.wikimedia.org/T239896 (10ayounsi) Yep, good for me! Thanks! [13:55:00] 10Operations, 10netops: Network issues reaching phabricator on IPv6 (Comcast/Portland OR) - https://phabricator.wikimedia.org/T240488 (10ayounsi) Thanks Chris for looking into it, and Brion for providing the outputs. All those mtr look fine though (the destination doesn't have any packet loss). Is the issue wi... [14:04:37] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [14:39:25] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:40:49] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10DavidBrooks) 05Open→03Invalid Closed as suggested. Sorry about the delay; I'll open another ticket although it's rapidly... [14:41:11] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:44:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:49:45] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:55:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:57:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:24] (03CR) 10Ayounsi: [C: 03+1] "Some IPs are still present on the routers:" [dns] - 10https://gerrit.wikimedia.org/r/556995 (https://phabricator.wikimedia.org/T240670) (owner: 10Arturo Borrero Gonzalez) [15:06:57] (03CR) 10Ayounsi: [C: 03+1] network: data: cleanup unused WMCS ranges [puppet] - 10https://gerrit.wikimedia.org/r/556994 (https://phabricator.wikimedia.org/T240670) (owner: 10Arturo Borrero Gonzalez) [15:11:25] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:16:25] (03PS1) 10Ayounsi: Cloud ACLs: rename labmon1002 to cloudmetrics1002 [homer/public] - 10https://gerrit.wikimedia.org/r/557565 (https://phabricator.wikimedia.org/T240456) [15:19:18] 10Operations, 10netops, 10Patch-For-Review: Add cloudmetrics1002 to network devices ACL - https://phabricator.wikimedia.org/T240456 (10ayounsi) 10.64.4.15 is already present in the router ACLs. I opened the CR bellow to rename its description. https://gerrit.wikimedia.org/r/c/operations/homer/public/+/557565 [15:36:43] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:38:31] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:22:29] (03PS1) 10Ammarpad: Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104) [16:25:58] (03PS2) 10Ammarpad: Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104) [16:35:13] 10Operations, 10Traffic: /sec-warning page: please add a helpful XML comment explaining why it's being delivered. - https://phabricator.wikimedia.org/T240794 (10Aklapper) [17:23:52] (03CR) 10Masumrezarock100: [C: 03+1] Re-add localized Wikipedia wordmark for szlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557584 (https://phabricator.wikimedia.org/T233104) (owner: 10Ammarpad) [17:40:45] 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Masumrezarock100) Heh. I believe I received that message at my meta talk page. [18:28:31] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.6042 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:32:09] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0625 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [18:53:23] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [18:58:27] (03CR) 10Jforrester: [C: 03+1] keys.txt: Only include Tim's current key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557158 (owner: 10Legoktm) [18:58:37] (03CR) 10Jforrester: [C: 03+1] keys.html: Include Tim's new key (73F146FECF9D333C) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557159 (owner: 10Legoktm) [19:20:51] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9417 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:23:59] 10Operations, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Infrastructure (phase-out-jessie), 10Release-Engineering-Team (CI & Testing services): Migrate contint* hosts to Buster - https://phabricator.wikimedia.org/T224591 (10Jdforrester-WMF) >>! In T224591#5739522, @hashar wrote: > Indeed tha... [19:24:29] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.04167 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:29:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:36:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:54:41] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:03:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:10:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:18:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:19:30] 10Operations, 10cloud-services-team, 10netops: Return traffic to eqiad WMCS triggering FNM - https://phabricator.wikimedia.org/T240789 (10Krinkle) [20:19:51] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [20:20:45] (03PS3) 10Ammarpad: Add minerva custom log for la.wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557439 (https://phabricator.wikimedia.org/T240728) [20:26:19] 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10MarcoAurelio) @aaron @Pchelolo Could you please take a look at this one? Thanks. [20:27:13] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [20:27:15] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CRITICAL - load average: 184.14, 112.62, 53.06 https://wikitech.wikimedia.org/wiki/Swift [20:28:41] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Traffic, 10Release-Engineering-Team (Development services): Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Krinkle) I've come around to agreeing with @Tgr. Gravatar seems like something we could support in good conscience... [20:29:03] PROBLEM - MD RAID on ms-be2016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:29:04] ACKNOWLEDGEMENT - MD RAID on ms-be2016 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T240798 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:29:08] 10Operations, 10ops-codfw: Degraded RAID on ms-be2016 - https://phabricator.wikimedia.org/T240798 (10ops-monitoring-bot) [20:32:41] PROBLEM - Disk space on ms-be2016 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdk1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ms-be2016&var-datasource=codfw+prometheus/ops [20:32:43] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 33.12, 72.87, 55.22 https://wikitech.wikimedia.org/wiki/Swift [20:33:31] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:34:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=icinga site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:36:39] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:44:03] PROBLEM - Varnish HTCP daemon on cp1075 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (vhtcpd), args vhtcpd https://wikitech.wikimedia.org/wiki/Varnish [20:57:29] 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Daimona) T240518 ? [21:00:25] 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm) p:05Triage→03Unbreak! Boldly triaging as UBN, this seems to affect the whole queue thing (T240800 was just created, I'm having troubles uploading a webm file [uploader... [21:00:58] 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm) >>! In T240800#5742793, @Daimona wrote: > T240518 ? My first guess. [21:02:05] PROBLEM - SSH on ms-be2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:03:43] RECOVERY - SSH on ms-be2021 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:07:53] 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm) Yup, it just takes hours to deliver. [21:08:13] 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm) [21:08:15] 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm) [21:08:17] 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm) [21:08:28] 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm) [21:08:30] 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm) [21:08:32] 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm) [21:08:45] 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Urbanecm) [21:08:47] 10Operations, 10Mail: MediaWiki mail system for watchlist on it.wikipedia is delivering very slowly - https://phabricator.wikimedia.org/T240800 (10Urbanecm) [21:08:49] 10Operations, 10MassMessage, 10User-DannyS712: MassMessage not delivering - https://phabricator.wikimedia.org/T240777 (10Urbanecm) [21:45:57] 10Operations, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Masumrezarock100) >>! In T240518#5742251, @Urbanecm wrote: > I just spent some time going through Grafana and found https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId... [22:28:55] (03PS1) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) [22:31:42] (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [22:32:40] (03CR) 10jerkins-bot: [V: 04-1] Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [22:33:58] (03CR) 10DannyS712: [C: 04-1] "Thanks for contributing @RetroCraft. Jenkins found some issues though" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [22:36:04] (03PS2) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) [22:37:25] (03CR) 10RetroCraft: "> Patch Set 1: Code-Review-1" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [23:07:44] (03CR) 10DannyS712: Create Test Custodians group at Beta Wikiversity (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [23:10:53] 10Operations, 10Core Platform Team, 10WMF-JobQueue: Job queue seems to be processed slowly than expected - https://phabricator.wikimedia.org/T240518 (10Krinkle) [23:36:21] (03PS3) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) [23:36:48] (03PS4) 10RetroCraft: Create Test Custodians group at Beta Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) [23:39:37] (03CR) 10RetroCraft: "Makes sense, fixed." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [23:49:22] (03CR) 10DannyS712: "Recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [23:50:32] (03CR) 10DannyS712: [C: 03+1] "Looks good" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557671 (https://phabricator.wikimedia.org/T240438) (owner: 10RetroCraft) [23:56:13] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert, rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/