[00:45:23] (03CR) 10DannyS712: [C: 03+1] Icinga: changing channel for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/686697 (https://phabricator.wikimedia.org/T282301) (owner: 101997kB) [03:08:55] PROBLEM - SSH on logstash2020.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:21:21] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:59] 10SRE, 10Traffic, 10netops: TATA SKY Broadband (AS134674) issues with connecting to upload.wikimedia.org - https://phabricator.wikimedia.org/T275234 (10Aklapper) @ssingh: Any news here? [03:55:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:56:01] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:10:15] RECOVERY - SSH on logstash2020.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:16:13] 10SRE, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [04:27:29] !log starting upgrade of batch H of mailing lists (T280322) [04:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:27:40] T280322: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 [05:03:15] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 1874 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [05:03:58] I look [05:07:29] I don't see an immediate problem. I think this alert needs fixing [05:10:34] smtp simply has recovered [05:10:44] I think it's due to an email to wikimediaannounce [05:23:54] 10SRE, 10Wikimedia-Mailing-lists: Error in qcluster - https://phabricator.wikimedia.org/T282071 (10Ladsgroup) 05Open→03Resolved a:03Legoktm [05:30:43] * legoktm peeks [05:34:21] legoktm: if that helps: https://grafana-rw.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1 (I fixed it) [05:34:37] it = the grafana dashboard [05:35:36] thanks, I'm reading https://docs.mailman3.org/projects/mailman/en/latest/src/mailman/app/docs/bounces.html right now [05:35:52] there really are 1875 files in the bounce queue rn [06:18:58] the bounce runner died, I'm trying to run it manually now [06:21:37] !log restarting mailman3 service, bounce runner died [06:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:22:30] the bounces runner is running again but not actually clearing the queue [06:30:40] ok [06:30:44] fixed the dashboard for real now [06:37:31] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [06:49:52] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner not working - https://phabricator.wikimedia.org/T282348 (10Legoktm) p:05Triage→03Unbreak! [06:52:38] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner not working - https://phabricator.wikimedia.org/T282348 (10Legoktm) https://gitlab.com/mailman/mailman/-/issues/755 suggests upgrading flufl.bounce to 3.0.1 might help, we're on 3.0.0 https://gitlab.com/mailman/mailman/-/issues/887 is about the bounce r... [07:00:05] jan_drewniak: My dear minions, it's time we take the moon! Just kidding. Time for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210510T1030). [07:01:30] Isn't it Sunday? [07:01:47] legoktm: is there anything I can do to help? [07:05:02] Amir1: I have no idea where the code that reads files out of queue/bounces is [07:05:24] it does not appear to be runners/bounce.py [07:05:47] okay, I check [07:08:59] but that's the only place that uses bounce methods https://gitlab.com/search?utf8=%E2%9C%93&search=%22maybe_forward%22&group_id=124427&project_id=183616&scope=&search_code=true&snippets=false&repository_ref=master&nav_source=navbar [07:10:18] there is something here https://gitlab.com/mailman/mailman/-/blob/master/src/mailman/core/pipelines.py [07:15:33] I think it's a switchboard [07:26:23] (03PS1) 10Legoktm: mailman3: Enable debug logging [puppet] - 10https://gerrit.wikimedia.org/r/687511 (https://phabricator.wikimedia.org/T282348) [07:28:05] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [07:28:12] (03PS2) 10Legoktm: mailman3: Enable debug logging [puppet] - 10https://gerrit.wikimedia.org/r/687511 (https://phabricator.wikimedia.org/T282348) [07:29:08] (03CR) 10Legoktm: [C: 03+2] mailman3: Enable debug logging [puppet] - 10https://gerrit.wikimedia.org/r/687511 (https://phabricator.wikimedia.org/T282348) (owner: 10Legoktm) [07:33:30] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman3 bounce runner not working - https://phabricator.wikimedia.org/T282348 (10Legoktm) So the bounce runner is running! It's just super slow ` May 09 07:31:21 2021 (20029) [BounceRunner] checking short circuit May 09 07:31:21 2021 (20029) [BounceRunn... [07:34:27] legoktm: can it be because we are upgrading the vm is under pressure? [07:34:46] it has only 4 cpus [07:35:20] the 20s seems very much like a timeout [07:36:13] but I have a stupider theory, which is that now that the lists are in Mailman3, it's actually trying to unsubscribe all these addresses that have been bouncing FOREVER [07:38:15] beautiful [07:46:55] it doesn't look like a slow query otherwise that would show up in SHOW FULL PROCESSLIST, which is mostly empty [07:49:37] PROBLEM - wikimedia-client-errors-alerts grafana alert on alert1001 is CRITICAL: CRITICAL: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is alerting: Client error alert. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [07:53:34] something is likely wrong with the wikimedia-in-wb list [07:54:44] or maybe those are just the messages its going through right now? [07:59:21] RECOVERY - wikimedia-client-errors-alerts grafana alert on alert1001 is OK: OK: Overview ( https://grafana.wikimedia.org/d/000000566/overview ) is not alerting. https://logstash.wikimedia.org/app/kibana%23/dashboard/AXDBY8Qhh3Uj6x1zCF56 https://grafana.wikimedia.org/d/000000566/ [08:01:16] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Mailman3 bounce runner not working - https://phabricator.wikimedia.org/T282348 (10Legoktm) I tried to apply the patch from https://gitlab.com/mailman/mailman/-/merge_requests/811 but it made no difference. My guess is that 20s is a timeout somewhere. [08:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [08:49:02] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner not working - https://phabricator.wikimedia.org/T282348 (10Legoktm) ` >>> from mailman.interfaces.member import DeliveryStatus, IMembershipManager >>> from zope.component import getUtility >>> manager = getUtility(IMembershipManager) >>> p=manager.member... [08:49:24] Amir1: ^^ figured it out, it's a MM3 bug [08:49:35] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) p:05Unbreak!→03High [08:49:47] oh nice [08:50:10] ACKNOWLEDGEMENT - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 2461 (limit: 25) Legoktm https://phabricator.wikimedia.org/T282348 https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [08:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [09:04:23] (03PS1) 10Ladsgroup: Revert "mailman3: Enable debug logging" [puppet] - 10https://gerrit.wikimedia.org/r/687409 [09:04:42] legoktm: at your leisure ^ [09:16:20] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) I live hacked in the following patch: ` diff --git a/src/mailman/model/member.py b/src/mailman/model/member.py index 1adeb6e41..6f70f990f 100644 --- a/src/mailman/model/member... [09:16:28] !log mailman3 live hacked patch at https://phabricator.wikimedia.org/T282348#7072358 to fix bounce queue [09:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:34] who'd have thought, making one query is much faster than making 2,600 queries [09:19:23] wohoo [09:19:25] Awesome [09:19:30] Thanks <3 [09:20:37] 10SRE, 10Wikimedia-Mailing-lists: Publish statistics about number of held messages per mailing list (Jan 2021) - https://phabricator.wikimedia.org/T270977 (10Aklapper) Thanks. I guess to some extent this issue will become more visible anyway now due to the Mailman3 migration (big big thanks for that). [09:22:10] Amir1: lmao, check your inbox [09:22:32] > "subscription has been disabled on listadmins@lists.wikimedia.org due to their bounce score exceeding the mailing list's bounce_score_threshold" [09:22:33] legoktm: the dark side [09:22:59] at least we get rid of the spammers for good [09:23:23] RIP my inbox [09:24:16] https://grafana-rw.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&forceLogin=true&from=now-6h&to=now [09:24:27] look at the out queue rising with all these notifications going out [09:25:10] lol [09:25:30] I wonder how many more I'm going to get [09:28:55] the bounce runner died again [09:31:55] kicked it again [09:35:15] died again [09:35:35] Maybe the out queue needs clearing first? [09:35:48] May 9 09:01:55 lists1001 kernel: [7408888.432123] Out of memory: Kill process 21960 (mailman) score 687 or sacrifice child [09:35:48] May 9 09:01:55 lists1001 kernel: [7408888.433858] Killed process 21960 (mailman) total-vm:10004984kB, anon-rss:4581088kB, file-rss:480kB, shmem-rss:0kB [09:36:36] I'm guessing it has a mem leak [09:37:50] 10SRE, 10Wikimedia-Mailing-lists, 10observability: Monitor mailman3 runner processes - https://phabricator.wikimedia.org/T282366 (10Legoktm) [09:38:29] Amir1: can you pause the imports for now? [09:38:49] done [09:39:11] well, ctrl+c didn't work [09:39:13] going ps aux [09:39:56] it should be fully killed now [09:41:09] (1406, "Data too long for column 'email' at row 1") [09:41:39] ffs [09:42:13] also, it seems exceptions are in /var/log/syslog [09:43:40] email is VARCHAR(225) [09:43:44] sorry, 255 [09:44:56] ... [09:52:04] I would suggest we send an email to listadmins explaining why everyone is getting a deluge of unsubscription notices if that mail wouldn't trigger like 100 more [09:52:57] the dashboard very nicely says it'll take 2 hours to send all the bounce notifcations [09:53:40] I handle it [09:54:43] okay, ready to see the too-long email? [09:54:45] 'email': 'p style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0); font-family: -webkit-standard; font-style: normal; font-variant-caps: normal; font-weight: n ... (67 characters truncated) ... ne; white-space: normal; word-spacing: 0px; -moz-text-size-adjust: auto; -webkit-text-stroke-width: 0px; text-decoration: none; text-align: center; "' [09:55:20] lmao [09:55:33] At least it's not '"; drop table' [09:57:11] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) The bounce runner died again because it tried to insert a too-big email into the bounceevent table (`VARCHAR(255)`): ` May 9 09:38:37 lists1001 mailman3[19017]: sqlalchemy.e... [09:57:32] ok, restarting again [09:58:31] maybe we should have kept it as slow :D [10:00:55] ugh I should have sent it to announce [10:02:58] > noreply@wikimedia.org's subscription disabled on Listadmins [10:03:29] now list of departed staff [10:05:29] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:06:03] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.308 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:08:01] RECOVERY - Query Service HTTP Port on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [10:15:24] queue is below 1k [10:17:25] and number of members of listadmins [10:30:11] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [10:30:58] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) Email validation: https://gitlab.com/mailman/mailman/-/issues/892 Patch for query optimization: https://gitlab.com/mailman/mailman/-/merge_requests/857 [10:42:56] legoktm: can I resume? [10:43:03] yep [10:52:40] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 180 days, 0:00:00 on cloudmetrics1002.eqiad.wmnet with reason: T275605 [10:52:40] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 180 days, 0:00:00 on cloudmetrics1002.eqiad.wmnet with reason: T275605 [10:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:51] T275605: cloudmetrics1002: mysterious issue - https://phabricator.wikimedia.org/T275605 [10:52:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10aborrero) [10:55:42] 10SRE, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install labmon1002 - https://phabricator.wikimedia.org/T165784 (10aborrero) [10:57:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): hw troubleshooting: server hardlocking for cloudmetrics1002.eqiad.wmnet - https://phabricator.wikimedia.org/T281881 (10aborrero) [11:13:29] (03PS1) 10Arturo Borrero Gonzalez: grafana-labs: point to cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/687585 (https://phabricator.wikimedia.org/T275605) [11:13:36] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudmetrics: fail over to cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/684990 (https://phabricator.wikimedia.org/T281881) (owner: 10Bstorm) [11:14:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] grafana-labs: point to cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/687585 (https://phabricator.wikimedia.org/T275605) (owner: 10Arturo Borrero Gonzalez) [11:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [11:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [12:08:23] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 240 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:09:23] the link is broken [12:10:11] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:24:59] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:25:03] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:47:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:50:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:33:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:35:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [14:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [15:15:33] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: daily-article-l@, education@ import to Mailman3 failed because of unicode characters in display name - https://phabricator.wikimedia.org/T282271 (10jcrespo) >>! In T282271#7071118, @Legoktm wrote: > I filed https://gitlab.com/mailman/mailman/-/issues/891 upstream... [15:40:42] (03PS4) 10Neechalkaran: Enable WikiLove extension on tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686700 (https://phabricator.wikimedia.org/T280326) [17:28:49] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [17:30:51] PROBLEM - mailman3_queue_size on lists1001 is CRITICAL: CRITICAL: 1 mailman3 queues above limits: bounces is 45 (limit: 25) https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [17:38:15] legoktm, Amir1: should that be /Monitoring (can't change it, filed bug) [17:47:01] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={netbox_device_statistics,pdu_sentry4} site={codfw,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [17:49:27] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [18:28:09] !log systemctl restart mailman3, bounce runner died again (T282348) [18:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:14] T282348: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 [18:30:53] RECOVERY - mailman3_queue_size on lists1001 is OK: OK: mailman3 queues are below the limits https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [18:33:22] 10SRE, 10Wikimedia-Mailing-lists: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 (10Legoktm) I hack applied https://gitlab.com/mailman/mailman/-/merge_requests/811/, I think it'll fix the crash issue. [18:39:15] (03PS1) 10Legoktm: mailman3: Monitor runners [puppet] - 10https://gerrit.wikimedia.org/r/687741 (https://phabricator.wikimedia.org/T282366) [18:52:47] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [18:58:01] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:02:15] (03CR) 10Rubin: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686700 (https://phabricator.wikimedia.org/T280326) (owner: 10Neechalkaran) [19:04:51] (03PS2) 10Legoktm: mailman3: Monitor runners [puppet] - 10https://gerrit.wikimedia.org/r/687741 (https://phabricator.wikimedia.org/T282366) [19:11:09] PROBLEM - mailman list info on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 Moved Permanently - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/mailman/listinfo/wikimedia-l - 618 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [19:11:43] heh [19:16:48] ACKNOWLEDGEMENT - mailman list info on lists1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 301 Moved Permanently - string Wikimedia Mailing List not found on https://lists.wikimedia.org:443/mailman/listinfo/wikimedia-l - 618 bytes in 0.011 second response time Legoktm moved to Mailman3! will fix alert https://wikitech.wikimedia.org/wiki/Mailman%23Monitoring [19:17:37] One time we'll be happy to see icinga going off :) [20:21:59] (03PS1) 10Zabe: Add svwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687785 (https://phabricator.wikimedia.org/T282389) [20:25:36] (03PS1) 10Zabe: Use svwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687787 (https://phabricator.wikimedia.org/T282389) [20:25:55] (03CR) 10Urbanecm: [C: 03+1] Add svwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687785 (https://phabricator.wikimedia.org/T282389) (owner: 10Zabe) [20:26:00] (03CR) 10Urbanecm: [C: 03+1] Use svwiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687787 (https://phabricator.wikimedia.org/T282389) (owner: 10Zabe) [20:46:25] (03PS1) 10Urbanecm: Add *.geograph.ie to the wgCopyUploadsDomains allowlist of Wikimedia Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/687795 (https://phabricator.wikimedia.org/T282007) [20:48:47] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [20:50:23] (03CR) 10Urbanecm: [C: 04-1] "CR-1 for visibility" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686437 (https://phabricator.wikimedia.org/T262155) (owner: 10Zabe) [20:51:00] (03CR) 10Urbanecm: [C: 03+1] Enable WikiLove extension on tawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686700 (https://phabricator.wikimedia.org/T280326) (owner: 10Neechalkaran) [20:53:47] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org [20:55:25] (03PS2) 10Zabe: Change namespace name and aliases on jawikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686437 (https://phabricator.wikimedia.org/T262155) [20:56:01] (03CR) 10Zabe: Change namespace name and aliases on jawikivoyage (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/686437 (https://phabricator.wikimedia.org/T262155) (owner: 10Zabe) [20:56:25] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Urbanecm) 05Open→03Declined Please try to upload it with upload_by_url feature. It seems to work with my tests. [21:31:18] (03PS1) 10Legoktm: lists: Update monitoring [puppet] - 10https://gerrit.wikimedia.org/r/687813 [21:32:27] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 70353824 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:34:45] pymysql.err.InternalError: (1205, 'Lock wait timeout exceeded; try restarting transaction') [21:35:03] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 474184 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:35:09] Ty legoktm on the PS. Was gonna do earlier but gerrit hates me. [21:35:18] The # -> / part [21:35:28] :) [21:35:43] I filed a bug for gerrit [21:35:50] It just refused to let me type a / [21:35:54] (03CR) 10Legoktm: [C: 03+2] mailman3: Monitor runners [puppet] - 10https://gerrit.wikimedia.org/r/687741 (https://phabricator.wikimedia.org/T282366) (owner: 10Legoktm) [21:36:04] (03CR) 10Legoktm: [C: 03+2] lists: Update monitoring [puppet] - 10https://gerrit.wikimedia.org/r/687813 (owner: 10Legoktm) [21:43:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8510 bytes in 0.260 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:44:34] !log restarted mailman3 again (T282348) pymysql.err.InternalError: (1205, 'Lock wait timeout exceeded; try restarting transaction') [21:44:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:39] T282348: Mailman3 bounce runner is running very slowly - https://phabricator.wikimedia.org/T282348 [21:45:23] PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:48:47] RECOVERY - mailman3_runners on lists1001 is OK: PROCS OK: 14 processes with UID = 38 (list), regex args /usr/lib/mailman3/bin/runner https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [21:50:00] 10SRE, 10Wikimedia-Mailing-lists, 10observability, 10Patch-For-Review: Monitor mailman3 runner processes - https://phabricator.wikimedia.org/T282366 (10Legoktm) 05Open→03Resolved a:03Legoktm ` 14:45:23 <+icinga-wm> PROBLEM - mailman3_runners on lists1001 is CRITICAL: PROCS CRITICAL: 13 processes wit... [21:52:30] (03PS3) 10Legoktm: icinga::ircbot: Send Wikidata notifications to #wikidata-feed [puppet] - 10https://gerrit.wikimedia.org/r/686697 (https://phabricator.wikimedia.org/T282301) (owner: 101997kB) [21:54:36] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29462/console" [puppet] - 10https://gerrit.wikimedia.org/r/686697 (https://phabricator.wikimedia.org/T282301) (owner: 101997kB) [21:58:00] 10SRE, 10MediaWiki-General, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.28; 2020-04-14), and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Krinkle) Production configuration still runs with the "HHVM-like" shim a... [21:58:08] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Mailman 3: Months with no emails are still listed in the archive - https://phabricator.wikimedia.org/T282341 (10Legoktm) I think this is https://gitlab.com/mailman/hyperkitty/-/issues/255 [22:38:05] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:39:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:28:45] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:29:15] (03PS2) 10Southparkfan: Add WMCS specific cloud role for syslog server [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) [23:29:29] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:48:46] (Processor usage over 85%) firing: Processor usage over 85% - https://alerts.wikimedia.org [23:53:46] (Processor usage over 85%) resolved: Processor usage over 85% - https://alerts.wikimedia.org