[01:36:58] PROBLEM - snapshot of s4 in codfw on alert1001 is CRITICAL: snapshot for s4 at codfw taken more than 3 days ago: Most recent backup 2021-04-22 01:16:17 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:29:58] PROBLEM - snapshot of s7 in codfw on alert1001 is CRITICAL: snapshot for s7 at codfw taken more than 3 days ago: Most recent backup 2021-04-22 02:06:47 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:54:00] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [03:17:30] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:24] what's a good project management software? I want to track things like lead times, processing times, and C/A. I also want to be able to share specific individual tickets with clients. [03:26:40] noway96: that is out of the scope of this channel [03:27:07] yeah.. [03:27:13] I don't know where to ask that [03:34:42] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.078 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:37:17] 10SRE, 10observability: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Reedy) [03:39:14] 10SRE, 10observability: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Reedy) [03:39:49] 10SRE, 10observability: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Reedy) [03:55:20] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [04:11:04] PROBLEM - snapshot of s5 in codfw on alert1001 is CRITICAL: snapshot for s5 at codfw taken more than 3 days ago: Most recent backup 2021-04-22 04:02:23 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [05:19:23] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in commonsdumps to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682260 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:21:19] (03CR) 10ArielGlenn: "Deployng this during the window when these jobs are not running." [puppet] - 10https://gerrit.wikimedia.org/r/682260 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:21:24] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in commonsdumps to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682260 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:22:15] (03PS2) 10ArielGlenn: snapshot: Migrate cronjobs in wikidatadumps to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682261 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:23:56] (03CR) 10ArielGlenn: [C: 03+2] "Deployng this during the window when these jobs are not running." [puppet] - 10https://gerrit.wikimedia.org/r/682261 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [05:31:47] (Primary outbound port utilisation over 80% #page) firing: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [05:31:47] (Primary inbound port utilisation over 80% #page) firing: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [05:36:47] (Primary outbound port utilisation over 80% #page) resolved: Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org [05:36:47] (Primary inbound port utilisation over 80% #page) resolved: Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org [05:52:30] PROBLEM - snapshot of s6 in codfw on alert1001 is CRITICAL: snapshot for s6 at codfw taken more than 3 days ago: Most recent backup 2021-04-22 05:38:37 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [06:17:00] 10SRE, 10observability: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10ArielGlenn) If people move stuff off of /srv/security we could get .5T back which would be helpful. Some of those files are from a few years ago. The big spender in /srv/mw-log is Exte... [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210425T0700) [08:36:40] the Primary outbount usage "pages" above were related to mr1-eqsin, that shouldn't page, the alert msg is wrong. Opening a task for it (it happened another time and I forgot to open a task :D) [08:40:29] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention "#page" in their IRC logs - https://phabricator.wikimedia.org/T281055 (10elukey) [08:41:08] elukey: maybe remove the magic word from the task title, so any activity on that won't ping people via wikibugs? :D [08:41:46] 10SRE, 10netops: mr1 port utilization alerts shouldn't mention hash page in their IRC logs - https://phabricator.wikimedia.org/T281055 (10Legoktm) [08:42:22] Majavah: good point :) [08:42:47] it pinged me :p [08:42:53] hahahahah [08:42:56] sorryyyyy [08:43:29] no worries :) [08:58:52] 10SRE, 10Wikimedia-Incident: Uncached wiki requests partially unavailable due to excessive request rates from a bot - https://phabricator.wikimedia.org/T280232 (10Schlurcher) Hi all, I have resumed my bot edits in Commons at 1/10th of the edit rate that lead to this. I made sure that the coding includes maxlag... [12:39:56] (03CR) 10Southparkfan: Add WMCS specific cloud role for syslog server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [14:00:26] RECOVERY - Disk space on mwlog1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=mwlog1001&var-datasource=eqiad+prometheus/ops [14:05:48] (03PS1) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:07:03] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:08:40] (03PS2) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:10:00] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:15:15] (03PS3) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:16:28] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:20:19] (03PS4) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:21:33] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:24:34] (03PS5) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:25:50] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:28:39] (03PS6) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:29:59] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:30:52] (03PS7) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:32:07] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:37:44] (03PS1) 10Andrew Bogott: dummy secrets for radosgw [labs/private] - 10https://gerrit.wikimedia.org/r/682319 [14:39:17] (03PS8) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:40:40] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:41:57] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] dummy secrets for radosgw [labs/private] - 10https://gerrit.wikimedia.org/r/682319 (owner: 10Andrew Bogott) [14:46:01] (03PS9) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:47:17] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:49:55] (03PS10) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:51:10] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:52:14] (03PS11) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:53:29] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [14:58:18] (03PS12) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [14:59:32] (03CR) 10jerkins-bot: [V: 04-1] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [15:15:32] (03PS13) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [15:20:52] (03PS14) 10Andrew Bogott: cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) [15:22:28] 10SRE, 10Performance-Team, 10Platform Engineering, 10observability: mwlog1001 is running out of free space on /srv/mw-log - https://phabricator.wikimedia.org/T281048 (10Reedy) >>! In T281048#7031656, @ArielGlenn wrote: > If people move stuff off of /srv/security we could get .5T back which would be helpful... [15:22:46] (03CR) 10Andrew Bogott: [C: 03+2] cloud-vps ceph: initial config for adding radosgw to control nodes [puppet] - 10https://gerrit.wikimedia.org/r/682317 (https://phabricator.wikimedia.org/T276961) (owner: 10Andrew Bogott) [15:23:11] !log sudo -u list /var/lib/mailman/bin/change_pw -l wikica-l -p $(pwgen -c1 -s 12) (T281066) [15:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:22] T281066: Recover admin password for mailing list Wikica-l - https://phabricator.wikimedia.org/T281066 [15:29:41] (03PS1) 10Andrew Bogott: ceph.conf: replace True and False with true and false [puppet] - 10https://gerrit.wikimedia.org/r/682320 [15:30:16] PROBLEM - Check systemd state on install1003 is CRITICAL: CRITICAL - degraded: The following units failed: squid.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:31:26] (03CR) 10Andrew Bogott: [C: 03+2] ceph.conf: replace True and False with true and false [puppet] - 10https://gerrit.wikimedia.org/r/682320 (owner: 10Andrew Bogott) [15:31:44] PROBLEM - Squid on install1003 is CRITICAL: connect to address 208.80.154.32 and port 8080: Connection refused https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:36:21] (03PS1) 10Reedy: Move ExternalStore log group from debug to error [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682322 (https://phabricator.wikimedia.org/T281048) [15:39:10] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Urbanecm) Unfortunately, it seems to be impossible to download the files from archive.org reliably. ` [urbanecm@notebook ~]$ ssh mwmaint1002.eqiad.wmnet [urbanecm@mwmaint... [15:47:24] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01001 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:51:36] RECOVERY - Squid on install1003 is OK: TCP OK - 0.000 second response time on 208.80.154.32 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:52:11] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Languageseeker) I'm actually not surprised that the c01 files are failing. For some reason, SRE does not seem to like the Tif from the British Library. However, I've been... [15:52:18] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Urbanecm) Noting that wget said `Read error at byte 4192239 (Success).Retrying.` several times during the download process. [15:52:38] RECOVERY - Check systemd state on install1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:03] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10RhinosF1) Noting at a simmilar time to Martin I saw: > ERROR    - Unexpected error (("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLen... [15:56:12] wikibugs is slow [15:57:39] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Languageseeker) As a shot in the dark, would it be possible to try with something else than wget? [16:15:30] (03PS1) 10BryanDavis: toolforge: Add all tools created before 2020-07-07 to legacy_redirector [puppet] - 10https://gerrit.wikimedia.org/r/682325 (https://phabricator.wikimedia.org/T281003) [16:19:17] (03CR) 10BryanDavis: "Not tested at all, so test/merge carefully." [puppet] - 10https://gerrit.wikimedia.org/r/682325 (https://phabricator.wikimedia.org/T281003) (owner: 10BryanDavis) [16:27:22] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Languageseeker) I don’t think it’s the IA because I tried uploading these files via Pattypan and kept getting a myriad of errors. Failure seems almost certain and success... [16:34:52] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Kizule) >>! In T281019#7032016, @Languageseeker wrote: > As a shot in the dark, would it be possible to try with something else than wget? curl is a possible option too,... [16:53:38] (03CR) 10Majavah: Add WMCS specific cloud role for syslog server (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [17:11:05] 10SRE, 10Internet-Archive, 10Wikimedia-Site-requests: Please Upload large files to Commons - https://phabricator.wikimedia.org/T281019 (10Urbanecm) Using curl did not help either: ` [urbanecm@mwmaint1002 ~/tmp]$ curl -O 'http://web.archive.org/web/20150905070709if_/http://www.quartos.org/quarto_images/ham-1... [17:14:38] PROBLEM - SSH on mw1279.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [18:15:58] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:44:29] 10SRE, 10WMF-JobQueue: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113 (10Aklapper) Is this ticket obsolete now that {T198220} is resolved? [20:10:20] (03PS1) 10Luke081515: Enable Wikidata description override on dewiki at beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682337 (https://phabricator.wikimedia.org/T279829) [20:20:27] hi, a few minutes ago I published the latest edition of the signpost, including sending mass messages both locally on enwiki and globally on meta, but the User:MediaWiki message delivery isn't making any edits to deliver the message. Is this a bug? Or just some delay in processing? [20:20:51] mass message is often broken [20:21:12] (03PS1) 10Luke081515: Enable local uploads on French Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682338 (https://phabricator.wikimedia.org/T280019) [20:21:24] this is the first time its refused to deliver a message for me [20:28:36] yeah, I tried again with a test message to a smaller list, nothing [20:32:43] https://phabricator.wikimedia.org/T281072 [20:44:29] does it work on beta? [21:10:00] DannyS712: last MassMessage delivery on meta happened yesterday, so I'd say it's MM being buggy as usua [21:10:02] usual* [21:10:52] (03CR) 10Urbanecm: [C: 04-2] "pending an EDP, I'll post a comment on task today or tomorrow" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682338 (https://phabricator.wikimedia.org/T280019) (owner: 10Luke081515) [21:48:13] (03PS1) 10Urbanecm: GrowthExperiments: Do not enable community configuration outside of beta wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/682339 (https://phabricator.wikimedia.org/T274520) [21:59:40] RECOVERY - snapshot of s1 in codfw on alert1001 is OK: Last snapshot for s1 at codfw (db2097.codfw.wmnet:3311) taken on 2021-04-25 20:39:14 (1056 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:31:10] RECOVERY - snapshot of s8 in codfw on alert1001 is OK: Last snapshot for s8 at codfw (db2100.codfw.wmnet:3318) taken on 2021-04-25 20:57:13 (1243 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [22:40:25] 10SRE, 10Wikimedia-Mailing-lists: hyperkitty didn't import all wikitech-l messages - https://phabricator.wikimedia.org/T281070 (10Legoktm) p:05High→03Low I spot checked some, these all seem malformed in some way, e.g. https://lists.wikimedia.org/pipermail/wikitech-l/2002-November/005560.html the from addre... [22:56:33] 10SRE, 10WMF-JobQueue: Rationalize our jobqueues redis topology - https://phabricator.wikimedia.org/T135113 (10Krinkle) 05Open→03Declined Yeah, I'm going to assume so. Also: * {T267581} * {T206016} * {T280582}