[00:00:04] <jouncebot>	 twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210422T0000).
[00:00:54] <legoktm>	 fun fact: we have 905 list admins
[00:03:13] <legoktm>	 !log subscribed all list admins to the listadmins@ mailing list (T280716)
[00:03:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:03:23] <stashbot>	 T280716: Ensure listadmins@ membership is up to date - https://phabricator.wikimedia.org/T280716
[00:06:37] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-eventlogging-saltrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:07:25] <wikibugs>	 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) I just sent an announcement to all list administrators: https://lists.wikimedia.org/pipermail/listadmins/2021-April/000344.html
[00:10:43] <icinga-wm>	 PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 11384 MB (5% inode=73%): /tmp 11384 MB (5% inode=73%): /var/tmp 11384 MB (5% inode=73%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops
[00:14:36] <wikibugs>	 (03PS1) 10Legoktm: mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716)
[00:15:00] <wikibugs>	 (03PS2) 10Legoktm: mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716)
[00:15:39] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) (owner: 10Legoktm)
[00:15:56] <wikibugs>	 (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29152/console" [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) (owner: 10Legoktm)
[00:18:05] <legoktm>	 I'll look at deneb in a minute
[00:21:14] <legoktm>	 Apr 19 16:06:22 deneb docker-report-releng[7966]: /var/lib/dpkg/info/debmonitor-client.postinst: line 7: systemd-sysusers: command not found
[00:24:53] <wikibugs>	 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10Legoktm)
[00:27:53] <legoktm>	 !log legoktm@deneb:/var/cache/apt/archives$ sudo rm -rf * # cleaned up 6GB
[00:28:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:28:11] <icinga-wm>	 RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops
[00:28:40] <legoktm>	 !log legoktm@deneb:/var/cache/pbuilder/aptcache$ sudo rm -rf * # Cleaned up 8GB more
[00:28:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:29:03] <legoktm>	 /dev/vda1       230G  172G   47G  79% /
[00:39:47] <wikibugs>	 (03PS1) 10Reedy: Update messages used for tech CoC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886)
[00:43:17] <wikibugs>	 (03PS1) 10Reedy: Add wmgUseFooterTechCodeOfConductLink to replace wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886)
[00:43:19] <wikibugs>	 (03PS1) 10Reedy: Flip variables in wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681836 (https://phabricator.wikimedia.org/T280886)
[00:43:48] <wikibugs>	 (03CR) 10Reedy: "This can probably go before the parent (as that needs MW changes and the train)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy)
[01:07:27] <icinga-wm>	 RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[01:40:13] <wikibugs>	 (03PS3) 10Legoktm: mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716)
[01:41:08] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Legoktm) Actually that was the wrong list, I sent one to test4@polymorphic and it didn't get archived properly...ugh.
[01:45:38] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) (owner: 10Legoktm)
[01:47:03] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[01:50:11] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Ensure listadmins@ membership is up to date - https://phabricator.wikimedia.org/T280716 (10Legoktm) 05Open→03Resolved
[01:51:57] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[02:47:22] <logmsgbot>	 !log krinkle@deploy1002 Started deploy [integration/docroot@010e445]: (no justification provided)
[02:47:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:47:31] <logmsgbot>	 !log krinkle@deploy1002 Finished deploy [integration/docroot@010e445]: (no justification provided) (duration: 00m 09s)
[02:47:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:01:53] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[03:09:13] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114418 bytes in 9.490 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[03:23:55] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[03:27:23] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:31:11] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114413 bytes in 7.102 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[03:37:49] <ori>	 legoktm: is there a way of seeing which list(s) I'm an admin of? they are almost certainly obsolete
[03:38:35] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[03:38:45] <legoktm>	 ori: yeah, I can look it up for you - do you know what email address you used?
[03:39:49] <ori>	 I got the notice at olivneh@wikimedia.org so I suppose it's that
[03:39:54] <ori>	 maybe ori@wikimedia.org
[03:44:25] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:46:28] <legoktm>	 ori: I don't see either of those addresses as a list admin, nor were those in the list that I added to the listadmin@ list today, so I'm guessing you were already subscribed to the list because of admining some list in the past?
[03:47:05] <ori>	 oh could be. I didn't realize listadmin@ was itself a list
[03:47:15] <ori>	 I'll just unsubscribe myself, then
[03:48:09] <ori>	 thanks for checking
[03:48:24] <legoktm>	 yeah, I debated clearing the subscriber list and starting anew but didn't want to accidentally remove people who should be informed
[03:48:26] <legoktm>	 np!
[04:26:39] <wikibugs>	 10SRE, 10Dumps-Generation, 10SRE-Access-Requests: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) >>! In T277629#7024732, @akosiaris wrote: > Any news on this?   I apologize but it's still not possible t...
[06:28:15] <wikibugs>	 (03PS5) 10Legoktm: mailman3: Add remove_from_lists helper [puppet] - 10https://gerrit.wikimedia.org/r/675353
[06:28:17] <wikibugs>	 (03PS6) 10Legoktm: mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356
[06:45:10] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/675353 (owner: 10Legoktm)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210422T0700)
[07:01:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:01:35] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:19:46] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356 (owner: 10Legoktm)
[07:30:19] <wikibugs>	 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10Legoktm)
[07:42:01] <wikibugs>	 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10Volans) a:03jbond Which version was trying to install? The latest version of debmonitor-client does this very check, see https://gerrit.wikimedia...
[07:53:33] <wikibugs>	 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10MoritzMuehlenhoff) The Docker image probably just needs a rebuild to pull in the fixed debmonitor-client package.
[08:46:11] <wikibugs>	 (03PS1) 10Alex Monk: Remove Yuvi's cloud-wide root key per request on IRC [labs/private] - 10https://gerrit.wikimedia.org/r/681935
[08:49:21] <wikibugs>	 (03CR) 10Yuvipanda: [C: 03+1] "I remember getting my key added here, and how amazing it felt <3" [labs/private] - 10https://gerrit.wikimedia.org/r/681935 (owner: 10Alex Monk)
[09:23:35] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:24:03] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[09:43:58] <wikibugs>	 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10jbond) I think this is just an old log entry from before i uploaded the new package (21/04/2021).  debmonitor is installed into the docker image by...
[09:45:15] <icinga-wm>	 RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:51:11] <wikibugs>	 (03PS2) 10Volans: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758
[10:19:35] <icinga-wm>	 PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:50:29] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 1.139e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[10:51:13] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.562 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[11:47:33] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114122 bytes in 3.494 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[11:54:59] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[12:01:02] <_joe_>	 so what's up with dbtree
[12:09:36] <_joe_>	 it looks like it has issues getting data back from db1115
[12:10:16] <_joe_>	 load average on the db server is 170
[12:30:42] <marostegui>	 Yes, I am taking a look
[12:31:02] <marostegui>	 !log Restart mysql on db1115 (tendril/dbtree will fail)
[12:31:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:50] <icinga-wm>	 PROBLEM - MariaDB Replica IO: db_inventory #page on db2093 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1115.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1115.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:37:59] <icinga-wm>	 PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:38:10] <marostegui>	 that's me
[12:38:11] <marostegui>	 gah
[12:38:13] <marostegui>	 sorry for the page
[12:38:58] <godog>	 ack, no worries
[12:39:40] <kormat>	 marostegui: i knew in my heart it must be your fault. <3
[12:39:57] <rzl>	 hi
[12:39:57] <sobanski>	 Phew 😥
[12:40:11] <rzl>	 oh good :)
[12:40:13] <marostegui>	 Yeah, trying to fix tendril :(
[12:40:30] <Urbanecm>	 i guess it's better to have pages for known reasons than otherwise :-)
[12:46:37] <Urbanecm>	 !log Start server-side upload for 2 video files (T280763, T280524)
[12:46:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:47] <stashbot>	 T280524: Server side upload for Butko - https://phabricator.wikimedia.org/T280524
[12:46:48] <stashbot>	 T280763: Server side upload for Butko - https://phabricator.wikimedia.org/T280763
[12:50:04] <icinga-wm>	 RECOVERY - MariaDB Replica IO: db_inventory #page on db2093 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:50:15] <icinga-wm>	 RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:50:27] <marostegui>	 I still don't know what's wrong with tendril
[12:50:30] <marostegui>	 Still investigating
[13:01:09] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 106983 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[13:15:57] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[13:17:17] <icinga-wm>	 PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:23] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service Marostegui known https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:23] <icinga-wm>	 ACKNOWLEDGEMENT - HTTPS-dbtree on dbmonitor1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.016 second response time Marostegui known https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[13:19:30] <marostegui>	 !log Tendril and dbtree are down at the moment
[13:19:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:17] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114121 bytes in 0.867 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[13:23:39] <marostegui>	 !log Tendril and dbtree are up but on a degraded status (slow reponse)
[13:23:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:05] <icinga-wm>	 RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:32:45] <wikibugs>	 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1156 is corrupted and needs to be recloned, probably best to use a logical dump.
[13:48:11] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[13:51:17] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1475.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:12:51] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 110227 bytes in 9.878 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[14:20:17] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[14:21:22] <_joe_>	 marostegui: need any help?
[14:21:50] <_joe_>	 marostegui: I saw quite a few huge procedures that looked like they were in a deadlock or something tbh
[14:21:55] <marostegui>	 _joe_: I am out of battery on my laptop, but we can live with dbtree/tendril being slow today, I might take a look later today or tomorrow
[14:21:59] <marostegui>	 _joe_: yes, usual tendril
[14:22:30] <_joe_>	 marostegui: ack, lemme know if you want me to take a look, but I can mostly apply blunt force
[14:37:47] <wikibugs>	 (03PS1) 10Ottomata: refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789)
[14:38:01] <wikibugs>	 (03PS2) 10Ottomata: refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789)
[14:38:03] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[14:41:30] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29153/console" [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[14:49:47] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 110711 bytes in 9.225 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[14:55:11] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:57:13] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[15:28:32] <wikibugs>	 (03PS1) 10Ottomata: Add Wikidata QRank link in dumps.wikimedia.org/other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416)
[15:29:03] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 111719 bytes in 8.226 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[15:29:10] <wikibugs>	 (03PS2) 10Ottomata: Add Wikidata QRank link in dumps.wikimedia.org/other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416)
[15:36:25] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[15:36:38] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Remove Yuvi's cloud-wide root key per request on IRC [labs/private] - 10https://gerrit.wikimedia.org/r/681935 (owner: 10Alex Monk)
[15:36:46] <wikibugs>	 (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] "alas!" [labs/private] - 10https://gerrit.wikimedia.org/r/681935 (owner: 10Alex Monk)
[15:40:51] <wikibugs>	 (03PS1) 10WMDE-Fisch: Fix suggested values not being shown when the param's type isn't specified [extensions/TemplateData] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681969 (https://phabricator.wikimedia.org/T280688)
[15:51:09] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 111735 bytes in 9.595 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[15:58:09] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[16:00:45] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1409.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:07:38] <wikibugs>	 10SRE, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality-Constraints, and 4 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Addshore)
[16:10:19] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:26:22] <wikibugs>	 (03CR) 10Ottomata: [V: 03+1 C: 03+2] refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata)
[16:26:54] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/<listname> should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10Ladsgroup)
[16:28:27] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/<listname> should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10Ladsgroup) After all lists are migrated or each one? Doing it one by one seems a bit complicated, either apache c...
[16:32:59] <logmsgbot>	 !log volker-e@deploy1002 Started deploy [design/style-guide@e914e8a]: Deploy design/style-guide: e914e8a icons: Add 'share' icon (#455)
[16:33:05] <logmsgbot>	 !log volker-e@deploy1002 Finished deploy [design/style-guide@e914e8a]: Deploy design/style-guide: e914e8a icons: Add 'share' icon (#455) (duration: 00m 06s)
[16:33:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:27] <wikibugs>	 (03CR) 10ArielGlenn: "is maintained *by* volunteers... Otherwise it looks fine to me, whenever you folks decide it's good I'm happy to merge it on through." [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) (owner: 10Ottomata)
[16:46:55] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114424 bytes in 2.721 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[16:48:01] <wikibugs>	 (03CR) 10JoKalliauer: [C: 04-1] "A up-to-date list is essential for the Commons-community to be able to create SVGs. Detailed informaiton can be found at https://meta.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris)
[16:54:13] <wikibugs>	 (03PS3) 10Ottomata: Add Wikidata QRank link in dumps.wikimedia.org/other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416)
[16:54:21] <wikibugs>	 (03CR) 10Ottomata: "Good catch, ty." [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) (owner: 10Ottomata)
[16:59:35] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 793.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:15:53] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[17:16:27] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432
[17:17:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 (owner: 10Giuseppe Lavagetto)
[17:26:56] <marostegui>	 !log Stop mysql on tendril/dbtree database
[17:27:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:29:48] <wikibugs>	 (03PS2) 10Phuedx: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper)
[17:33:29] <icinga-wm>	 PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[17:34:43] <icinga-wm>	 PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:34:59] <icinga-wm>	 PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:35:39] <icinga-wm>	 RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 108738 bytes in 0.479 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org
[17:36:09] <icinga-wm>	 PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:13] <icinga-wm>	 PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:13] <icinga-wm>	 RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:01:17] <icinga-wm>	 RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:05] <icinga-wm>	 RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:02:17] <icinga-wm>	 RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:01:44] <wikibugs>	 (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in shorturl to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682010 (https://phabricator.wikimedia.org/T273673)
[19:01:46] <wikibugs>	 (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in contentxlation to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682011 (https://phabricator.wikimedia.org/T273673)
[19:01:48] <wikibugs>	 (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in mediaperprojectlists to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682012 (https://phabricator.wikimedia.org/T273673)
[19:19:09] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup)
[20:16:37] <wikibugs>	 (03CR) 10Gehel: "Minor comments inline, feel free to ping me to discuss" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[20:24:58] <wikibugs>	 (03CR) 10Gehel: "another minor comment" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel)
[20:25:21] <wikibugs>	 (03CR) 10Gehel: "minor question, otherwise LGTM" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans)
[20:41:34] <wikibugs>	 (03CR) 10Sascha: [C: 03+1] "Looks great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) (owner: 10Ottomata)
[21:48:16] <wikibugs>	 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10RobH)
[23:36:04] <wikibugs>	 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T280668 (10wiki_willy) a:03Jclark-ctr FYI - this one is out of warranty. @Jclark-ctr - can you see if we have any drives from decom'd servers around?  Thanks, Willy
[23:36:23] <wikibugs>	 10SRE, 10ops-eqiad: Can't access thanos-fe1001.mgmt - https://phabricator.wikimedia.org/T280623 (10wiki_willy) a:03Cmjohnson
[23:36:41] <wikibugs>	 10SRE, 10ops-eqiad: htmldumper1001 power suply failure - https://phabricator.wikimedia.org/T280618 (10wiki_willy) a:03Cmjohnson
[23:37:28] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10wiki_willy) a:05wiki_willy→03Cmjohnson
[23:55:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10wiki_willy) @Cmjohnson - can you provide an update on this one?  This is one of the priority installs.  Thanks, Willy