[00:00:04] twentyafterfour: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Phabricator update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210422T0000). [00:00:54] fun fact: we have 905 list admins [00:03:13] !log subscribed all list admins to the listadmins@ mailing list (T280716) [00:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:23] T280716: Ensure listadmins@ membership is up to date - https://phabricator.wikimedia.org/T280716 [00:06:37] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: refinery-eventlogging-saltrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:07:25] 10SRE, 10Security-Team, 10Wikimedia-Mailing-lists: Upgrade GNU Mailman from 2.1 to Mailman3 - https://phabricator.wikimedia.org/T52864 (10Legoktm) I just sent an announcement to all list administrators: https://lists.wikimedia.org/pipermail/listadmins/2021-April/000344.html [00:10:43] PROBLEM - Disk space on deneb is CRITICAL: DISK CRITICAL - free space: / 11384 MB (5% inode=73%): /tmp 11384 MB (5% inode=73%): /var/tmp 11384 MB (5% inode=73%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [00:14:36] (03PS1) 10Legoktm: mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) [00:15:00] (03PS2) 10Legoktm: mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) [00:15:39] (03CR) 10jerkins-bot: [V: 04-1] mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) (owner: 10Legoktm) [00:15:56] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29152/console" [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) (owner: 10Legoktm) [00:18:05] I'll look at deneb in a minute [00:21:14] Apr 19 16:06:22 deneb docker-report-releng[7966]: /var/lib/dpkg/info/debmonitor-client.postinst: line 7: systemd-sysusers: command not found [00:24:53] 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10Legoktm) [00:27:53] !log legoktm@deneb:/var/cache/apt/archives$ sudo rm -rf * # cleaned up 6GB [00:28:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:28:11] RECOVERY - Disk space on deneb is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=deneb&var-datasource=codfw+prometheus/ops [00:28:40] !log legoktm@deneb:/var/cache/pbuilder/aptcache$ sudo rm -rf * # Cleaned up 8GB more [00:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:03] /dev/vda1 230G 172G 47G 79% / [00:39:47] (03PS1) 10Reedy: Update messages used for tech CoC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681834 (https://phabricator.wikimedia.org/T280886) [00:43:17] (03PS1) 10Reedy: Add wmgUseFooterTechCodeOfConductLink to replace wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886) [00:43:19] (03PS1) 10Reedy: Flip variables in wmgUseFooterCodeOfConductLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681836 (https://phabricator.wikimedia.org/T280886) [00:43:48] (03CR) 10Reedy: "This can probably go before the parent (as that needs MW changes and the train)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681835 (https://phabricator.wikimedia.org/T280886) (owner: 10Reedy) [01:07:27] RECOVERY - SSH on mw1279.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [01:40:13] (03PS3) 10Legoktm: mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) [01:41:08] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Install mailman3 and mailman2 at the same time on the cloud - https://phabricator.wikimedia.org/T278612 (10Legoktm) Actually that was the wrong list, I sent one to test4@polymorphic and it didn't get archived properly...ugh. [01:45:38] (03CR) 10Legoktm: [C: 03+2] mailman: Add script to dump all list admins [puppet] - 10https://gerrit.wikimedia.org/r/681823 (https://phabricator.wikimedia.org/T280716) (owner: 10Legoktm) [01:47:03] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [01:50:11] 10SRE, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Ensure listadmins@ membership is up to date - https://phabricator.wikimedia.org/T280716 (10Legoktm) 05Open→03Resolved [01:51:57] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:47:22] !log krinkle@deploy1002 Started deploy [integration/docroot@010e445]: (no justification provided) [02:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:31] !log krinkle@deploy1002 Finished deploy [integration/docroot@010e445]: (no justification provided) (duration: 00m 09s) [02:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:01:53] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:09:13] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114418 bytes in 9.490 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:23:55] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:27:23] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:31:11] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114413 bytes in 7.102 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:37:49] legoktm: is there a way of seeing which list(s) I'm an admin of? they are almost certainly obsolete [03:38:35] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [03:38:45] ori: yeah, I can look it up for you - do you know what email address you used? [03:39:49] I got the notice at olivneh@wikimedia.org so I suppose it's that [03:39:54] maybe ori@wikimedia.org [03:44:25] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:46:28] ori: I don't see either of those addresses as a list admin, nor were those in the list that I added to the listadmin@ list today, so I'm guessing you were already subscribed to the list because of admining some list in the past? [03:47:05] oh could be. I didn't realize listadmin@ was itself a list [03:47:15] I'll just unsubscribe myself, then [03:48:09] thanks for checking [03:48:24] yeah, I debated clearing the subscriber list and starting anew but didn't want to accidentally remove people who should be informed [03:48:26] np! [04:26:39] 10SRE, 10Dumps-Generation, 10SRE-Access-Requests: Create new group for root access to snapshot*, dumpsdata* and labstore1006,7 with holger in it - https://phabricator.wikimedia.org/T277629 (10ArielGlenn) >>! In T277629#7024732, @akosiaris wrote: > Any news on this? I apologize but it's still not possible t... [06:28:15] (03PS5) 10Legoktm: mailman3: Add remove_from_lists helper [puppet] - 10https://gerrit.wikimedia.org/r/675353 [06:28:17] (03PS6) 10Legoktm: mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356 [06:45:10] (03CR) 10Legoktm: [C: 03+2] "> Patch Set 4: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/675353 (owner: 10Legoktm) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210422T0700) [07:01:07] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:01:35] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:19:46] (03CR) 10Legoktm: [C: 03+2] mailman3: Add discard_held_messages script and timer [puppet] - 10https://gerrit.wikimedia.org/r/675356 (owner: 10Legoktm) [07:30:19] 10SRE, 10DBA, 10Wikimedia-Mailing-lists: Upgrade lists-next to bullseye mailman versions - https://phabricator.wikimedia.org/T280887 (10Legoktm) [07:42:01] 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10Volans) a:03jbond Which version was trying to install? The latest version of debmonitor-client does this very check, see https://gerrit.wikimedia... [07:53:33] 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10MoritzMuehlenhoff) The Docker image probably just needs a rebuild to pull in the fixed debmonitor-client package. [08:46:11] (03PS1) 10Alex Monk: Remove Yuvi's cloud-wide root key per request on IRC [labs/private] - 10https://gerrit.wikimedia.org/r/681935 [08:49:21] (03CR) 10Yuvipanda: [C: 03+1] "I remember getting my key added here, and how amazing it felt <3" [labs/private] - 10https://gerrit.wikimedia.org/r/681935 (owner: 10Alex Monk) [09:23:35] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:24:03] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:43:58] 10SRE, 10SRE-tools: debmonitor-client.postinst: line 7: systemd-sysusers: command not found on stretch docker images - https://phabricator.wikimedia.org/T280892 (10jbond) I think this is just an old log entry from before i uploaded the new package (21/04/2021). debmonitor is installed into the docker image by... [09:45:15] RECOVERY - Check systemd state on deneb is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:51:11] (03PS2) 10Volans: setup.py: support more recent PyParsing versions [software/cumin] - 10https://gerrit.wikimedia.org/r/681758 [10:19:35] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:50:29] PROBLEM - WDQS high update lag on wdqs1004 is CRITICAL: 1.139e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [10:51:13] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 2.562 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:47:33] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114122 bytes in 3.494 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [11:54:59] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [12:01:02] <_joe_> so what's up with dbtree [12:09:36] <_joe_> it looks like it has issues getting data back from db1115 [12:10:16] <_joe_> load average on the db server is 170 [12:30:42] Yes, I am taking a look [12:31:02] !log Restart mysql on db1115 (tendril/dbtree will fail) [12:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:50] PROBLEM - MariaDB Replica IO: db_inventory #page on db2093 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1115.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1115.eqiad.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:37:59] PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:10] that's me [12:38:11] gah [12:38:13] sorry for the page [12:38:58] ack, no worries [12:39:40] marostegui: i knew in my heart it must be your fault. <3 [12:39:57] hi [12:39:57] Phew 😥 [12:40:11] oh good :) [12:40:13] Yeah, trying to fix tendril :( [12:40:30] i guess it's better to have pages for known reasons than otherwise :-) [12:46:37] !log Start server-side upload for 2 video files (T280763, T280524) [12:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:47] T280524: Server side upload for Butko - https://phabricator.wikimedia.org/T280524 [12:46:48] T280763: Server side upload for Butko - https://phabricator.wikimedia.org/T280763 [12:50:04] RECOVERY - MariaDB Replica IO: db_inventory #page on db2093 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:50:15] RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:50:27] I still don't know what's wrong with tendril [12:50:30] Still investigating [13:01:09] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 106983 bytes in 0.472 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [13:15:57] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.016 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [13:17:17] PROBLEM - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:23] ACKNOWLEDGEMENT - Check systemd state on dbmonitor1002 is CRITICAL: CRITICAL - degraded: The following units failed: tendril-5m.service,tendril-queries.service Marostegui known https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:23] ACKNOWLEDGEMENT - HTTPS-dbtree on dbmonitor1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 354 bytes in 0.016 second response time Marostegui known https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [13:19:30] !log Tendril and dbtree are down at the moment [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:17] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114121 bytes in 0.867 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [13:23:39] !log Tendril and dbtree are up but on a degraded status (slow reponse) [13:23:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:05] RECOVERY - Check systemd state on dbmonitor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:45] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1156 is corrupted and needs to be recloned, probably best to use a logical dump. [13:48:11] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [13:51:17] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1475.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:12:51] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 110227 bytes in 9.878 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [14:20:17] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [14:21:22] <_joe_> marostegui: need any help? [14:21:50] <_joe_> marostegui: I saw quite a few huge procedures that looked like they were in a deadlock or something tbh [14:21:55] _joe_: I am out of battery on my laptop, but we can live with dbtree/tendril being slow today, I might take a look later today or tomorrow [14:21:59] _joe_: yes, usual tendril [14:22:30] <_joe_> marostegui: ack, lemme know if you want me to take a look, but I can mostly apply blunt force [14:37:47] (03PS1) 10Ottomata: refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) [14:38:01] (03PS2) 10Ottomata: refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) [14:38:03] (03CR) 10jerkins-bot: [V: 04-1] refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:41:30] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29153/console" [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:49:47] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 110711 bytes in 9.225 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [14:55:11] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [14:57:13] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [15:28:32] (03PS1) 10Ottomata: Add Wikidata QRank link in dumps.wikimedia.org/other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) [15:29:03] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 111719 bytes in 8.226 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [15:29:10] (03PS2) 10Ottomata: Add Wikidata QRank link in dumps.wikimedia.org/other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) [15:36:25] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [15:36:38] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Remove Yuvi's cloud-wide root key per request on IRC [labs/private] - 10https://gerrit.wikimedia.org/r/681935 (owner: 10Alex Monk) [15:36:46] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] "alas!" [labs/private] - 10https://gerrit.wikimedia.org/r/681935 (owner: 10Alex Monk) [15:40:51] (03PS1) 10WMDE-Fisch: Fix suggested values not being shown when the param's type isn't specified [extensions/TemplateData] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681969 (https://phabricator.wikimedia.org/T280688) [15:51:09] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 111735 bytes in 9.595 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [15:58:09] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:00:45] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1409.62 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:07:38] 10SRE, 10Cassandra, 10Dependency-Tracking, 10Wikibase-Quality-Constraints, and 4 others: Store WikibaseQualityConstraint check data in persistent storage instead of in the cache - https://phabricator.wikimedia.org/T204024 (10Addshore) [16:10:19] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:26:22] (03CR) 10Ottomata: [V: 03+1 C: 03+2] refine_sanitize - use refinery 0.1.6 with RefineSanitize job [puppet] - 10https://gerrit.wikimedia.org/r/681991 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [16:26:54] 10SRE, 10Wikimedia-Mailing-lists: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/ should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10Ladsgroup) [16:28:27] 10SRE, 10Wikimedia-Mailing-lists: After lists have been migrated, https://lists.wikimedia.org/mailman/listinfo/ should redirect to postorius - https://phabricator.wikimedia.org/T280893 (10Ladsgroup) After all lists are migrated or each one? Doing it one by one seems a bit complicated, either apache c... [16:32:59] !log volker-e@deploy1002 Started deploy [design/style-guide@e914e8a]: Deploy design/style-guide: e914e8a icons: Add 'share' icon (#455) [16:33:05] !log volker-e@deploy1002 Finished deploy [design/style-guide@e914e8a]: Deploy design/style-guide: e914e8a icons: Add 'share' icon (#455) (duration: 00m 06s) [16:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:27] (03CR) 10ArielGlenn: "is maintained *by* volunteers... Otherwise it looks fine to me, whenever you folks decide it's good I'm happy to merge it on through." [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) (owner: 10Ottomata) [16:46:55] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 114424 bytes in 2.721 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [16:48:01] (03CR) 10JoKalliauer: [C: 04-1] "A up-to-date list is essential for the Commons-community to be able to create SVGs. Detailed informaiton can be found at https://meta.wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681665 (owner: 10Alexandros Kosiaris) [16:54:13] (03PS3) 10Ottomata: Add Wikidata QRank link in dumps.wikimedia.org/other/analytics [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) [16:54:21] (03CR) 10Ottomata: "Good catch, ty." [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) (owner: 10Ottomata) [16:59:35] PROBLEM - MariaDB Replica Lag: s1 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 793.31 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:15:53] RECOVERY - MariaDB Replica Lag: s1 on clouddb1017 is OK: OK slave_sql_lag Replication lag: 0.37 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:16:27] (03PS3) 10Giuseppe Lavagetto: helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 [17:17:18] (03CR) 10jerkins-bot: [V: 04-1] helmfile: install a simple deployment shell [puppet] - 10https://gerrit.wikimedia.org/r/681432 (owner: 10Giuseppe Lavagetto) [17:26:56] !log Stop mysql on tendril/dbtree database [17:27:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:48] (03PS2) 10Phuedx: Clean-up decommisioned Print schema configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/570625 (https://phabricator.wikimedia.org/T196159) (owner: 10Polishdeveloper) [17:33:29] PROBLEM - HTTPS-dbtree on dbmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [17:34:43] PROBLEM - Check systemd state on prometheus2003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:34:59] PROBLEM - Check systemd state on prometheus1004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:35:39] RECOVERY - HTTPS-dbtree on dbmonitor1002 is OK: HTTP OK: HTTP/1.1 200 OK - 108738 bytes in 0.479 second response time https://wikitech.wikimedia.org/wiki/Dbtree.wikimedia.org [17:36:09] PROBLEM - Check systemd state on prometheus1003 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:13] PROBLEM - Check systemd state on prometheus2004 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:13] RECOVERY - Check systemd state on prometheus1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:17] RECOVERY - Check systemd state on prometheus2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:05] RECOVERY - Check systemd state on prometheus2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:02:17] RECOVERY - Check systemd state on prometheus1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:01:44] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in shorturl to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682010 (https://phabricator.wikimedia.org/T273673) [19:01:46] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in contentxlation to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682011 (https://phabricator.wikimedia.org/T273673) [19:01:48] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in mediaperprojectlists to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/682012 (https://phabricator.wikimedia.org/T273673) [19:19:09] 10SRE, 10Wikimedia-Mailing-lists: Upgrade mailing lists from mailman2 to 3 in batches - https://phabricator.wikimedia.org/T280322 (10Ladsgroup) [20:16:37] (03CR) 10Gehel: "Minor comments inline, feel free to ping me to discuss" (032 comments) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [20:24:58] (03CR) 10Gehel: "another minor comment" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/628315 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [20:25:21] (03CR) 10Gehel: "minor question, otherwise LGTM" (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/681692 (https://phabricator.wikimedia.org/T212783) (owner: 10Volans) [20:41:34] (03CR) 10Sascha: [C: 03+1] "Looks great, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/681994 (https://phabricator.wikimedia.org/T278416) (owner: 10Ottomata) [21:48:16] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10RobH) [23:36:04] 10SRE, 10ops-eqiad: Degraded RAID on cloudvirt1018 - https://phabricator.wikimedia.org/T280668 (10wiki_willy) a:03Jclark-ctr FYI - this one is out of warranty. @Jclark-ctr - can you see if we have any drives from decom'd servers around? Thanks, Willy [23:36:23] 10SRE, 10ops-eqiad: Can't access thanos-fe1001.mgmt - https://phabricator.wikimedia.org/T280623 (10wiki_willy) a:03Cmjohnson [23:36:41] 10SRE, 10ops-eqiad: htmldumper1001 power suply failure - https://phabricator.wikimedia.org/T280618 (10wiki_willy) a:03Cmjohnson [23:37:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 (10wiki_willy) a:05wiki_willy→03Cmjohnson [23:55:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Dumps-Generation: (Need By: 2021-03-31) rack/setup/install snapshot101[1-5] - https://phabricator.wikimedia.org/T272509 (10wiki_willy) @Cmjohnson - can you provide an update on this one? This is one of the priority installs. Thanks, Willy