[00:04:50] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:07:02] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3056 is OK: HTTP OK: HTTP/1.0 200 OK - 23653 bytes in 0.326 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [00:18:27] (03PS1) 10Reedy: Move RelatedArticles extensionfunction into conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 [00:18:45] (03PS2) 10Reedy: Move RelatedArticles extensionfunction into conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 [00:19:32] (03CR) 10Reedy: "I'm a bit confused why this is needed... It's from T163114" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (owner: 10Reedy) [00:20:19] (03CR) 10jerkins-bot: [V: 04-1] Move RelatedArticles extensionfunction into conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (owner: 10Reedy) [00:21:05] (03PS3) 10Reedy: Move RelatedArticles extensionfunction into conditional [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 [00:22:41] (03CR) 10Reedy: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (owner: 10Reedy) [00:23:57] (03PS4) 10Reedy: Remove RelatedArticles ExtensionFunction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 [00:25:08] (03CR) 10jerkins-bot: [V: 04-1] Remove RelatedArticles ExtensionFunction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (owner: 10Reedy) [00:25:36] (03PS5) 10Reedy: Remove RelatedArticles ExtensionFunction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 [00:30:47] (03PS6) 10Reedy: Remove RelatedArticles ExtensionFunction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 [00:30:49] (03PS1) 10Reedy: Assign RelatedArticles config to wgRelatedArticlesFooterAllowedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 [00:32:52] (03CR) 10Jforrester: "Ha. Yay for less cruft in prod. Sad about the two years of this getting run on every request. Oops." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (owner: 10Reedy) [00:39:04] (03PS7) 10Reedy: Update RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 [00:39:06] (03PS2) 10Reedy: Replace $wgRelatedArticlesFooterWhitelistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 [01:01:28] PROBLEM - Check systemd state on ms-be2028 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:40] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [01:17:52] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: initializing_shards: 0, timed_out: False, status: green, task_max_waiting_in_queue_millis: 0, cluster_name: cloudelastic-chi-eqiad, active_primary_shards: 895, active_shards: 1793, relocating_shards: 0, number_of_in_flight_fetch: 0, number_of_pending_tasks: 0, number_of_nodes: 6, number_of_data_nod [01:17:52] rds_percent_as_number: 100.0, delayed_unassigned_shards: 0, unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [01:18:24] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 139.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [01:20:52] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2028 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [01:27:06] (03CR) 10Reedy: [C: 04-2] "Needs to wait for I468a38df92347cc764e0457d4598bedfc4d92efa to be merged and be everywhere, so will need to sit for a few weeks..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 (owner: 10Reedy) [01:27:12] RECOVERY - Check systemd state on ms-be2028 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:27:33] (03PS3) 10Reedy: Replace $wgRelatedArticlesFooterWhitelistedSkins [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680814 [01:28:09] (03CR) 10Reedy: "Will see about deploying later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (owner: 10Reedy) [01:50:51] (03CR) 10Krinkle: [C: 03+1] "> Patch Set 6:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (owner: 10Reedy) [01:51:40] (03PS8) 10Krinkle: Update RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (https://phabricator.wikimedia.org/T180192) (owner: 10Reedy) [01:52:12] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2028 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:03:41] (03CR) 10Jforrester: "> Patch Set 7: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (https://phabricator.wikimedia.org/T180192) (owner: 10Reedy) [02:03:54] (03CR) 10Jforrester: [C: 03+1] Update RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (https://phabricator.wikimedia.org/T180192) (owner: 10Reedy) [02:05:54] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:10:42] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [02:21:34] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 71.19 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [02:29:36] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 77.29 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1002&panelId=37 [02:59:03] (03CR) 10Legoktm: exim: Add support for handling mailman3 inside mailman2 conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:00:07] (03CR) 10Reedy: "I should probably also deploy this one too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674465 (owner: 10Reedy) [03:09:37] (03CR) 10Legoktm: exim: Add support for handling mailman3 inside mailman2 conf (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:20:56] PROBLEM - WDQS SPARQL on wdqs1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:39] 10SRE, 10Wikimedia-Mailing-lists: mail.wikimedia.org doesn't redirect to lists.wikimedia.org - https://phabricator.wikimedia.org/T280473 (10Legoktm) [03:27:57] (03CR) 10Legoktm: [C: 03+1] "Will merge tomorrow" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:29:26] (03CR) 10Legoktm: "I made the certbot certs world-readable because I couldn't figure out the correct permissions when getting lists.wmcloud.org to work, in c" [puppet] - 10https://gerrit.wikimedia.org/r/680335 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [03:33:58] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 140.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [04:41:58] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 65.08 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1003&panelId=37 [05:05:17] !log Restart m2 database master T280251 [05:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:28] T280251: Upgrade mysql on db1107 (m2 db master) - https://phabricator.wikimedia.org/T280251 [05:10:40] PROBLEM - Check systemd state on mw2331 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:44] PROBLEM - Check systemd state on mw1316 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:46] PROBLEM - Check systemd state on cp4030 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:48] PROBLEM - Check systemd state on db1119 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:56] PROBLEM - Check systemd state on logstash2023 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:12] PROBLEM - Check systemd state on db1143 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:18] PROBLEM - Check systemd state on mw2283 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:22] PROBLEM - Check systemd state on mw1356 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:32] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:48] PROBLEM - Check systemd state on dns5002 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:11:58] PROBLEM - Check systemd state on mw2265 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:02] PROBLEM - Check systemd state on mw1407 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:16] PROBLEM - Check systemd state on mw2372 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:30] PROBLEM - Check systemd state on thanos-be1003 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:40] PROBLEM - Check systemd state on mw2291 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:46] PROBLEM - Check systemd state on mw2362 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:58] PROBLEM - Check systemd state on cp1083 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:02] PROBLEM - Check systemd state on ms-be1063 is CRITICAL: CRITICAL - degraded: The following units failed: debmonitor-client.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:14] RECOVERY - Check systemd state on db1119 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:30] RECOVERY - Check systemd state on mw1407 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:10] RECOVERY - Check systemd state on mw2291 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:16] RECOVERY - Check systemd state on mw2362 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:38] RECOVERY - Check systemd state on mw2331 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:40] RECOVERY - Check systemd state on mw1316 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:08] RECOVERY - Check systemd state on db1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:52] RECOVERY - Check systemd state on mw2265 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:10] RECOVERY - Check systemd state on mw2372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:26] RECOVERY - Check systemd state on thanos-be1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:52] RECOVERY - Check systemd state on cp1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:58] RECOVERY - Check systemd state on ms-be1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:18:18] RECOVERY - Check systemd state on logstash2023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:20:30] <_joe_> uhm [05:21:31] _joe_: probably related to m2 restart, as that host hosts debmonitor database [05:21:40] <_joe_> ahhh I see [05:21:45] <_joe_> yes makes sense [05:23:31] <_joe_> all those servers got back an error from the debmonitor server [05:24:41] make sense yeah [05:26:02] RECOVERY - Check systemd state on mw2283 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:26:06] RECOVERY - Check systemd state on mw1356 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:22] (03PS1) 10Marostegui: instances.yaml: Add db1179 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/680868 (https://phabricator.wikimedia.org/T275633) [05:29:00] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1179 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/680868 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [05:30:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1179 to dbctl T275633', diff saved to https://phabricator.wikimedia.org/P15403 and previous config saved to /var/cache/conftool/dbconfig/20210419-053050-marostegui.json [05:30:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:01] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [05:31:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1179 in s3 for the first time with minimal weight T275633', diff saved to https://phabricator.wikimedia.org/P15404 and previous config saved to /var/cache/conftool/dbconfig/20210419-053127-marostegui.json [05:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1179 in s3 for the first time with minimal weight T275633', diff saved to https://phabricator.wikimedia.org/P15405 and previous config saved to /var/cache/conftool/dbconfig/20210419-053730-marostegui.json [05:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:37:39] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [05:41:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 T272008', diff saved to https://phabricator.wikimedia.org/P15406 and previous config saved to /var/cache/conftool/dbconfig/20210419-054158-marostegui.json [05:42:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:07] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [05:42:15] !log Stop sanitarium master on s1 (lag will show up on clouddb* labsdb* hosts) T272008 [05:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:48:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1106', diff saved to https://phabricator.wikimedia.org/P15407 and previous config saved to /var/cache/conftool/dbconfig/20210419-054831-marostegui.json [05:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:06] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The following units failed: jupyter-lexnasser-singleuser.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1074 T272008', diff saved to https://phabricator.wikimedia.org/P15408 and previous config saved to /var/cache/conftool/dbconfig/20210419-055240-marostegui.json [05:52:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:52:49] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [05:53:11] !log Stop sanitarium master on s2 (lag will show up on clouddb* labsdb* hosts) T272008 [05:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:56:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 10%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15409 and previous config saved to /var/cache/conftool/dbconfig/20210419-055613-root.json [05:56:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:57:10] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:49] <_joe_> !log rolling out further envoy upgrades T280317 [06:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1179 in s3 for the first time with minimal weight T275633', diff saved to https://phabricator.wikimedia.org/P15410 and previous config saved to /var/cache/conftool/dbconfig/20210419-060321-marostegui.json [06:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:30] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:05:28] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [06:10:44] (03PS1) 10Ladsgroup: Disable legacy javascript variable for the rest of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680871 (https://phabricator.wikimedia.org/T72470) [06:10:50] <_joe_> !log upgrading envoy everywhere in codfw T280317 [06:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 25%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15411 and previous config saved to /var/cache/conftool/dbconfig/20210419-061116-root.json [06:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:17:24] <_joe_> !log upgrading envoy everywhere in eqiad T280317 [06:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:23:52] (03PS13) 10Giuseppe Lavagetto: Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) [06:23:54] (03PS1) 10Giuseppe Lavagetto: similar-users: don't override envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680873 (https://phabricator.wikimedia.org/T280317) [06:24:26] (03CR) 10jerkins-bot: [V: 04-1] Helm chart to run MediaWiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [06:24:30] (03CR) 10jerkins-bot: [V: 04-1] similar-users: don't override envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680873 (https://phabricator.wikimedia.org/T280317) (owner: 10Giuseppe Lavagetto) [06:26:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 50%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15412 and previous config saved to /var/cache/conftool/dbconfig/20210419-062620-root.json [06:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:03] (03PS1) 10Elukey: Add conf200[4-6] IPs to zookeeper's main firewall config [puppet] - 10https://gerrit.wikimedia.org/r/680874 (https://phabricator.wikimedia.org/T271573) [06:34:05] (03PS1) 10Elukey: Swap zookeeper from conf2001 to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/680875 (https://phabricator.wikimedia.org/T271573) [06:36:51] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29099/console" [puppet] - 10https://gerrit.wikimedia.org/r/680875 (https://phabricator.wikimedia.org/T271573) (owner: 10Elukey) [06:39:23] (03PS2) 10Elukey: Swap zookeeper from conf2001 to conf2004 [puppet] - 10https://gerrit.wikimedia.org/r/680875 (https://phabricator.wikimedia.org/T271573) [06:41:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 75%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15413 and previous config saved to /var/cache/conftool/dbconfig/20210419-064123-root.json [06:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:40] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::analytics_cluster::hadoop::master: move hadoop dirs under /srv [puppet] - 10https://gerrit.wikimedia.org/r/680259 (https://phabricator.wikimedia.org/T265126) (owner: 10Elukey) [06:46:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112 T272008', diff saved to https://phabricator.wikimedia.org/P15414 and previous config saved to /var/cache/conftool/dbconfig/20210419-064600-marostegui.json [06:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:09] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [06:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15415 and previous config saved to /var/cache/conftool/dbconfig/20210419-064914-root.json [06:49:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:50] (03Abandoned) 10Giuseppe Lavagetto: similar-users: don't override envoy version [deployment-charts] - 10https://gerrit.wikimedia.org/r/680873 (https://phabricator.wikimedia.org/T280317) (owner: 10Giuseppe Lavagetto) [06:52:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15416 and previous config saved to /var/cache/conftool/dbconfig/20210419-065213-root.json [06:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:23] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [06:52:54] (03PS3) 10Ayounsi: Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) [06:53:49] (03CR) 10Ayounsi: "Only outstanding change is a small one on the management routers that I tested individually." [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi) [06:55:11] (03PS1) 10Marostegui: install_server: Do not format db1179 [puppet] - 10https://gerrit.wikimedia.org/r/680977 (https://phabricator.wikimedia.org/T275633) [06:56:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1074 (re)pooling @ 100%: Repool db1074', diff saved to https://phabricator.wikimedia.org/P15417 and previous config saved to /var/cache/conftool/dbconfig/20210419-065627-root.json [06:56:31] (03CR) 10Marostegui: [C: 03+2] install_server: Do not format db1179 [puppet] - 10https://gerrit.wikimedia.org/r/680977 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [06:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:34] (03PS1) 10Muehlenhoff: Adapt name of security suite for bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/680978 (https://phabricator.wikimedia.org/T275873) [07:00:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1121 T272008', diff saved to https://phabricator.wikimedia.org/P15418 and previous config saved to /var/cache/conftool/dbconfig/20210419-070035-marostegui.json [07:00:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:44] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [07:04:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15419 and previous config saved to /var/cache/conftool/dbconfig/20210419-070418-root.json [07:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 10%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15420 and previous config saved to /var/cache/conftool/dbconfig/20210419-070439-root.json [07:04:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 15%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15421 and previous config saved to /var/cache/conftool/dbconfig/20210419-070718-root.json [07:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:29] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:07:56] (03PS1) 10Giuseppe Lavagetto: kubernetes: Upgrade envoy to the latest 1.15 version [puppet] - 10https://gerrit.wikimedia.org/r/680979 (https://phabricator.wikimedia.org/T280317) [07:13:43] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kubernetes: Upgrade envoy to the latest 1.15 version [puppet] - 10https://gerrit.wikimedia.org/r/680979 (https://phabricator.wikimedia.org/T280317) (owner: 10Giuseppe Lavagetto) [07:17:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1082 T272008', diff saved to https://phabricator.wikimedia.org/P15422 and previous config saved to /var/cache/conftool/dbconfig/20210419-071701-marostegui.json [07:17:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:17:11] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [07:17:48] !log oblivian@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [07:17:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15423 and previous config saved to /var/cache/conftool/dbconfig/20210419-071921-root.json [07:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 25%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15424 and previous config saved to /var/cache/conftool/dbconfig/20210419-071943-root.json [07:19:46] !log oblivian@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [07:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:37] !log oblivian@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [07:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 20%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15425 and previous config saved to /var/cache/conftool/dbconfig/20210419-072221-root.json [07:22:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:22:30] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:27:06] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [07:34:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: Repool db1112', diff saved to https://phabricator.wikimedia.org/P15426 and previous config saved to /var/cache/conftool/dbconfig/20210419-073425-root.json [07:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 50%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15427 and previous config saved to /var/cache/conftool/dbconfig/20210419-073446-root.json [07:34:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 25%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15428 and previous config saved to /var/cache/conftool/dbconfig/20210419-073449-root.json [07:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 30%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15429 and previous config saved to /var/cache/conftool/dbconfig/20210419-073725-root.json [07:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:34] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:40:32] (03PS1) 10Ayounsi: BGP: prioritize directly connected peers [homer/public] - 10https://gerrit.wikimedia.org/r/680980 (https://phabricator.wikimedia.org/T280054) [07:41:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1085 T272008', diff saved to https://phabricator.wikimedia.org/P15430 and previous config saved to /var/cache/conftool/dbconfig/20210419-074155-marostegui.json [07:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:04] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [07:43:32] (03CR) 10Ayounsi: "To be carefully deployed and tested with the example showed in the linked task." [homer/public] - 10https://gerrit.wikimedia.org/r/680980 (https://phabricator.wikimedia.org/T280054) (owner: 10Ayounsi) [07:45:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 25%: Repool db1085', diff saved to https://phabricator.wikimedia.org/P15431 and previous config saved to /var/cache/conftool/dbconfig/20210419-074510-root.json [07:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:04] (03PS2) 10Ayounsi: BGP: prioritize directly connected peers [homer/public] - 10https://gerrit.wikimedia.org/r/680980 (https://phabricator.wikimedia.org/T280054) [07:46:05] 10SRE, 10serviceops, 10Patch-For-Review: upgrade conf2* servers to stretch - https://phabricator.wikimedia.org/T271573 (10elukey) @JMeybohm I created the first two patch to swap the zookeeper servers, in theory it should work fine. The delicate step is the roll restart of the daemons after the second one, bu... [07:46:37] (03PS2) 10Elukey: hadoop: improve default log4j config [puppet] - 10https://gerrit.wikimedia.org/r/680383 (https://phabricator.wikimedia.org/T276906) [07:49:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 75%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15432 and previous config saved to /var/cache/conftool/dbconfig/20210419-074950-root.json [07:49:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 50%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15433 and previous config saved to /var/cache/conftool/dbconfig/20210419-074953-root.json [07:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:02] (03PS3) 10Elukey: hadoop: improve default log4j config [puppet] - 10https://gerrit.wikimedia.org/r/680383 (https://phabricator.wikimedia.org/T276906) [07:51:42] !log upgrade mwdebug2002 to PHP 7.2.34 [07:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 40%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15434 and previous config saved to /var/cache/conftool/dbconfig/20210419-075229-root.json [07:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:37] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:00:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 50%: Repool db1085', diff saved to https://phabricator.wikimedia.org/P15435 and previous config saved to /var/cache/conftool/dbconfig/20210419-080013-root.json [08:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:59] !log installing python-bleach security updates [08:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:04:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1121 (re)pooling @ 100%: Repool db1121', diff saved to https://phabricator.wikimedia.org/P15436 and previous config saved to /var/cache/conftool/dbconfig/20210419-080454-root.json [08:04:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 75%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15437 and previous config saved to /var/cache/conftool/dbconfig/20210419-080456-root.json [08:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:19] !log swift eqiad-prod: less weight for ms-be[1019-1026] / more weight to ms-be106[0-3] - T272836 [08:07:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:30] T272836: Decom ms-be[1019-1026] from swift - https://phabricator.wikimedia.org/T272836 [08:07:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15438 and previous config saved to /var/cache/conftool/dbconfig/20210419-080732-root.json [08:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:42] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:07:51] !log dcaro@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on labstore1004.eqiad.wmnet with reason: Restarting mysql [08:07:52] !log dcaro@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on labstore1004.eqiad.wmnet with reason: Restarting mysql [08:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 75%: Repool db1085', diff saved to https://phabricator.wikimedia.org/P15439 and previous config saved to /var/cache/conftool/dbconfig/20210419-081517-root.json [08:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:19:54] 10SRE: creation of raju@wikipedia.org for fundraising team - https://phabricator.wikimedia.org/T280371 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Hi @MNoorWMF, this is implemented now! Resolving the task but feel free to reopen if something is amiss. [08:20:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1082 (re)pooling @ 100%: Repool db1082', diff saved to https://phabricator.wikimedia.org/P15440 and previous config saved to /var/cache/conftool/dbconfig/20210419-082000-root.json [08:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 60%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15441 and previous config saved to /var/cache/conftool/dbconfig/20210419-082236-root.json [08:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:47] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:24:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Envoy: set per_try_timeout for eventgate-main. [puppet] - 10https://gerrit.wikimedia.org/r/680372 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [08:26:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079 T272008', diff saved to https://phabricator.wikimedia.org/P15442 and previous config saved to /var/cache/conftool/dbconfig/20210419-082559-marostegui.json [08:26:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:08] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [08:27:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] Envoy: set per_try_timeout for eventgate-main. [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [08:28:42] (03Merged) 10jenkins-bot: Envoy: set per_try_timeout for eventgate-main. [deployment-charts] - 10https://gerrit.wikimedia.org/r/679855 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [08:30:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 25%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15443 and previous config saved to /var/cache/conftool/dbconfig/20210419-083018-root.json [08:30:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1085 (re)pooling @ 100%: Repool db1085', diff saved to https://phabricator.wikimedia.org/P15444 and previous config saved to /var/cache/conftool/dbconfig/20210419-083021-root.json [08:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:56] !log Restart m5 master - wikitech will go down T279657 [08:31:02] marostegui: Failed to log message to wiki. Somebody should check the error logs. [08:31:02] T279657: Upgrade mysql on db1128 (m5 db master) - https://phabricator.wikimedia.org/T279657 [08:31:46] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [08:32:40] RECOVERY - Check systemd state on cp4030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:33:45] 10SRE, 10SRE-tools: debmonitor-client.service stays in failed state in case of server errors - https://phabricator.wikimedia.org/T280484 (10ema) [08:34:10] !log Testing log [08:34:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:19] !log restart debmonitor-client.service on cp4030, dns5002, an-worker1106 T280484 [08:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:28] T280484: debmonitor-client.service stays in failed state in case of server errors - https://phabricator.wikimedia.org/T280484 [08:35:28] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:32] RECOVERY - Check systemd state on dns5002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:37:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 70%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15445 and previous config saved to /var/cache/conftool/dbconfig/20210419-083740-root.json [08:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:49] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:38:50] PROBLEM - Memory correctable errors -EDAC- on thumbor2001 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor2001&var-datasource=codfw+prometheus/ops [08:45:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 T272008', diff saved to https://phabricator.wikimedia.org/P15446 and previous config saved to /var/cache/conftool/dbconfig/20210419-084523-marostegui.json [08:45:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 50%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15447 and previous config saved to /var/cache/conftool/dbconfig/20210419-084528-root.json [08:45:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:32] T272008: Move wikireplicas under the new sanitarium hosts (db1154, db1155) - https://phabricator.wikimedia.org/T272008 [08:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:05] (03CR) 10David Caro: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [08:46:40] (03Abandoned) 10David Caro: icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [08:48:06] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10Peachey88) I believe we already have Source Han Sans (via [[ https://packages.debian.org/sid/fonts-noto-cjk | fonts-noto-cjk ]]) installed via {T123223}. [08:48:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1087 T272008', diff saved to https://phabricator.wikimedia.org/P15448 and previous config saved to /var/cache/conftool/dbconfig/20210419-084834-marostegui.json [08:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:26] (03Restored) 10Jcrespo: icinga: allow clearing a downtime for a host [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [08:50:44] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1124 and db1125 can now be reimaged to buster and moved to their final destinations. [08:52:13] (03CR) 10Jcrespo: "I think it would be more productive to merge this, as it is a net improvement over the existing setup." [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [08:52:44] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 80%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15449 and previous config saved to /var/cache/conftool/dbconfig/20210419-085243-root.json [08:52:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:53] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [08:53:07] (03PS1) 10Marostegui: install_server: Reimage db1124,db1125 to buster [puppet] - 10https://gerrit.wikimedia.org/r/680986 (https://phabricator.wikimedia.org/T258361) [08:56:07] (03CR) 10Marostegui: "+1 to having a way to remove downtime!" [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [08:58:05] (03CR) 10Jbond: [C: 03+1] "lgtm, see comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680978 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [08:58:12] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1124,db1125 to buster [puppet] - 10https://gerrit.wikimedia.org/r/680986 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [09:00:24] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/680376 (owner: 10David Caro) [09:00:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 75%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15450 and previous config saved to /var/cache/conftool/dbconfig/20210419-090031-root.json [09:00:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/680980 (https://phabricator.wikimedia.org/T280054) (owner: 10Ayounsi) [09:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 90%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15451 and previous config saved to /var/cache/conftool/dbconfig/20210419-090747-root.json [09:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:56] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:12:02] 10SRE, 10SRE-tools: debmonitor-client.service stays in failed state in case of server errors - https://phabricator.wikimedia.org/T280484 (10Volans) a:03jbond Assigning to @jbond that was working on this. @ema thanks for the task. FWIW some retry logic has already been added to the debmonitor client (see htt... [09:13:30] (03CR) 10Jbond: [C: 03+1] "lgtm, minor optional nits inline" (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi) [09:15:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1079 (re)pooling @ 100%: Repool db1079', diff saved to https://phabricator.wikimedia.org/P15452 and previous config saved to /var/cache/conftool/dbconfig/20210419-091535-root.json [09:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:41] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: fix reimage race on /var/log/swift [puppet] - 10https://gerrit.wikimedia.org/r/680307 (owner: 10Filippo Giunchedi) [09:21:35] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [09:22:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1161 T280492', diff saved to https://phabricator.wikimedia.org/P15453 and previous config saved to /var/cache/conftool/dbconfig/20210419-092234-marostegui.json [09:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:43] T280492: Upgrade all sanitarium masters to 10.4 and Buster - https://phabricator.wikimedia.org/T280492 [09:22:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Slowly pool db1179 for the first time in s3 T275633', diff saved to https://phabricator.wikimedia.org/P15454 and previous config saved to /var/cache/conftool/dbconfig/20210419-092251-root.json [09:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:00] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [09:23:53] (03PS1) 10Marostegui: mariadb: Reimage db1161 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/680989 (https://phabricator.wikimedia.org/T280492) [09:24:47] (03CR) 10Marostegui: [C: 03+2] mariadb: Reimage db1161 to Buster and 10.4 [puppet] - 10https://gerrit.wikimedia.org/r/680989 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [09:24:50] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: hide diffs for files with sensitive data [puppet] - 10https://gerrit.wikimedia.org/r/680309 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [09:25:05] (03PS2) 10Filippo Giunchedi: swift: hide diffs for files with sensitive data [puppet] - 10https://gerrit.wikimedia.org/r/680309 (https://phabricator.wikimedia.org/T280257) [09:28:22] (03CR) 10Filippo Giunchedi: [C: 03+2] thanos: extend retention to 5y [puppet] - 10https://gerrit.wikimedia.org/r/680308 (owner: 10Filippo Giunchedi) [09:29:00] (03PS1) 10Marostegui: install_server: Reimage all codfw sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/680992 (https://phabricator.wikimedia.org/T280492) [09:33:12] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage all codfw sanitarium masters [puppet] - 10https://gerrit.wikimedia.org/r/680992 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [09:34:23] (03PS1) 10Ema: cache: enable exp caching policy on cp3051 [puppet] - 10https://gerrit.wikimedia.org/r/680995 (https://phabricator.wikimedia.org/T275809) [09:35:08] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/680995 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [09:37:03] 10SRE, 10MW-on-K8s, 10serviceops: Figure out appropriate readiness and liveness probes - https://phabricator.wikimedia.org/T276908 (10Joe) 05Open→03Resolved [09:37:09] 10SRE, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create a basic helm chart to test MediaWiki on kubernetes - https://phabricator.wikimedia.org/T265327 (10Joe) [09:37:32] 10SRE, 10MediaWiki-General, 10Browser-Support-Apple-Safari: File:Chessboard480.svg not visible on safari when size is fixed at 208px - https://phabricator.wikimedia.org/T280439 (10Daimona) 05Resolved→03Open a:05Daimona→03None Actually, it's still broken. It did work immediately after regenerating the... [09:39:59] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:34] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.497e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [09:42:04] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1161.eqiad.wmnet with reason: REIMAGE [09:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:49] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) [09:44:56] 10SRE, 10MW-on-K8s, 10serviceops: Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10Joe) p:05Triage→03High [09:48:23] (03CR) 10ArielGlenn: [C: 03+1] "At last I have got the thing working in deployment-prep! Output looks good, so quoting must be fine. I have not tested the MAILTO function" [puppet] - 10https://gerrit.wikimedia.org/r/678338 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [09:51:34] 10SRE, 10LDAP-Access-Requests: NDA for Superset Request from WMDE Employee Manuel - https://phabricator.wikimedia.org/T280162 (10Manuel) @Dzahn My Wikitech username is "Manuel Merz (WMDE)". Cheers! [09:51:37] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: REIMAGE [09:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:57] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [09:53:01] (03CR) 10Ema: [C: 03+2] cache: enable exp caching policy on cp3051 [puppet] - 10https://gerrit.wikimedia.org/r/680995 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [09:53:06] (03PS1) 10Marostegui: install_server: Reimage db1156 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/681000 (https://phabricator.wikimedia.org/T280492) [09:53:42] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2002.codfw.wmnet with reason: REIMAGE [09:53:45] (03CR) 10Marostegui: [C: 03+2] install_server: Reimage db1156 to Buster [puppet] - 10https://gerrit.wikimedia.org/r/681000 (https://phabricator.wikimedia.org/T280492) (owner: 10Marostegui) [09:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:51] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: limit apifeatureusage curator job to jobs_host [puppet] - 10https://gerrit.wikimedia.org/r/680399 (https://phabricator.wikimedia.org/T274394) (owner: 10Cwhite) [09:56:01] (03CR) 10Ema: [C: 03+1] trafficserver: remove mwdebug1003 from x-wikimedia-debug-routing [puppet] - 10https://gerrit.wikimedia.org/r/680393 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [09:56:54] !log cp3051: varnish-frontend-restart to apply exp policy settings changes starting from empty cache T275809 [09:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:03] T275809: cache_upload cache policy + large_objects_cutoff concerns - https://phabricator.wikimedia.org/T275809 [09:57:04] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs1005 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:58:02] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs1005 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [09:59:32] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [10:00:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: bootstrap repo for thirdparty/kubedam-k8s-1-18 [puppet] - 10https://gerrit.wikimedia.org/r/680253 (https://phabricator.wikimedia.org/T280299) (owner: 10Majavah) [10:04:43] !log aborrero@apt1001:~ $ sudo -i reprepro --delete clearvanished (remove old buster-wikimedia|thirdparty/kubeadm-k8s-1-15,16 repos and packages) [10:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:36] !log aborrero@apt1001:~ $ sudo -i reprepro --component thirdparty/kubeadm-k8s-1-18 update buster-wikimedia [10:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE [10:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1156.eqiad.wmnet with reason: REIMAGE [10:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:49] RECOVERY - WDQS SPARQL on wdqs1005 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.068 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:15:54] !log depooling wdqs1005 [10:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:59] (03CR) 10Majavah: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/680851 (owner: 10Majavah) [10:19:00] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] aptrepo: Use buster for thirdparty/kubeadm-k8s-docker.com [puppet] - 10https://gerrit.wikimedia.org/r/680851 (owner: 10Majavah) [10:22:31] !log reimaging theemin to bullseye [10:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:47] !log imported 1.16.3 into envoy-future [10:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak and Amir1: Your horoscope predicts another unfortunate Wikimedia Portals Update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1030). [10:32:21] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on theemin.codfw.wmnet with reason: REIMAGE [10:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:52] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [10:34:17] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on theemin.codfw.wmnet with reason: REIMAGE [10:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:08] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681008 (https://phabricator.wikimedia.org/T128546) [10:37:21] (03PS1) 10Ladsgroup: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681009 (https://phabricator.wikimedia.org/T279419) [10:37:53] (03Abandoned) 10Ladsgroup: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681009 (https://phabricator.wikimedia.org/T279419) (owner: 10Ladsgroup) [10:38:37] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [10:38:57] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681008 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:26] (03PS1) 10Filippo Giunchedi: swift: force creation of /var/log/swift symlink [puppet] - 10https://gerrit.wikimedia.org/r/681010 (https://phabricator.wikimedia.org/T280257) [10:39:49] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681008 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:39:55] (03CR) 10jerkins-bot: [V: 04-1] swift: force creation of /var/log/swift symlink [puppet] - 10https://gerrit.wikimedia.org/r/681010 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [10:40:33] Amir1: This weeks portals update includes some logo fixes right? [10:40:41] yup [10:41:10] (03PS1) 10WMDE-Fisch: [beta] Enable changes to the descriptions in the VE transclusion dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681011 (https://phabricator.wikimedia.org/T273425) [10:42:45] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10LarsWirzenius) a:03LarsWirzenius [10:43:18] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [10:45:49] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:681008| Bumping portals to master (T128546)]] (duration: 00m 58s) [10:45:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:58] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:46:47] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:681008| Bumping portals to master (T128546)]] (duration: 00m 57s) [10:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:31] Amir1: I need a bit of a sanity check, I just did the deploy, did that fix anything? [10:48:01] jan_drewniak: Yes. it's in https://www.wikinews.org [10:48:11] the mw logo was stretched, it shouldn't be anymore [10:48:22] Amir1: Ok great :) [10:48:27] context T279419 [10:48:27] T279419: New MediaWiki logo is stretched on portals www.wiktionary.org, www.wikinews.org, www.wikiquote.org, www.wikibooks.org and www.wikiversity.org - https://phabricator.wikimedia.org/T279419 [10:48:48] Can anyone from CI look https://gerrit.wikimedia.org/r/c/integration/config/+/680697? [10:48:48] (03PS1) 10Jbond: cfssl: update certificate expire check script [puppet] - 10https://gerrit.wikimedia.org/r/681012 [10:50:29] Jayprakash12345: wrong channel, use #wikimedia-releng [10:50:57] Majavah: Thanks :) [10:51:03] 10SRE, 10Patch-For-Review: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 (10MoritzMuehlenhoff) [10:52:36] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [10:53:50] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install atlas-codfw.wikimedia.org - https://phabricator.wikimedia.org/T273114 (10faidon) @CDanis could you look at this soon? Thanks! [10:55:00] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: introduce eqiad1 service implementation [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) [10:58:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29105/" [puppet] - 10https://gerrit.wikimedia.org/r/675556 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Time to snap out of that daydream and deploy European mid-day backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1100). [11:00:04] CFisch_WMDE and Amir1: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] \o [11:00:21] o/ [11:00:24] (03PS2) 10Jbond: cfssl: update certificate expire check script [puppet] - 10https://gerrit.wikimedia.org/r/681012 [11:00:36] \o [11:00:37] time to snap out of that NASA Ingenuity stream, more like ;) [11:00:39] o/ [11:00:48] (03CR) 10Urbanecm: [C: 03+2] Add filtering for the suggested values combo box [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [11:00:53] +2'ed the backport :) [11:01:00] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [11:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:19] !log aborrero@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [11:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:41] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29106/console" [puppet] - 10https://gerrit.wikimedia.org/r/681012 (owner: 10Jbond) [11:01:44] shall we get to config pushed now? [11:01:46] Amir1: wanna go with your patch first? [11:01:50] yeah [11:01:53] cool [11:02:00] (03CR) 10Jbond: [V: 03+1 C: 03+2] cfssl: update certificate expire check script [puppet] - 10https://gerrit.wikimedia.org/r/681012 (owner: 10Jbond) [11:02:02] Urbanecm: Cool, feel free, nothing to test there. I'm currently struggling a bit with my little one at home otherwise I would do the deploy myself. [11:02:11] !log import promethus-rsyslog-exporter for bullseye-wikimedia/main [11:02:15] CFisch_WMDE: ok, no problem :) [11:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:43] (03CR) 10Ladsgroup: [C: 03+2] Disable legacy javascript variable for the rest of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680871 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:03:17] Amir1: please ping me once done, I added few last-time config patches for me [11:03:26] sure [11:03:40] (03Merged) 10jenkins-bot: Disable legacy javascript variable for the rest of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680871 (https://phabricator.wikimedia.org/T72470) (owner: 10Ladsgroup) [11:05:31] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:680871|Disable legacy javascript variable for the rest of wikis (T72470)]] (duration: 00m 57s) [11:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:40] T72470: Remove legacy javascript globals - https://phabricator.wikimedia.org/T72470 [11:05:43] Urbanecm: I'm done [11:05:43] (03PS1) 10Jbond: P:multirootca: fix sudo:: rule title [puppet] - 10https://gerrit.wikimedia.org/r/681023 [11:05:47] thanks [11:05:59] (03PS2) 10Urbanecm: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680302 (https://phabricator.wikimedia.org/T279853) [11:06:07] (03CR) 10Urbanecm: [C: 03+2] testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680302 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [11:06:55] (03Merged) 10jenkins-bot: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680302 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [11:07:29] (03CR) 10Jbond: [C: 03+2] P:multirootca: fix sudo:: rule title [puppet] - 10https://gerrit.wikimedia.org/r/681023 (owner: 10Jbond) [11:10:05] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 03f8ed819091624f5ae4a8d7ed3631dc322fabcd: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD (T279853) (duration: 00m 57s) [11:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:15] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [11:10:28] first patch synced, other need more testing [11:11:11] (03PS2) 10Muehlenhoff: Adapt name of security suite for bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/680978 (https://phabricator.wikimedia.org/T275873) [11:11:23] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=testwiki # T279853 [11:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:46] ...something is broken, I'm done... [11:16:56] Urbanecm: sooooo the backport can't be deployed? [11:17:03] oh, it can [11:17:06] did it merge already? [11:17:19] i meant that i'm not going with rest of my patches [11:17:27] CFisch_WMDE: your patch is still in the CI it seems [11:17:28] ahh k :-) [11:17:31] yeah [11:17:38] I was just wondering ;-) [11:18:31] (03CR) 10Muehlenhoff: [C: 03+2] Adapt name of security suite for bullseye and later [puppet] - 10https://gerrit.wikimedia.org/r/680978 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [11:18:47] no problem, i should've been more explicit [11:20:43] (03CR) 10jerkins-bot: [V: 04-1] Add filtering for the suggested values combo box [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [11:20:52] o.O [11:21:08] something's wrong with your patch it seems :) [11:21:11] Seems I'll have to fix that first. [11:21:14] yup [11:21:27] Ok just skip it, I'll deal with that some other time and try again. [11:21:34] but it's browser test, so maybe it's just temporary [11:21:48] Hmm you could retry [11:21:56] CFisch_WMDE: yeah, if you have time, I'll +2 it again [11:22:08] (03CR) 10Urbanecm: [C: 03+2] "flaky tests" [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [11:22:12] I have to run now :-/ [11:22:21] oh [11:22:24] so next time then :) [11:22:50] Or, if you're bold you can still deploy when it runs through :-D [11:23:12] I'll see in the logs. No hard feelings, not so urgent. [11:23:14] * CFisch_WMDE gone [11:23:34] i removed the +2 [11:23:43] I'd rather not deploy sth without someone who knows what it does around :D [11:25:07] (03PS1) 10Jbond: sudo: add new flag purge_sudoeres_d [puppet] - 10https://gerrit.wikimedia.org/r/681026 [11:25:52] (03CR) 10Jbond: "Note PCC is not super useful here as this change will only affect things that are not managed" [puppet] - 10https://gerrit.wikimedia.org/r/681026 (owner: 10Jbond) [11:25:57] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29109/console" [puppet] - 10https://gerrit.wikimedia.org/r/681026 (owner: 10Jbond) [11:27:48] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=testwiki --force # T279853 [11:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:58] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [11:30:12] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/679292 (owner: 10Jbond) [11:30:20] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/679297 (owner: 10Jbond) [11:31:35] (03PS1) 10Arturo Borrero Gonzalez: openstack: clodugw: conntrackd: resolve peer names [puppet] - 10https://gerrit.wikimedia.org/r/681028 (https://phabricator.wikimedia.org/T270704) [11:33:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/29110/" [puppet] - 10https://gerrit.wikimedia.org/r/681028 (https://phabricator.wikimedia.org/T270704) (owner: 10Arturo Borrero Gonzalez) [11:33:51] !log imported debdeploy 0.0.99.13-1+deb11u1 to bullseye-wikimedia T275873 [11:33:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:00] T275873: Prepare our base system layer for Debian 11/bullseye - https://phabricator.wikimedia.org/T275873 [11:37:05] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [11:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:13] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [11:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:15] (03CR) 10Jbond: "ready for review" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond) [11:43:03] (03CR) 10jerkins-bot: [V: 04-1] Add filtering for the suggested values combo box [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [11:55:54] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add conf200[4-6] IPs to zookeeper's main firewall config [puppet] - 10https://gerrit.wikimedia.org/r/680874 (https://phabricator.wikimedia.org/T271573) (owner: 10Elukey) [12:01:59] (03PS1) 10Marostegui: db1182: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681030 (https://phabricator.wikimedia.org/T275633) [12:06:21] (03CR) 10Marostegui: [C: 03+2] db1182: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681030 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [12:10:00] Re, Urbanecm understandable :-D. CI is still failing anyways .... seems to be an issue with something somewhere -.- [12:12:31] Might be this? T280491 [12:12:32] T280491: Ruby Browser Tests broken after OS update - https://phabricator.wikimedia.org/T280491 [12:13:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 236, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:14:59] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.01201 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:16:49] Amir1: Seems to be something different: [12:16:55] 13:40:56 [0-4] Error in "App can go back from a save error both on desktop and mobile" [12:16:55] 13:40:56 element (".oo-ui-dialog #data-bridge-app .wb-db-error-saving__back") still displayed after 10000ms [12:18:11] oooh [12:18:17] CFisch_WMDE: I know what's causing this [12:18:28] the wikibase patch needs backporting [12:18:34] :-D [12:18:38] (03PS1) 10Marostegui: instances.yaml: Add db1182 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/681031 (https://phabricator.wikimedia.org/T275633) [12:18:43] (03PS2) 10Filippo Giunchedi: swift: force creation of /var/log/swift symlink [puppet] - 10https://gerrit.wikimedia.org/r/681010 (https://phabricator.wikimedia.org/T280257) [12:19:15] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 81, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:19:44] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29111/console" [puppet] - 10https://gerrit.wikimedia.org/r/681010 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [12:19:53] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/680296 [12:20:19] CFisch_WMDE: Backporting to wmf1? [12:20:45] +1 [12:21:07] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/679462 is the one with the issue [12:21:33] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1182 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/681031 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [12:21:52] looking for an easy +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/681010/ [12:22:21] (03PS1) 10Ladsgroup: bridge: Split mobile and desktop error handling browser test, disable mobile [extensions/Wikibase] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680853 (https://phabricator.wikimedia.org/T279068) [12:22:33] (03CR) 10Ladsgroup: [C: 03+2] bridge: Split mobile and desktop error handling browser test, disable mobile [extensions/Wikibase] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680853 (https://phabricator.wikimedia.org/T279068) (owner: 10Ladsgroup) [12:22:55] (03CR) 10Volans: "LGTM, just couple of nits inline." (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond) [12:24:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2072.codfw.wmnet with reason: REIMAGE [12:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:32] (03PS1) 10Muehlenhoff: Only disable timers for ipmitoo/smartmon timers up to Buster [puppet] - 10https://gerrit.wikimedia.org/r/681033 (https://phabricator.wikimedia.org/T275873) [12:25:11] CFisch_WMDE: so I get it merged and once it's done, you can +2 yours. Would that work for you? [12:26:12] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2072.codfw.wmnet with reason: REIMAGE [12:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:57] (03PS1) 10Hnowlan: cr/firewall: allow access to new AQS hosts [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) [12:27:14] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10LarsWirzenius) I was hoping we could retire Scap before WMF moves to bullseye, but @... [12:27:34] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/29112/" [puppet] - 10https://gerrit.wikimedia.org/r/681033 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [12:27:55] Sounds good. Do I need to care deploying that one too, when deploying mine at some time? Or since it's only tests, you don't care? :-) [12:27:55] (03PS1) 10Urbanecm: cswiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681063 (https://phabricator.wikimedia.org/T279853) [12:27:59] Amir1: ^ [12:28:07] jouncebot: now [12:28:07] For the next 0 hour(s) and 1 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1030) [12:28:07] For the next 0 hour(s) and 31 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1100) [12:28:28] that...doesn't look right [12:28:52] * Urbanecm deploying [12:28:55] I won't sync it, just rebase [12:28:56] (03CR) 10Urbanecm: [C: 03+2] cswiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681063 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:29:35] Amir1: kk [12:29:54] 10SRE, 10Scap, 10Python3-Porting, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2021-04-01 to 2021-06-30 (Q4)): Porting scap to Python 3 - https://phabricator.wikimedia.org/T279628 (10MoritzMuehlenhoff) >>! In T279628#7013821, @LarsWirzenius wrote: > I was hoping we c... [12:30:54] (03Merged) 10jenkins-bot: cswiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681063 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:32:28] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db2126.codfw.wmnet with reason: REIMAGE [12:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:57] CFisch_WMDE: it gets better, the tests are failing now due to the bug I mentioned [12:34:05] (03PS4) 10Jbond: Drop python2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 [12:34:10] I can force merge this one to unblock VE [12:34:29] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 3e3cce192f1e99cbcae739f234271411d10974ac: cswiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD (T279853) (duration: 00m 58s) [12:34:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2126.codfw.wmnet with reason: REIMAGE [12:34:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:37] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [12:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:00] Amir1: ^^' [12:35:12] (03CR) 10Jbond: "updated thanks" (033 comments) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond) [12:38:43] !log mwscript extensions/GrowthExperiments/maintenance/migrateMentorMenteeRelationship.php --wiki=cswiki # T279853 [12:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:55] (03CR) 10Jbond: [C: 03+2] Drop python2 support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond) [12:38:58] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/679358 (owner: 10Jbond) [12:39:07] Amir1: It's not super urgent but would still be nice, then it's not someone biased from our team ;-). [12:42:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Only disable timers for ipmitoo/smartmon timers up to Buster [puppet] - 10https://gerrit.wikimedia.org/r/681033 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [12:43:14] (03PS1) 10Ema: cache: decrease caching likelihood on cp3051 [puppet] - 10https://gerrit.wikimedia.org/r/681064 (https://phabricator.wikimedia.org/T275809) [12:44:20] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/681064 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [12:44:32] (03PS2) 10Urbanecm: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680303 (https://phabricator.wikimedia.org/T279853) [12:45:09] (03CR) 10Urbanecm: [C: 03+2] testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680303 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:45:58] (03CR) 10Ema: [C: 03+2] cache: decrease caching likelihood on cp3051 [puppet] - 10https://gerrit.wikimedia.org/r/681064 (https://phabricator.wikimedia.org/T275809) (owner: 10Ema) [12:46:00] (03Merged) 10jenkins-bot: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_NEW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680303 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:48:11] (03CR) 10jerkins-bot: [V: 04-1] bridge: Split mobile and desktop error handling browser test, disable mobile [extensions/Wikibase] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680853 (https://phabricator.wikimedia.org/T279068) (owner: 10Ladsgroup) [12:51:32] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: ef0f68e2a9c1c638911bb06c47ba6e8ef88ee393: testwiki: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_NEW (T279853) (duration: 00m 57s) [12:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:43] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [12:53:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1182 to dbctl T275633', diff saved to https://phabricator.wikimedia.org/P15457 and previous config saved to /var/cache/conftool/dbconfig/20210419-125301-marostegui.json [12:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:11] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [12:54:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1182 in s2 for the first time with minimal weight T275633', diff saved to https://phabricator.wikimedia.org/P15458 and previous config saved to /var/cache/conftool/dbconfig/20210419-125407-marostegui.json [12:54:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:22] (03PS2) 10Urbanecm: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680304 (https://phabricator.wikimedia.org/T279853) [12:55:24] (03CR) 10Elukey: [C: 03+2] Add conf200[4-6] IPs to zookeeper's main firewall config [puppet] - 10https://gerrit.wikimedia.org/r/680874 (https://phabricator.wikimedia.org/T271573) (owner: 10Elukey) [12:55:26] (03CR) 10Ladsgroup: [C: 03+2] "The failure is unrelated and due to T280491 but at the same time, doing backport of this is crucially important to unblock all other backp" [extensions/Wikibase] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680853 (https://phabricator.wikimedia.org/T279068) (owner: 10Ladsgroup) [12:55:28] CFisch_WMDE: you should be unblocked now [12:55:31] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] bridge: Split mobile and desktop error handling browser test, disable mobile [extensions/Wikibase] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680853 (https://phabricator.wikimedia.org/T279068) (owner: 10Ladsgroup) [12:55:34] (03CR) 10Ladsgroup: [C: 03+1] "it should be unblocked by now." [extensions/VisualEditor] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/679462 (https://phabricator.wikimedia.org/T271898) (owner: 10WMDE-Fisch) [12:56:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1182', diff saved to https://phabricator.wikimedia.org/P15459 and previous config saved to /var/cache/conftool/dbconfig/20210419-125600-marostegui.json [12:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:10] (03CR) 10Urbanecm: [C: 03+2] wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680304 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:56:27] Amir1: \o/ thanks! [12:57:07] (03Merged) 10jenkins-bot: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680304 (https://phabricator.wikimedia.org/T279853) (owner: 10Urbanecm) [12:58:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: bd076306c0ae0428ff13743f499b2a02d42b6eab: wgGEMentorshipMigrationStage: Set to WRITE_BOTH/READ_OLD everywhere (T279853) (duration: 00m 57s) [12:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:54] T279853: Migrate mentor/mentee relationship to a separate database table on Wikimedia wikis - https://phabricator.wikimedia.org/T279853 [13:01:19] (03PS1) 10Ottomata: test/refine_sanitize - absent sanitize_eventlogging_analytics_delayed_test until June [puppet] - 10https://gerrit.wikimedia.org/r/681069 [13:02:30] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitize - absent sanitize_eventlogging_analytics_delayed_test until June [puppet] - 10https://gerrit.wikimedia.org/r/681069 (owner: 10Ottomata) [13:03:38] (03PS2) 10Ottomata: test/refine_sanitize - absent sanitize delayed_test until June [puppet] - 10https://gerrit.wikimedia.org/r/681069 [13:04:30] Amir1 CFisch_WMDE we are getting a bunch of Wikimedia\Rdbms\Database::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays errors, is that related to any of the work? [13:04:42] or should I create a task? [13:04:45] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitize - absent sanitize delayed_test until June [puppet] - 10https://gerrit.wikimedia.org/r/681069 (owner: 10Ottomata) [13:05:02] marostegui: my patch is just tests. CFisch_WMDE's hasn't merged yet [13:05:09] 10SRE, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Jan.Kamenicek) While one would expect that such a crucial broken feature should be fixed within days, in Wik... [13:05:09] ah ok [13:05:12] Thanks! [13:05:14] I will create a task [13:05:16] so I assume Urbanecm's patches? [13:05:30] (03PS3) 10Ottomata: test/refine_sanitize - absent sanitize delayed_test until June [puppet] - 10https://gerrit.wikimedia.org/r/681069 (https://phabricator.wikimedia.org/T273789) [13:05:41] from what I can see, it is mostly cswiki [13:05:47] that's likely me [13:05:58] marostegui: do you have a link to the error? [13:05:59] https://logstash.wikimedia.org/goto/004b938dedef4a372378b6b3460c78c7 [13:06:02] thanks [13:06:17] (03CR) 10Elukey: [C: 03+1] "IPs looks good, if possible I'd amend the commit message to specify that we are adding Cassandra instance IPs and not host level ones, so " [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) (owner: 10Hnowlan) [13:07:08] (03CR) 10Ottomata: [C: 03+2] test/refine_sanitize - absent sanitize delayed_test until June [puppet] - 10https://gerrit.wikimedia.org/r/681069 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [13:07:34] (03CR) 10Volans: "@Arzhel, just a thought, couldn't we get those from netbox dynamically maybe?" [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) (owner: 10Hnowlan) [13:07:51] marostegui: I'll create a task for it, not sure why it is happening right now. Should I add you/#dba there? [13:08:11] Urbanecm: I don't think there's anything we can do there :) [13:08:32] okay then :) [13:09:25] thanks for letting me know anyway :) [13:09:31] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 70.17 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [13:10:19] marostegui: the errors should stop appearing now. I'll figure out how to fix it before starting it on other wikis. [13:10:23] PROBLEM - Check systemd state on kubernetes1006 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:10:33] Urbanecm: thanks! [13:11:05] (03PS1) 10KartikMistry: Enable ContentTranslation as a default tool for 11 Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681077 (https://phabricator.wikimedia.org/T279422) [13:14:43] Urbanecm: errors stopped [13:15:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1182 in s2 for the first time with minimal weight T275633', diff saved to https://phabricator.wikimedia.org/P15460 and previous config saved to /var/cache/conftool/dbconfig/20210419-131501-marostegui.json [13:15:07] (03PS1) 10Jbond: debmonitor-client: update version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/681078 [13:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:12] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:15:29] cool :) [13:16:24] (03CR) 10Ayounsi: "> Patch Set 1:" [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) (owner: 10Hnowlan) [13:17:03] ftr, created T280525 for it [13:17:04] T280525: Wikimedia\Rdbms\Database::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays - https://phabricator.wikimedia.org/T280525 [13:18:51] (03CR) 10Volans: [C: 03+2] "LGTM" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/681078 (owner: 10Jbond) [13:19:02] Urbanecm: thanks, I got notified :) [13:19:09] :) [13:19:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1182 in s2 for the first time with minimal weight T275633', diff saved to https://phabricator.wikimedia.org/P15461 and previous config saved to /var/cache/conftool/dbconfig/20210419-131936-marostegui.json [13:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:19:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:29] PROBLEM - Check that envoy is running on idp-test1001 is CRITICAL: CRITICAL - Expecting active but unit envoyproxy.service is failed https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:21:52] (03Merged) 10jenkins-bot: debmonitor-client: update version [software/debmonitor] - 10https://gerrit.wikimedia.org/r/681078 (owner: 10Jbond) [13:22:26] (03PS4) 10Ayounsi: Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) [13:22:29] (03CR) 10Gehel: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [13:22:41] (03CR) 10Ayounsi: "Thanks, all good suggestions." (034 comments) [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi) [13:25:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 5%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15462 and previous config saved to /var/cache/conftool/dbconfig/20210419-132554-root.json [13:26:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:03] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:26:39] 10SRE, 10Wikimedia-Mailing-lists: mail.wikimedia.org doesn't redirect to lists.wikimedia.org - https://phabricator.wikimedia.org/T280473 (10faidon) I killed that domain in 2014 (operations/dns 3a7f472cb3e9bcd03f0492cfdd8c0a2156f448d3). Noone has complained since to my knowledge, and I'd recommend to not reintr... [13:27:15] RECOVERY - Check that envoy is running on idp-test1001 is OK: OK - envoyproxy.service is active https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [13:30:24] (03CR) 10Ayounsi: [C: 03+2] Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi) [13:31:33] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1006 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:31:48] (03Merged) 10jenkins-bot: Merge all system.conf templates in one [homer/public] - 10https://gerrit.wikimedia.org/r/679351 (https://phabricator.wikimedia.org/T269345) (owner: 10Ayounsi) [13:37:21] RECOVERY - Check systemd state on kubernetes1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:38:59] (03PS1) 10Ayounsi: Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/681082 [13:39:30] (03PS1) 10Volans: Upstream release v0.2.8 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681083 [13:39:57] (03CR) 10Ayounsi: [C: 03+2] Fix typo [homer/public] - 10https://gerrit.wikimedia.org/r/681082 (owner: 10Ayounsi) [13:40:00] 10ops-eqiad, 10DC-Ops: payments1006.frack.eqiad.wmnet DRAC no console output - https://phabricator.wikimedia.org/T280527 (10Jgreen) [13:40:45] (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681083 (owner: 10Volans) [13:40:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 10%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15463 and previous config saved to /var/cache/conftool/dbconfig/20210419-134057-root.json [13:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:41:08] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:42:15] (03PS1) 10Ladsgroup: snapshot: Migrate cronjobs in cirrussearch to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681084 (https://phabricator.wikimedia.org/T273673) [13:42:42] (03CR) 10jerkins-bot: [V: 04-1] snapshot: Migrate cronjobs in cirrussearch to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681084 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:44:28] (03PS1) 10Jbond: P:tlsproxy::envoy: Add support to opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681085 [13:44:43] (03PS2) 10Ladsgroup: snapshot: Migrate cronjobs in cirrussearch to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681084 (https://phabricator.wikimedia.org/T273673) [13:45:16] (03PS2) 10Jbond: P:tlsproxy::envoy: Add support to opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681085 (https://phabricator.wikimedia.org/T279804) [13:47:09] (03CR) 10Muehlenhoff: [C: 03+2] Only disable timers for ipmitoo/smartmon timers up to Buster [puppet] - 10https://gerrit.wikimedia.org/r/681033 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [13:47:57] (03CR) 10Ladsgroup: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/681084 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [13:48:36] (03PS3) 10Jbond: P:tlsproxy::envoy: Add support to opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681085 (https://phabricator.wikimedia.org/T279804) [13:48:38] (03PS1) 10Jbond: idp_test: opt out from FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681089 [13:49:45] (03CR) 10jerkins-bot: [V: 04-1] P:tlsproxy::envoy: Add support to opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681085 (https://phabricator.wikimedia.org/T279804) (owner: 10Jbond) [13:53:33] (03CR) 10Ottomata: [C: 03+1] hadoop: improve default log4j config [puppet] - 10https://gerrit.wikimedia.org/r/680383 (https://phabricator.wikimedia.org/T276906) (owner: 10Elukey) [13:55:48] (03PS2) 10Hnowlan: cr/firewall: allow access to new AQS hosts [homer/public] - 10https://gerrit.wikimedia.org/r/681059 (https://phabricator.wikimedia.org/T280155) [13:55:59] (03PS4) 10Jbond: P:tlsproxy::envoy: Add support to opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681085 (https://phabricator.wikimedia.org/T279804) [13:56:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 15%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15464 and previous config saved to /var/cache/conftool/dbconfig/20210419-135601-root.json [13:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:11] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [13:59:57] 10SRE, 10Wikimedia-Mailing-lists: mail.wikimedia.org doesn't redirect to lists.wikimedia.org - https://phabricator.wikimedia.org/T280473 (10Ladsgroup) Thanks. The rewrite rule for it still exists in exim4. We should remove that I think? [14:02:21] (03CR) 10ArielGlenn: [C: 03+2] snapshot: Migrate cronjobs in cirrussearch to systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/681084 (https://phabricator.wikimedia.org/T273673) (owner: 10Ladsgroup) [14:02:49] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1006 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:05:00] (03PS1) 10Ottomata: analytics test - import mediawiki.page-data with camus and refine it [puppet] - 10https://gerrit.wikimedia.org/r/681092 (https://phabricator.wikimedia.org/T273789) [14:05:45] (03CR) 10Jbond: "FYI" (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/681082 (owner: 10Ayounsi) [14:07:04] (03PS2) 10Ottomata: analytics test - import mediawiki.page-delete with camus and refine it [puppet] - 10https://gerrit.wikimedia.org/r/681092 (https://phabricator.wikimedia.org/T273789) [14:07:22] (03PS1) 10Ppchelko: Factor out rollback logic from WikiPage [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680857 [14:07:37] (03CR) 10Ottomata: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29118/console" [puppet] - 10https://gerrit.wikimedia.org/r/681092 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:08:26] (03PS3) 10Ottomata: analytics test - import mediawiki.page-delete with camus and refine it [puppet] - 10https://gerrit.wikimedia.org/r/681092 (https://phabricator.wikimedia.org/T273789) [14:11:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 20%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15465 and previous config saved to /var/cache/conftool/dbconfig/20210419-141105-root.json [14:11:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:14] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [14:12:30] (03CR) 10Jbond: [C: 03+2] P:tlsproxy::envoy: Add support to opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681085 (https://phabricator.wikimedia.org/T279804) (owner: 10Jbond) [14:13:49] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:17] (03CR) 10Ottomata: [C: 03+2] analytics test - import mediawiki.page-delete with camus and refine it [puppet] - 10https://gerrit.wikimedia.org/r/681092 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:14:23] (03CR) 10Jbond: [C: 03+2] idp_test: opt out from FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681089 (owner: 10Jbond) [14:15:45] (03CR) 10Muehlenhoff: sudo: add new flag purge_sudoeres_d (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681026 (owner: 10Jbond) [14:17:44] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. I also cherrypicked the patch locally to sretest1002 (which is running Bullseye and thus has no Python 2 anyway) and tha" [puppet] - 10https://gerrit.wikimedia.org/r/670972 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [14:23:53] (03PS1) 10Jbond: envoy: fix yaml syntax [puppet] - 10https://gerrit.wikimedia.org/r/681096 [14:24:42] (03CR) 10Volans: [C: 03+2] Upstream release v0.2.8 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681083 (owner: 10Volans) [14:25:11] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10MarkTraceur) Hello! Sorry about the lag, I don't have good notification settings for Phabricator and I was on vacation last week when I was notified outside of Phabricator. Request approved a... [14:25:24] (03CR) 10Jbond: [C: 03+2] envoy: fix yaml syntax [puppet] - 10https://gerrit.wikimedia.org/r/681096 (owner: 10Jbond) [14:26:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 30%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15466 and previous config saved to /var/cache/conftool/dbconfig/20210419-142608-root.json [14:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:18] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [14:27:00] (03Merged) 10jenkins-bot: Upstream release v0.2.8 [software/debmonitor] (debian) - 10https://gerrit.wikimedia.org/r/681083 (owner: 10Volans) [14:29:40] !log imported envoyproxy_1.16.3-1 debs to envoy-future component [14:29:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:54] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [14:30:53] (03PS1) 10Jbond: envoy: fix yaml [puppet] - 10https://gerrit.wikimedia.org/r/681097 [14:32:27] (03CR) 10Jbond: [C: 03+2] envoy: fix yaml [puppet] - 10https://gerrit.wikimedia.org/r/681097 (owner: 10Jbond) [14:40:11] (03PS1) 10Jbond: envoy: fix header block in both sections [puppet] - 10https://gerrit.wikimedia.org/r/681098 [14:40:52] (03PS2) 10Jbond: sudo: add new flag purge_sudoeres_d [puppet] - 10https://gerrit.wikimedia.org/r/681026 [14:41:01] (03CR) 10Jbond: "updated thanks" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681026 (owner: 10Jbond) [14:41:01] !log uploaded debmonitor-client 0.2.8 to apt.w.o for jessie, stretch, buster, bullseye [14:41:05] jbond42: ^^^ [14:41:08] thanks [14:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 40%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15467 and previous config saved to /var/cache/conftool/dbconfig/20210419-144112-root.json [14:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:21] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [14:41:24] (03CR) 10Jbond: [C: 03+2] envoy: fix header block in both sections [puppet] - 10https://gerrit.wikimedia.org/r/681098 (owner: 10Jbond) [14:42:32] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] wmcs: Add link to runbook on puppet alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro) [14:43:10] jouncebot: now [14:43:10] No deployments scheduled for the next 2 hour(s) and 16 minute(s) [14:43:11] jouncebot: next [14:43:12] In 2 hour(s) and 16 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1700) [14:44:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1086 T278229', diff saved to https://phabricator.wikimedia.org/P15468 and previous config saved to /var/cache/conftool/dbconfig/20210419-144422-marostegui.json [14:44:27] (03PS3) 10Reedy: PoolCounter: Use namespaced Client class name, not deprecated name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674465 [14:44:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:32] T278229: decommission db1086.eqiad.wmnet - https://phabricator.wikimedia.org/T278229 [14:45:31] (03PS1) 10Marostegui: db1086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681100 (https://phabricator.wikimedia.org/T278229) [14:46:12] (03CR) 10Reedy: [C: 03+2] PoolCounter: Use namespaced Client class name, not deprecated name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674465 (owner: 10Reedy) [14:46:30] (03CR) 10Marostegui: [C: 03+2] db1086: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/681100 (https://phabricator.wikimedia.org/T278229) (owner: 10Marostegui) [14:46:36] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10RobH) [14:46:58] 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10RobH) [14:47:19] (03Merged) 10jenkins-bot: PoolCounter: Use namespaced Client class name, not deprecated name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/674465 (owner: 10Reedy) [14:48:31] (03PS9) 10Reedy: Update RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (https://phabricator.wikimedia.org/T180192) [14:48:59] !log reedy@deploy1002 Synchronized wmf-config/PoolCounterSettings.php: Use namespaced PoolCounter Client (duration: 00m 57s) [14:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:20] (03CR) 10Reedy: [C: 03+2] Update RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (https://phabricator.wikimedia.org/T180192) (owner: 10Reedy) [14:49:39] (03CR) 10Jcrespo: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:50:14] (03PS3) 10Jbond: sudo: add new flag purge_sudoeres_d [puppet] - 10https://gerrit.wikimedia.org/r/681026 [14:50:24] (03PS1) 10Andrew Bogott: OpenStack Keystone: Remove some unneeded policy rules [puppet] - 10https://gerrit.wikimedia.org/r/681101 (https://phabricator.wikimedia.org/T276018) [14:50:31] (03Merged) 10jenkins-bot: Update RelatedArticles config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680811 (https://phabricator.wikimedia.org/T180192) (owner: 10Reedy) [14:50:43] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Initial setup for the swift media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [14:51:56] !log reedy@deploy1002 Synchronized wmf-config/CommonSettings.php: Remove RelatedArticles extension function and wmg to wg mapping (duration: 00m 56s) [14:52:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:21] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Keystone: Remove some unneeded policy rules [puppet] - 10https://gerrit.wikimedia.org/r/681101 (https://phabricator.wikimedia.org/T276018) (owner: 10Andrew Bogott) [14:53:03] !log update debmonitor-client - T280484 [14:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:12] T280484: debmonitor-client.service stays in failed state in case of server errors - https://phabricator.wikimedia.org/T280484 [14:53:39] !log reedy@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Rename RelatedArticles wmg variables to wg (duration: 00m 56s) [14:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15469 and previous config saved to /var/cache/conftool/dbconfig/20210419-145616-root.json [14:56:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:25] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [14:57:57] (03PS3) 10Ottomata: test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 [14:58:23] (03PS4) 10Ottomata: test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 (https://phabricator.wikimedia.org/T273789) [14:58:31] (03PS1) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [14:58:34] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:58:53] (03PS5) 10Ottomata: test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 (https://phabricator.wikimedia.org/T273789) [14:59:04] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [14:59:19] (03CR) 10Svantje Lilienthal: [C: 03+1] [beta] Enable changes to the descriptions in the VE transclusion dialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681011 (https://phabricator.wikimedia.org/T273425) (owner: 10WMDE-Fisch) [14:59:44] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 133974856 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:59:58] (03PS2) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [15:01:02] (03CR) 10Jcrespo: "@marostegui, I am completely at your will to do any modification you prefer, from section name, to role/profiles used, etc." [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:01:16] (03PS1) 10Urbanecm: DatabaseMentorStore: Fix deprecation warning in upsert query [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680860 (https://phabricator.wikimedia.org/T280525) [15:01:30] (03PS1) 10Ottomata: test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) [15:01:44] (03CR) 10jerkins-bot: [V: 04-1] test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/681105 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:02:07] (03Abandoned) 10Ottomata: test/refine_sanitized - add a general purpose event_sanitized job [puppet] - 10https://gerrit.wikimedia.org/r/679961 (https://phabricator.wikimedia.org/T273789) (owner: 10Ottomata) [15:02:08] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [15:03:00] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01234 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:04:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/681026 (owner: 10Jbond) [15:04:44] (03CR) 10David Caro: wmcs: Add link to runbook on puppet alerts. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro) [15:05:53] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10RobH) [15:06:12] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10RobH) a:03Papaul [15:06:40] 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10RobH) [15:07:38] PROBLEM - DPKG on conf2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:09:10] PROBLEM - DPKG on conf2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:09:21] (03PS13) 10Jcrespo: mediabackup: Initial setup for the swift media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [15:10:49] volans: the dpkg failures on conf* and caused by systemd-sysusers in the postinst (it doesn't exist on jessie) [15:11:00] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Initial setup for the swift media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:11:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 60%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15470 and previous config saved to /var/cache/conftool/dbconfig/20210419-151119-root.json [15:11:26] moritzm: ah... I've just rebuilt the existing, was it already incompatible with jessie? [15:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:30] let's simply stick with the old debmonitor release on the remaining five hosts, they don't need the new features from 0.2.8 [15:11:31] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:11:31] moritzm: is there a way to force it? [15:11:37] sorry about that... can we just revert it to the previous version? [15:11:42] ack ill downgrade them [15:11:59] but is also failing on other hosts [15:12:28] with apt could not get lock [15:12:32] AFAICT [15:12:32] volans: i only see faliures on the two conf serveres for now where do you see other issues [15:12:32] volans: kinda it was incompatible on the source level, I held back making an update until jessie's were gone, but then forgot about this when you prepared the new release, sorry for that [15:12:43] jbond42: https://puppetboard.wikimedia.org/nodes?status=failed [15:12:55] and the widespread icinga alert above [15:13:19] (03PS1) 10David Caro: prometheus: allow using the --storage.tsdb.retention.size option [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) [15:13:31] jbond42: you mean force the postinst run that it failing? I don't think so, it uses "set -e" IIRC [15:13:46] ack [15:14:01] (03CR) 10jerkins-bot: [V: 04-1] prometheus: allow using the --storage.tsdb.retention.size option [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) (owner: 10David Caro) [15:14:04] (03CR) 10David Caro: prometheus: allow using the --storage.tsdb.retention.size option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) (owner: 10David Caro) [15:15:17] (03PS2) 10David Caro: prometheus: allow using the --storage.tsdb.retention.size option [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) [15:15:38] (03CR) 10David Caro: prometheus: allow using the --storage.tsdb.retention.size option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) (owner: 10David Caro) [15:16:06] (03PS14) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [15:16:14] not sure about the icinga warnings on the other hosts, I just re-ran puppet on an-worker1121 and it ran just fine [15:16:41] volans: i think the puppet runs may have just been unfortunate and ran at the same time as debdeploy [15:16:53] yeah, that seems likely [15:16:56] spotchecking individual ones works, ill run puppet on failed ones now [15:17:16] ok, seemed a bit too many for being just a race with debdeploy [15:17:33] sorry aboyt the jessy one, is there an easy way to tell reprepro to use the old version [15:17:37] and discard the new one? [15:18:14] i dont think so, i was hoping the old deb would still be on deneb [15:18:33] moritzm: you have any ideas? or is it easier to just rebuild? [15:18:43] I can rebuild if that helps [15:19:07] I don't have the previous one as I didn't build it [15:19:55] volans: if you could gick of a rebuild of the old release that would be usefull 0.2.0 was the version that was installed [15:20:38] jbond42: on mwlog1001 there is debmonitor-client_0.2.0-1+deb8u1_all.deb [15:20:51] in /var/cache/apt/archives [15:20:58] let me upload that one [15:21:19] volans: ack thanks [15:21:34] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005875 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:21:55] yeah, apt/archives seems easiest [15:22:27] * volans doing [15:23:34] thanks volans, ill stop :) [15:24:18] !log reverted debmonitor-client to 0.2.0-1 on apt.w.o for jessie-wikimedia [15:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:11] jbond42: package is in apt [15:25:17] now you have to force it as it's older [15:25:19] volans: thanks [15:25:41] np, was my fault, I had forgot it was already incompatible with jessie [15:26:20] had to remove and includeb, because reprepro wasn't allowing me to add an older version [15:26:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 70%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15471 and previous config saved to /var/cache/conftool/dbconfig/20210419-152623-root.json [15:26:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:33] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:27:16] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] wmcs: Add link to runbook on puppet alerts. [puppet] - 10https://gerrit.wikimedia.org/r/680254 (owner: 10David Caro) [15:28:00] ok sorted now thanks [15:28:34] confirmed, apt-get uopdate working fine on mwlog1001 [15:31:35] yeah, all my smoke tests were also fine \o/ [15:31:48] \o/ [15:38:24] RECOVERY - DPKG on conf2002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:40:04] RECOVERY - DPKG on conf2001 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [15:41:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 80%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15472 and previous config saved to /var/cache/conftool/dbconfig/20210419-154127-root.json [15:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:44] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:43:33] 10SRE, 10Analytics-Clusters, 10User-Elukey: Manage Hue via systemd unit - https://phabricator.wikimedia.org/T206484 (10Ottomata) 05Open→03Resolved a:03Ottomata [15:43:57] (03PS1) 10Muehlenhoff: Install cumin2002 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/681116 [15:44:26] (03PS2) 10Muehlenhoff: Install cumin2002 with bullseye [puppet] - 10https://gerrit.wikimedia.org/r/681116 [15:46:31] (03PS15) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [15:46:33] (03PS3) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [15:46:35] (03PS1) 10Jcrespo: mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) [15:48:27] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:49:36] (03CR) 10Jcrespo: "Adding myself to CC as I had initially assumed only one cumin host would be active per site. There is nothing preventing from having multi" [puppet] - 10https://gerrit.wikimedia.org/r/681116 (owner: 10Muehlenhoff) [15:55:54] (03CR) 10Effie Mouzeli: [C: 03+1] conftool: Create a shared jobrunner_videoscaler [puppet] - 10https://gerrit.wikimedia.org/r/679258 (https://phabricator.wikimedia.org/T279100) (owner: 10Alexandros Kosiaris) [15:55:57] (03PS2) 10Jcrespo: mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) [15:56:03] (03CR) 10Effie Mouzeli: [C: 03+1] conftool: fix TODO by adding 2 dedicated codfw videoscalers [puppet] - 10https://gerrit.wikimedia.org/r/679432 (https://phabricator.wikimedia.org/T279100) (owner: 10Dzahn) [15:56:31] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 90%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15473 and previous config saved to /var/cache/conftool/dbconfig/20210419-155631-root.json [15:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:40] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [15:58:25] (03CR) 10jerkins-bot: [V: 04-1] mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) (owner: 10Jcrespo) [15:58:27] (03PS16) 10Jcrespo: mediabackup: Initial setup for the media backup worker hosts [puppet] - 10https://gerrit.wikimedia.org/r/668380 (https://phabricator.wikimedia.org/T276442) [15:58:41] 10SRE, 10ops-eqiad, 10Analytics-Radar: Try to move some new analytics worker nodes to different racks - https://phabricator.wikimedia.org/T276239 (10Ottomata) Hiya, @Cmjohnson any news on this? {T275767} is blocked on this task. [16:07:34] 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10akosiaris) Hi! Adopting the new functionality in networkpolicy resources has indeed created some tech debt. It's a tech debt we created on purp... [16:08:13] 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10Ottomata) <3 [16:11:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Slowly pool db1182 for the first time in s2 T275633', diff saved to https://phabricator.wikimedia.org/P15474 and previous config saved to /var/cache/conftool/dbconfig/20210419-161134-root.json [16:11:38] PROBLEM - Check systemd state on deneb is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-releng-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:11:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:44] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [16:15:15] (03CR) 10Jbond: [C: 03+1] "LGTM some optional comments inline" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/681107 (https://phabricator.wikimedia.org/T280530) (owner: 10David Caro) [16:18:16] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [16:22:48] 10SRE, 10Analytics, 10Event-Platform, 10serviceops, 10Patch-For-Review: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10akosiaris) a:03akosiaris [16:25:03] !log Updated the Wikidata property suggester with data from the 2021-04-12 JSON dump (with pre-applied T132839 workarounds) [16:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:15] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [16:34:52] (03PS10) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [16:41:01] (03PS1) 10Ottomata: refine_job - RefineMonitor should also use use_keytab param [puppet] - 10https://gerrit.wikimedia.org/r/681129 [16:45:03] (03CR) 10Ottomata: [C: 03+2] "This is already using kerberos by default since analytics' principal is authed on an-launcher1002 by other jobs." [puppet] - 10https://gerrit.wikimedia.org/r/681129 (owner: 10Ottomata) [16:48:01] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1001.eqiad.wmnet [16:48:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:56] (03PS1) 10Jbond: hiera - idp: opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681130 (https://phabricator.wikimedia.org/T279804) [16:55:19] (03PS1) 10Ottomata: test/refine - suffix job names with _test for easier identification in alerts [puppet] - 10https://gerrit.wikimedia.org/r/681131 [16:55:42] (03CR) 10Jbond: [C: 03+2] hiera - idp: opt out of FLoC [puppet] - 10https://gerrit.wikimedia.org/r/681130 (https://phabricator.wikimedia.org/T279804) (owner: 10Jbond) [16:56:06] (03PS1) 10Ppchelko: [EventBus] Make eventage-main timeout consistent with envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681132 (https://phabricator.wikimedia.org/T249745) [16:56:48] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet last ran 3 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [16:56:59] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1001.eqiad.wmnet [16:57:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:26] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1002.eqiad.wmnet [16:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:16] RECOVERY - puppet last run on otrs1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:00:05] ryankemper: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1700). [17:02:14] (03PS2) 10Ppchelko: Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) [17:04:05] (03CR) 10Alexandros Kosiaris: [C: 03+1] [EventBus] Make eventage-main timeout consistent with envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681132 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [17:05:09] (03PS1) 10Bartosz Dziewoński: Remove
tags around headings for compat with MobileFrontend [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681146 (https://phabricator.wikimedia.org/T280433) [17:05:44] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [17:08:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1002.eqiad.wmnet [17:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:38] (03PS1) 10Hoo man: Revert "Set wgPageImagesAPIDefaultLicense to 'any' for wikidata" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681137 [17:09:43] (03PS4) 10Jcrespo: mariadb: Setup 2 new host as temporary metadata database for media backups [puppet] - 10https://gerrit.wikimedia.org/r/681103 (https://phabricator.wikimedia.org/T276442) [17:09:45] (03PS3) 10Jcrespo: mediabackup: Setup the storage hosts [puppet] - 10https://gerrit.wikimedia.org/r/681117 (https://phabricator.wikimedia.org/T276442) [17:11:03] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1003.eqiad.wmnet [17:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:16] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1003.eqiad.wmnet [17:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:20:57] (03PS11) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [17:21:40] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [17:23:05] 10SRE, 10Wikimedia-Mailing-lists: mail.wikimedia.org doesn't redirect to lists.wikimedia.org - https://phabricator.wikimedia.org/T280473 (10Legoktm) >>! In T280473#7014238, @Ladsgroup wrote: > Thanks. The rewrite rule for it still exists in exim4. We should remove that I think? That's what I filed {T280472} f... [17:23:28] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1004.eqiad.wmnet [17:23:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:29] (03CR) 10Elukey: "@Razzi: let's create a specific recipe with the mdx stuff added, we can tune it later on if needed.." [puppet] - 10https://gerrit.wikimedia.org/r/679607 (https://phabricator.wikimedia.org/T278421) (owner: 10Razzi) [17:30:13] (03CR) 10Ottomata: [C: 03+2] test/refine - suffix job names with _test for easier identification in alerts [puppet] - 10https://gerrit.wikimedia.org/r/681131 (owner: 10Ottomata) [17:30:19] (03PS12) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [17:34:07] (03PS13) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [17:37:46] jouncebot: next [17:37:47] In 0 hour(s) and 22 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1800) [17:38:04] (03CR) 10Urbanecm: [C: 03+2] "preparing for B&C" [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680860 (https://phabricator.wikimedia.org/T280525) (owner: 10Urbanecm) [17:40:08] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1004.eqiad.wmnet [17:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:51] (03PS14) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [17:47:24] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2001.codfw.wmnet [17:47:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:10] MatmaRex: will you be around in ~10 minutes? If so, I can +2 your backport now, so it takes less time to merge :) [17:51:24] yeah, i'm here [17:51:26] col [17:51:28] *cool [17:51:30] (03CR) 10Urbanecm: [C: 03+2] Remove
tags around headings for compat with MobileFrontend [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681146 (https://phabricator.wikimedia.org/T280433) (owner: 10Bartosz Dziewoński) [17:52:13] (03PS3) 10Dzahn: trafficserver: remove mwdebug1003 from x-wikimedia-debug-routing [puppet] - 10https://gerrit.wikimedia.org/r/680393 (https://phabricator.wikimedia.org/T267248) [17:52:27] (03Merged) 10jenkins-bot: DatabaseMentorStore: Fix deprecation warning in upsert query [extensions/GrowthExperiments] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680860 (https://phabricator.wikimedia.org/T280525) (owner: 10Urbanecm) [17:59:20] (03CR) 10Dzahn: [C: 03+2] trafficserver: remove mwdebug1003 from x-wikimedia-debug-routing [puppet] - 10https://gerrit.wikimedia.org/r/680393 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [18:00:05] RoanKattouw, Niharika, and Urbanecm: That opportune time is upon us again. Time for a Morning backport window deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T1800). [18:00:05] Pchelolo, Urbanecm, and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:26] I'll deploy my own, I'll go last if you don't mind. [18:00:31] i can deploy today [18:00:35] Pchelolo: ack, I'll ping you when ready [18:00:35] please ping once you're done Urbanecm [18:00:38] will do [18:00:38] thank you [18:00:41] haha [18:01:29] syncing my own first [18:01:49] Pchelolo: if you want, feel free to +2 your backport now, to give CI time to process it [18:01:58] (03Merged) 10jenkins-bot: Remove
tags around headings for compat with MobileFrontend [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681146 (https://phabricator.wikimedia.org/T280433) (owner: 10Bartosz Dziewoński) [18:02:05] Urbanecm: fyi, no more mwdebug1003, no more stretch testing [18:02:08] oh good idea [18:02:13] mutante: acnowledged, thanks :) [18:02:19] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/GrowthExperiments/includes/Mentorship/Store/DatabaseMentorStore.php: 0233507470377f6ac45768e345cd2e359e5d0e57: DatabaseMentorStore: Fix deprecation warning in upsert query (T280525) (duration: 00m 57s) [18:02:20] *nod*, cool [18:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:29] T280525: Wikimedia\Rdbms\Database::normalizeUpsertKeys called with deprecated parameter style: the unique key array should be a string or array of string arrays - https://phabricator.wikimedia.org/T280525 [18:02:34] mutante: it's still in the browser plugin, is that expected? [18:02:46] (03CR) 10Ppchelko: [C: 03+2] Factor out rollback logic from WikiPage [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680857 (owner: 10Ppchelko) [18:02:59] Urbanecm: merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/680393 a minute ago [18:03:19] oh, cool, didn't know this is puppet maintained too :) [18:03:19] that should mean in browser plugin it cant work anymore now [18:03:25] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2001.codfw.wmnet [18:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:38] MatmaRex: please test yours on mwdebug1001 and lmk :) [18:03:46] or I am missing that there is a second patch needed in another repo for the plugin [18:06:30] looking [18:06:33] thanks [18:06:34] (sorry) [18:07:16] np [18:08:46] (03PS15) 10Ahmon Dancy: WIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/680405 [18:09:26] Urbanecm: i'm not seeing the expected effect, are you sure it's deployed? [18:09:35] MatmaRex: double checking [18:11:27] MatmaRex: sorry, i forgot to scap pull :/. Can you test again? [18:12:24] Urbanecm: aha, looks good now! [18:12:29] cool, syncing! [18:13:57] !log urbanecm@deploy1002 Synchronized php-1.37.0-wmf.1/extensions/DiscussionTools/: 66d137b75a7073c7162c443cc8c6ec6f3be714e0: Remove
tags around headings for compat with MobileFrontend (T280433) (duration: 00m 59s) [18:14:02] and done [18:14:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:06] T280433: DiscussionTools makes talk page sections uncollapsible on mobile - https://phabricator.wikimedia.org/T280433 [18:14:09] (03PS1) 10Krinkle: [Beta Cluster] mediawiki: Remove "commons.wikipedia" redirect [puppet] - 10https://gerrit.wikimedia.org/r/681142 [18:14:10] Pchelolo: floor is yours :) [18:14:15] thank you Urbanecm [18:14:46] as a more experienced deployer, do you think it's ok to run scap sync-file on entire /includes of core? [18:14:54] of should I do scap sync-world [18:15:22] I'm trying out new suggestion by thcipriani to deploy large patches as backports [18:15:29] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, 10Service-deployment-requests: [Draft] New service request: WDQ Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10Gehel) [18:15:33] Urbanecm: seems the WikimediaDebug repo is github [18:16:20] Pchelolo: in theory it should work, and it's the most you can do with scap sync-file [18:16:48] (you can't sync whole php-1.37.0-wmf.1, but includes should work) [18:16:55] (03CR) 10Ppchelko: [C: 03+2] [EventBus] Make eventage-main timeout consistent with envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681132 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [18:16:58] ^ it may take a bit, it'll run a php -l on all the php files [18:17:03] yeah [18:17:14] might be slower, but it should work, AFAIK [18:17:17] not that sync-world *won't* take a bit :) [18:17:27] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 2 others: [Draft] New service request: WDQ Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10Gehel) [18:17:34] sync-world would definitely be slower :D [18:17:35] Urbanecm: aha, so first it was "Added mwdebug1003 to the list of servers (Effie Mouzeli)" but then it was "List of debug servers is now fetched from noc.wikimedia.org (Gilles Dubuc)" [18:17:36] so, thcipriani, what's the right process to execute your suggestion [18:17:44] (03Merged) 10jenkins-bot: [EventBus] Make eventage-main timeout consistent with envoy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681132 (https://phabricator.wikimedia.org/T249745) (owner: 10Ppchelko) [18:18:01] I have a huuuuge backport, do I do sync-file for /includes and for /tests, or do I do sync-world? [18:18:28] I think that's what I'd do: sync-file for includes and sync-file for tests [18:18:35] +1 [18:18:43] oki. thank you. [18:19:02] I'll give you some feedback on the entire experience soon [18:19:54] nice, thanks for trying the experiment [18:21:20] !log ppchelko@deploy1002 Synchronized wmf-config/CommonSettings.php: T249745 [EventBus] Make eventage-main timeout consistent with envoy (duration: 00m 56s) [18:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:30] T249745: Could not enqueue jobs: "Unable to deliver all events: 503: Service Unavailable" - https://phabricator.wikimedia.org/T249745 [18:23:14] (03PS1) 10Dzahn: remove mwdebug1003 from list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681144 (https://phabricator.wikimedia.org/T267248) [18:23:20] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/681144" [puppet] - 10https://gerrit.wikimedia.org/r/680393 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [18:23:24] (03CR) 10Ppchelko: [C: 03+2] Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [18:26:14] (03Merged) 10jenkins-bot: Factor out rollback logic from WikiPage [core] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/680857 (owner: 10Ppchelko) [18:26:27] (03CR) 10Dzahn: [C: 03+1] [Beta Cluster] mediawiki: Remove "commons.wikipedia" redirect [puppet] - 10https://gerrit.wikimedia.org/r/681142 (owner: 10Krinkle) [18:27:22] (03PS3) 10Ppchelko: Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) [18:27:29] (03CR) 10Ppchelko: Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [18:27:33] (03CR) 10Ppchelko: [C: 03+2] Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [18:27:41] (03PS2) 10Dzahn: DHCP: remove mw1261 through mw1301 [puppet] - 10https://gerrit.wikimedia.org/r/679958 (https://phabricator.wikimedia.org/T280203) [18:28:20] (03Merged) 10jenkins-bot: Math: Enable RESTBase-less Wikidata math validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/679357 (https://phabricator.wikimedia.org/T274436) (owner: 10Ppchelko) [18:30:21] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Dzahn) [18:30:26] 10SRE, 10SRE-Access-Requests: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Dzahn) a:05MarkTraceur→03None Thanks Mark! ACK [18:34:15] (03PS1) 10Dzahn: admin: add mlitn to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/681167 (https://phabricator.wikimedia.org/T274749) [18:34:46] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2002.codfw.wmnet [18:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:01] (03CR) 10Dzahn: "Hi Alex, https://phabricator.wikimedia.org/T274749 has been unblocked. I think this should be all that is needed. Assigning to you as clin" [puppet] - 10https://gerrit.wikimedia.org/r/681167 (https://phabricator.wikimedia.org/T274749) (owner: 10Dzahn) [18:38:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Dzahn) a:03akosiaris Assigning to Αλέξανδρος as clinic duty to review/confirm. Uploaded change above, I think that is all that is needed because Matthias already has ex... [18:38:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to stat boxes for mlitn - https://phabricator.wikimedia.org/T274749 (10Dzahn) [18:39:36] !log ppchelko@deploy1002 Synchronized wmf-config/CommonSettings.php: T274436 Math: Enable RESTBase-less Wikidata math validation (duration: 00m 56s) [18:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:45] T274436: Enable RESTbaseless validation in wikibase - https://phabricator.wikimedia.org/T274436 [18:40:07] (03CR) 10Ahmon Dancy: "nice work Joe." [deployment-charts] - 10https://gerrit.wikimedia.org/r/670220 (https://phabricator.wikimedia.org/T265327) (owner: 10Giuseppe Lavagetto) [18:42:06] 10SRE, 10Performance-Team, 10Platform Engineering, 10Goal: Decommission the "session redis" cluster - https://phabricator.wikimedia.org/T243520 (10jijiki) [18:42:41] 10SRE, 10serviceops, 10Performance-Team (Radar): Get rid of nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10jijiki) [18:42:46] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [18:42:49] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [18:43:29] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10Dzahn) The linked ticket was back in jessie, but that's accurate. From a random appserver on buster nowadays: ` [mw2300:~] $ dpkg -l | grep cjk ii fonts-noto-cjk 1:20... [18:44:02] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2002.codfw.wmnet [18:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:45:12] 10SRE, 10Wikimedia-SVG-rendering: Adding new font for CJK media display - https://phabricator.wikimedia.org/T280432 (10Dzahn) This is https://packages.debian.org/buster/fonts-noto-cjk so maybe it would have to be an upstream bug against that package cc: @Muehlenhoff [18:47:03] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: cluster=thumbor,name=thumbor2001.codfw.wmnet [18:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2003.codfw.wmnet [18:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:36] !log ppchelko@deploy1002 Synchronized php-1.37.0-wmf.1/includes/: Factor out rollback logic from WikiPage - /includes (duration: 01m 01s) [18:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:28] 10SRE, 10serviceops, 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [18:52:48] 10SRE, 10serviceops, 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [18:52:51] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [18:52:57] (03CR) 10Krinkle: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/654330 (owner: 10Aaron Schulz) [18:53:35] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster from the memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [18:54:26] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10Krinkle) [18:55:04] !log ppchelko@deploy1002 Synchronized php-1.37.0-wmf.1/maintenance: Factor out rollback logic from WikiPage - /maintenance (duration: 00m 57s) [18:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:38] !log ppchelko@deploy1002 Synchronized php-1.37.0-wmf.1/tests: Factor out rollback logic from WikiPage - /tests (duration: 00m 59s) [18:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:51] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 3 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) [18:57:33] thcipriani: ok, so doing sync-file was an enormous mistake [18:57:46] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2003.codfw.wmnet [18:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:46] * Urbanecm is curious to hear why [18:59:00] it syncs stuff in random order and it doesn't seem to depool servers??? [18:59:12] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster and away from the memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [18:59:16] Pchelolo: that happens with any syncing methods [18:59:24] so we get https://logstash.wikimedia.org/app/dashboards#/view/0a9ecdc0-b6dc-11e8-9d8f-dbc23b470465?_g=h@c823129&_a=h@265790c [18:59:27] if you need to sync files in particular order, you need to sync-file them individually [19:00:04] if there is no good order of syncs that would avoid this happening, then the patch cannot be backported, and must be changed in order to have a clear order in which it won't create an error spike [19:00:09] so I've created 225 production erros in the process [19:00:30] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10Krinkle) Note that deployment of T113916 (migrate module dep store from a MW core DB table written to during GET requests, to instead use Main Stash) was halted becau... [19:00:45] ok, I thought that it depools the servers while doing it [19:01:44] sorry about this... this was an experiment, and I guess it's a failed one. [19:03:51] Pchelolo: this may be a weird suggestion but perhaps you would want to sit in on a deployment (backports/configs) training sometime? this is exactly the sort of thing we should be better about spelling out, and making sure everyone knows how it works [19:04:32] yup. would be nice to know exactly what's happening, especially in some unusual cases [19:04:53] at worst it turns out you already know everything and can help train others, in a better case you learn something AND you also can later train others! [19:04:59] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2004.codfw.wmnet [19:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:24] if you are still working I will shut up, but poke me or others later about this if you like. [19:05:47] nono, I'm done shooting myself in a foot [19:06:26] very successfully shot the foot off [19:06:34] 10SRE, 10serviceops: Move "redis_sessions" to "redis_misc" cluster - https://phabricator.wikimedia.org/T280586 (10jijiki) [19:07:55] 10SRE, 10serviceops: Move "redis_sessions" to "redis_misc" cluster - https://phabricator.wikimedia.org/T280586 (10jijiki) p:05Triage→03Medium [19:08:12] 10SRE, 10Services, 10Wikidata, 10Wikidata-Query-Service, and 2 others: [Draft] New service request: WDQS Flink based Streaming Updater - https://phabricator.wikimedia.org/T280579 (10MPhamWMF) [19:08:27] 10SRE, 10serviceops: Move "redis_sessions" to "redis_misc" cluster - https://phabricator.wikimedia.org/T280586 (10jijiki) [19:08:31] 10SRE, 10Platform Engineering, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Phasing out "redis_sessions" MediaWiki cluster and away from the memcached cluster - https://phabricator.wikimedia.org/T267581 (10jijiki) [19:09:12] there are no trainings this week because of the 22 holiday, but there is an eu one next week and a us-friendly one next week (thurs/fri, you can find them on Tyler's calendar) [19:09:34] choose either of those that week or a later week and just show up :-) [19:09:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Dwisehaupt) Re-adding the ops tags to get payments1006 back on the radar for the console redirection. [19:10:01] 10SRE, 10serviceops, 10Performance-Team (Radar): Phase out nutcracker for connecting to redis - https://phabricator.wikimedia.org/T277183 (10jijiki) [19:13:34] 10SRE, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Dwisehaupt) Backing that out and removing the ops tags as @Jgreen created T280527 this morning specifically about the console issue. [19:14:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: payments1006.frack.eqiad.wmnet DRAC no console output - https://phabricator.wikimedia.org/T280527 (10Dwisehaupt) [19:15:04] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2004.codfw.wmnet [19:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:26] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) >>! In T280582#7015884, @Krinkle wrote: > Note that deployment of T113916 was halted because the Redis capacity was actually considered too small. (That task... [19:24:29] apergos: yeah, that was my plan, I cancelled since it was a "global holiday" [19:24:46] yep makes perfect sense [19:25:06] I wish I knew a better way to evengelize these [19:25:58] we could maybe mark them on the dpleoyment calendar itself ("Training!!" and a link to the training page, which probably should have a longer introduction ... mmm... would it be ok if I tweak the intro text on that page since ther eisn't much?) [19:28:33] yeah, tweaking that page would be great for anything that's missing [19:29:14] adding it to my todo list! [19:29:25] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:29:27] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:29:53] apergos: <3 thanks! any improvements you can think of for the process, I'm open :) [19:30:18] marketing marketing marketing! :-D [19:30:33] indeed [19:35:21] you know what would also be really cool for deployments - if we had a google calendar synced with deployment calendar [19:38:41] apparently there is a way to subscribe in google calendar [19:39:04] https://wikitech.wikimedia.org/wiki/Deployments I've not tried it but there's a little box there about it, after the intro [19:40:11] gosh today's been so full of learning - thank you apergos [19:46:39] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:46:41] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:54:36] 10SRE, 10serviceops, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10Reedy) [19:56:47] !log repool wdqs1005 [19:56:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:04] chrisalbon and accraze: May I have your attention please! Services – Graphoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T2000) [20:12:34] happy to help! if you try the calendar subscription, let me know what happens, Pchelolo [20:39:51] 10SRE, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [20:41:40] 10SRE, 10serviceops, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [20:48:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 238, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:49:45] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:00:04] Reedy and sbassett: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T2100). [21:00:13] Hey all - I've got one sec patch to deploy for T280226. Let me know if I shouldn't. [21:02:45] sbassett: I dont think anything else is going on [21:02:48] jouncebot: now [21:02:49] For the next 0 hour(s) and 57 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T2000) [21:02:49] For the next 1 hour(s) and 57 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T2100) [21:03:38] graphoid/ores probably not using their window [21:03:58] !log Deployed security patch for T280226 [21:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:51] mutante: yes, it's officially the weekly security deployment window now. I just always double-check in here to ensure nothing has run over or there isn't an emergency or something :) [21:05:09] sbassett: everything seemed quiet, go ahead [21:05:40] ok, cool! and thanks for asking [21:16:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:30:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudcephosd1018.wikimedi... [21:31:46] (03PS3) 10Legoktm: exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [21:32:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) John checked and the cable had a link light, then I rechecked the bios settings. I had missed that somehow the PXE boot h... [21:33:13] (03CR) 10Legoktm: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/29119/console" [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [21:33:23] (03PS1) 10Dzahn: site: add phabricator[12]002 for upcoming hardware refresh [puppet] - 10https://gerrit.wikimedia.org/r/681197 (https://phabricator.wikimedia.org/T279177) [21:35:32] (03CR) 10Legoktm: [V: 03+1 C: 03+2] exim: Add support for handling mailman3 inside mailman2 conf [puppet] - 10https://gerrit.wikimedia.org/r/680328 (https://phabricator.wikimedia.org/T278612) (owner: 10Ladsgroup) [21:38:55] (03CR) 10Legoktm: [C: 03+1] remove mwdebug1003 from list of debug servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/681144 (https://phabricator.wikimedia.org/T267248) (owner: 10Dzahn) [21:40:20] (03CR) 10Dzahn: [C: 03+2] "noop but will be needed in the future" [puppet] - 10https://gerrit.wikimedia.org/r/681197 (https://phabricator.wikimedia.org/T279177) (owner: 10Dzahn) [21:44:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) [21:44:06] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1018.wikimedia.org with reason: REIMAGE [21:44:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:55] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [21:46:10] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1018.wikimedia.org with reason: REIMAGE [21:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:34] (03PS1) 10Dzahn: admin: add Manuel Merz to ldap_only admins (wmde, nda) [puppet] - 10https://gerrit.wikimedia.org/r/681202 (https://phabricator.wikimedia.org/T280162) [21:50:47] (03PS1) 10Brennen Bearnes: Review access change [gitlab-ansible] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/681149 [21:54:17] (03CR) 10Dzahn: [V: 03+1] "dn: uid=manuel-wmde,ou=people,dc=wikimedia,dc=org" [puppet] - 10https://gerrit.wikimedia.org/r/681202 (https://phabricator.wikimedia.org/T280162) (owner: 10Dzahn) [21:55:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudcephosd1018.wikimedia.org'] ` and were **ALL** successful. [21:56:46] (03CR) 10Thcipriani: [V: 03+2 C: 03+2] "This group is itself owned by Release Engineering, so this looks fine to me." [gitlab-ansible] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/681149 (owner: 10Brennen Bearnes) [21:57:22] (03CR) 10Dzahn: "should have linked instead to https://phabricator.wikimedia.org/T280540 and https://phabricator.wikimedia.org/T280544" [puppet] - 10https://gerrit.wikimedia.org/r/681197 (https://phabricator.wikimedia.org/T279177) (owner: 10Dzahn) [21:58:07] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install phab2002 - https://phabricator.wikimedia.org/T280544 (10Dzahn) added phab2002 to site.pp with "insetup" role already. Just needs DHCP. https://gerrit.wikimedia.org/r/681197 [21:58:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install phab1002 - https://phabricator.wikimedia.org/T280540 (10Dzahn) added phab1002 to site.pp with "insetup" role already. Just needs DHCP. https://gerrit.wikimedia.org/r/681197 [22:04:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Jclark-ctr) just received nic card today [22:05:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10RobH) This just needs serials swapped by John before task is resolved. [22:06:01] PROBLEM - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 128.1 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [22:12:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Jclark-ctr) a:03RobH @RobH Swapped nic card handing back over for imaging [22:12:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10Jclark-ctr) [22:15:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudcephosd10[16-20].eqiad.wmnet - https://phabricator.wikimedia.org/T274945 (10Jclark-ctr) 05Open→03Resolved updated serial numbers in netbox. resolving [22:15:14] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: int can't take in float as string [cookbooks] - 10https://gerrit.wikimedia.org/r/680095 (https://phabricator.wikimedia.org/T280108) (owner: 10Ryan Kemper) [22:19:30] (03CR) 10Ryan Kemper: wdqs: improve replaceNamespace log output (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667054 (https://phabricator.wikimedia.org/T269331) (owner: 10Ryan Kemper) [22:19:34] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: improve replaceNamespace log output [puppet] - 10https://gerrit.wikimedia.org/r/667054 (https://phabricator.wikimedia.org/T269331) (owner: 10Ryan Kemper) [22:33:06] (03PS1) 10RobH: cloudvirt1040 mac update [puppet] - 10https://gerrit.wikimedia.org/r/681212 (https://phabricator.wikimedia.org/T275081) [22:33:40] (03CR) 10RobH: [C: 03+2] cloudvirt1040 mac update [puppet] - 10https://gerrit.wikimedia.org/r/681212 (https://phabricator.wikimedia.org/T275081) (owner: 10RobH) [22:34:44] ryankemper: heyas you ahve pending stuff [22:34:51] Ryan Kemper: wdqs: improve replaceNamespace log output (5ec399392b) [22:34:57] ok for me to roll into my puppet merge ? [22:35:14] robh: yes please do! thanks [22:35:25] cool, merging now [22:35:32] done [22:36:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [22:37:03] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) [22:37:08] 10SRE, 10netops: Higher latency on Lumen eqiad/esams link - https://phabricator.wikimedia.org/T277654 (10wiki_willy) Not really getting anywhere with Lumen on this - just getting the same response, only from different people. Latest response is: "The latency increased on your existing circuit was due to a ro... [22:37:39] !log reindexing commons and wikidata on elastic@eqiad finished/failed (T274200) [22:37:43] !log reindexing wikidata on cloudelastic finished/failed (T274200) [22:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:47] T274200: Reindex English and Italian wikis to enable homoglyph plugin - https://phabricator.wikimedia.org/T274200 [22:37:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:38:04] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` cloudvirt1040.eqiad.wmnet ` T... [22:48:54] RECOVERY - Rate of JVM GC Old generation-s runs - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad&var-instance=cloudelastic1001&panelId=37 [22:53:50] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE [22:53:52] (03PS9) 10Legoktm: Add shellbox chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 [22:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:56:00] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1040.eqiad.wmnet with reason: REIMAGE [22:56:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:48] (03CR) 10Legoktm: Add shellbox chart (035 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667047 (owner: 10Legoktm) [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210419T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:05:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1040.eqiad.wmnet'] ` and were **ALL** successful. [23:14:01] (03PS2) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [23:15:32] (03CR) 10Ryan Kemper: "Patchset 2 moves the argument parser from __init__ to rolling_operation since these args are only used for that cookbook now." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [23:18:12] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper) [23:23:50] (03PS1) 10Bartosz Dziewoński: CommentFormatter: Add 'ext-discussiontools-section' class instead of overwriting [extensions/DiscussionTools] (wmf/1.37.0-wmf.1) - 10https://gerrit.wikimedia.org/r/681153 (https://phabricator.wikimedia.org/T280433) [23:25:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 (10RobH) 05Open→03Resolved [23:32:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:37:52] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:39:37] (03CR) 10Cwhite: [C: 03+1] swift: force creation of /var/log/swift symlink [puppet] - 10https://gerrit.wikimedia.org/r/681010 (https://phabricator.wikimedia.org/T280257) (owner: 10Filippo Giunchedi) [23:50:05] (03PS3) 10Ryan Kemper: elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) [23:50:19] 10SRE, 10Projects-Cleanup, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10thcipriani) [23:52:45] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: refactor various rolling operations [cookbooks] - 10https://gerrit.wikimedia.org/r/679701 (https://phabricator.wikimedia.org/T280221) (owner: 10Ryan Kemper)