[00:20:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={pdu_sentry4,routinator} site={eqiad,eqsin} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:23:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:36:09] PROBLEM - Hadoop DataNode on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [00:36:39] PROBLEM - Check systemd state on an-worker1112 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:29] PROBLEM - Hadoop NodeManager on an-worker1112 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:53:09] RECOVERY - Hadoop DataNode on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [00:53:39] RECOVERY - Check systemd state on an-worker1112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:57] RECOVERY - Hadoop NodeManager on an-worker1112 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:12:43] PROBLEM - Puppet CA expired certs on puppetmaster1001 is CRITICAL: CRITICAL https://wikitech.wikimedia.org/wiki/Puppet%23Renew_agent_certificate [01:49:31] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:54:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:50:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:52:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:20:43] PROBLEM - WDQS SPARQL on wdqs1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:30:19] RECOVERY - WDQS SPARQL on wdqs1004 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:28:15] (03PS1) 10Andrew Bogott: Keystone: stop monitoring for novaadmin/novaobserver project membership [puppet] - 10https://gerrit.wikimedia.org/r/667422 (https://phabricator.wikimedia.org/T274385) [04:28:17] (03PS1) 10Andrew Bogott: wmfkeystonehooks: stop adding service users to all projects [puppet] - 10https://gerrit.wikimedia.org/r/667423 (https://phabricator.wikimedia.org/T274385) [04:29:06] (03CR) 10jerkins-bot: [V: 04-1] wmfkeystonehooks: stop adding service users to all projects [puppet] - 10https://gerrit.wikimedia.org/r/667423 (https://phabricator.wikimedia.org/T274385) (owner: 10Andrew Bogott) [04:31:14] (03PS2) 10Andrew Bogott: wmfkeystonehooks: stop adding service users to all projects [puppet] - 10https://gerrit.wikimedia.org/r/667423 (https://phabricator.wikimedia.org/T274385) [04:37:46] (03PS1) 10KartikMistry: Remove test2wiki from wgContentTranslationAsBetaFeature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667424 [06:22:57] (03PS1) 10Marostegui: mariadb: Decommission db1092 [puppet] - 10https://gerrit.wikimedia.org/r/667427 (https://phabricator.wikimedia.org/T275019) [06:25:15] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission for hosts db1092.eqiad.wmnet [06:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:16] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db1092.eqiad.wmnet [06:32:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db1092 [puppet] - 10https://gerrit.wikimedia.org/r/667427 (https://phabricator.wikimedia.org/T275019) (owner: 10Marostegui) [06:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:24] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1092.eqiad.wmnet - https://phabricator.wikimedia.org/T275019 (10Marostegui) a:05Marostegui→03wiki_willy This is ready for #dc-ops! [06:35:47] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:35:53] 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1092.eqiad.wmnet - https://phabricator.wikimedia.org/T275019 (10Marostegui) [06:36:17] 10SRE, 10DC-Ops, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10Aklapper) [06:36:20] 10SRE, 10serviceops, 10Parsoid (Tracking): Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10Aklapper) [06:40:45] (03PS1) 10Marostegui: db1168: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667428 (https://phabricator.wikimedia.org/T258361) [06:41:36] (03CR) 10Marostegui: [C: 03+2] db1168: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667428 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:43:19] (03PS1) 10Marostegui: instances.yaml: Add db1168 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667429 (https://phabricator.wikimedia.org/T258361) [06:43:55] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1168 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/667429 (https://phabricator.wikimedia.org/T258361) (owner: 10Marostegui) [06:46:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1168 to dbctl T258361!', diff saved to https://phabricator.wikimedia.org/P14519 and previous config saved to /var/cache/conftool/dbconfig/20210301-064603-marostegui.json [06:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:12] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [06:47:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1168 with minimal weight T258361', diff saved to https://phabricator.wikimedia.org/P14520 and previous config saved to /var/cache/conftool/dbconfig/20210301-064704-marostegui.json [06:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:48] (03PS1) 10Marostegui: db1134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667430 (https://phabricator.wikimedia.org/T275343) [06:52:58] (03CR) 10Marostegui: [C: 03+2] db1134: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/667430 (https://phabricator.wikimedia.org/T275343) (owner: 10Marostegui) [06:55:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 1%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14521 and previous config saved to /var/cache/conftool/dbconfig/20210301-065500-root.json [06:55:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:39] (03CR) 10ArielGlenn: "> If no one has objections I plan to merge and deploy this Sunday during the next window for these dumps." [puppet] - 10https://gerrit.wikimedia.org/r/660871 (owner: 10ArielGlenn) [07:05:49] !log Stop MySQL on db2082 to clone db2152 - T275633 [07:05:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:55] T275633: Productionize db21[45-52] and db11[76-84] - https://phabricator.wikimedia.org/T275633 [07:10:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14523 and previous config saved to /var/cache/conftool/dbconfig/20210301-071004-root.json [07:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:42] PROBLEM - MariaDB Replica IO: s8 on db2094 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2082.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2082.codfw.wmnet (111 Connection refused) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:10:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some more weight to db1168', diff saved to https://phabricator.wikimedia.org/P14524 and previous config saved to /var/cache/conftool/dbconfig/20210301-071047-marostegui.json [07:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:16] db2094 is me [07:25:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14525 and previous config saved to /var/cache/conftool/dbconfig/20210301-072507-root.json [07:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Give some more weight to db1168', diff saved to https://phabricator.wikimedia.org/P14526 and previous config saved to /var/cache/conftool/dbconfig/20210301-072957-marostegui.json [07:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 15%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14527 and previous config saved to /var/cache/conftool/dbconfig/20210301-074011-root.json [07:40:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 4%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14528 and previous config saved to /var/cache/conftool/dbconfig/20210301-074759-root.json [07:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:34] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) db1168 is now slowly being pooled into s6 running 10.4.18 [07:48:46] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:51:30] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:51:36] (03PS1) 10Muehlenhoff: Remove access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/667469 [07:51:39] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [07:53:15] !log clean up old logs + apt-get clean + puppet clientbucket on an-coord1001 to free space [07:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:47] !log Upgrade pc1010 pc2008 pc200 to 10.4.18 [07:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14529 and previous config saved to /var/cache/conftool/dbconfig/20210301-075514-root.json [07:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:47] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for agaduran [puppet] - 10https://gerrit.wikimedia.org/r/667469 (owner: 10Muehlenhoff) [08:02:54] (03PS1) 10Urbanecm: Set wgGEHelpPanelAskMentor to true by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667529 (https://phabricator.wikimedia.org/T275908) [08:03:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 5%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14530 and previous config saved to /var/cache/conftool/dbconfig/20210301-080303-root.json [08:03:07] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:23] (03PS1) 10Marostegui: wmnet: Switch m3-master [dns] - 10https://gerrit.wikimedia.org/r/667530 (https://phabricator.wikimedia.org/T273281) [08:10:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 40%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14531 and previous config saved to /var/cache/conftool/dbconfig/20210301-081018-root.json [08:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 10%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14532 and previous config saved to /var/cache/conftool/dbconfig/20210301-081806-root.json [08:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14533 and previous config saved to /var/cache/conftool/dbconfig/20210301-082521-root.json [08:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:56] (03CR) 10Giuseppe Lavagetto: Add php 7.3 images (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664884 (owner: 10Giuseppe Lavagetto) [08:29:15] (03PS3) 10Giuseppe Lavagetto: Add php 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/664884 [08:30:31] (03Abandoned) 10Giuseppe Lavagetto: mediawiki::webserver: restart mtail when modifying programs [puppet] - 10https://gerrit.wikimedia.org/r/666644 (owner: 10Giuseppe Lavagetto) [08:31:30] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] php/httpd: use numeric uids [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/666393 (owner: 10Giuseppe Lavagetto) [08:33:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 15%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14534 and previous config saved to /var/cache/conftool/dbconfig/20210301-083310-root.json [08:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:22] !log reboot an-worker1112 [08:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:28] (03PS1) 10Marostegui: mariadb: Productionize db2152 [puppet] - 10https://gerrit.wikimedia.org/r/667535 (https://phabricator.wikimedia.org/T275633) [08:40:12] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2152 [puppet] - 10https://gerrit.wikimedia.org/r/667535 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [08:40:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 65%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14535 and previous config saved to /var/cache/conftool/dbconfig/20210301-084025-root.json [08:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:01] jouncebot: now [08:42:01] No deployments scheduled for the next 2 hour(s) and 47 minute(s) [08:42:05] RECOVERY - MariaDB Replica IO: s8 on db2094 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [08:42:09] (03CR) 10Urbanecm: [C: 03+2] rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm) [08:42:59] (03Merged) 10jenkins-bot: rowiki: Update help panel links [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666682 (https://phabricator.wikimedia.org/T275130) (owner: 10Urbanecm) [08:45:30] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 92f65972f4277624f74369af08563a8ca6254bda: rowiki: Update help panel links (T275130) (duration: 01m 08s) [08:45:36] * Urbanecm done [08:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:41] T275130: Deploy Growth features on Romanian Wikipedia - https://phabricator.wikimedia.org/T275130 [08:48:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 20%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14536 and previous config saved to /var/cache/conftool/dbconfig/20210301-084813-root.json [08:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:48] 10SRE, 10Traffic, 10serviceops: ChartMuseum responses are cached in the CDN with default (24h) ttl - https://phabricator.wikimedia.org/T272633 (10JMeybohm) 05Open→03Resolved Closing this as cache is disabled now. [08:49:17] (03CR) 10Kormat: [C: 03+1] wmnet: Switch m3-master [dns] - 10https://gerrit.wikimedia.org/r/667530 (https://phabricator.wikimedia.org/T273281) (owner: 10Marostegui) [08:49:43] (03CR) 10JMeybohm: [C: 03+1] "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 (https://phabricator.wikimedia.org/T274262) (owner: 10PipelineBot) [08:51:37] (03CR) 10Marostegui: [C: 03+2] wmnet: Switch m3-master [dns] - 10https://gerrit.wikimedia.org/r/667530 (https://phabricator.wikimedia.org/T273281) (owner: 10Marostegui) [08:55:13] (03PS4) 10Kormat: mariadb: Convert pt-heartbeat to a systemd service. [puppet] - 10https://gerrit.wikimedia.org/r/665324 (https://phabricator.wikimedia.org/T252528) [08:55:23] (03PS2) 10Muehlenhoff: Reduce TTL for irc CNAME to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/667161 [08:55:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14537 and previous config saved to /var/cache/conftool/dbconfig/20210301-085529-root.json [08:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:37] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10elukey) [08:58:42] (03PS1) 10Marostegui: install_server: Do not reimage db2152 [puppet] - 10https://gerrit.wikimedia.org/r/667538 (https://phabricator.wikimedia.org/T275633) [08:59:36] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db2152 [puppet] - 10https://gerrit.wikimedia.org/r/667538 (https://phabricator.wikimedia.org/T275633) (owner: 10Marostegui) [09:03:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 25%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14538 and previous config saved to /var/cache/conftool/dbconfig/20210301-090317-root.json [09:03:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:21] (03CR) 10Urbanecm: [C: 03+1] Add tay to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/667105 (https://phabricator.wikimedia.org/T275803) (owner: 10Gerrit maintenance bot) [09:09:50] (03CR) 10Kosta Harlan: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [09:10:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 85%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14539 and previous config saved to /var/cache/conftool/dbconfig/20210301-091032-root.json [09:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:03] 10SRE, 10ops-eqiad, 10Analytics: an-worker1112 reports I/O errors for a disk - https://phabricator.wikimedia.org/T274981 (10elukey) 05Open→03Resolved Recreated partition for /dev/sdl and re-mounted. let's see if any error will trigger. Closing for the moment, will reopen if I make the disk to fail. [09:11:20] (03CR) 10Muehlenhoff: [C: 03+2] Reduce TTL for irc CNAME to 5 minutes [dns] - 10https://gerrit.wikimedia.org/r/667161 (owner: 10Muehlenhoff) [09:18:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 30%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14540 and previous config saved to /var/cache/conftool/dbconfig/20210301-091820-root.json [09:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:08] 10SRE: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) [09:22:55] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) Renamed task to be jobrunner+buster specific and looping in #serviceops [09:25:36] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repool db1134 after on-site maintenance', diff saved to https://phabricator.wikimedia.org/P14541 and previous config saved to /var/cache/conftool/dbconfig/20210301-092536-root.json [09:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:26:25] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Joe) I can't imagine a single valid reason for a distro upgrade meaning that data transfer would slow down so much. My suggestion is we re-image one jobrunner to stretch and we check... [09:31:47] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] "I don't know how to actually review this. If it helps, why not? What unexpected issues could this possibly cause?" [puppet] - 10https://gerrit.wikimedia.org/r/666948 (https://phabricator.wikimedia.org/T275757) (owner: 10Awight) [09:33:08] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) I'll also note that the behavior is generally quite rare compared to the number of PUTs from jobrunners, e.g. I haven't been able to reproduce using the swift python clien... [09:33:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 40%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14542 and previous config saved to /var/cache/conftool/dbconfig/20210301-093324-root.json [09:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:59] (03PS1) 10DCausse: [wdqs] buffer 250 messages instead of 1000 [puppet] - 10https://gerrit.wikimedia.org/r/667541 [09:41:25] (03PS2) 10Svantje Lilienthal: [DNM] ReferenceTooltips gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [09:41:50] (03CR) 10ZPapierski: [C: 03+1] [wdqs] buffer 250 messages instead of 1000 [puppet] - 10https://gerrit.wikimedia.org/r/667541 (owner: 10DCausse) [09:45:06] (03CR) 10DCausse: [C: 03+1] wdqs: improve replaceNamespace log output (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667054 (https://phabricator.wikimedia.org/T269331) (owner: 10Ryan Kemper) [09:48:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 50%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14543 and previous config saved to /var/cache/conftool/dbconfig/20210301-094828-root.json [09:48:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:06] (03PS1) 10Vgutierrez: ATS: Enable parent proxies support on ats-tls@cp5012 [puppet] - 10https://gerrit.wikimedia.org/r/667545 (https://phabricator.wikimedia.org/T274888) [09:51:23] (03CR) 10Filippo Giunchedi: [C: 03+2] "Confirmed in pontoon:" [puppet] - 10https://gerrit.wikimedia.org/r/667112 (https://phabricator.wikimedia.org/T254605) (owner: 10Filippo Giunchedi) [09:51:44] (03PS1) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/667546 [09:53:17] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10Joe) @RLazarus in https://phabricator.wikimedia.org/T248093#6076630 you mentioned committing a script for automating cert renewal, and I see it indeed. Renewing the certs should amount to just runn... [09:53:45] (03PS3) 10Svantje Lilienthal: [DNM] ReferenceTooltips gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [09:53:53] (03PS1) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [09:54:16] (03CR) 10Volans: [C: 03+2] code style: improve doc and link doc from tox [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans) [09:54:45] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Urbanecm) >>! In T275752#6864889, @fgiunchedi wrote: > Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still eqiad jobrunners, and suspiciously all have... [09:55:16] (03CR) 10jerkins-bot: [V: 04-1] [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [09:58:32] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28281/console" [puppet] - 10https://gerrit.wikimedia.org/r/667545 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [09:59:15] (03PS2) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [09:59:23] (03Merged) 10jenkins-bot: code style: improve doc and link doc from tox [software/spicerack] - 10https://gerrit.wikimedia.org/r/666934 (owner: 10Volans) [10:01:49] (03PS1) 10DCausse: [relforge] allow index auto creation [puppet] - 10https://gerrit.wikimedia.org/r/667548 [10:03:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 65%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14544 and previous config saved to /var/cache/conftool/dbconfig/20210301-100331-root.json [10:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:21] (03PS1) 10Volans: puppetdb microservice: refactor prior to expand it [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) [10:04:23] (03PS1) 10Volans: puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) [10:04:38] (03PS3) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [10:05:52] (03CR) 10jerkins-bot: [V: 04-1] puppetdb microservice: refactor prior to expand it [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:06:06] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ATS: Enable parent proxies support on ats-tls@cp5012 [puppet] - 10https://gerrit.wikimedia.org/r/667545 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [10:06:32] (03CR) 10jerkins-bot: [V: 04-1] puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:07:02] (03PS2) 10Gehel: [relforge] allow index auto creation [puppet] - 10https://gerrit.wikimedia.org/r/667548 (owner: 10DCausse) [10:07:28] (03PS2) 10Volans: puppetdb microservice: refactor prior to expand it [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) [10:07:30] (03PS2) 10Volans: puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) [10:08:43] (03CR) 10Gehel: [C: 03+2] [relforge] allow index auto creation [puppet] - 10https://gerrit.wikimedia.org/r/667548 (owner: 10DCausse) [10:09:22] (03CR) 10jerkins-bot: [V: 04-1] puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [10:11:52] (03PS4) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [10:14:29] (03PS3) 10Volans: puppetdb microservice: add support for cumin [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) [10:15:03] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 9 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28284/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [10:15:06] !log restart ats-tls on cp5012 [10:15:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:27] (03PS1) 10Muehlenhoff: Fix list of additional ops permission groups [puppet] - 10https://gerrit.wikimedia.org/r/667551 [10:18:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 75%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14545 and previous config saved to /var/cache/conftool/dbconfig/20210301-101835-root.json [10:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:49] (03PS3) 10Volans: Testing CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 [10:23:26] (03PS5) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [10:23:53] (03PS9) 10Gehel: kibana: use different settings based off version [puppet] - 10https://gerrit.wikimedia.org/r/666677 (https://phabricator.wikimedia.org/T275658) (owner: 10Ryan Kemper) [10:24:32] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28285/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [10:24:49] (03CR) 10jerkins-bot: [V: 04-1] Testing CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [10:25:29] (03Abandoned) 10Volans: Testing CI [software/spicerack] - 10https://gerrit.wikimedia.org/r/647657 (owner: 10Volans) [10:25:42] (03CR) 10Gehel: [C: 03+2] kibana: use different settings based off version [puppet] - 10https://gerrit.wikimedia.org/r/666677 (https://phabricator.wikimedia.org/T275658) (owner: 10Ryan Kemper) [10:25:44] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/667172 (owner: 10David Caro) [10:27:01] (03PS6) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [10:28:10] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28286/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [10:32:31] (03PS7) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [10:33:35] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28287/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [10:33:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 85%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14546 and previous config saved to /var/cache/conftool/dbconfig/20210301-103338-root.json [10:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:44] (03PS4) 10Thiemo Kreuz (WMDE): ReferenceTooltips and other gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) [10:34:20] (03PS8) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [10:36:04] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28288/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [10:36:38] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] ReferenceTooltips and other gadget names for ReferencePreviews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [10:45:12] (03PS1) 10Elukey: role::druid::analytics::worker: perf tuning for the broker [puppet] - 10https://gerrit.wikimedia.org/r/667556 [10:45:46] (03CR) 10Kosta Harlan: [C: 04-2] "Let's wait for I0ef88d7360cd7bd1610931d7ade31bee975e8f4f" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667546 (owner: 10Kosta Harlan) [10:48:23] (03CR) 10Volans: doc: Introduce a code reviewing guideline (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [10:48:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1168 (re)pooling @ 100%: Slowly pool db1168 for the first time', diff saved to https://phabricator.wikimedia.org/P14547 and previous config saved to /var/cache/conftool/dbconfig/20210301-104842-root.json [10:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:20] (03PS1) 10Elukey: role::druid::analytics::worker: tune cache settings [puppet] - 10https://gerrit.wikimedia.org/r/667558 (https://phabricator.wikimedia.org/T270173) [10:55:03] (03CR) 10Volans: [C: 04-1] "See inline" (036 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/667170 (owner: 10David Caro) [10:55:31] (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: perf tuning for the broker [puppet] - 10https://gerrit.wikimedia.org/r/667556 (owner: 10Elukey) [10:55:34] (03PS9) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [10:55:49] (03CR) 10Elukey: [C: 03+2] role::druid::analytics::worker: tune cache settings [puppet] - 10https://gerrit.wikimedia.org/r/667558 (https://phabricator.wikimedia.org/T270173) (owner: 10Elukey) [10:56:57] 10SRE, 10Analytics: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10MoritzMuehlenhoff) [10:58:52] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 6 NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28290/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [11:00:28] (03PS1) 10David Caro: wmcs.postrgesql.osm_primary: add missing required engine param [puppet] - 10https://gerrit.wikimedia.org/r/667560 (https://phabricator.wikimedia.org/T276039) [11:02:54] People complain about mailing lists not working or something [11:03:03] is there some (un)planned problem with Mailman? [11:04:10] (03PS1) 10Alexandros Kosiaris: Support ANALYTICS_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 [11:05:46] (03PS1) 10Phuedx: vector: Stage 2 of WVUI search treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667562 (https://phabricator.wikimedia.org/T249297) [11:08:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/667119 (owner: 10Jbond) [11:08:14] 10SRE, 10serviceops, 10User-jijiki: Enable TLS on memcached - https://phabricator.wikimedia.org/T271967 (10jijiki) [11:09:23] Cladis: but mailman does seem to have a queue to serve so email delivery might be stalled. [11:09:37] no, but* [11:10:20] (03PS2) 10Muehlenhoff: Add bullseye-wikimedia to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/667162 (https://phabricator.wikimedia.org/T275873) [11:11:30] (03CR) 10Kosta Harlan: [C: 03+1] "Should we bump the version to include the latest Docker image which adds support for the configurable base URL, or do that in a follow-up?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [11:11:57] (03PS1) 10Filippo Giunchedi: prometheus: add notes_link for stale node-exporter textfile [puppet] - 10https://gerrit.wikimedia.org/r/667567 [11:16:09] akosiaris: I assume it will resolve on its own then? Thanks for the update (we have an AGM announcement stuck :) ) [11:17:01] Cladis: it is already resolving itself. See https://grafana.wikimedia.org/d/nULM0E1Wk/mailman?viewPanel=2&orgId=1 [11:17:15] it will take a while but eventually all messages will be delivered [11:17:54] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10fgiunchedi) >>! In T275752#6869341, @Urbanecm wrote: >>>! In T275752#6864889, @fgiunchedi wrote: >> Looking back a few days, e.g. Feb 4-5th, the list of hosts that take > 80s is still... [11:17:56] (03CR) 10Muehlenhoff: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/667162 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [11:18:15] (03PS3) 10Muehlenhoff: Add bullseye-wikimedia to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/667162 (https://phabricator.wikimedia.org/T275873) [11:18:17] (03CR) 10MSantos: [C: 03+1] wmcs.postrgesql.osm_primary: add missing required engine param [puppet] - 10https://gerrit.wikimedia.org/r/667560 (https://phabricator.wikimedia.org/T276039) (owner: 10David Caro) [11:19:00] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add notes_link for stale node-exporter textfile [puppet] - 10https://gerrit.wikimedia.org/r/667567 (owner: 10Filippo Giunchedi) [11:21:55] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667162 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [11:29:21] (03CR) 10Muehlenhoff: [C: 03+2] Add bullseye-wikimedia to apt.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/667162 (https://phabricator.wikimedia.org/T275873) (owner: 10Muehlenhoff) [11:30:01] (03CR) 10David Caro: [C: 03+2] wmcs.postrgesql.osm_primary: add missing required engine param [puppet] - 10https://gerrit.wikimedia.org/r/667560 (https://phabricator.wikimedia.org/T276039) (owner: 10David Caro) [11:30:05] jan_drewniak: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Wikimedia Portals Update . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T1130). [11:30:58] (03PS1) 10Urbanecm: Deploy Growth features to newcomers on da.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667570 (https://phabricator.wikimedia.org/T256126) [11:31:18] (03PS1) 10Aklapper: Fix broken rendering of characters in EasyTimeline for Yue Chinese [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) [11:32:37] (03PS1) 10Elukey: role::druid::analytics: remove config not needed anymore for Historicals [puppet] - 10https://gerrit.wikimedia.org/r/667572 [11:33:17] (03CR) 10Elukey: [C: 03+2] role::druid::analytics: remove config not needed anymore for Historicals [puppet] - 10https://gerrit.wikimedia.org/r/667572 (owner: 10Elukey) [11:41:20] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667573 (https://phabricator.wikimedia.org/T128546) [11:41:29] (03CR) 10jerkins-bot: [V: 04-1] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667573 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:42:37] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667574 (https://phabricator.wikimedia.org/T128546) [11:42:51] (03Abandoned) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667573 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:43:01] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667574 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:44:02] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667574 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [11:44:41] 10SRE, 10Wikimedia-Logstash, 10User-fgiunchedi: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10hashar) Another case was MediaWiki monolog update which eventually caused the `session` field to fail to index properly due to mismatch type (object vs string) which was... [11:47:20] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Logging options for apache httpd in k8s - https://phabricator.wikimedia.org/T265876 (10Joe) a:03Joe At the meeting we decided it's ok to let apache log to kafka as a main method of collection. We will therefore, at least in a first iteration: * Log to /... [11:48:24] (03PS1) 10Elukey: profile::amd_gpu: add python 3.8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) [11:48:54] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:667574| Bumping portals to master (T128546)]] (duration: 00m 55s) [11:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:03] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [11:49:49] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:667574| Bumping portals to master (T128546)]] (duration: 00m 55s) [11:49:51] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) I am aiming to at least test TLS on memcached T271967, hoping to roll it out next month. If this works out, we will not be needing mcrouter certs. We have 60 days ahead of us, I think it ca... [11:49:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:01] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28291/console" [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [11:54:28] (03PS1) 10Elukey: role::druid::analytics:worker: fix duplication/typo in broker settings [puppet] - 10https://gerrit.wikimedia.org/r/667576 [11:54:39] (03CR) 10Elukey: [C: 03+2] role::druid::analytics:worker: fix duplication/typo in broker settings [puppet] - 10https://gerrit.wikimedia.org/r/667576 (owner: 10Elukey) [11:56:20] 10SRE, 10Analytics, 10Patch-For-Review: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) p:05Triage→03Medium [11:57:43] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10JMeybohm) p:05Triage→03Medium [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: Dear deployers, time to do the European mid-day backport window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T1200). [12:00:04] tgr and phuedx: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:59] o/ [12:01:55] o/ [12:02:15] tgr_: phuedx: I'm happy to deploy, unless either of you want to self-serve? [12:02:40] 10SRE, 10Language-Team, 10Wikimedia-Site-requests, 10Performance-Team (Radar): Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319 (10Fuzzy) There are two apparent solutions to the effective limit for Hebrew pages: A. Use `mb_strlen` instead of `strlen` to mea... [12:03:37] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10Joe) >>! In T276029#6870062, @jijiki wrote: > I am aiming to at least test TLS on memcached T271967, hoping to roll it out next month. If this works out, we will not be needing mcrouter certs. We h... [12:04:15] awight: I don't mind either way :) [12:05:02] awight: thanks! my patch is a no-op in production, can go out without checking [12:05:12] Great, I'll do them both then. [12:05:13] I'm also around, in case I'm needed [12:05:36] (03CR) 10Awight: [C: 03+2] "Config window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666441 (owner: 10Gergő Tisza) [12:05:47] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: set GELinkRecommendationsUseEventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666441 (owner: 10Gergő Tisza) [12:06:54] (03PS2) 10Awight: GrowthExperiments: set GELinkRecommendationsUseEventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666441 (owner: 10Gergő Tisza) [12:07:07] (03CR) 10Awight: [C: 03+2] "PS 2: manual rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666441 (owner: 10Gergő Tisza) [12:07:29] tgr_: heads-up, there are both GELinkRecommendations* and GELinkRecommendation* settings. [12:07:53] (03Merged) 10jenkins-bot: GrowthExperiments: set GELinkRecommendationsUseEventGate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666441 (owner: 10Gergő Tisza) [12:07:55] Not the fault of this patch. [12:10:19] (03PS2) 10Awight: vector: Stage 2 of WVUI search treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667562 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [12:10:29] (03CR) 10Awight: [C: 03+2] "Config window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667562 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [12:10:30] !log awight@deploy1001 Synchronized wmf-config: Config: [[gerrit:666441|GrowthExperiments: set GELinkRecommendationsUseEventGate (T274198)]] (duration: 01m 05s) [12:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:10:38] T274198: Beta wiki configuration for add link project - https://phabricator.wikimedia.org/T274198 [12:11:15] (03Merged) 10jenkins-bot: vector: Stage 2 of WVUI search treatment A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667562 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [12:12:07] awight: This patch does require testing [12:12:28] phuedx: ack [12:13:10] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10jijiki) >>! In T276029#6870186, @Joe wrote: >>>! In T276029#6870062, @jijiki wrote: >> I am aiming to at least test TLS on memcached T271967, hoping to roll it out next month. If this works out, we... [12:13:28] phuedx: live on mwdebug1001.eqiad.wmnet [12:13:29] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [12:13:35] awight: Thanks [12:15:58] awight: I don't see it [12:16:09] (thanks for the deploy!) [12:17:19] (03CR) 10David Caro: doc: Introduce a code reviewing guideline (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [12:18:07] tgr_: Here's the odd one out, maybe already a typo? https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/InitialiseSettings-labs.php#L1109 [12:18:57] awight: Still testing. Thanks for your patience :) [12:19:04] phuedx: I have all day ;-) [12:23:05] awight: We think we've found a bug on hewiki. Let's revert [12:23:14] phuedx: will do! [12:23:46] (03PS1) 10Awight: Revert "vector: Stage 2 of WVUI search treatment A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667399 [12:24:08] (03PS2) 10Awight: Revert "vector: Stage 2 of WVUI search treatment A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667399 (https://phabricator.wikimedia.org/T249297) [12:24:28] (03CR) 10Awight: [C: 03+2] "Reverting during config window." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667399 (https://phabricator.wikimedia.org/T249297) (owner: 10Awight) [12:25:13] (03Merged) 10jenkins-bot: Revert "vector: Stage 2 of WVUI search treatment A/B test" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667399 (https://phabricator.wikimedia.org/T249297) (owner: 10Awight) [12:25:36] Thanks, awight! [12:26:01] phuedx: Reverted in deployment. Thanks for your caution! [12:26:09] awight: oh right, we only have that in the beta config yet. Thanks for spotting, we should probably fix it. [12:26:38] !log EU config deployments complete [12:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:43] tgr_: :-) [12:34:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:36:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:36:47] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:37:41] (03Abandoned) 10Kosta Harlan: EventLoggingSchemas: Bump HomepageVisit version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [12:39:30] (03CR) 10VolkerE: "Bug filed at https://phabricator.wikimedia.org/T276081" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667399 (https://phabricator.wikimedia.org/T249297) (owner: 10Awight) [12:43:27] (03CR) 10Hnowlan: WIP: Deploy tegola on kubernetes (039 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [12:59:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [13:03:34] (03PS4) 10Hnowlan: api-gateway: generic discovery service config option, add linkrecommendation [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) [13:06:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [13:06:48] (03PS2) 10Alexandros Kosiaris: Support ANALYTICS_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 [13:07:06] (03CR) 10Alexandros Kosiaris: "> Patch Set 1: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [13:07:32] (03CR) 10jerkins-bot: [V: 04-1] Support ANALYTICS_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [13:14:07] 10SRE, 10OTRS, 10Security: ((OTRS)) Community Edition 6 is end-of-life; no FOSS replacement provided - https://phabricator.wikimedia.org/T275294 (10MoritzMuehlenhoff) We don't use the Debian OTRS packages for ticket.wikimedia.org, but adding this as an additional data point: Debian switched to the Znuny fork... [13:15:20] (03PS10) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [13:15:53] (03Abandoned) 10JMeybohm: linkrecommendation: Allow egress to analytics.wikimedia.org [deployment-charts] - 10https://gerrit.wikimedia.org/r/666982 (owner: 10JMeybohm) [13:16:25] (03CR) 10JMeybohm: [C: 04-1] Support ANALYTICS_BASE_URL (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [13:19:49] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:21:41] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10fgiunchedi) Please excuse the drive-by comment; from the diagram below it seems that RO access to swift from users and applica... [13:22:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:22:19] !log instaling docker.io security updates for Buster [13:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:42] 10SRE, 10Analytics: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10Ottomata) [13:31:20] (03CR) 10Volans: [C: 03+1] "LGTM, good catch" [puppet] - 10https://gerrit.wikimedia.org/r/667551 (owner: 10Muehlenhoff) [13:36:57] (03CR) 10Ottomata: "Yes, but we need the extension change to be deployed on all wikis before we do. If you want to make this change before that happens (has " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [13:38:20] (03PS11) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [13:39:52] (03PS1) 10Urbanecm: Define wmgGEFeaturesMayBeAvailableToNewcomers that controls whether GE features are newcomer-deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667581 (https://phabricator.wikimedia.org/T276091) [13:41:21] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28293/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [13:43:02] (03CR) 10Elukey: [V: 03+1] profile::amd_gpu: add python 3.8 on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [13:44:56] (03CR) 10Muehlenhoff: [C: 03+1] profile::amd_gpu: add python 3.8 on Buster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [13:45:45] (03PS2) 10Elukey: profile::amd_gpu: add python 3.8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) [13:47:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:48:34] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28294/console" [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [13:48:42] (03PS1) 10Elukey: druid: add query cache hit rate for historical to exporter [puppet] - 10https://gerrit.wikimedia.org/r/667585 [13:49:28] (03CR) 10Elukey: [C: 03+2] druid: add query cache hit rate for historical to exporter [puppet] - 10https://gerrit.wikimedia.org/r/667585 (owner: 10Elukey) [13:49:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:51:01] (03CR) 10Muehlenhoff: [C: 03+2] Fix list of additional ops permission groups [puppet] - 10https://gerrit.wikimedia.org/r/667551 (owner: 10Muehlenhoff) [13:52:47] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Services, 10Service-deployment-requests: [DRAFT] New Service Request tegola - https://phabricator.wikimedia.org/T274390 (10Jgiannelos) Hey @fgiunchedi, thanks for the feedback. I think what we are trying to show in the charts is the differentiation... [13:54:07] (03CR) 10Jgiannelos: WIP: Deploy tegola on kubernetes (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667165 (https://phabricator.wikimedia.org/T275874) (owner: 10Jgiannelos) [13:55:22] (03CR) 10Volans: [C: 04-1] "additional comment" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/667170 (owner: 10David Caro) [13:55:50] (03PS12) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [13:58:32] (03CR) 10Clarakosi: [C: 03+1] "> Patch Set 3: Code-Review+1" [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 (https://phabricator.wikimedia.org/T274262) (owner: 10PipelineBot) [13:58:55] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28295/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [13:59:28] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10Joe) [14:04:03] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 194845592 and 6 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:05:23] !log installing openldap security updates on stretch (client-side tools/libs only, slapd instances all on Buster and fixed) [14:05:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:54] (03Abandoned) 10Kosta Harlan: linkrecommendation: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/667546 (owner: 10Kosta Harlan) [14:06:19] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 942112 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:07:05] !log Upgrade dbproxy1020 kernel [14:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:01] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10akosiaris) = Modify mtail to be able to consume logs from kafka = In this idea, we'd be able to just consume a kafka topic directly from mtail... [14:10:22] (03PS13) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [14:10:44] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10LSobanski) @Papaul Can it be added to the template or does it need to be added manually to every task? [14:11:05] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [14:13:15] (03CR) 10Klausman: [C: 03+1] profile::amd_gpu: add python 3.8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [14:13:24] (03CR) 10Alexandros Kosiaris: Support ANALYTICS_BASE_URL (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [14:13:33] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28296/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [14:14:30] (03CR) 10Elukey: "David, this is a great effort, I see a lot of good points but I am a little confused about why we are adding specific instructions only to" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [14:15:16] (03PS3) 10Alexandros Kosiaris: Support ANALYTICS_BASE_URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 [14:16:08] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::amd_gpu: add python 3.8 on Buster [puppet] - 10https://gerrit.wikimedia.org/r/667575 (https://phabricator.wikimedia.org/T275896) (owner: 10Elukey) [14:17:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:18:54] !log upgrade mc1030 mc2030 to memcached 1.6 [14:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:33] (03PS1) 10MSantos: maps: fix imposm3 cache dir [puppet] - 10https://gerrit.wikimedia.org/r/667598 [14:20:35] (03PS1) 10Elukey: profile::python38: fix component name [puppet] - 10https://gerrit.wikimedia.org/r/667599 [14:22:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:22:25] (03CR) 10Elukey: [C: 03+2] profile::python38: fix component name [puppet] - 10https://gerrit.wikimedia.org/r/667599 (owner: 10Elukey) [14:23:08] (03PS2) 10MSantos: maps: fix imposm3 cache dir [puppet] - 10https://gerrit.wikimedia.org/r/667598 [14:24:56] (03CR) 10David Caro: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [14:25:01] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1030.eqiad.wmnet [14:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:59] (03PS1) 10Muehlenhoff: Add cuminunpriv1001 to allowed hosts for puppetdb microservice [puppet] - 10https://gerrit.wikimedia.org/r/667600 [14:29:47] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10Eugene.chernov) @brennen, L3 has been signed and ‘ichernov’ is ok as the username [14:31:20] (03CR) 10Hnowlan: maps: fix imposm3 cache dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667598 (owner: 10MSantos) [14:32:00] (03CR) 10Elukey: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [14:32:34] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1030.eqiad.wmnet [14:32:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:18] 10SRE, 10Analytics, 10Patch-For-Review: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10elukey) The patch seems working, I was able to install `rocm-gdb` v3.8 on an-worker1096 :) Tobias opened https://github.com/RadeonOpenCompute/ROCm/issues/1396 [14:35:28] (03PS1) 10Vgutierrez: ATS: Enable parent proxies for text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/667604 (https://phabricator.wikimedia.org/T274888) [14:37:03] (03PS14) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [14:37:24] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28297/console" [puppet] - 10https://gerrit.wikimedia.org/r/667604 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [14:37:54] (03PS1) 10Elukey: Move an-worker1097 to the Hadoop gpu buster workers [puppet] - 10https://gerrit.wikimedia.org/r/667605 (https://phabricator.wikimedia.org/T231067) [14:38:14] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] ATS: Enable parent proxies for text@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/667604 (https://phabricator.wikimedia.org/T274888) (owner: 10Vgutierrez) [14:40:20] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28298/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [14:40:50] (03PS1) 10Marostegui: Revert "wmnet: Switch m3-master" [dns] - 10https://gerrit.wikimedia.org/r/667401 [14:40:56] (03CR) 10David Caro: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [14:40:58] (03CR) 10Elukey: [C: 03+2] Move an-worker1097 to the Hadoop gpu buster workers [puppet] - 10https://gerrit.wikimedia.org/r/667605 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [14:41:13] (03PS2) 10Marostegui: Revert "wmnet: Switch m3-master" [dns] - 10https://gerrit.wikimedia.org/r/667401 [14:41:24] (03CR) 10DCausse: [C: 03+1] add new updater job properties [deployment-charts] - 10https://gerrit.wikimedia.org/r/667034 (https://phabricator.wikimedia.org/T273095) (owner: 10Mstyles) [14:41:55] (03CR) 10Marostegui: [C: 03+2] Revert "wmnet: Switch m3-master" [dns] - 10https://gerrit.wikimedia.org/r/667401 (owner: 10Marostegui) [14:48:58] !log Failover m3 proxy back to dbproxy1020 [14:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:13] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667600 (owner: 10Muehlenhoff) [14:51:50] (03PS1) 10Andrew Bogott: Added bogus secret file for etcd.deployment-prep.eqiad1.wikimedia.cloud.key [labs/private] - 10https://gerrit.wikimedia.org/r/667612 [14:52:04] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added bogus secret file for etcd.deployment-prep.eqiad1.wikimedia.cloud.key [labs/private] - 10https://gerrit.wikimedia.org/r/667612 (owner: 10Andrew Bogott) [14:52:40] (03CR) 10Elukey: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [14:56:21] (03CR) 10Awight: ReferenceTooltips and other gadget names for ReferencePreviews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [14:57:54] (03CR) 10Muehlenhoff: [C: 03+2] debmonitor: Bump the uwsgi buffer size to 8192 [puppet] - 10https://gerrit.wikimedia.org/r/667132 (https://phabricator.wikimedia.org/T275599) (owner: 10Muehlenhoff) [14:58:04] 10SRE, 10DC-Ops, 10Platform Engineering, 10serviceops, 10Patch-For-Review: Rename wtp* servers to parse* (Parsoid PHP servers) - https://phabricator.wikimedia.org/T245888 (10jijiki) Since we will be moving mediawiki to k8s relatively soon, I am not sure if it is worth the hassle at this point. My opinion... [14:59:26] (03PS15) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [15:00:54] (03PS16) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [15:01:40] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28300/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [15:08:09] (03CR) 10JMeybohm: [C: 03+1] Support ANALYTICS_BASE_URL (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [15:08:49] (03PS17) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [15:09:41] (03CR) 10Kormat: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28301/console" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (owner: 10Kormat) [15:10:08] (03PS1) 10Herron: promethues: scrape mtail metrics from logstash 7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/667617 (https://phabricator.wikimedia.org/T276104) [15:11:37] !log rolling restart of ats-tls on cp[5007-5011] [15:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:45] (03PS1) 10Urbanecm: Enable Growth features in idwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667619 (https://phabricator.wikimedia.org/T259024) [15:15:07] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) @LSobanski yes it can be added to the template. [15:19:58] (03CR) 10MSantos: maps: fix imposm3 cache dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667598 (owner: 10MSantos) [15:20:22] (03PS3) 10MSantos: maps: fix imposm3 cache dir [puppet] - 10https://gerrit.wikimedia.org/r/667598 [15:21:37] (03CR) 10JMeybohm: [C: 03+2] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 (https://phabricator.wikimedia.org/T274262) (owner: 10PipelineBot) [15:22:21] (03PS18) 10Kormat: [WIP] mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 [15:22:23] (03Merged) 10jenkins-bot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 (https://phabricator.wikimedia.org/T274262) (owner: 10PipelineBot) [15:36:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:58] (03CR) 10Alexandros Kosiaris: Support ANALYTICS_BASE_URL (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [15:40:42] (03PS1) 10Muehlenhoff: Bump docker.io versions used by CI/releases Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/667627 [15:42:31] (03PS19) 10Kormat: mariadb: Add section parameters [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) [15:42:57] (03CR) 10Hashar: [C: 03+1] Bump docker.io versions used by CI/releases Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/667627 (owner: 10Muehlenhoff) [15:44:01] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] ReferenceTooltips and other gadget names for ReferencePreviews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [15:44:28] (03CR) 10Muehlenhoff: [C: 03+2] Bump docker.io versions used by CI/releases Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/667627 (owner: 10Muehlenhoff) [15:44:32] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Keep calculating latencies for MediaWiki requests that happen k8s - https://phabricator.wikimedia.org/T276095 (10colewhite) Another possible solution is to extract the metrics via a sum aggregation query with prometheus-es-exporter. It's pretty easy to se... [15:44:37] (03PS2) 10Muehlenhoff: Bump docker.io versions used by CI/releases Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/667627 [15:54:19] (03CR) 10Awight: ReferenceTooltips and other gadget names for ReferencePreviews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [15:54:39] (03CR) 10Kormat: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1003/28303/" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [15:56:27] (03PS1) 10Muehlenhoff: docker: Simply use the default package version on Buster and later [puppet] - 10https://gerrit.wikimedia.org/r/667628 [15:57:20] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [15:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:30] (03CR) 10David Caro: "> One note: all the details that you added in the doc code change have been followed by people since a long time, so the "some experience " [software/spicerack] - 10https://gerrit.wikimedia.org/r/666601 (owner: 10David Caro) [16:01:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:04:45] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/667628 (owner: 10Muehlenhoff) [16:11:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:12:27] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Legoktm) [16:12:33] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Legoktm) [16:14:18] (03PS4) 10Ahmon Dancy: wmf-config/CommonSettings.php: Add MW_NO_ETCD handling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) [16:14:33] (03CR) 10Ahmon Dancy: wmf-config/CommonSettings.php: Add MW_NO_ETCD handling (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667244 (https://phabricator.wikimedia.org/T238436) (owner: 10Ahmon Dancy) [16:15:34] (03PS1) 10Legoktm: install_server: Swtich mw1307 (jobrunner) back to stretch [puppet] - 10https://gerrit.wikimedia.org/r/667631 (https://phabricator.wikimedia.org/T275752) [16:16:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:18] (03CR) 10Legoktm: "I think we'll need to bump the epoch to force re-rendering of the images with new fonts..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667571 (https://phabricator.wikimedia.org/T188997) (owner: 10Aklapper) [16:20:00] (03PS1) 10Gergő Tisza: GrowthExperiments: update variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667634 [16:20:31] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'mathoid' for release 'production' . [16:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:03] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: update variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667634 (owner: 10Gergő Tisza) [16:26:49] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] prometheus::postgres_exporter: disk metrics and custom queries [puppet] - 10https://gerrit.wikimedia.org/r/666888 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [16:35:58] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1097.eqiad.wmnet with reason: REIMAGE [16:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:00] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1097.eqiad.wmnet with reason: REIMAGE [16:38:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:31] 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Review ROCm deployment procedures and current packages - https://phabricator.wikimedia.org/T275896 (10fdans) [16:43:34] 10SRE, 10Analytics, 10Traffic: Downloading from Archiva.wikimedia.org seems slower than Maven Central - https://phabricator.wikimedia.org/T273086 (10hashar) p:05Low→03Medium We had Nginx buffering disabled but there are still unreasonable delay to start a transfer. There are Java CI builds failing random... [16:43:36] (03CR) 10Jcrespo: "Not a review, but a heads up, that I think you will find interesting for me to point out:" [puppet] - 10https://gerrit.wikimedia.org/r/667547 (https://phabricator.wikimedia.org/T275497) (owner: 10Kormat) [16:47:48] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667617 (https://phabricator.wikimedia.org/T276104) (owner: 10Herron) [16:51:57] (03CR) 10Gergő Tisza: "extension.json was updated in I4bf9e023368 which will be deployed this week. So we either need this config patch or to wait until Thursday" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [17:00:14] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 56.15 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [17:00:18] (03CR) 10Dzahn: [C: 03+2] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Atayal" [dns] - 10https://gerrit.wikimedia.org/r/667105 (https://phabricator.wikimedia.org/T275803) (owner: 10Gerrit maintenance bot) [17:01:14] (03PS2) 10Dzahn: Add tay to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/667105 (https://phabricator.wikimedia.org/T275803) (owner: 10Gerrit maintenance bot) [17:02:32] (03PS1) 10Elukey: profile::python38: limit the scope of the module to Buster [puppet] - 10https://gerrit.wikimedia.org/r/667640 [17:03:53] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [17:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:31] (03CR) 10Elukey: [C: 03+2] profile::python38: limit the scope of the module to Buster [puppet] - 10https://gerrit.wikimedia.org/r/667640 (owner: 10Elukey) [17:05:23] !log new Wikimedia project language - tay - Atayal is spoken by the Atayal people of Taiwan [17:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:34] !log our latest Wikipedia language edition ready to move on from the incubator https://tay.wikipedia.org [17:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:41] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) a:05Jclark-ctr→03fgiunchedi Status update: I assumed this operation would be simpler and take less time; at this point the host is best decom'd altogether (since it is already 4y old anyway, we can take... [17:09:05] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/667015 (owner: 10JMeybohm) [17:09:32] 10SRE, 10ops-eqiad, 10User-fgiunchedi: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) [17:14:33] (03CR) 10Dzahn: [C: 03+2] install_server: Swtich mw1307 (jobrunner) back to stretch [puppet] - 10https://gerrit.wikimedia.org/r/667631 (https://phabricator.wikimedia.org/T275752) (owner: 10Legoktm) [17:17:04] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:17:05] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet [17:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:12] 10SRE, 10serviceops, 10Patch-For-Review: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for hosts: ` mw1307.eqiad.wmnet ` The log can be found in `/var/log/wm... [17:22:43] (03CR) 10Dzahn: "please check/deploy grants for new hardware mwmaint2002 that is replacing mwmaint2001 due to age, thank you" [puppet] - 10https://gerrit.wikimedia.org/r/667240 (https://phabricator.wikimedia.org/T274170) (owner: 10Dzahn) [17:24:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:24:25] (03PS2) 10Dzahn: site: add mwmaint2002.codfw.wmnet to maintenance server role [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) [17:25:04] (03CR) 10jerkins-bot: [V: 04-1] site: add mwmaint2002.codfw.wmnet to maintenance server role [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [17:25:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:26:04] (03Abandoned) 10Gergő Tisza: GrowthExperiments: update variable name [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667634 (owner: 10Gergő Tisza) [17:27:47] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:29:12] (03PS1) 10Hnowlan: prometheus::postgres_exporter: Load additional rules on stretch [puppet] - 10https://gerrit.wikimedia.org/r/667645 (https://phabricator.wikimedia.org/T269581) [17:29:58] 10SRE, 10ops-eqiad, 10Analytics: Degraded RAID on an-worker1097 - https://phabricator.wikimedia.org/T274819 (10elukey) To keep archives happy - disk formatted and re-added back in service. [17:34:16] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28304/console" [puppet] - 10https://gerrit.wikimedia.org/r/667645 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [17:35:13] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:36:39] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:37:09] (03PS1) 10Elukey: Set an-worker1098 as Hadoop GPU Buster worker [puppet] - 10https://gerrit.wikimedia.org/r/667668 (https://phabricator.wikimedia.org/T231067) [17:37:32] (03PS2) 10Jdlrobson: Enable og tags on non-wikidata wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667007 (https://phabricator.wikimedia.org/T157145) [17:37:38] (03CR) 10Elukey: [C: 03+2] Set an-worker1098 as Hadoop GPU Buster worker [puppet] - 10https://gerrit.wikimedia.org/r/667668 (https://phabricator.wikimedia.org/T231067) (owner: 10Elukey) [17:37:45] (03PS3) 10Dzahn: site: add mwmaint2002.codfw.wmnet to maintenance server role [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) [17:39:20] 10SRE: Log the real X-Client-IP in apache mediawiki logs - https://phabricator.wikimedia.org/T246348 (10jijiki) [17:39:38] (03CR) 10Cwhite: [C: 03+1] "Great find! This ought to be deployed before I2b890c707616c2a00c8a38b63f2c548ca30b8f34." [puppet] - 10https://gerrit.wikimedia.org/r/667617 (https://phabricator.wikimedia.org/T276104) (owner: 10Herron) [17:40:25] (03PS2) 10Effie Mouzeli: (WIP) mediawiki::alerts: add per cluster error/fatals rate alert [puppet] - 10https://gerrit.wikimedia.org/r/666719 (https://phabricator.wikimedia.org/T262078) [17:40:50] (03CR) 10Dzahn: [C: 04-2] "stalled until switched" [puppet] - 10https://gerrit.wikimedia.org/r/667288 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [17:41:15] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE [17:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:17] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1307.eqiad.wmnet with reason: REIMAGE [17:43:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:31] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops: legoktm can't build CI docker images without using root because he's no longer in contint-admins - https://phabricator.wikimedia.org/T275731 (10hashar) I would rather have cherry picked people that knows about docker-pkg / CI. But I guess it is fi... [17:44:12] (03PS1) 10Dzahn: switch (unused) mwmaint codfw server to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/667675 (https://phabricator.wikimedia.org/T275905) [17:44:23] 10SRE, 10serviceops, 10Patch-For-Review: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 (10Dzahn) p:05Triage→03High [17:44:59] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10Sergey.Trofimovsky.SF) This is the confirmation that the L3 document is signed. `You signed this document on Fri, Feb 26, 7:18 PM.` [17:46:12] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by volans on cumin1001.eqiad.wmnet for hosts: ` sretest1002.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/202103011745_volans_18022_sretest1002_eqiad_w... [17:46:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10decommission-hardware: decommission db1092.eqiad.wmnet - https://phabricator.wikimedia.org/T275019 (10wiki_willy) a:05wiki_willy→03Cmjohnson [17:46:44] (03CR) 10Dzahn: [C: 03+1] "commented out anyways, technically ready after role is applied" [dns] - 10https://gerrit.wikimedia.org/r/667675 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [17:47:12] (03CR) 10Alexandros Kosiaris: prometheus::postgres_exporter: Load additional rules on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667645 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [17:48:14] (03PS2) 10Hnowlan: prometheus::postgres_exporter: Load additional rules on stretch [puppet] - 10https://gerrit.wikimedia.org/r/667645 (https://phabricator.wikimedia.org/T248858) [17:48:46] (03CR) 10RLazarus: [C: 03+1] "Currently we aren't set up to have two maintenance servers in the active DC -- jobs would start on both machines." [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [17:48:52] (03CR) 10Hnowlan: prometheus::postgres_exporter: Load additional rules on stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/667645 (https://phabricator.wikimedia.org/T248858) (owner: 10Hnowlan) [17:52:12] (03CR) 10Dzahn: "ACK, it's only ok because codfw is currently not active, I'll make it a quick switch away from mwmaint2001. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [17:52:42] (03PS1) 10Ahmon Dancy: DevServices.php: Add placeholder for eventgate-analytics-external [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/667676 [17:54:23] (03CR) 10Dzahn: [C: 03+2] site: add mwmaint2002.codfw.wmnet to maintenance server role [puppet] - 10https://gerrit.wikimedia.org/r/667292 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [17:56:24] 10SRE, 10serviceops: Renew certs for mcrouter on all mw appservers - https://phabricator.wikimedia.org/T276029 (10RLazarus) >>! In T276029#6869338, @Joe wrote: > @RLazarus in https://phabricator.wikimedia.org/T248093#6076630 you mentioned committing a script for automating cert renewal, and I see it indeed. Re... [17:57:35] rzl: arr, forgot mcrouter cert is still needed for mwmaint2002, but creating it [17:58:09] !log volans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [17:58:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:14] (03PS2) 10Zabe: Enable babel categorize on thwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667348 (https://phabricator.wikimedia.org/T275283) [17:59:59] !log puppetmaster1001 - generating mcrouter cert for mwmaint2002 T275905 [18:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] ryankemper: Dear deployers, time to do the Wikidata Query Service weekly deploy deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T1800). [18:00:06] T275905: move mwmaint2002 into production, replace mwmaint2001 - https://phabricator.wikimedia.org/T275905 [18:00:12] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1002.eqiad.wmnet with reason: REIMAGE [18:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:29] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) [18:02:44] 10SRE, 10Product-Analytics, 10SRE-Access-Requests, 10Structured-Data-Backlog (Current Work): Add Matthew Williams to analytics-privatedata-users - https://phabricator.wikimedia.org/T275671 (10CBogen) [18:03:19] (03PS1) 10Dzahn: add fake mcrouter certs for mwmaint2002 [labs/private] - 10https://gerrit.wikimedia.org/r/667679 (https://phabricator.wikimedia.org/T275905) [18:03:53] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on mwmaint2002.codfw.wmnet with reason: new install [18:03:53] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on mwmaint2002.codfw.wmnet with reason: new install [18:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:31] (03CR) 10Dzahn: "and this is where I will remove mwmaint2001 right away.. rebasing" [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [18:06:31] (03CR) 10CRusnov: [C: 03+1] "looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/667549 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [18:06:35] mutante: aha, nod [18:07:10] (03CR) 10RLazarus: [C: 03+1] switch (unused) mwmaint codfw server to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/667675 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [18:07:15] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['sretest1002.eqiad.wmnet'] ` and were **ALL** successful. [18:07:39] (03CR) 10Zppix: "> Node 'mwmaint2002.codfw.wmnet' is already defined (file: /srv/workspace/puppet/manifests/site.pp, line: 1867); cannot redefine (file: /s" [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [18:07:45] ack, cert created. puppet is running now, it takes while [18:08:15] (03CR) 10RLazarus: [C: 03+1] "LGTM, pending the rebase" [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [18:08:43] (03CR) 10CRusnov: [C: 03+1] "I have not tested, but it looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/667550 (https://phabricator.wikimedia.org/T244840) (owner: 10Volans) [18:10:03] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1098.eqiad.wmnet with reason: REIMAGE [18:10:08] (03CR) 10Hashar: "That might be fine. There are a bunch of callers to ::docker or ::docker:engine that would need to be adjusted. Then I guess we can pick" [puppet] - 10https://gerrit.wikimedia.org/r/667628 (owner: 10Muehlenhoff) [18:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:12] (03PS1) 10Phuedx: Revert "Revert "vector: Stage 2 of WVUI search treatment A/B test"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667680 (https://phabricator.wikimedia.org/T249297) [18:10:19] (03PS2) 10Dzahn: site: remove mwmaint2001.codfw.mwnet [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) [18:10:34] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/667293 (https://phabricator.wikimedia.org/T275928) (owner: 10Dzahn) [18:12:04] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1098.eqiad.wmnet with reason: REIMAGE [18:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:16:02] (03CR) 10Hashar: [C: 04-1] "We can even drop the $ic_clone_require variable entirely ;)" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667102 (owner: 10Muehlenhoff) [18:16:19] PROBLEM - Logstash Elasticsearch indexing errors #o11y on alert1001 is CRITICAL: 32.05 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/3283cc1372b7df18f26128163125cf45 https://grafana.wikimedia.org/dashboard/db/logstash [18:18:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:18:30] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10greg) >>! In T275677#6865296, @KFrancis wrote: > @jbond @jbond Hello, would you please confirm if Oly Kalinichenko us an employee or contractor fo... [18:19:41] RECOVERY - Stale file for node-exporter textfile in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Stale_file_for_node-exporter_textfile https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [18:19:43] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:20:46] !log mwmaint2002 - shutting down for maintenance [18:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:59] (03CR) 10Ahmon Dancy: [C: 03+2] DevServices.php: Add placeholder for eventgate-analytics-external [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/667676 (owner: 10Ahmon Dancy) [18:22:13] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1307.eqiad.wmnet'] ` and were **ALL** successful. [18:22:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1307.eqiad.wmnet [18:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:03] (03PS1) 10Legoktm: docker_pusher: Use new docker-registry credentials [puppet] - 10https://gerrit.wikimedia.org/r/667682 (https://phabricator.wikimedia.org/T273521) [18:23:21] (03Merged) 10jenkins-bot: DevServices.php: Add placeholder for eventgate-analytics-external [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/667676 (owner: 10Ahmon Dancy) [18:24:31] !log mw1307 - back to stretch now [18:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:24:49] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1307.eqiad.wmnet [18:24:51] jouncebot: now [18:24:51] For the next 0 hour(s) and 5 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T1800) [18:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:28] 10SRE, 10serviceops: Jobrunner on Buster occasional timeout on codfw file upload - https://phabricator.wikimedia.org/T275752 (10Dzahn) >>! In T275752#6869162, @Joe wrote: > My suggestion is we re-image one jobrunner to stretch and we check if that changes things dramatically. @Legoktm / @Dzahn can you take car... [18:26:23] !log [Relforge] Lifting downtime on `relforge1004` now that T275658 is done [18:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:31] T275658: Kibana: Render kibana settings file based off of Kibana/Elasticsearch version - https://phabricator.wikimedia.org/T275658 [18:33:43] (03CR) 10Dduvall: [C: 03+1] "Looks right to me." [puppet] - 10https://gerrit.wikimedia.org/r/667682 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [18:36:03] (03PS2) 10Urbanecm: Define wmgGEFeaturesMayBeAvailableToNewcomers that controls whether GE features are newcomer-deployed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667581 (https://phabricator.wikimedia.org/T276091) [18:40:19] (03PS3) 10Urbanecm: Simplify deployment of Growth team features (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667581 (https://phabricator.wikimedia.org/T276091) [18:40:21] (03PS1) 10Urbanecm: Simplify deployment of Growth team features (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667686 (https://phabricator.wikimedia.org/T276091) [18:40:24] (03PS1) 10Urbanecm: Simplify deployment of Growth team features (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667687 (https://phabricator.wikimedia.org/T276091) [18:40:59] lemme merge those three patches [18:41:22] (03CR) 10Urbanecm: [C: 03+2] Simplify deployment of Growth team features (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667581 (https://phabricator.wikimedia.org/T276091) (owner: 10Urbanecm) [18:42:32] (03Restored) 10Gergő Tisza: EventLoggingSchemas: Bump HomepageVisit version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [18:42:35] !log mwmaint2002.mgmt - racadm serveraction powerup [18:42:39] (03PS2) 10Gergő Tisza: EventLoggingSchemas: Bump HomepageVisit version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [18:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:27] (03CR) 10Dzahn: [C: 03+2] scap: add mwmaint2002 to dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/667267 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [18:44:32] (03PS2) 10Dzahn: scap: add mwmaint2002 to dsh groups [puppet] - 10https://gerrit.wikimedia.org/r/667267 (https://phabricator.wikimedia.org/T275905) [18:44:35] jouncebot: now [18:44:35] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [18:47:34] what's wrong with CI? [18:47:53] https://integration.wikimedia.org/zuul/ [18:48:05] too many patches [18:48:08] (03CR) 10Dzahn: [V: 03+2 C: 03+2] add fake mcrouter certs for mwmaint2002 [labs/private] - 10https://gerrit.wikimedia.org/r/667679 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [18:48:13] yeah, cluttered :/ [18:48:23] ah, not just me [18:49:05] the jenkins ui is not loading at all for some reason [18:49:06] (03PS1) 10Jdlrobson: Fixes max-width configuration for new Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667688 (https://phabricator.wikimedia.org/T260091) [18:49:11] now it did, just super slow [18:49:15] Majavah: also https://integration.wikimedia.org/ci/ takes A LOT of time to load [18:49:37] can confirm it's slow but still working. it just verified me [18:49:40] submitting [18:49:51] how did it fill that bad [18:50:45] did someone push a metric ton of automated patches at the same time or something [18:50:56] confirmed it's not general load on contint1001 [18:50:59] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [18:52:10] https://integration.wikimedia.org/zuul/ [18:52:11] (03Merged) 10jenkins-bot: Simplify deployment of Growth team features (1/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667581 (https://phabricator.wikimedia.org/T276091) (owner: 10Urbanecm) [18:52:18] (03CR) 10Urbanecm: [C: 03+2] Simplify deployment of Growth team features (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667686 (https://phabricator.wikimedia.org/T276091) (owner: 10Urbanecm) [18:52:45] Majavah: wiki creation request happened but that's not a "ton" of patches [18:52:53] When maint bot runs [18:53:00] RECOVERY - dhclient process on sretest1001 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [18:54:09] https://grafana.wikimedia.org/d/000000321/zuul?orgId=1 [18:54:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: e991806eb9dc5ec018ebc59832d02e8a6563ba0a: Simplify deployment of Growth team features (1/3; T276091) (duration: 00m 57s) [18:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:41] T276091: [config] Make it easier to deploy features in shadow mode - https://phabricator.wikimedia.org/T276091 [18:54:56] Majavah: https://grafana.wikimedia.org/d/000000321/zuul?orgId=1&from=1614020079042&to=1614624879042&viewPanel=25 [18:55:34] 10SRE, 10netbox, 10Patch-For-Review: Add SSO support to netbox - https://phabricator.wikimedia.org/T244849 (10crusnov) I've experimented with SSO on netbox-next and been reading a lot of code, and this is an update on all of that. There are a few ways to do this, obviously, and they all have upsides and dow... [18:55:41] the "gate processing time" part of https://grafana.wikimedia.org/d/000000321/zuul?orgId=1 has a slightly red color [18:55:50] but not broken [18:55:57] (03CR) 10Urbanecm: [C: 03+2] Simplify deployment of Growth team features (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667687 (https://phabricator.wikimedia.org/T276091) (owner: 10Urbanecm) [18:56:00] but there are a ton of new jobs, yes [18:56:03] (03Merged) 10jenkins-bot: Simplify deployment of Growth team features (2/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667686 (https://phabricator.wikimedia.org/T276091) (owner: 10Urbanecm) [18:56:05] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [18:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:25] the queue just spiked up at 16:00Z [18:57:22] (03Merged) 10jenkins-bot: Simplify deployment of Growth team features (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667687 (https://phabricator.wikimedia.org/T276091) (owner: 10Urbanecm) [18:57:52] maybe we should just ask releng if they have any ideas [18:58:51] tbh its probably just a case of being patient [18:59:39] I'm curious what caused the spike [19:01:00] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: de0f74126eddafb5375b853d543b377e78544caa: Simplify deployment of Growth team features (2/3; T276091) (duration: 00m 57s) [19:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:07] T276091: [config] Make it easier to deploy features in shadow mode - https://phabricator.wikimedia.org/T276091 [19:01:51] just saw mention of releng. Anything I can help with? [19:02:01] dancy: CI is overloaded and the interface is slow [19:02:22] maybe it's just a spike of patches, maybe it's something bigger, no idea, jenkins is just a black box to me [19:02:27] ok. I'll poke around and see if something stands out [19:02:37] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 599b7390c840388d97dc4cdbf1796451d4024c22: Simplify deployment of Growth team features (3/3; T276091) (duration: 01m 00s) [19:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:07] (03PS1) 10Ottomata: Configure spark to work better with conda environments [puppet] - 10https://gerrit.wikimedia.org/r/667689 (https://phabricator.wikimedia.org/T272313) [19:03:58] It looks like almost all of the available docker agents are busy.... [19:04:13] it's backlogged https://integration.wikimedia.org/zuul/ [19:04:34] dancy: looking at the patches it looks like a bunch of humans just did all their MediaWiki work at once [19:04:37] nod.. lots of mediawiki commits being tested. [19:04:39] jouncebot seems to be AWOL as well [19:04:45] good point [19:04:47] jouncebot: next [19:04:47] In 0 hour(s) and 55 minute(s): deployment server switch to deploy1002 on buster!! (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T2000) [19:04:51] jouncebot: now [19:04:51] For the next 0 hour(s) and 55 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T1900) [19:05:05] tgr_: I'm currently preparing patches for the GE deployment [19:05:09] and i plan to deploy them [19:05:37] tgr_: in the meanwhile, if you want to go ahead with your and phuedx 's patch, feel free to go ahead [19:05:42] no rush. But the bot shouldn't know that, right? [19:06:01] yeah, it's supposed to throw a ping [19:06:03] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [19:06:05] no idea why it didn't [19:06:11] Urbanecm: maybe it didnt refresh [19:06:24] Zppix: morning B&C window is scheduled for months [19:06:45] oh didnt realize you meant the B&C nevermind [19:10:30] tgr_, Urbanecm: Would I be stepping on your toes if I had a go at deploying them? [19:10:43] phuedx: fine with me, just remember to ping me once done [19:11:58] tgr_: I'll deploy yours first [19:13:09] Cool? [19:14:04] (03PS1) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667691 (https://phabricator.wikimedia.org/T275550) [19:14:45] stealth mode sounds interesting [19:14:59] Majavah: it is :) [19:19:03] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: 2021-03-31) rack/setup/install ms-backup200[12] - https://phabricator.wikimedia.org/T274202 (10Papaul) [19:19:37] tgr_: ? [19:20:11] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:24] (03PS1) 10Urbanecm: Enable Growth features on hrwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667692 (https://phabricator.wikimedia.org/T275684) [19:20:30] phuedx: go ahead with yours one, and skip tgr's :) [19:20:51] (03PS2) 10Phuedx: Revert "Revert "vector: Stage 2 of WVUI search treatment A/B test"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667680 (https://phabricator.wikimedia.org/T249297) [19:21:16] Urbanecm: Do you know anything about the ` /w/api.php InvalidArgumentException: The Title object yields no ID. Perhaps the page [[Benutzer:Celestine_Viciente/Chelmonops_curiosus]] doesn't exist?` errors on dewiki? [19:21:28] no, but I'll look [19:21:32] (03PS1) 10Mholloway: Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS [extensions/WikimediaEvents] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667656 [19:21:34] cool thx. [19:21:51] (03PS2) 10Urbanecm: Enable Growth features on sqwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667691 (https://phabricator.wikimedia.org/T275550) [19:21:56] If you don't mind, hit me with a private message when you have some info [19:22:14] (03CR) 10Dzahn: [C: 03+2] switch (unused) mwmaint codfw server to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/667675 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [19:22:17] (03PS2) 10Dzahn: switch (unused) mwmaint codfw server to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/667675 (https://phabricator.wikimedia.org/T275905) [19:22:34] dancy: sure. Please do ping me if you don't hear from me [19:22:43] ok [19:24:05] (03CR) 10Phuedx: [C: 03+2] "Backport window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667680 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [19:24:10] phuedx: sorry, I was AFK. [19:25:19] dancy: maybe related to the job ebernhardson is running, which seems to be running into problems with pages with no revisions? [19:26:02] (03Abandoned) 10Mholloway: Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS [extensions/WikimediaEvents] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667656 (owner: 10Mholloway) [19:26:33] tgr_: Thanks for the info. Is that a bug in the job code or something else? [19:27:14] (03CR) 10Dzahn: [C: 03+2] tcpircbot: add mwmaint2002 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/667238 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [19:27:16] no, the job just calls the API. [19:27:30] (03Merged) 10jenkins-bot: Revert "Revert "vector: Stage 2 of WVUI search treatment A/B test"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667680 (https://phabricator.wikimedia.org/T249297) (owner: 10Phuedx) [19:27:47] (03PS1) 10Urbanecm: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) [19:27:53] OK. I'll ignore those errors for now and I will assume they'll stop at some point. [19:28:18] btw, did you look at some web page to determine that ebernhardson is running a job or was that something you just happened to know? [19:28:22] (03PS1) 10Jdlrobson: prometheus: Setup configuration for client error alerts [puppet] - 10https://gerrit.wikimedia.org/r/667695 (https://phabricator.wikimedia.org/T264665) [19:28:27] Change is on mwdebug1001 [19:31:40] (03PS2) 10Urbanecm: WIP: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) [19:32:21] (03PS1) 10Legoktm: Add docker-registry passwords for I2abc30c81ba [labs/private] - 10https://gerrit.wikimedia.org/r/667696 [19:32:32] Testing looks good. Deploying now [19:32:41] (03CR) 10RLazarus: [C: 03+1] add mwmaint2002 to maintenance hosts list for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/667245 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [19:32:57] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Add docker-registry passwords for I2abc30c81ba [labs/private] - 10https://gerrit.wikimedia.org/r/667696 (owner: 10Legoktm) [19:34:14] !log phuedx@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:667680|Revert "Revert "vector: Stage 2 of WVUI search treatment A/B test"" (T249297)]] (duration: 00m 54s) [19:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:21] T249297: Deploy the new Vue.js search experience - https://phabricator.wikimedia.org/T249297 [19:34:51] Urbanecm: That's deployed :) [19:34:56] thanks! [19:35:05] tgr_: wanna you go now? [19:35:53] Urbanecm: will do, thanks. [19:36:15] (03PS3) 10Gergő Tisza: EventLoggingSchemas: Bump HomepageVisit version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [19:36:39] (03CR) 10Gergő Tisza: [C: 03+2] EventLoggingSchemas: Bump HomepageVisit version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [19:36:45] (03CR) 10Dzahn: [C: 03+2] add mwmaint2002 to maintenance hosts list for firewalls [puppet] - 10https://gerrit.wikimedia.org/r/667245 (https://phabricator.wikimedia.org/T275905) (owner: 10Dzahn) [19:37:52] (03CR) 10Ppchelko: [C: 03+1] api-gateway: generic discovery service config option, add linkrecommendation [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [19:38:04] (03PS3) 10Urbanecm: Enable Growth features on eowiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667694 (https://phabricator.wikimedia.org/T276123) [19:38:42] (03Merged) 10jenkins-bot: EventLoggingSchemas: Bump HomepageVisit version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/666842 (https://phabricator.wikimedia.org/T275615) (owner: 10Kosta Harlan) [19:41:14] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:666842|EventLoggingSchemas: Bump HomepageVisit version (T275615)]] (duration: 00m 56s) [19:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:21] T275615: 'impact_module_state' is a required property - https://phabricator.wikimedia.org/T275615 [19:41:38] (03CR) 10Andrew Bogott: [C: 03+2] Keystone: stop monitoring for novaadmin/novaobserver project membership [puppet] - 10https://gerrit.wikimedia.org/r/667422 (https://phabricator.wikimedia.org/T274385) (owner: 10Andrew Bogott) [19:42:01] (03CR) 10Andrew Bogott: [C: 03+2] wmfkeystonehooks: stop adding service users to all projects [puppet] - 10https://gerrit.wikimedia.org/r/667423 (https://phabricator.wikimedia.org/T274385) (owner: 10Andrew Bogott) [19:42:43] (03PS2) 10Legoktm: docker_pusher: Use new docker-registry credentials [puppet] - 10https://gerrit.wikimedia.org/r/667682 (https://phabricator.wikimedia.org/T273521) [19:42:43] Urbanecm: done, thanks [19:42:51] thanks [19:44:50] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28307/console" [puppet] - 10https://gerrit.wikimedia.org/r/667682 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [19:44:53] (03PS2) 10Urbanecm: Enable Growth features on hrwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667692 (https://phabricator.wikimedia.org/T275684) [19:46:02] (03PS1) 1020after4: topic: Update Phatality dsh targets for kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/667700 [19:46:31] (03CR) 10jerkins-bot: [V: 04-1] topic: Update Phatality dsh targets for kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/667700 (owner: 1020after4) [19:47:14] (03PS2) 1020after4: topic: Update Phatality dsh targets for kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/667700 (https://phabricator.wikimedia.org/T272655) [19:49:13] (03CR) 10Urbanecm: [C: 03+2] Enable Growth features on hrwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667692 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [19:49:48] (03CR) 10Herron: [C: 03+2] promethues: scrape mtail metrics from logstash 7 cluster [puppet] - 10https://gerrit.wikimedia.org/r/667617 (https://phabricator.wikimedia.org/T276104) (owner: 10Herron) [19:50:01] (03Merged) 10jenkins-bot: Enable Growth features on hrwiki in stealth mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667692 (https://phabricator.wikimedia.org/T275684) (owner: 10Urbanecm) [19:50:43] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/667700 (https://phabricator.wikimedia.org/T272655) (owner: 1020after4) [19:52:41] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (1/3; T275684) (duration: 00m 55s) [19:52:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:48] T275684: Deploy Growth features on Croatian Wikipedia - https://phabricator.wikimedia.org/T275684 [19:53:58] !log urbanecm@deploy1001 Synchronized dblists/growthexperiments.dblist: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (2/3; T275684) (duration: 00m 56s) [19:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:18] !log urbanecm@deploy1001 sync-file aborted: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (3/3; T275684) (duration: 00m 03s) [19:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:28] (03CR) 10Dzahn: [C: 03+1] topic: Update Phatality dsh targets for kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/667700 (https://phabricator.wikimedia.org/T272655) (owner: 1020after4) [19:55:24] !log urbanecm@deploy1001 Synchronized wmf-config/config/hrwiki.yaml: d53834e: Enable Growth features on hrwiki in stealth modeEnable Growth features on hrwiki in stealth mode (3/3; T275684) (duration: 00m 54s) [19:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:49] (03Restored) 10Mholloway: Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS [extensions/WikimediaEvents] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667656 (owner: 10Mholloway) [19:56:02] (03PS2) 10Mholloway: Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS [extensions/WikimediaEvents] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667656 [19:59:03] (03CR) 10Cwhite: [C: 03+2] prometheus: Setup configuration for client error alerts [puppet] - 10https://gerrit.wikimedia.org/r/667695 (https://phabricator.wikimedia.org/T264665) (owner: 10Jdlrobson) [20:00:04] mutante and twentyafterfour: #bothumor I � Unicode. All rise for deployment server switch to deploy1002 on buster!! deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T2000). [20:00:42] \o/ [20:01:07] \o [20:01:48] mutante: where to begin? [20:01:51] dns? [20:02:00] (03PS2) 10Jdlrobson: Fixes max-width configuration for new Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667688 (https://phabricator.wikimedia.org/T260091) [20:02:39] PROBLEM - configured eth on sretest1002 is CRITICAL: eno2 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [20:04:37] (03CR) 10Nray: [C: 03+1] Fixes max-width configuration for new Vector [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667688 (https://phabricator.wikimedia.org/T260091) (owner: 10Jdlrobson) [20:06:49] PROBLEM - dhclient process on sretest1002 is CRITICAL: PROCS CRITICAL: 1 process with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [20:07:50] (03PS3) 10Legoktm: docker_pusher: Use new docker-registry credentials [puppet] - 10https://gerrit.wikimedia.org/r/667682 (https://phabricator.wikimedia.org/T273521) [20:08:07] twentyafterfour: I'd begin with placing a scap lock on both servers, if that's possible :) [20:11:38] (03PS1) 10Jdlrobson: Separate Wikivoyage wordmark and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667703 (https://phabricator.wikimedia.org/T273477) [20:20:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:07] 10SRE: Malformed membership for ops user , has additional group(s): {'contint-admins', 'contint-docker'} - https://phabricator.wikimedia.org/T276165 (10jijiki) [20:21:33] (03PS2) 10Jdlrobson: Separate Wikivoyage wordmark and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667703 (https://phabricator.wikimedia.org/T261033) [20:21:35] (03PS1) 10Jdlrobson: Update the Persian Wikipedia logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667704 (https://phabricator.wikimedia.org/T261033) [20:23:16] Urbanecm: not a bad idea [20:23:53] (03CR) 10Dzahn: [C: 03+2] switch deployment CNAME from deploy1001 to deploy1002 [dns] - 10https://gerrit.wikimedia.org/r/635113 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:23:57] (03PS3) 10Dzahn: switch deployment CNAME from deploy1001 to deploy1002 [dns] - 10https://gerrit.wikimedia.org/r/635113 (https://phabricator.wikimedia.org/T265963) [20:24:40] (03CR) 10Legoktm: [C: 03+2] docker_pusher: Use new docker-registry credentials [puppet] - 10https://gerrit.wikimedia.org/r/667682 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [20:25:16] 10SRE, 10DBA, 10Performance-Team (Radar), 10Sustainability (MediaWiki-MultiDC): Apache <=> mariadb SSL/TLS for cross-datacenter writes - https://phabricator.wikimedia.org/T134809 (10Krinkle) [20:25:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:03] !log upgrade mc1029, mc2029 to memcached 1.6 [20:28:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:53] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1029.eqiad.wmnet [20:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:20] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1029.eqiad.wmnet [20:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:00] !log deploy1001 - disable puppet and manually create scap-global-lock - NO DEPLOYMENTS [20:37:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:40] (03CR) 10Kosta Harlan: [C: 03+1] "We want this to run for beta cluster, and it will be a no-op for production while our config flag has the link recommendation feature swit" [puppet] - 10https://gerrit.wikimedia.org/r/655865 (https://phabricator.wikimedia.org/T261408) (owner: 10Gergő Tisza) [20:42:25] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 231918200 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:44:51] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:46:38] (03CR) 10Dzahn: [C: 03+2] hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [20:49:17] (03CR) 10Kosta Harlan: [C: 03+1] api-gateway: generic discovery service config option, add linkrecommendation [deployment-charts] - 10https://gerrit.wikimedia.org/r/662692 (https://phabricator.wikimedia.org/T269581) (owner: 10Hnowlan) [20:49:53] (03PS1) 10Phamhi: wikireplica: depool clouddbb1013 [puppet] - 10https://gerrit.wikimedia.org/r/667708 (https://phabricator.wikimedia.org/T273281) [20:50:23] (03CR) 10jerkins-bot: [V: 04-1] wikireplica: depool clouddbb1013 [puppet] - 10https://gerrit.wikimedia.org/r/667708 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [20:52:25] (03PS4) 10Dzahn: hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) [20:53:23] (03PS2) 10Phamhi: wikireplica: depool clouddbb1013 [puppet] - 10https://gerrit.wikimedia.org/r/667708 (https://phabricator.wikimedia.org/T273281) [20:53:53] (03CR) 10Kosta Harlan: Support ANALYTICS_BASE_URL (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/667561 (owner: 10Alexandros Kosiaris) [20:53:55] (03CR) 10jerkins-bot: [V: 04-1] wikireplica: depool clouddbb1013 [puppet] - 10https://gerrit.wikimedia.org/r/667708 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [20:54:19] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Sergey Trofimovsky from Speed & Function - https://phabricator.wikimedia.org/T275722 (10KFrancis) @jbond Hello, I am confirming Sergey Trofimovsky is covered under Speed & Functions existing agreement. Please proceed with the acces... [20:54:57] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Eugene Chernov from Speed & Function - https://phabricator.wikimedia.org/T275679 (10KFrancis) @jbond Hello, I am confirming Eugene Chernov is covered under Speed & Function's existing agreement. Please proceed with the access. [20:55:38] 10SRE, 10SRE-Access-Requests: Requesting access to gitlab1001 / gitlab1002 for Oly Kalinichenko from Speed & Function - https://phabricator.wikimedia.org/T275677 (10KFrancis) @jbond Hello, I am confirming Oly Kalinichenko is covered under Speed & Function's existing agreement. Please proceed with the access. [20:55:47] (03PS3) 10Phamhi: wikireplica: depool clouddbb1013 [puppet] - 10https://gerrit.wikimedia.org/r/667708 (https://phabricator.wikimedia.org/T273281) [20:57:00] (03CR) 10Bstorm: [C: 03+1] wikireplica: depool clouddbb1013 [puppet] - 10https://gerrit.wikimedia.org/r/667708 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [20:57:47] (03CR) 10Phamhi: [C: 03+2] wikireplica: depool clouddbb1013 [puppet] - 10https://gerrit.wikimedia.org/r/667708 (https://phabricator.wikimedia.org/T273281) (owner: 10Phamhi) [21:00:11] (03PS5) 10Dzahn: hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) [21:01:32] (03CR) 10Dzahn: [C: 03+2] hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [21:01:38] (03PS6) 10Dzahn: hiera/scap: switch deployment server to deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635105 (https://phabricator.wikimedia.org/T265963) [21:05:19] !log re-enabling puppet on deploy1001 - running puppet on deploy*, switching eqiad scap master and deployment_server globally (T265963) [21:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:26] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [21:08:55] !log [mwdebug1001:~] $ /usr/local/lib/nagios/plugins/check_mw_versions --deployhost deploy1002.eqiad.wmnet - OKAY: wikiversions in sync (T265963) [21:09:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:06] (03PS2) 10Cwhite: profile: remove logstash inputs on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/663697 (https://phabricator.wikimedia.org/T234854) [21:14:03] (03CR) 10Cwhite: [C: 03+2] profile: remove logstash inputs on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/663697 (https://phabricator.wikimedia.org/T234854) (owner: 10Cwhite) [21:16:13] !log pooling mw1262 back [21:16:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:55] PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 16 down 2 https://wikitech.wikimedia.org/wiki/HAProxy [21:18:25] !log mw1262 - running puppet to switch to new deployment server, scap pull [21:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:36] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Art+Feminism Wikimedians Mailing List - https://phabricator.wikimedia.org/T275552 (10Masssly) @jbond Could you please resend the password to listserve@artandfeminism.org The address was not created until today, so the message didn't get in. Thanks! [21:24:58] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Art+Feminism Wikimedians Mailing List - https://phabricator.wikimedia.org/T275552 (10Masssly) 05Resolved→03Open [21:30:52] !log completed removal of kafka logging inputs to legacy logstash cluster - T234854 [21:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:00] T234854: Upgrade ELK Stack to version 7 - https://phabricator.wikimedia.org/T234854 [21:31:53] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10Krinkle) a:05aaron→03Krinkle Next steps: 1. Update mediawiki/WANObjectCache to implement a new config option that... [21:35:51] PROBLEM - Check no envoy runtime configuration is left persistent on mwdebug1001 is CRITICAL: connect to address 127.0.0.1 and port 9631: Connection refused https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23Envoy [21:37:17] RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 18 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [21:37:31] (03PS1) 10Phamhi: Revert "wikireplica: depool clouddbb1013" [puppet] - 10https://gerrit.wikimedia.org/r/667658 [21:38:16] !log cumin 'mw*' 'grep master_rsync /etc/scap.cfg' showed all mw servers are now using deploy1002 (T265963) [21:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:23] T265963: Replace production deployment servers and update them to Buster - https://phabricator.wikimedia.org/T265963 [21:38:27] (03CR) 10Phamhi: [C: 03+2] Revert "wikireplica: depool clouddbb1013" [puppet] - 10https://gerrit.wikimedia.org/r/667658 (owner: 10Phamhi) [21:40:16] (03PS1) 10Effie Mouzeli: profile::templates::services_proxy: switch to ::1 when listen_ipv6 is true [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) [21:43:06] !log rebooted clouddb1013 for maintenance [21:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:31] (03PS1) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services proxy on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) [21:47:45] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition=2 prometheus=ops site=eqiad topic=rsyslog-notice https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=loggi [21:47:45] c=All&var-consumer_group=All [21:49:45] !log deploy1002 - removed scap-global-lock, unlocked scap [21:49:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:24] (03PS2) 10Effie Mouzeli: profile::templates::services_proxy: switch to ::1 when listen_ipv6 is true [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) [21:50:59] (03PS2) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services proxy on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) [21:52:02] !log mstyles@deploy1002 Started deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103) [21:52:09] STOP [21:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:11] T270103: Import commons mediainfo RDF dumps to hive - https://phabricator.wikimedia.org/T270103 [21:52:14] no DEPLOYEMNETS [21:52:21] (03PS3) 10Effie Mouzeli: profile::templates::services_proxy: switch to ::1 when listen_ipv6 is true [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) [21:52:31] (03PS3) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services proxy on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) [21:53:23] mutante: i don't think they're in this channel [21:53:31] that makes it worse [21:53:46] aren't you supposed to be here when deploying [21:53:49] c: isn't that against policy? [21:53:52] and the window is blocked [21:54:00] yeah [21:54:01] mutante: I'm pretty sure you are [21:54:07] this was the worst possible time ... even possible [21:54:09] like by the minute [21:54:11] Maybe rant on the task [21:54:16] unless they're using a completely different nick unrelated to their username [21:54:28] mutante: at least it was on the new host? [21:54:36] !log mstyles@deploy1002 Finished deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103) (duration: 02m 34s) [21:54:39] twentyafterfour: yea, i mean.. invountary test for us? [21:54:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:43] involuntary [21:54:53] maryum: ^ [21:55:01] I'm going to run the test on l10n [21:55:03] Found their nick on meta c [21:55:09] https://meta.wikimedia.org/wiki/User:MStyles_(WMF) [21:55:19] is there a problem going on with the deploy servers? [21:55:37] maryum: did you check the deploy calendar? [21:55:39] yes we are switching to new deployment servers [21:55:49] oh sorry! did not check the deploy calendar [21:56:13] maryum: that would probably be a very good thing to do [21:56:16] maryum: we are in the middle of switching, you were just the first person ever to test the new server [21:56:16] !log twentyafterfour@deploy1002 Started scap: (no justification provided) [21:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:56:23] maryum: the good part.. did it work? [21:56:25] It's https://wikitech.wikimedia.org/wiki/Deployments maryum [21:56:34] it did not work, I had to roll back [21:56:36] I will just wait [21:56:43] and make sure to check the calendar in the future [21:56:45] cannot delete non-empty directory: php-1.36.0-wmf.18 [21:57:01] this happens just once every 2 or 3 years [21:57:12] but the timing was such that ... [21:57:21] i had literally just unlocked the new server [21:57:24] !log running scap sync from the new server deply1002 [21:57:30] or scap would have stopped it [21:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:42] it was only possible for a timespan of like a minute [21:57:45] I'm happy to test again! [21:58:51] `scap sync-world` seems to be working ok so far [21:59:04] did it mess up permissions? [21:59:10] ok, good [21:59:42] maryum: what kind of errors did you see? [21:59:42] it didn't mess up permissions but I don't have permissions on php-1.36.0-wmf.18/cache/l10n [21:59:53] checking permissions [21:59:58] there was a timeout error [22:00:04] Reedy and sbassett: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210301T2200) [22:00:06] timeout error https://www.irccloud.com/pastebin/oDmFFing/ [22:00:14] 10SRE, 10serviceops, 10Patch-For-Review: Migrate onhost memcached to use a unix socket - https://phabricator.wikimedia.org/T273115 (10jijiki) [22:00:18] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) [22:00:21] 10SRE, 10serviceops, 10Patch-For-Review, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [22:00:54] mutante: that ^ looks like apache isn't running on deploy1002? [22:01:10] 4.0K drwxr-xr-x 3 mwdeploy mwdeploy 4.0K Dec 8 22:09 php-1.36.0-wmf.18 [22:01:13] 4.0K drwxr-xr-x 16 mwdeploy mwdeploy 4.0K Feb 16 17:41 php-1.36.0-wmf.31 [22:01:16] 4.0K drwxr-xr-x 16 mwdeploy mwdeploy 4.0K Feb 23 17:55 php-1.36.0-wmf.32 [22:01:43] twentyafterfour: Active: active (running) since Tue 2021-02-02 14:08:26 UTC; 3 weeks 6 days ago [22:02:54] Failed to establish a new connection: [Errno 110] Connection timed out [22:03:12] i see where this is going i think [22:03:14] analytics VLAN [22:03:42] oh [22:03:45] I mean.. I scap pulled already from some mw* hosts [22:03:51] from the new deploy host [22:03:57] and this is stat* [22:04:39] (03PS4) 10Effie Mouzeli: hieradata: enable ipv6 on envoy services proxy on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) [22:04:46] yes, from stat1007 I can "curl deploy1001.eqiad.wmnet" but not "curl deploy1002.eqiad.wmnet" [22:04:48] mutante: the scap3 deployments work differently [22:04:49] and this isn't puppet [22:05:04] ferm rules? [22:05:11] no, ACLs [22:05:13] sigh [22:06:28] ACKNOWLEDGEMENT - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1002 job=burrow partition=2 prometheus=ops site=eqiad topic=rsyslog-notice cole_white The result of removing the kafka consumers from the legacy cluster. T234854 Investigating options to clear this up. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consu [22:06:28] rafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [22:06:45] in ferm it is allowed for http [22:08:10] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC OK: https://puppet-compiler.wmflabs.org/compiler1002/28313/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/667714 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [22:08:47] (03CR) 10Effie Mouzeli: [V: 03+1] "PCC when enabled on (667714) mwdebug1001 https://puppet-compiler.wmflabs.org/compiler1002/28313/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/667713 (https://phabricator.wikimedia.org/T255568) (owner: 10Effie Mouzeli) [22:11:05] (03CR) 10Jforrester: [C: 04-1] Separate Wikivoyage wordmark and icon (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667703 (https://phabricator.wikimedia.org/T261033) (owner: 10Jdlrobson) [22:11:28] (03PS1) 10Dzahn: add deploy1002/deploy2002 to scap firewall [homer/public] - 10https://gerrit.wikimedia.org/r/667718 (https://phabricator.wikimedia.org/T265963) [22:12:27] !log twentyafterfour@deploy1002 Finished scap: (no justification provided) (duration: 16m 24s) [22:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:47] 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace sata cables for cloudvirt1024 - https://phabricator.wikimedia.org/T275215 (10Jclark-ctr) 05Open→03Resolved Replaced Failed Hard drive. [22:18:11] (03PS2) 10Dzahn: add deploy1002/deploy2002 to scap firewall [homer/public] - 10https://gerrit.wikimedia.org/r/667718 (https://phabricator.wikimedia.org/T265963) [22:19:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:20:28] (03CR) 10Volans: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/667718 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [22:23:49] (03CR) 10Dzahn: [C: 03+2] add deploy1002/deploy2002 to scap firewall [homer/public] - 10https://gerrit.wikimedia.org/r/667718 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [22:24:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:26:01] (03PS3) 10Jdlrobson: Separate Wikivoyage wordmark and icon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/667703 (https://phabricator.wikimedia.org/T261033) [22:28:17] mutante: is the deployment server switch complete? [22:29:10] maryum: no, deploying firewall changes to hopefully fix it [22:30:47] 10SRE, 10ops-codfw, 10DC-Ops: (Need By: TBD) rack/setup/install backup2003 - https://phabricator.wikimedia.org/T274185 (10Papaul) [22:36:37] maryum: deployed firewall change on routers to allow deploy1002 and now I could connect to it from stat1007 [22:37:02] mutante: thanks for the update, I can go ahead and use the deploy server now? [22:37:41] twentyafterfour: ok from your side? ^ [22:38:28] maryum: try again what you did earlier, on deploy1002 [22:38:32] please [22:38:37] yes going ahead [22:38:42] 10SRE, 10ops-eqiad, 10User-fgiunchedi: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10Jclark-ctr) @fgiunchedi Sorry for delays i had ran into where i was missing the right torques tool to remove pci riser card. after i removed riser card i found the 10g ports are part of the main... [22:39:10] !log mstyles@deploy1002 Started deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103) [22:39:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:39:16] T270103: Import commons mediainfo RDF dumps to hive - https://phabricator.wikimedia.org/T270103 [22:41:14] !log mstyles@deploy1002 Finished deploy [wikimedia/discovery/analytics@ca2c5b5]: import commons ttl dag fix (T270103) (duration: 02m 04s) [22:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:28] 10SRE, 10ops-eqiad, 10User-fgiunchedi: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10Jclark-ctr) 05Open→03Resolved [22:41:30] mutante: everything worked! [22:41:36] yay! [22:42:27] maryum: :) great! and you know what.. it was good that you deployed right then after all. because we would not have noticed there is an issue for you. because this was special just for hosts in analytics vlan and did not affect mediawiki [22:42:38] and now both are good [22:42:52] glad I was able to help! [22:43:50] cool :) switching deployment server is always a bit tricky, i remembered "tin" but it's just a thing every couple years so not really an everyday runbook [22:43:55] glad it's done [22:47:20] all clear? i've got a high-priority fix to deploy too, if so. [22:48:05] mholloway: yea, do it. we are not aware of any more issues [22:48:13] still be good to watch one [22:48:16] cool, thanks [22:48:24] (03CR) 10Mholloway: [C: 03+2] Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS [extensions/WikimediaEvents] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667656 (owner: 10Mholloway) [22:48:27] just forget about deploy1001 [22:51:28] (03PS1) 10Razzi: kafka: Disable alert for absolute max lag value and under-replicated partitions [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) [22:51:45] (03PS2) 10Razzi: kafka: Disable alert for absolute max lag value and under-replicated partitions [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) [22:52:06] (03CR) 10Razzi: "I'm not sure if the right way to disable alerts is to remove the thresholds, let me know!" [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) (owner: 10Razzi) [22:53:40] (03Merged) 10jenkins-bot: Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS [extensions/WikimediaEvents] (wmf/1.36.0-wmf.32) - 10https://gerrit.wikimedia.org/r/667656 (owner: 10Mholloway) [22:56:19] ok, just kicked off scap sync-file [22:57:04] !log mholloway-shell@deploy1002 Synchronized php-1.36.0-wmf.32/extensions/WikimediaEvents: Fix: Restore exporting wgWMESchemaEditAttemptStepSamplingRate to JS (duration: 00m 57s) [22:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:41] (03PS7) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [23:00:13] !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@61e7533]: ores_bulk_ingest: Handle unexpected api response [23:00:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:47] !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@61e7533]: ores_bulk_ingest: Handle unexpected api response (duration: 01m 33s) [23:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:27] (03PS1) 10Dzahn: trafficserver: add director for gitlab to gitlab1001 [puppet] - 10https://gerrit.wikimedia.org/r/667731 (https://phabricator.wikimedia.org/T276144) [23:39:49] (03PS1) 10Dzahn: gitlab: open port 80 for traffic from caching servers [puppet] - 10https://gerrit.wikimedia.org/r/667733 (https://phabricator.wikimedia.org/T276144) [23:43:13] (03CR) 10Ottomata: kafka: Disable alert for absolute max lag value and under-replicated partitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/667724 (https://phabricator.wikimedia.org/T273702) (owner: 10Razzi) [23:45:51] 10SRE, 10DNS, 10Traffic, 10serviceops, and 3 others: DNS for GitLab - https://phabricator.wikimedia.org/T276170 (10Dzahn) [23:45:58] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: forward external traffic to gitlab VMs (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) [23:46:46] 10SRE, 10Traffic, 10GitLab (Initialization), 10Patch-For-Review, and 2 others: forward external traffic to gitlab VMs (was: Port map of how Gitlab is accessed) - https://phabricator.wikimedia.org/T276144 (10Dzahn) This is partially blocked on T276170 .