[00:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:02:53] (03CR) 10Ladsgroup: "I think it might have bugs: https://puppet-compiler.wmflabs.org/compiler1001/27421/" [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [00:04:49] (03CR) 10Ladsgroup: [C: 03+1] monitoring::host: move hostgroup_default to params, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [00:08:11] (03CR) 10Legoktm: [WIP] docker_registry_ha: Add a script to generate a static HTML homepage (0314 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [00:08:22] (03PS5) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [00:08:53] (03CR) 10jerkins-bot: [V: 04-1] docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [00:10:07] (03PS6) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [00:22:47] (03PS1) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) [00:29:24] (03CR) 10Bstorm: "This is part review, part async cry for help. 😊 I am working my way through how to set up these ports for a new wikireplica setup to go th" [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [00:31:17] (03PS2) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) [00:32:25] (03PS3) 10Legoktm: visualdiff: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655183 (https://phabricator.wikimedia.org/T266479) [00:33:40] (03CR) 10Legoktm: [C: 03+2] visualdiff: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655183 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [00:35:39] (03PS3) 10Legoktm: docker_pkg: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655184 (https://phabricator.wikimedia.org/T266479) [00:40:14] (03CR) 10Legoktm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/27422/" [puppet] - 10https://gerrit.wikimedia.org/r/655184 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [00:43:40] (03PS3) 10Legoktm: zuul: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479) [00:46:37] (03CR) 10Legoktm: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/27423/" [puppet] - 10https://gerrit.wikimedia.org/r/655182 (https://phabricator.wikimedia.org/T266479) (owner: 10Legoktm) [00:50:10] (03CR) 10Legoktm: [C: 03+1] icinga::elastic: require_package->ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655519 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:50:44] (03CR) 10Legoktm: [C: 03+1] bird: require_package->ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655522 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:51:18] (03CR) 10Legoktm: [C: 03+1] docker: require_package->ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655520 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [00:52:57] 10SRE: Redundant bootloaders for software RAID - https://phabricator.wikimedia.org/T215183 (10crusnov) >>! In T215183#6719053, @Volans wrote: > I agree we should audit it. I think that with redfish API it should be doable, adding @crusnov as they've worked on it last Q. Indeed. I don't know if there's a direct... [00:55:17] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2812503800 and 248 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:31] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1497966448 and 211 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:39] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4247088912 and 337 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:11] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4153152408 and 363 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:11] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5082479424 and 404 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:11] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 460571224 and 198 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:25] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1836119736 and 271 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:57:32] (03CR) 10Legoktm: "I was going to run this through PCC but it seems unused?" [puppet] - 10https://gerrit.wikimedia.org/r/655511 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [00:57:51] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 748576 and 280 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:05] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 45280 and 293 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:37] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 356824 and 326 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:58:53] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 58880 and 340 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:33] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 191760 and 380 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:59:33] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 34928 and 381 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:00:39] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 173856 and 447 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:25] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 248955344 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:35] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1558872632 and 99 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:13] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3925336072 and 242 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:17] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3806034016 and 230 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:17] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2185848952 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:48] (03CR) 10CDanis: "PCC with current hiera is a no-op, as expected https://puppet-compiler.wmflabs.org/compiler1002/27415/" [puppet] - 10https://gerrit.wikimedia.org/r/655485 (owner: 10CDanis) [01:13:01] PROBLEM - Postgres Replication Lag on maps2009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 4073032472 and 224 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:07] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1011738032 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:16:27] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 24632 and 155 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:16:29] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 52392 and 155 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:16:39] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 395720 and 165 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:15] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 12392 and 203 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:17] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 57752 and 205 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:17] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 29976 and 205 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:18:01] RECOVERY - Postgres Replication Lag on maps2009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56304 and 249 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:26:32] (03CR) 10Legoktm: [C: 04-1] mailman3: Add parts for Postorius (web interface) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655203 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [01:47:04] (03CR) 10Legoktm: deployment::rsync: replace cron with systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655172 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [02:07:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.26 [core] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655537 [02:07:31] (03PS2) 10Jforrester: Branch commit for wmf/1.36.0-wmf.26 [core] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655537 (https://phabricator.wikimedia.org/T267419) (owner: 10TrainBranchBot) [02:09:55] PROBLEM - Disk space on maps1004 is CRITICAL: DISK CRITICAL - free space: /srv 56587 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [03:33:08] RECOVERY - Disk space on maps1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=maps1004&var-datasource=eqiad+prometheus/ops [05:21:13] o/ [05:59:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1075', diff saved to https://phabricator.wikimedia.org/P13722 and previous config saved to /var/cache/conftool/dbconfig/20210112-055953-marostegui.json [05:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:17] PROBLEM - Check systemd state on dbprov2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:48] (03PS1) 10Marostegui: db1079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/655546 [06:05:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1079', diff saved to https://phabricator.wikimedia.org/P13723 and previous config saved to /var/cache/conftool/dbconfig/20210112-060557-marostegui.json [06:05:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:27] (03CR) 10Marostegui: [C: 03+2] db1079: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/655546 (owner: 10Marostegui) [06:08:01] PROBLEM - SSH on logstash2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:15:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13724 and previous config saved to /var/cache/conftool/dbconfig/20210112-061541-root.json [06:15:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:09] !log Stop mysql on db1079 to clone db1155:3317 T268742 [06:16:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:11] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [06:24:57] Going to prune some old versions on the deployment server [06:30:07] !log jhuneidi@deploy1001 Pruned MediaWiki: 1.36.0-wmf.21 (duration: 03m 21s) [06:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:27] PROBLEM - Check systemd state on logstash2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13725 and previous config saved to /var/cache/conftool/dbconfig/20210112-063044-root.json [06:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:13] PROBLEM - ElasticSearch health check for shards on 9200 on logstash2006 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7f21fe6834e0: Failed to establish a new connection: [Errno 111] Connection [06:31:13] ://wikitech.wikimedia.org/wiki/Search%23Administration [06:31:23] RECOVERY - SSH on logstash2006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:33:47] RECOVERY - Check systemd state on logstash2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:35] RECOVERY - ElasticSearch health check for shards on 9200 on logstash2006 is OK: OK - elasticsearch status production-logstash-codfw: active_shards: 862, status: green, task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, active_primary_shards: 456, timed_out: False, initializing_shards: 0, cluster_name: production-logstash-codfw, number_of_in_flight_fetch: 0, relocating_shards: 0, number_of_nodes: 6, active_shards_p [06:34:35] 100.0, unassigned_shards: 0, number_of_data_nodes: 3, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [06:42:01] RECOVERY - Check systemd state on dbprov2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:45:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13726 and previous config saved to /var/cache/conftool/dbconfig/20210112-064548-root.json [06:45:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:46:35] (03CR) 10Ayounsi: [C: 03+2] Only configure relevant vlans on a device [homer/public] - 10https://gerrit.wikimedia.org/r/655445 (owner: 10Ayounsi) [06:47:04] (03Merged) 10jenkins-bot: Only configure relevant vlans on a device [homer/public] - 10https://gerrit.wikimedia.org/r/655445 (owner: 10Ayounsi) [06:53:03] !log push CR655445, only configure vlans relevant to a switch [06:53:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:36] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1075 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13727 and previous config saved to /var/cache/conftool/dbconfig/20210112-070051-root.json [07:00:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:15] (03CR) 10Elukey: "From my point of view the cookbook does it job, really nice work Razzi!" (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [07:15:22] RECOVERY - Check systemd state on stat1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1112', diff saved to https://phabricator.wikimedia.org/P13728 and previous config saved to /var/cache/conftool/dbconfig/20210112-080419-marostegui.json [08:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:55] (03PS2) 10Muehlenhoff: Add bast3005 [puppet] - 10https://gerrit.wikimedia.org/r/655450 (https://phabricator.wikimedia.org/T257324) [08:15:54] (03CR) 10Lars Wirzenius: [C: 03+2] Branch commit for wmf/1.36.0-wmf.26 [core] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655537 (https://phabricator.wikimedia.org/T267419) (owner: 10TrainBranchBot) [08:20:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P13729 and previous config saved to /var/cache/conftool/dbconfig/20210112-082023-root.json [08:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:42] !log Deploy schema change on s3 eqiad master - T270187 [08:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:45] T270187: Schema change for renaming user_properties_property index - https://phabricator.wikimedia.org/T270187 [08:28:54] (03CR) 10Muehlenhoff: [C: 03+2] Add bast3005 [puppet] - 10https://gerrit.wikimedia.org/r/655450 (https://phabricator.wikimedia.org/T257324) (owner: 10Muehlenhoff) [08:30:26] !log installing remaining curl security updates on stretch [08:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:37] (03PS1) 10Elukey: cumin: add hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/655617 [08:34:35] (03CR) 10Muehlenhoff: cumin: add hadoop-related aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655617 (owner: 10Elukey) [08:35:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P13730 and previous config saved to /var/cache/conftool/dbconfig/20210112-083526-root.json [08:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:04] (03CR) 10Elukey: Add cookbook for rebooting druid nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [08:39:20] (03CR) 10Elukey: cumin: add hadoop-related aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655617 (owner: 10Elukey) [08:39:41] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.26 [core] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655537 (https://phabricator.wikimedia.org/T267419) (owner: 10TrainBranchBot) [08:40:32] !log Sanitize bclwiktionary diqwiktionary niawiki niawiktionary diqwiktionary on db1124 db2094 db11154 T270280 T270276 T270414 T270410 T271261 [08:40:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:40] T270414: Prepare and check storage layer for niawiki - https://phabricator.wikimedia.org/T270414 [08:40:40] T271261: Prepare and check storage layer for trwikivoyage - https://phabricator.wikimedia.org/T271261 [08:40:40] T270276: Prepare and check storage layer for diqwiktionary - https://phabricator.wikimedia.org/T270276 [08:40:41] T270280: Prepare and check storage layer for bclwiktionary - https://phabricator.wikimedia.org/T270280 [08:40:41] T270410: Prepare and check storage layer for niawiktionary - https://phabricator.wikimedia.org/T270410 [08:45:55] jouncebot: now [08:45:55] No deployments scheduled for the next 3 hour(s) and 14 minute(s) [08:45:57] jouncebot: next [08:45:57] In 3 hour(s) and 14 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1200) [08:46:11] (03CR) 10Reedy: [C: 03+2] Add a monolog channel for StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655512 (https://phabricator.wikimedia.org/T271755) (owner: 10SBassett) [08:46:59] (03Merged) 10jenkins-bot: Add a monolog channel for StopForumSpam [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655512 (https://phabricator.wikimedia.org/T271755) (owner: 10SBassett) [08:47:08] !log 1.36.0-wmf.26 was branched at e6ad9ab7713ee33c30cd7c17762737870dc8fd08 for T267419 [08:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:47:11] T267419: 1.36.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T267419 [08:48:08] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27424/console" [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:49:04] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings-labs.php: T271755 (duration: 00m 57s) [08:49:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:07] T271755: Add a monolog channel for StopForumSpam (beta cluster) - https://phabricator.wikimedia.org/T271755 [08:49:46] (03CR) 10David Caro: [C: 03+2] wmcs.sge.prometheus Retry getting the job count [puppet] - 10https://gerrit.wikimedia.org/r/655384 (https://phabricator.wikimedia.org/T271686) (owner: 10David Caro) [08:49:49] (03CR) 10David Caro: [C: 03+2] wmcs.sge.prometheus: blacked and isorted [puppet] - 10https://gerrit.wikimedia.org/r/655385 (owner: 10David Caro) [08:50:28] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 04-1] "See inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655516 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [08:50:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P13731 and previous config saved to /var/cache/conftool/dbconfig/20210112-085030-root.json [08:50:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1112 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P13732 and previous config saved to /var/cache/conftool/dbconfig/20210112-090533-root.json [09:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:48] (03CR) 10Muehlenhoff: cumin: add hadoop-related aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655617 (owner: 10Elukey) [09:07:10] (03PS1) 10Lars Wirzenius: testwikis wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655619 [09:07:12] (03CR) 10Lars Wirzenius: [C: 03+2] testwikis wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655619 (owner: 10Lars Wirzenius) [09:08:03] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655619 (owner: 10Lars Wirzenius) [09:10:00] (03PS2) 10Elukey: cumin: add hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/655617 [09:10:30] (03CR) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (owner: 10Giuseppe Lavagetto) [09:10:41] (03PS3) 10Giuseppe Lavagetto: Switch the base image to buster from stretch. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 [09:16:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655617 (owner: 10Elukey) [09:17:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix exception raised by build process [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655409 (https://phabricator.wikimedia.org/T226728) (owner: 10Giuseppe Lavagetto) [09:19:17] (03Merged) 10jenkins-bot: Fix exception raised by build process [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655409 (https://phabricator.wikimedia.org/T226728) (owner: 10Giuseppe Lavagetto) [09:20:55] (03CR) 10Elukey: [C: 03+2] cumin: add hadoop-related aliases [puppet] - 10https://gerrit.wikimedia.org/r/655617 (owner: 10Elukey) [09:22:17] (03CR) 10Jgiannelos: [C: 03+1] tegola: Add docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [09:22:19] (03PS1) 10Volans: tests: fix deprecated pytest argument [software/homer] - 10https://gerrit.wikimedia.org/r/655620 [09:26:16] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Please don't use FROM scratch for base images." (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [09:27:47] (03PS2) 10Giuseppe Lavagetto: Fix UX of the argument parser [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655410 (https://phabricator.wikimedia.org/T253131) [09:27:49] (03PS2) 10Giuseppe Lavagetto: Always refresh the base images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) [09:27:51] (03PS2) 10Giuseppe Lavagetto: Add ability to separate the apt and the general http proxy [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655412 (https://phabricator.wikimedia.org/T183545) [09:28:22] !log liw@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.26 [09:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix UX of the argument parser [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655410 (https://phabricator.wikimedia.org/T253131) (owner: 10Giuseppe Lavagetto) [09:31:52] (03Merged) 10jenkins-bot: Fix UX of the argument parser [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655410 (https://phabricator.wikimedia.org/T253131) (owner: 10Giuseppe Lavagetto) [09:52:24] (03CR) 10Arturo Borrero Gonzalez: "Some comments inline. I don't have a lot of experience with this LVS setup, so other should comment on the format, config, etc." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) (owner: 10Bstorm) [09:55:14] (03CR) 10Jbond: redis::slave: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655518 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:55:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Bump all helm_scaffold_versions to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/643057 (owner: 10Alexandros Kosiaris) [09:56:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "this LGTM. I don't have a lot of experience with writing puppet tests like you did here, so I'm adding J.Bond as reviewer as well." [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [09:56:58] (03CR) 10Jbond: "lgtm but as mentioned the role is not used perhaps we can just delete it?" [puppet] - 10https://gerrit.wikimedia.org/r/655511 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [09:59:58] (03CR) 10Jbond: [C: 03+1] "see inline for typo lg otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655508 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:00:06] (03PS4) 10Muehlenhoff: Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077 [10:00:38] (03CR) 10Muehlenhoff: Add a wrapper to optionally mail the output of a systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff) [10:05:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (but needs meeting approval for the new rules/group)" [puppet] - 10https://gerrit.wikimedia.org/r/649077 (https://phabricator.wikimedia.org/T271718) (owner: 10ArielGlenn) [10:12:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1138', diff saved to https://phabricator.wikimedia.org/P13736 and previous config saved to /var/cache/conftool/dbconfig/20210112-101211-marostegui.json [10:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:19] !log Restart mysql on db1138 to pick up new config T271427 T271106 [10:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:23] T271427: Switchover s4 (commonswiki) from db1081 to db1138 - https://phabricator.wikimedia.org/T271427 [10:13:23] T271106: Enable report_host on candidate masters - https://phabricator.wikimedia.org/T271106 [10:13:41] (03CR) 10Vgutierrez: "we needed to set TTL consistently to 0 after seeing issues documented on https://phabricator.wikimedia.org/T219414 and https://community.l" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655476 (owner: 10Andrew Bogott) [10:15:22] !log liw@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.26 (duration: 67m 18s) [10:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:22] (03PS1) 10Muehlenhoff: Add a new option to enable mail output for a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655628 [10:26:07] !log installing systemd bugfix update from Buster 10.7 point release [10:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:32] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: T271058 [10:28:32] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cloudnet1004.eqiad.wmnet with reason: T271058 [10:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:36] T271058: cloudnet1004: network hiccup because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 [10:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:29] (03PS1) 10Elukey: Add cookbook to upgrade hadoop client nodes to Bigtop [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) [10:43:46] 10SRE, 10WMF-NDA-Requests: Request from WMDE employee Amrutha - https://phabricator.wikimedia.org/T271725 (10amy_rc) [10:48:54] 10SRE, 10WMF-NDA-Requests: Request from WMDE employee Amrutha - https://phabricator.wikimedia.org/T271725 (10amy_rc) @Lea_WMDE : Could you confirm that you are my manager and approval for this request? Regards Amrutha [10:51:50] !log swift eqiad-prod: add weight to ms-be106[0-3] - T268435 [10:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:53] T268435: Add ms-be106[0-3] to swift - https://phabricator.wikimedia.org/T268435 [11:01:33] (03PS3) 10Hnowlan: tegola: Add docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) [11:01:49] (03CR) 10Hnowlan: tegola: Add docker image. (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [11:07:27] (03PS1) 10Volans: tests: fix typo in mocked objects [software/homer] - 10https://gerrit.wikimedia.org/r/655632 [11:07:30] (03PS1) 10Volans: tests: add coverage for netbox device data inventory [software/homer] - 10https://gerrit.wikimedia.org/r/655633 [11:07:43] (03CR) 10Volans: [C: 03+2] tests: fix deprecated pytest argument [software/homer] - 10https://gerrit.wikimedia.org/r/655620 (owner: 10Volans) [11:10:58] (03Merged) 10jenkins-bot: tests: fix deprecated pytest argument [software/homer] - 10https://gerrit.wikimedia.org/r/655620 (owner: 10Volans) [11:11:24] jouncebot: next [11:11:24] In 0 hour(s) and 48 minute(s): European mid-day backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1200) [11:18:36] (03CR) 10Ayounsi: [C: 03+1] "Saw it live." [software/homer] - 10https://gerrit.wikimedia.org/r/655632 (owner: 10Volans) [11:20:23] (03PS1) 10Elukey: Remove analytics-tool1004 from puppet (decommed node) [puppet] - 10https://gerrit.wikimedia.org/r/655634 (https://phabricator.wikimedia.org/T268219) [11:22:03] 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [11:22:11] (03CR) 10Elukey: [C: 03+2] Remove analytics-tool1004 from puppet (decommed node) [puppet] - 10https://gerrit.wikimedia.org/r/655634 (https://phabricator.wikimedia.org/T268219) (owner: 10Elukey) [11:22:13] (03CR) 10Volans: [C: 03+2] tests: fix typo in mocked objects [software/homer] - 10https://gerrit.wikimedia.org/r/655632 (owner: 10Volans) [11:22:40] !log installing edk2 security updates [11:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:33] (03Merged) 10jenkins-bot: tests: fix typo in mocked objects [software/homer] - 10https://gerrit.wikimedia.org/r/655632 (owner: 10Volans) [11:26:30] (03PS1) 10Filippo Giunchedi: WIP: add interface::rps to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) [11:27:40] 10SRE, 10Traffic: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) p:05Medium→03High Lowering the number of Lua states on cp3050 did [[ https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-comparison?viewPanel=90&orgId=1&var-site=esams%20prometheus%2Fops&var-... [11:28:21] 10SRE, 10Performance-Team, 10Traffic: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) [11:33:40] (03PS2) 10Filippo Giunchedi: WIP: add interface::rps to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) [11:34:00] (03CR) 10Ayounsi: [C: 03+1] tests: add coverage for netbox device data inventory [software/homer] - 10https://gerrit.wikimedia.org/r/655633 (owner: 10Volans) [11:35:05] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27426/console" [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [11:36:40] (03CR) 10Daimona Eaytoy: "Scheduled for today so I can guarantee my presence https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=1893115&oldid=1893078" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [11:36:42] (03PS3) 10Filippo Giunchedi: WIP: add interface::rps to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) [11:37:31] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27427/console" [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [11:40:00] (03CR) 10Giuseppe Lavagetto: tegola: Add docker image. (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [11:40:32] (03PS4) 10Filippo Giunchedi: role: add interface::rps to swift::storage [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) [11:41:35] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) @RobH Hi! Once `bios/drac/serial setup/testing` I can take care of the os install/dhcp/etc.. config if you want :) [11:43:16] (03PS1) 10ZPapierski: Fix /sparql rewrite and alias rules [puppet] - 10https://gerrit.wikimedia.org/r/655639 (https://phabricator.wikimedia.org/T267825) [11:47:15] _joe_ TIL tegola :D [11:47:36] <_joe_> yeah not sure others get how funny that is :D [11:48:31] brilliant [11:50:39] (03CR) 10Volans: [C: 03+2] tests: add coverage for netbox device data inventory [software/homer] - 10https://gerrit.wikimedia.org/r/655633 (owner: 10Volans) [11:54:50] (03Merged) 10jenkins-bot: tests: add coverage for netbox device data inventory [software/homer] - 10https://gerrit.wikimedia.org/r/655633 (owner: 10Volans) [11:57:43] (03PS1) 10KartikMistry: Update cxserver to 2021-01-12-095820-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/655642 (https://phabricator.wikimedia.org/T234220) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for European mid-day backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1200). [12:00:04] dcausse and Daimona: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:18] the lovely time of day is here again :) [12:00:24] I can deploy today! [12:01:30] o/ [12:01:52] Daimona: dunno which train James_F meant in his review (prob. some that's already deployed judging by the date), and I would really appreciate if he could self-remove his +1 so I don't have to override :) [12:02:28] dcausse: you here too? :) [12:02:59] 10SRE: Integrate Buster 10.7 point update - https://phabricator.wikimedia.org/T269558 (10MoritzMuehlenhoff) [12:04:19] (03CR) 10Volans: [C: 04-1] "Thanks for writing a new cookbook Razzi! Few comments inline, mostly nits, one main suggestion to move the functionality within the class " (0312 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [12:06:00] Urbanecm: it was last week's train, but I also would appreciate James to be here. Not sure about his timezone, so I just scheduled the patch for this window, but I can wait [12:06:37] ("wait" includes waiting for the next window if necessary) [12:07:53] Daimona: I think he lives in the US, so he's more likely to be around during morning B&C window [12:08:35] Sure [12:08:47] !log draining ganeti3001 for eventual reboot [12:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:45] anything else I can do for you Daimona ? [12:10:22] I'd say no, but thank you :-) [12:10:42] I'll reschedule it for tonight's window [12:10:48] great! [12:12:10] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [12:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:05] (03CR) 10Volans: "Nice! couple of nits and couple of questions inline..." (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [12:19:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [12:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:28] (03PS2) 10Muehlenhoff: Make check-cumin-aliases always return 0 [puppet] - 10https://gerrit.wikimedia.org/r/645311 [12:26:48] (03CR) 10Alexandros Kosiaris: [C: 04-1] Switch the base image to buster from stretch. (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/615683 (owner: 10Giuseppe Lavagetto) [12:33:10] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) Reminder: do not add IPV6 entries to these hosts (T267043#6692741) [12:44:09] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10AlexisJazz) [12:46:09] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10AlexisJazz) [12:48:26] (03PS1) 10Arturo Borrero Gonzalez: cloud: neutron: conntrackd: fix gw address in filter [puppet] - 10https://gerrit.wikimedia.org/r/655650 [12:54:19] (03CR) 10Alexandros Kosiaris: [C: 03+2] "I 'll merge to get us back to the previous status quo and unbreak kubernetes node imaging, but hoping we can get rid of this module altoge" [puppet] - 10https://gerrit.wikimedia.org/r/655475 (https://phabricator.wikimedia.org/T271099) (owner: 10Alexandros Kosiaris) [12:58:17] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:38] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10RhinosF1) Certificates last 3 months so probably similar issues [13:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1300) [13:03:23] (03CR) 10Alexandros Kosiaris: "Ah seems like we uploaded the same patch and I just noticed this one. I 've just merged https://gerrit.wikimedia.org/r/c/operations/puppet" [puppet] - 10https://gerrit.wikimedia.org/r/655465 (owner: 10Jbond) [13:03:29] (03Abandoned) 10Alexandros Kosiaris: lvm: Always force vgremoval [puppet] - 10https://gerrit.wikimedia.org/r/655465 (owner: 10Jbond) [13:04:25] (03PS4) 10Hnowlan: tegola: Add docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) [13:05:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Many thanks for the +1s, merging!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/643057 (owner: 10Alexandros Kosiaris) [13:06:36] (03Merged) 10jenkins-bot: Bump all helm_scaffold_versions to 0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/643057 (owner: 10Alexandros Kosiaris) [13:07:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/641969 (owner: 10Alexandros Kosiaris) [13:08:07] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [software/service-checker] - 10https://gerrit.wikimedia.org/r/641789 (owner: 10Alexandros Kosiaris) [13:10:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [software/service-checker] - 10https://gerrit.wikimedia.org/r/641790 (https://phabricator.wikimedia.org/T259686) (owner: 10Alexandros Kosiaris) [13:11:29] (03Merged) 10jenkins-bot: tox: Drop py27, py34, add py37 [software/service-checker] - 10https://gerrit.wikimedia.org/r/641969 (owner: 10Alexandros Kosiaris) [13:11:56] (03Merged) 10jenkins-bot: Remove old trusty comment [software/service-checker] - 10https://gerrit.wikimedia.org/r/641789 (owner: 10Alexandros Kosiaris) [13:13:39] oops I missed the deploy window :/ [13:14:21] (03Merged) 10jenkins-bot: Allow skipping cert verification [software/service-checker] - 10https://gerrit.wikimedia.org/r/641790 (https://phabricator.wikimedia.org/T259686) (owner: 10Alexandros Kosiaris) [13:18:37] (03CR) 10Elukey: Add cookbook to upgrade hadoop client nodes to Bigtop (035 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [13:19:19] (03CR) 10Alexandros Kosiaris: "LGTM, minor naming nitpick though" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [13:19:26] (03CR) 10Alexandros Kosiaris: [C: 03+1] Always refresh the base images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [13:19:39] (03CR) 10DCausse: [C: 03+1] Fix /sparql rewrite and alias rules [puppet] - 10https://gerrit.wikimedia.org/r/655639 (https://phabricator.wikimedia.org/T267825) (owner: 10ZPapierski) [13:26:17] (03PS1) 10Alexandros Kosiaris: Bump debian/changelog for packaging 0.2.1 [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 [13:26:44] (03PS2) 10Elukey: Add cookbook to upgrade hadoop client nodes to Bigtop [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) [13:27:24] (03CR) 10Elukey: Add cookbook to upgrade hadoop client nodes to Bigtop (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [13:28:15] (03CR) 10Alexandros Kosiaris: "recheck" [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 (owner: 10Alexandros Kosiaris) [13:32:49] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10Vgutierrez) `root@deployment-cache-upload06:/etc/acmecerts/unified/live# openssl x509 -dates -noout -in rsa-2048.crt notBefore=Jan... [13:33:12] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10RhinosF1) 05Open→03Resolved a:03Vgutierrez Fixed per discussion on #wikimedia-traffic at least for another 90 days. [13:33:19] !log draining ganeti3002 for eventual reboot [13:33:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:45] 10SRE, 10Traffic, 10Beta-Cluster-reproducible, 10HTTPS: The certificate for upload.beta.wmflabs.org expired on January 12, 2021. - https://phabricator.wikimedia.org/T271808 (10RhinosF1) But yeah I guess this should be fixed/monitored better so it doesn't need manual reload. [13:34:06] (03PS2) 10Alexandros Kosiaris: Bump debian/changelog for packaging 0.2.1 [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 [13:39:03] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [13:39:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:26] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [13:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:19] (03PS1) 10Ottomata: Finalize migration of UniversalLanguageSelector to event platform [puppet] - 10https://gerrit.wikimedia.org/r/655658 (https://phabricator.wikimedia.org/T267352) [13:52:42] (03CR) 10Ottomata: [C: 03+2] Finalize migration of UniversalLanguageSelector to event platform [puppet] - 10https://gerrit.wikimedia.org/r/655658 (https://phabricator.wikimedia.org/T267352) (owner: 10Ottomata) [13:53:28] !log failover ganeti master in esams to ganeti3002 [13:53:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:00] !log draining ganeti3003 for eventual reboot [13:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:22] PROBLEM - ganeti-wconfd running on ganeti3003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:00:04] liw and longma: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Mediawiki train - European+American Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1400). [14:00:05] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:06] (03PS1) 10Lars Wirzenius: group0 wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655664 [14:03:08] (03CR) 10Lars Wirzenius: [C: 03+2] group0 wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655664 (owner: 10Lars Wirzenius) [14:04:18] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655664 (owner: 10Lars Wirzenius) [14:05:40] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.26 [14:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:44] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:09] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [14:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:08] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [14:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:39] (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 14:" (0311 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [14:28:36] (03CR) 10Jforrester: [C: 03+1] "Train is out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/647116 (https://phabricator.wikimedia.org/T269712) (owner: 10Jforrester) [14:34:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:36:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:39:45] (03PS1) 10Jbond: .gitignore: use wild card for swap files [puppet] - 10https://gerrit.wikimedia.org/r/655686 [14:40:54] (03CR) 10Jbond: [C: 03+2] .gitignore: use wild card for swap files [puppet] - 10https://gerrit.wikimedia.org/r/655686 (owner: 10Jbond) [14:47:59] (03CR) 10Jbond: [C: 04-1] "See inline, feel free to ping me on IRC (jbond42) if anything is unclear" (0312 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654419 (https://phabricator.wikimedia.org/T268877) (owner: 10David Caro) [14:49:05] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/655465 (owner: 10Jbond) [14:53:29] (03CR) 10Jbond: "lg however we should still exit not 0 on failure" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff) [14:55:17] (03PS2) 10Jbond: Add a new option to enable mail output for a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655628 (owner: 10Muehlenhoff) [14:56:02] (03CR) 10Jbond: [C: 03+1] "LGTM, added a dependency on the other change" [puppet] - 10https://gerrit.wikimedia.org/r/655628 (owner: 10Muehlenhoff) [14:57:11] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:03:00] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for the quick fixes!" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/655630 (https://phabricator.wikimedia.org/T269919) (owner: 10Elukey) [15:18:55] PROBLEM - Check systemd state on dbprov2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:20:23] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [15:20:25] 9unmerged changes was me, they are merged now [15:20:50] jynus: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter, FYI ^^^ [15:21:14] (03CR) 10Muehlenhoff: "> Patch Set 4:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff) [15:21:41] volans, didn't we add a reset-failed on cron? [15:21:58] or maybe it was for another service [15:22:08] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on ms-be2031.codfw.wmnet with reason: test unattended reboot [15:22:08] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2031.codfw.wmnet with reason: test unattended reboot [15:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:40] jynus: I don't recall if was this one or another [15:25:06] nah, it was \*.scope [15:25:48] I am going to poke around for root causes, it doesn't make sense only dbprov2001 fails [15:26:53] e.g. underlying disk errors [15:27:22] q [15:27:33] sorry, wrong window [15:30:58] (03PS3) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) [15:32:54] (03CR) 10Volans: "One main question about deletion, LGTM otherwise." (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [15:33:29] I've created T271821 in case someone want to follow my research [15:33:29] T271821: export_smart_data_dump.service failed on dbprov2001 because of a timeout in the raid facter - https://phabricator.wikimedia.org/T271821 [15:35:53] RECOVERY - HP RAID on ms-be2055 is OK: OK: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 3I:3:1, 3I:3:2, 3I:3:3, 3I:3:4, 4I:5:1, 4I:5:2 - Controller: OK - Battery/Capacitor: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:36:15] (03PS3) 10Alexandros Kosiaris: Bump debian/changelog for packaging 0.2.1 [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 [15:36:51] 10SRE, 10ops-codfw, 10SRE-swift-storage: Degraded RAID on ms-be2055 - https://phabricator.wikimedia.org/T271055 (10fgiunchedi) 05Open→03Resolved Disk is rebuilding [15:37:07] (03CR) 10jerkins-bot: [V: 04-1] Bump debian/changelog for packaging 0.2.1 [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 (owner: 10Alexandros Kosiaris) [15:37:37] 10SRE, 10MW-on-K8s, 10observability, 10serviceops: Add support for scraping php applications to the kubernetes prometheus scraper - https://phabricator.wikimedia.org/T271822 (10Joe) [15:39:04] (03CR) 10CDanis: [C: 03+1] role: add interface::rps to swift::storage (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655636 (https://phabricator.wikimedia.org/T271415) (owner: 10Filippo Giunchedi) [15:39:14] (03PS4) 10Bstorm: wikireplicas: set up LVS for multiinstance wikireplicas [puppet] - 10https://gerrit.wikimedia.org/r/655533 (https://phabricator.wikimedia.org/T271476) [15:40:30] (03PS4) 10Alexandros Kosiaris: Bump debian/changelog for packaging 0.2.1 [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 [15:40:55] (03CR) 10Ema: [C: 03+2] Make query.wikidata.org point to microsite backend instead (for GUI) [puppet] - 10https://gerrit.wikimedia.org/r/655051 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [15:41:44] * addshore watches [15:45:27] RECOVERY - Check systemd state on dbprov2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:56] (03PS2) 10CDanis: Switch to pytest and use tox-wikimedia [software/klaxon] - 10https://gerrit.wikimedia.org/r/651846 (owner: 10Legoktm) [15:47:25] (03CR) 10jerkins-bot: [V: 04-1] Switch to pytest and use tox-wikimedia [software/klaxon] - 10https://gerrit.wikimedia.org/r/651846 (owner: 10Legoktm) [15:49:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Bump debian/changelog for packaging 0.2.1 [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 (owner: 10Alexandros Kosiaris) [15:51:17] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 111.00, 102.86, 66.65 https://wikitech.wikimedia.org/wiki/Swift [15:51:51] expected ^ [15:52:18] !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on ms-be2055.codfw.wmnet with reason: reboot [15:52:19] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ms-be2055.codfw.wmnet with reason: reboot [15:52:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:28] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Couple of minor nitpicks online. We 'll need to create tokens and namespaces before this can be deployed though." (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [15:53:02] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [15:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:52] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655496 (https://phabricator.wikimedia.org/T271696) (owner: 10CRusnov) [15:55:24] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [15:55:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:06] !log cp5008: ats-backend-restart to apply jit.off(true, true) in default.lua T265625 [15:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:09] T265625: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 [15:56:17] (03PS6) 10Ottomata: Add new service eventstreams-internal [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) [15:57:14] (03Restored) 10Alexandros Kosiaris: lvs: stop monitoring graphoid [puppet] - 10https://gerrit.wikimedia.org/r/654959 (https://phabricator.wikimedia.org/T242855) (owner: 10Dzahn) [15:57:19] (03CR) 10Ottomata: Add new service eventstreams-internal (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/644612 (https://phabricator.wikimedia.org/T269160) (owner: 10Ottomata) [15:58:01] RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 7.93, 1.72, 0.56 https://wikitech.wikimedia.org/wiki/Swift [15:58:11] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Actually, this can go in now and given we will need to remove the LVS setup without pestering ourselves with pages and alerts, it can go i" [puppet] - 10https://gerrit.wikimedia.org/r/654959 (https://phabricator.wikimedia.org/T242855) (owner: 10Dzahn) [15:58:17] (03PS3) 10Alexandros Kosiaris: lvs: stop monitoring graphoid [puppet] - 10https://gerrit.wikimedia.org/r/654959 (https://phabricator.wikimedia.org/T242855) (owner: 10Dzahn) [15:59:55] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks!" [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 (owner: 10Alexandros Kosiaris) [16:00:35] (03PS1) 10Herron: dns: add kibana7.svc record [dns] - 10https://gerrit.wikimedia.org/r/655696 (https://phabricator.wikimedia.org/T234854) [16:00:53] (03Merged) 10jenkins-bot: Bump debian/changelog for packaging 0.2.1 [software/service-checker] - 10https://gerrit.wikimedia.org/r/655655 (owner: 10Alexandros Kosiaris) [16:01:20] (03PS5) 10Muehlenhoff: Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077 [16:02:37] RECOVERY - Host db2140 is UP: PING OK - Packet loss = 0%, RTA = 34.17 ms [16:03:12] (03CR) 10jerkins-bot: [V: 04-1] Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff) [16:03:36] 10SRE, 10ops-codfw, 10DBA: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Papaul) a:05Papaul→03Marostegui DiMM B6 replaced , server is back up. return tracking information below. {F33996733} [16:03:52] (03PS1) 10Ladsgroup: Fix the mapping [puppet] - 10https://gerrit.wikimedia.org/r/655697 (https://phabricator.wikimedia.org/T266702) [16:04:35] (03CR) 10Herron: [C: 03+2] dns: add kibana7.svc record [dns] - 10https://gerrit.wikimedia.org/r/655696 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:04:58] 10SRE, 10ops-codfw, 10DBA: db2140 crashed due to HW memory errors - https://phabricator.wikimedia.org/T271084 (10Marostegui) Thanks Papaul. Going to start mysql, check its data, enable replication and later repool it. Will close this task once fully done [16:05:00] (03CR) 10Addshore: [C: 03+1] Fix the mapping [puppet] - 10https://gerrit.wikimedia.org/r/655697 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [16:07:54] (03PS2) 10Ema: ATS: fix wdqs remap rules [puppet] - 10https://gerrit.wikimedia.org/r/655697 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [16:08:38] (03CR) 10Ema: [C: 03+2] ATS: fix wdqs remap rules [puppet] - 10https://gerrit.wikimedia.org/r/655697 (https://phabricator.wikimedia.org/T266702) (owner: 10Ladsgroup) [16:09:50] (03PS1) 10Alexandros Kosiaris: service-checker: Bump to 0.2.1-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/655700 [16:12:14] (03PS1) 10Muehlenhoff: Move bast3005 to the correct DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/655703 [16:12:57] (03CR) 10Volans: "recheck" [software/klaxon] - 10https://gerrit.wikimedia.org/r/651846 (owner: 10Legoktm) [16:14:07] (03CR) 10Muehlenhoff: [C: 03+2] Move bast3005 to the correct DHCP config [puppet] - 10https://gerrit.wikimedia.org/r/655703 (owner: 10Muehlenhoff) [16:14:45] (03PS1) 10Mforns: Migrate SuggestedTagsAction to Event Platform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655706 (https://phabricator.wikimedia.org/T267351) [16:15:42] (03CR) 10CDanis: [C: 03+2] Switch to pytest and use tox-wikimedia [software/klaxon] - 10https://gerrit.wikimedia.org/r/651846 (owner: 10Legoktm) [16:16:27] PROBLEM - Stale file for node-exporter textfile in codfw on alert1001 is CRITICAL: cluster=mysql file=device_smart.prom instance=db2140 job=node site=codfw https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [16:17:02] (03Merged) 10jenkins-bot: Switch to pytest and use tox-wikimedia [software/klaxon] - 10https://gerrit.wikimedia.org/r/651846 (owner: 10Legoktm) [16:18:46] !log herron@puppetmaster1001 conftool action : set/weight=10; selector: name=logstash2031.codfw.wmnet [16:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:05] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) >>! In T267050#6732006, @RobH wrote: > fixed the netbox issue, hsots will be imaged later today I hadn't circled back to this yet, but its on my radar to address/image today! [16:20:03] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] service-checker: Bump to 0.2.1-1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/655700 (owner: 10Alexandros Kosiaris) [16:25:40] (03PS1) 10Ayounsi: Add dduvall to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/655708 (https://phabricator.wikimedia.org/T271746) [16:26:58] (03CR) 10Ayounsi: [C: 03+2] Add dduvall to contint-roots [puppet] - 10https://gerrit.wikimedia.org/r/655708 (https://phabricator.wikimedia.org/T271746) (owner: 10Ayounsi) [16:27:47] 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Ladsgroup) [16:27:55] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to contint-roots for dduvall - https://phabricator.wikimedia.org/T271746 (10ayounsi) 05Open→03Resolved a:03ayounsi And done. [16:30:53] 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Ladsgroup) 05Open→03Resolved This is done now. Thanks to everyone who helped! {meme, src="macro-deployed"} [16:31:43] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [16:31:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:28] (03CR) 10Herron: [C: 03+2] kibana7: add kibana7 conftool entries [puppet] - 10https://gerrit.wikimedia.org/r/654436 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:33:45] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [16:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:36:41] RECOVERY - Stale file for node-exporter textfile in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile [16:37:27] (03PS2) 10Mforns: Migrate SuggestedTagsAction to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655706 (https://phabricator.wikimedia.org/T267351) [16:37:29] !log cp5008: ats-backend-restart to apply jit.off(true, true) to all lua scripts T265625 [16:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:32] T265625: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 [16:37:50] (03CR) 10Jbond: "test result:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/651171 (https://phabricator.wikimedia.org/T270618) (owner: 10Jbond) [16:39:54] !log herron@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: cluster=kibana7,service=kibana7 [16:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:44:26] (03CR) 10CRusnov: [C: 03+2] tools/rotatedump: Constrain dump maintenance to automatically generated files [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655496 (https://phabricator.wikimedia.org/T271696) (owner: 10CRusnov) [16:44:35] (03PS2) 10Herron: kibana7: repoint (rename) kibana-next services to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/654437 (https://phabricator.wikimedia.org/T234854) [16:47:27] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:48:00] (03PS1) 10ArielGlenn: Fix undefined index error in ApiQueryInfo [core] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655671 (https://phabricator.wikimedia.org/T271804) [16:48:15] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] update flink config with swift and other values (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/650633 (https://phabricator.wikimedia.org/T269876) (owner: 10Mstyles) [16:54:02] (03CR) 10ArielGlenn: [C: 03+2] Fix undefined index error in ApiQueryInfo [core] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655671 (https://phabricator.wikimedia.org/T271804) (owner: 10ArielGlenn) [16:56:39] !log reinstalling bast3005 with correct DHCP settings [16:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:25] (03CR) 10Herron: [C: 03+2] kibana7: repoint (rename) kibana-next services to kibana7 [puppet] - 10https://gerrit.wikimedia.org/r/654437 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [16:58:40] t [16:58:49] er, hi :) [16:59:09] (03PS8) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [16:59:30] (03CR) 10Jbond: "Test results:" [puppet] - 10https://gerrit.wikimedia.org/r/651174 (https://phabricator.wikimedia.org/T193762) (owner: 10Jbond) [17:00:04] jbond42 and cdanis: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1700). [17:01:12] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [17:01:38] 10SRE, 10WVUI: Import npm 6.14.8 to buster dist. on apt.wikimedia.org - https://phabricator.wikimedia.org/T270321 (10jbond) @ovasileva in case you missed it i added comment on T269957 [17:02:38] 10SRE, 10Performance-Team, 10Traffic: ats-be occasional system CPU usage increase - https://phabricator.wikimedia.org/T265625 (10ema) Disabling JIT in all Lua scripts on cp5008 resulted in ats-be not calling lj_vm_hotcall/mmap anymore and CPU usage [[https://grafana.wikimedia.org/d/A__2L7eWz/cache-hosts-comp... [17:03:03] (03CR) 10Jbond: "recheck" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [17:03:59] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:04:01] (03CR) 10Razzi: "Thanks for the comments @Volans and @Elukey! This is looking better and should be close to complete." (0314 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [17:04:05] (03PS6) 10Jbond: nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) [17:04:47] (03PS2) 10Dduvall: releases: Provide docker to PipelineLib based jobs [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) [17:04:59] (03CR) 10Dduvall: releases: Provide docker to PipelineLib based jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655500 (https://phabricator.wikimedia.org/T271477) (owner: 10Dduvall) [17:05:50] (03PS9) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [17:07:41] (03PS6) 10Muehlenhoff: Add a wrapper to optionally mail the output of a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/655077 [17:07:48] (03CR) 10jerkins-bot: [V: 04-1] nodegen: add cumin support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/651800 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [17:09:05] !log rebooting people1002 (people.wikimedia.org) [17:09:06] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [17:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:24] !log shutting down db2132, db2078:m1 for m1 codfw replica reprovisioning T270877 [17:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:28] T270877: db2078 m1 mysqld process crashed - https://phabricator.wikimedia.org/T270877 [17:11:28] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) [17:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:43] (03CR) 10Ottomata: [C: 03+1] eventgate, eventstreams: Log with namedlevels [deployment-charts] - 10https://gerrit.wikimedia.org/r/594492 (https://phabricator.wikimedia.org/T239459) (owner: 10Alexandros Kosiaris) [17:12:35] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/655077 (owner: 10Muehlenhoff) [17:15:39] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 0 down 3 https://wikitech.wikimedia.org/wiki/HAProxy [17:16:43] that's expected [17:16:46] will silence [17:17:36] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10RobH) >>! In T271058#6739089, @aborrero wrote: > hey @RobH could you please help us here with the vendor side of the firmware... [17:18:05] (03Merged) 10jenkins-bot: Fix undefined index error in ApiQueryInfo [core] (wmf/1.36.0-wmf.26) - 10https://gerrit.wikimedia.org/r/655671 (https://phabricator.wikimedia.org/T271804) (owner: 10ArielGlenn) [17:18:27] 10SRE, 10WVUI: Import npm 6.14.8 to buster dist. on apt.wikimedia.org - https://phabricator.wikimedia.org/T270321 (10ovasileva) >>! In T270321#6740376, @jbond wrote: > @ovasileva in case you missed it i added comment on T269957 @jbond - Thank you! [17:18:31] 10SRE, 10ops-eqiad, 10cloud-services-team (Kanban): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10RobH) [17:18:47] 10SRE, 10cloud-services-team (Kanban): apt key for `thirdparty/ceph-nautilus/buster` has expired. - https://phabricator.wikimedia.org/T259873 (10Andrew) a:03aborrero [17:19:22] (03PS1) 10Zoranzoki21: Enable visualeditor on kuwiki by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655714 (https://phabricator.wikimedia.org/T270841) [17:20:21] !log roll restarting eqiad/codfw low-traffic pybals for kibana-next -> kibana7 rename [17:20:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:29] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 110 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:26:31] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 6 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:29:04] ^"Could not enqueue jobs" [17:36:29] (03PS15) 10Hnowlan: sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) [17:37:00] (03CR) 10Hnowlan: sockpuppet-api: Create basic chart and service config (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [17:37:56] (03CR) 10jerkins-bot: [V: 04-1] sockpuppet-api: Create basic chart and service config [deployment-charts] - 10https://gerrit.wikimedia.org/r/643721 (https://phabricator.wikimedia.org/T268837) (owner: 10Hnowlan) [17:40:08] !log dnsX002 - upgrade gdnsd to 3.5.0 [17:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:39] (03PS10) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [17:42:47] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash1007.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:44:15] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:45:11] is there someone on the SRE sie that can be my buddy as I muddle through deploying a backport? [17:45:59] *side [17:47:07] 10Puppet, 10SRE, 10puppet-compiler, 10User-jbond: puppet master command will be removed in puppet 6 - https://phabricator.wikimedia.org/T236373 (10jbond) https://github.com/github/octocatalog-diff/issues/235 [17:48:51] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - kibana-ssl_443: Servers logstash1007.eqiad.wmnet are marked down but pooled: logstash-syslog-tcp_10514: Servers logstash1008.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:53:54] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1800). [18:02:07] 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) Wow, already done? Now that was quicker than anticipated. nice :) [18:06:52] !log dns2001,dns5001 - upgrade gdnsd to 3.5.0 [18:06:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:36] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:10:25] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10JAnstee_WMF) @herron please reopen and get manager approval from @sbodington [18:11:42] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10herron) 05Resolved→03Open a:05Rmaung→03Sbodington [18:17:28] (03PS3) 10Mforns: Migrate SuggestedTagsAction to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655706 (https://phabricator.wikimedia.org/T267351) [18:22:50] (03PS1) 10Mforns: Migrate HomepageVisit and ServerSideAccountCreation to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655723 (https://phabricator.wikimedia.org/T267333) [18:23:22] 10SRE: Request for Kerberos password - https://phabricator.wikimedia.org/T271845 (10DNdubane_WMF) [18:23:58] 10SRE, 10Analytics: Request for Kerberos password - https://phabricator.wikimedia.org/T271845 (10elukey) [18:39:32] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Ottomata) Approved from Data Engineering standpoint. FYI we will be re-writing our Jupyter hub documentation and setup this quarte... [18:42:51] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10RKemper) @Cmjohnson Just checking in here - I think when we left off, a ticket was going to be created with Dell for the hardware memory corrupti... [18:47:16] right: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/655671 has been pulled to mwdebug1001 [18:48:10] testing on mediawiki.org with Special:ApiSandbox#action=query&format=json&prop=info&titles=Download&inprop=notificationtimestamp [18:48:23] which I am not watching, as described in https://phabricator.wikimedia.org/T271815 [18:48:34] gives no error, so presumably the fix is good to go [18:48:48] RECOVERY - haproxy failover on dbproxy2001 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [18:48:55] ^as expected [18:49:09] (03PS2) 10Bartosz Dziewoński: Disable DiscussionTools' newtopictool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654520 (https://phabricator.wikimedia.org/T270119) (owner: 10Esanders) [18:50:40] I am trying to make sure i have the query right by trying in another tab without the extension [18:52:10] (03PS3) 10Bartosz Dziewoński: Disable DiscussionTools' upcoming newtopictool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654520 (https://phabricator.wikimedia.org/T270119) (owner: 10Esanders) [18:52:44] jouncebot: refresh [18:52:45] I refreshed my knowledge about deployments. [18:53:32] seems like i get weird behavior in the other tab so good [18:53:35] ok moving on [18:55:14] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:56:59] legoktm: still there? [18:57:04] yep [18:57:54] want to verify the command and format: I'm on deploy1001 in /srv/mediawiki-staging about to do scap sync-file php-1.36.0-wmf.26/includes/api/ApiQueryInfo.php 'Backport: [[gerrit:[655671]|[Fix undefined index error in ApiQueryInfo] ([T271815])]]' as per the docs [18:57:55] T271815: PHP Notice: Undefined offset: 0 - https://phabricator.wikimedia.org/T271815 [18:58:07] does that look right, or what is off? [18:58:33] sorry, you got a first time deployer here (or at least it's been > 8 years since any deployment, so might as well be first time) [18:58:51] normally I just have a message like "Fix undefined index error in ApiQueryInfo (T271815)" [18:58:59] not sure how all the brackets will end up on the [[SAL]] page [18:59:03] but otherwise it looks good [18:59:11] I'm just following the docs but sure I can smiplify it [19:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T1900). [19:00:04] tgr, mforns, and MatmaRex: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:23] here :] [19:00:28] hi [19:00:33] o/ [19:00:56] give me a couple. this is the train blocker backport which took a lot longer than it should because [19:01:05] first time deployer with anxiety is slow. i.e. me. [19:01:31] no problemo :] [19:01:35] !log ariel@deploy1001 Synchronized php-1.36.0-wmf.26/includes/api/ApiQueryInfo.php: Backport: (gerrit 655671) Fix undefined index error in ApiQueryInfo (T271815) (duration: 01m 06s) [19:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:47] claims to be done [19:02:02] :D [19:02:08] lemme check now [19:02:10] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/655734 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:02:11] * legoktm waits for cluster to fall over [19:02:28] 10SRE, 10ops-eqiad, 10Patch-For-Review: Decommission mw1259-mw1260 - https://phabricator.wikimedia.org/T187466 (10Dzahn) hosts were never removed from DCHP.. cleaning up. [19:02:36] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/655733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:02:36] logs look clean so far [19:02:55] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/655731 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:03:34] (03CR) 10Ottomata: [C: 03+1] Migrate HomepageVisit and ServerSideAccountCreation to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655723 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [19:03:38] last testwiki related error at 2021-01-12T18:58:18 [19:03:50] did you test after that? [19:05:14] I was testing on mediawikiwiki because that's also group0 [19:05:33] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Cmjohnson) quick update, all the servers are cabled, need to add to netbox next and then setup idrac. These will be ready to be handed over tomorrow. [19:05:47] sure :-) don't see any of that [19:05:52] but I ave no guarantee I am doing these tests correctly [19:06:10] Try a page that doesn't exist whatsoever apergos [19:06:28] (don't see errors, I mean) [19:07:19] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [19:07:44] 10SRE, 10ops-eqiad, 10DC-Ops: frdev1001 ILO inaccessible - https://phabricator.wikimedia.org/T267969 (10Cmjohnson) @jgreen do you have our mgmt ports in a vlan? I don't think anyone has touched the server prior to the ILO becoming inaccessible. The problem happens often and pulling the power and rebooting u... [19:08:16] (03PS1) 10Dzahn: DHCP: remove mw1259, mw1260 [puppet] - 10https://gerrit.wikimedia.org/r/655735 (https://phabricator.wikimedia.org/T187466) [19:08:37] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2224.codfw.wmnet'] ` Of which those **F... [19:08:40] 10SRE, 10ops-eqiad: Please remove sdb from ms-be1022 - https://phabricator.wikimedia.org/T271512 (10Cmjohnson) 05Open→03Resolved Done [19:09:04] still looks ok [19:09:09] 10SRE, 10ops-eqiad, 10SRE-swift-storage: Degraded RAID on ms-be1019 - https://phabricator.wikimedia.org/T270806 (10Cmjohnson) 05Open→03Resolved Done [19:09:22] apergos: then it's probably fine [19:09:27] good [19:09:33] sorry to tie up everyone's time with this [19:09:36] next time will be faster [19:10:15] I'm sure no one minds [19:11:25] not at all [19:11:43] ok, you're on, thanks legoktm and Reedy for hand-holding [19:11:58] lemme go tell li w that the train is officially all his [19:14:07] anyone doing the backports? [19:14:14] I can do it I suppose [19:16:33] apergos: :)) anytime! [19:16:35] ah lol, I see that wmf26 never made it out to all of group0 which means my earlier testing was flawed :-D [19:16:42] but at least my later testing was good for sure! [19:16:46] let's go with the noops first [19:17:01] (03CR) 10Gergő Tisza: [C: 03+2] Disable DiscussionTools' upcoming newtopictool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654520 (https://phabricator.wikimedia.org/T270119) (owner: 10Esanders) [19:17:21] tgr_: will you be usimg mwdebug1002? [19:17:31] Krinkle: I was planning to [19:17:39] I'd like to fiddle on mwdebug1001 for a bit, don't worry about overwriting/syncing to me, I'll manage. [19:18:01] you can also try mwdebug1003 to test on buster, fwiw [19:18:07] (03Merged) 10jenkins-bot: Disable DiscussionTools' upcoming newtopictool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/654520 (https://phabricator.wikimedia.org/T270119) (owner: 10Esanders) [19:18:50] ok [19:18:51] tgr_: thanks. the code using that option is not merged yet, so i have nothing to test [19:19:41] (03CR) 10Dzahn: [C: 03+2] "decom'ed in 2018" [puppet] - 10https://gerrit.wikimedia.org/r/655735 (https://phabricator.wikimedia.org/T187466) (owner: 10Dzahn) [19:19:41] 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Ladsgroup) Filed {T271851} for clean up [19:19:51] (03PS2) 10Dzahn: DHCP: remove mw1259, mw1260 [puppet] - 10https://gerrit.wikimedia.org/r/655735 (https://phabricator.wikimedia.org/T187466) [19:19:56] (03PS2) 10Gergő Tisza: Alphabetize ORES settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655301 (https://phabricator.wikimedia.org/T256887) [19:20:06] (03CR) 10Gergő Tisza: [C: 03+2] Alphabetize ORES settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655301 (https://phabricator.wikimedia.org/T256887) (owner: 10Gergő Tisza) [19:20:48] (03PS4) 10Gergő Tisza: Migrate SuggestedTagsAction to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655706 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [19:21:01] (03Merged) 10jenkins-bot: Alphabetize ORES settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655301 (https://phabricator.wikimedia.org/T256887) (owner: 10Gergő Tisza) [19:21:05] (03PS2) 10Gergő Tisza: Migrate HomepageVisit and ServerSideAccountCreation to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655723 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [19:22:11] (03PS1) 10Dzahn: DHCP: switch all codfw appservers from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/655740 (https://phabricator.wikimedia.org/T245757) [19:22:31] (03PS2) 10Dzahn: DHCP: switch all codfw appservers from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/655740 (https://phabricator.wikimedia.org/T245757) [19:23:26] (03PS5) 10Gergő Tisza: Migrate SuggestedTagsAction to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655706 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [19:23:37] (03PS3) 10Gergő Tisza: Migrate HomepageVisit and ServerSideAccountCreation to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655723 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [19:25:03] !log rolling restart of eventgate-analytics-external pods to clear schema caches - T267333 [19:25:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:06] (03CR) 10Dzahn: "ok, cool. thanks for merging" [puppet] - 10https://gerrit.wikimedia.org/r/654959 (https://phabricator.wikimedia.org/T242855) (owner: 10Dzahn) [19:25:07] T267333: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 [19:26:00] (03CR) 10Gergő Tisza: [C: 03+2] Migrate SuggestedTagsAction to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655706 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [19:26:48] (03Merged) 10jenkins-bot: Migrate SuggestedTagsAction to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655706 (https://phabricator.wikimedia.org/T267351) (owner: 10Mforns) [19:27:03] (03CR) 10Gergő Tisza: [C: 03+2] Migrate HomepageVisit and ServerSideAccountCreation to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655723 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [19:27:46] !log dns3001,dns4001 - upgrade gdnsd to 3.5.0 [19:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:56] (03Merged) 10jenkins-bot: Migrate HomepageVisit and ServerSideAccountCreation to Event Platform on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655723 (https://phabricator.wikimedia.org/T267333) (owner: 10Mforns) [19:28:18] thanks a lot tgr_ :] [19:28:21] mforns: do you want to test those patches before they get synced? they are testwiki-only so I figure there might not be much point [19:28:37] no, thanks, I will test them on testwiki [19:32:18] (03CR) 10Muehlenhoff: [C: 03+1] DHCP: switch all codfw appservers from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/655740 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [19:32:56] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Bunch of no-op/testwiki changes: [[gerrit:654520]], [[gerrit:655301]], [[gerrit:655706]], [[gerrit:655723]] (duration: 01m 05s) [19:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:24] mforns: it's live [19:33:55] (03PS2) 10Gergő Tisza: Enable ORES filters on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655302 (https://phabricator.wikimedia.org/T256887) [19:34:22] (03PS1) 10Jbond: wmflib: create dir::split and dir::mkdir_p functions [puppet] - 10https://gerrit.wikimedia.org/r/655741 [19:36:30] (03CR) 10Gergő Tisza: [C: 03+2] Enable ORES filters on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655302 (https://phabricator.wikimedia.org/T256887) (owner: 10Gergő Tisza) [19:37:18] (03Merged) 10jenkins-bot: Enable ORES filters on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655302 (https://phabricator.wikimedia.org/T256887) (owner: 10Gergő Tisza) [19:45:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:48] !log tgr@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:655302|Enable ORES filters on ukwiki (T256887)]] (duration: 01m 05s) [19:46:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:52] T256887: Enable ORES filters for ukwiki (Ukrainian Wikipedia) - https://phabricator.wikimedia.org/T256887 [19:47:24] on hindsight not including task numbers into the sync summary was a bad idea. [19:48:22] !log synced Config: [[gerrit:655301|Alphabetize ORES settings (T256887)]] [19:48:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:42] !log synced Config: [[gerrit:655706|Migrate SuggestedTagsAction to Event Platform on testwiki (T267351)]] [19:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:46] T267351: SuggestedTagsAction Event Platform Migration - https://phabricator.wikimedia.org/T267351 [19:49:00] !log synced Config: [[gerrit:655723|Migrate HomepageVisit and ServerSideAccountCreation to Event Platform on testwiki (T267333)]] [19:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:04] T267333: Migrate Growth EventLogging schemas to Event Platform - https://phabricator.wikimedia.org/T267333 [19:49:20] !log synced Config: [[gerrit:654520|Disable DiscussionTools' upcoming newtopictool (T270119)]] [19:49:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:23] T270119: Create a setting for the New Discussion Tool in Special:Preferences - https://phabricator.wikimedia.org/T270119 [19:50:05] (03PS1) 10Razzi: admin: Add krb: present for kharlan [puppet] - 10https://gerrit.wikimedia.org/r/655744 (https://phabricator.wikimedia.org/T271467) [19:52:34] (03PS1) 10Razzi: admin: add krb: present for janstee [puppet] - 10https://gerrit.wikimedia.org/r/655747 (https://phabricator.wikimedia.org/T271844) [19:52:59] !log dns1001,authdns1001 - upgrade gdnsd to 3.5.0 [19:53:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:01] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/655743 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [19:53:08] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 105 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:53:30] (03PS1) 10Razzi: admin: add krb: present for dumisani [puppet] - 10https://gerrit.wikimedia.org/r/655748 (https://phabricator.wikimedia.org/T271845) [19:53:38] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 118.21, 101.37, 95.06 https://wikitech.wikimedia.org/wiki/Swift [19:54:08] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:54:45] (03CR) 10Razzi: [C: 03+2] admin: add krb: present for dumisani [puppet] - 10https://gerrit.wikimedia.org/r/655748 (https://phabricator.wikimedia.org/T271845) (owner: 10Razzi) [19:56:32] (03PS2) 10Razzi: admin: add krb: present for janstee [puppet] - 10https://gerrit.wikimedia.org/r/655747 (https://phabricator.wikimedia.org/T271844) [19:56:42] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 102.13, 100.48, 95.86 https://wikitech.wikimedia.org/wiki/Swift [19:56:46] 10SRE, 10Analytics, 10Patch-For-Review: Request for Kerberos password - https://phabricator.wikimedia.org/T271845 (10razzi) @DNdubane_WMF this should be all set, check your email :) and comment if you have any issues! [19:57:18] (03CR) 10Razzi: [C: 03+2] admin: add krb: present for janstee [puppet] - 10https://gerrit.wikimedia.org/r/655747 (https://phabricator.wikimedia.org/T271844) (owner: 10Razzi) [19:57:40] !log backports done [19:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:58:48] (03PS2) 10Razzi: admin: Add krb: present for kharlan [puppet] - 10https://gerrit.wikimedia.org/r/655744 (https://phabricator.wikimedia.org/T271467) [20:00:04] liw and longma: I, the Bot under the Fountain, allow thee, The Deployer, to do Mediawiki train - European+American Version (secondary timeslot) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T2000). [20:00:18] (03CR) 10Razzi: [C: 03+2] admin: Add krb: present for kharlan [puppet] - 10https://gerrit.wikimedia.org/r/655744 (https://phabricator.wikimedia.org/T271467) (owner: 10Razzi) [20:03:18] thanks tgr_! [20:03:38] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:30] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:04] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:11] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) So I went to image these, and none of the mgmt interfaces are pingable. So either they aren't plugged in, or they were misconfigured. Since they aren't remotely accessible,... [20:19:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:21:08] 10SRE, 10SRE-Access-Requests: Requesting access to labweb1001 and labweb1002 for jhernandez - https://phabricator.wikimedia.org/T271859 (10Jhernandez) [20:21:17] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:28] !log running 'mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=ukwiki' on terbium [20:40:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:27] (03PS1) 10Herron: ELK: promote logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655754 (https://phabricator.wikimedia.org/T234854) [20:46:01] PROBLEM - very high load average likely xfs on ms-be2055 is CRITICAL: CRITICAL - load average: 107.81, 100.56, 93.37 https://wikitech.wikimedia.org/wiki/Swift [20:46:33] (03PS1) 10RobH: updating with faster speed memory [software] - 10https://gerrit.wikimedia.org/r/655755 [20:48:22] (03CR) 10RobH: [C: 03+2] updating with faster speed memory [software] - 10https://gerrit.wikimedia.org/r/655755 (owner: 10RobH) [20:49:01] (03CR) 10Krinkle: [C: 03+2] Enable "coalesceKeys" for global keys for WANCache (II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607155 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [20:49:50] (03Merged) 10jenkins-bot: Enable "coalesceKeys" for global keys for WANCache (II) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607155 (https://phabricator.wikimedia.org/T252564) (owner: 10Aaron Schulz) [20:56:17] * Krinkle staging on mwdebug1002 [20:57:08] (03PS1) 10Krinkle: Revert "Enable "coalesceKeys" for global keys for WANCache (II)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655684 [20:58:01] (03CR) 10Krinkle: [C: 03+2] "Still bugged, causing db read-only." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655684 (owner: 10Krinkle) [20:58:21] 10SRE, 10Inuka-Team, 10SRE-Access-Requests, 10Security-Team, 10Product-Analytics (Kanban): Provide raw KaiOSAppFeedback data to Chelsea Riley for analysis - https://phabricator.wikimedia.org/T271202 (10sbassett) >>! In T271202#6736704, @nshahquinn-wmf wrote: > On a couple of occasions, I have provided us... [21:00:11] (03Merged) 10jenkins-bot: Revert "Enable "coalesceKeys" for global keys for WANCache (II)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655684 (owner: 10Krinkle) [21:13:07] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:24] (03CR) 10Dzahn: [C: 03+2] DHCP: switch all codfw appservers from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/655740 (https://phabricator.wikimedia.org/T245757) (owner: 10Dzahn) [21:18:35] jouncebot: now [21:18:35] For the next 0 hour(s) and 41 minute(s): Mediawiki train - European+American Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210112T2000) [21:18:52] longma: Train still active, or can I deploy quickly? [21:19:54] We are on the European deploy schedule this week so you can deploy now [21:20:08] (03CR) 10Jforrester: [C: 03+2] Don't pass protocol-relative URLs to the Ace worker [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/655394 (https://phabricator.wikimedia.org/T271487) (owner: 10Jforrester) [21:20:11] Cool, thanks. [21:20:21] Thanks for checking :) [21:20:36] Always. [21:21:13] RECOVERY - Check systemd state on ms-be2055 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:47] (03PS1) 10Andrew Bogott: tlsproxy::localssl: Remove support for the acme_subjects param [puppet] - 10https://gerrit.wikimedia.org/r/655761 (https://phabricator.wikimedia.org/T252199) [21:22:49] (03PS1) 10Andrew Bogott: Remove the 'letsencrypt' module [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) [21:25:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:26:19] (03CR) 10Andrew Bogott: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27436/ should tell me if this is a huge mistake." [puppet] - 10https://gerrit.wikimedia.org/r/655762 (https://phabricator.wikimedia.org/T252199) (owner: 10Andrew Bogott) [21:27:20] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqiad.wmnet for host... [21:37:51] 10SRE, 10Wikidata, 10Wikidata Query UI, 10Patch-For-Review, 10User-Addshore: Move WDQS UI to microsites - https://phabricator.wikimedia.org/T266702 (10Dzahn) It would be great if we can add one or a couple assertions to `./modules/profile/files/httpbb/miscweb/test_miscweb.yaml` in operations/puppet. Tha... [21:38:16] (03PS1) 10Ssingh: dnsrecusor: update variable name for installing version from component [puppet] - 10https://gerrit.wikimedia.org/r/655763 [21:40:03] (03CR) 10Cwhite: [C: 03+1] ELK: promote logstash-next to logstash.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/655754 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:41:55] !log rolling restart of eventgate-analytics-external pods [21:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:58] (03CR) 10Dzahn: [C: 03+2] bird: require_package->ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655522 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:44:55] (03Merged) 10jenkins-bot: Don't pass protocol-relative URLs to the Ace worker [extensions/AbuseFilter] (wmf/1.36.0-wmf.25) - 10https://gerrit.wikimedia.org/r/655394 (https://phabricator.wikimedia.org/T271487) (owner: 10Jforrester) [21:45:02] (03CR) 10Dzahn: "noop on centrallog1001 - thx for review" [puppet] - 10https://gerrit.wikimedia.org/r/655522 (https://phabricator.wikimedia.org/T266479) (owner: 10Dzahn) [21:45:55] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:39] !log jforrester@deploy1001 Synchronized php-1.36.0-wmf.25/extensions/AbuseFilter/modules/mode-abusefilter.js: T271487 Don't pass protocol-relative URLs to the Ace worker (duration: 01m 06s) [21:50:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:42] T271487: Uncaught SyntaxError: Failed to execute 'open' on 'XMLHttpRequest': Invalid URL / Uncaught SyntaxError: Failed to execute 'open' on 'XMLHttpRequest': Invalid URL / Malformed URIs in AbuseFilter worker-abusefilter.js - https://phabricator.wikimedia.org/T271487 [21:52:40] (03PS2) 10Ssingh: dnsrecusor: update variable name for installing version from component [puppet] - 10https://gerrit.wikimedia.org/r/655763 [21:53:03] (03PS1) 10Luke081515: Create Contact page for Ombuds commission at Meta-Wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/655786 (https://phabricator.wikimedia.org/T271828) [21:55:05] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2224.codfw.wmnet with reason: REIMAGE [21:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:06] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2224.codfw.wmnet with reason: REIMAGE [21:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:31] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2225.codfw.wmnet with reason: REIMAGE [22:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:21] (03PS3) 10Ssingh: dnsrecusor: update variable name for installing version from component [puppet] - 10https://gerrit.wikimedia.org/r/655763 [22:02:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2225.codfw.wmnet with reason: REIMAGE [22:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:43] 10SRE, 10Traffic: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 (10BBlack) There's some anomalies in network graphs on authdns1001 that I hadn't noticed until today, which go all the way back to Oct 26, which is pro... [22:04:12] !log proceeding with Netbox 2.9 upgrade T266487 [22:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:15] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [22:05:30] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/27439/" [puppet] - 10https://gerrit.wikimedia.org/r/655763 (owner: 10Ssingh) [22:05:47] (03CR) 10CRusnov: [C: 03+2] Make scripts and reports compatible with Netbox 2.9 [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/643444 (https://phabricator.wikimedia.org/T266487) (owner: 10Ayounsi) [22:06:08] (03PS2) 10CRusnov: Make Homer compatible with Netbox 2.9 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/643681 (https://phabricator.wikimedia.org/T266487) (owner: 10Ayounsi) [22:07:01] (03PS5) 10CRusnov: netbox: Add only non-2.8 compatible setting for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/649436 (https://phabricator.wikimedia.org/T266488) [22:07:07] !log reboot authdns1001 - T266746#6741647 [22:07:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:19] (03CR) 10CRusnov: [C: 03+2] dns: migrate script to Netbox 2.9+ [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/655040 (https://phabricator.wikimedia.org/T266488) (owner: 10Volans) [22:07:19] T266746: TCP traffic increase for DNS over TLS breached a low limit for max open files on authdns1001/2001 - https://phabricator.wikimedia.org/T266746 [22:07:50] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10fkaelin) I am trying to access Hue, and after looking at these tasks requesting access for Hue [[ https://phabricator.wikimedia.org/T271602 | T271602 ]] and [[... [22:09:46] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Make Homer compatible with Netbox 2.9 [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/643681 (https://phabricator.wikimedia.org/T266487) (owner: 10Ayounsi) [22:10:01] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:10:15] (03CR) 10CRusnov: [C: 03+2] netbox: Add only non-2.8 compatible setting for Netbox [puppet] - 10https://gerrit.wikimedia.org/r/649436 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [22:10:23] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:12:19] !log Merged Netbox 2.9 related changes in puppet and -extras; testing on -next T266487 [22:12:21] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 63, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:12:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:23] BGP/BFD there are authdns1001, they'll recover sohrtly [22:12:23] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [22:12:45] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:27:52] (03CR) 10Volans: "Thanks for the quick fixes, much much better! Couple of nits inline and a question for Luca" (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [22:28:24] RECOVERY - very high load average likely xfs on ms-be2055 is OK: OK - load average: 71.22, 75.85, 79.84 https://wikitech.wikimedia.org/wiki/Swift [22:30:25] !log crusnov@deploy1001 Started deploy [netbox/deploy@b17db99]: Deploy Netbox 2.9.10 to production T266487 [22:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:28] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [22:32:59] !log crusnov@deploy1001 Finished deploy [netbox/deploy@b17db99]: Deploy Netbox 2.9.10 to production T266487 (duration: 02m 33s) [22:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:14] (03PS7) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) [22:34:24] (03CR) 10Legoktm: docker_registry_ha: Add a script to generate a static HTML homepage (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [22:34:36] (03PS1) 10Ladsgroup: cache: Make statsd address an argument and hiera() -> lookup() [puppet] - 10https://gerrit.wikimedia.org/r/655790 (https://phabricator.wikimedia.org/T209953) [22:35:06] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:35:18] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:24] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:36:26] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:37:24] !log Upgrade of Netbox to 2.9 complete, checking support software. T266487 [22:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:28] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [22:37:46] RECOVERY - Check systemd state on netbox2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:26] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2224.codfw.wmnet'] ` and were **ALL** s... [22:46:52] !log crusnov@deploy1001 Started deploy [netbox/deploy@b17db99]: Rerun production deploy of Netbox 2.9 just in case T266487 [22:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:46:55] T266487: Upgrade Netbox and accompanying to the 2.9 series (Tracking Task) - https://phabricator.wikimedia.org/T266487 [22:46:57] !log crusnov@deploy1001 Finished deploy [netbox/deploy@b17db99]: Rerun production deploy of Netbox 2.9 just in case T266487 (duration: 00m 05s) [22:46:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:12] 10SRE, 10serviceops, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), and 2 others: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw2225.codfw.wmnet'] ` and were **ALL** s... [22:47:44] (03PS1) 10Ladsgroup: eventlogging: Remove profile::eventlogging::analytics::files [puppet] - 10https://gerrit.wikimedia.org/r/655791 (https://phabricator.wikimedia.org/T259030) [22:48:12] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:49:02] fixing that. [22:51:05] (should be fixed) [22:52:25] (03CR) 10Razzi: Add cookbook for rebooting druid nodes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [22:52:33] (03PS11) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [22:55:17] (03PS1) 10Legoktm: docker_registry_ha: Have nginx serve /srv/hompage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [22:55:40] (03PS2) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [22:55:47] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2224.codfw.wmnet [22:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:55:55] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw2225.codfw.wmnet [22:55:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:38] (03CR) 10Volans: [C: 03+1] "LGTM (no need for re-review from me in case of a new PS)" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/654725 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [22:59:02] 10SRE, 10Analytics, 10SRE-Access-Requests: Requesting access to Production Shell Access (analytics-privatedata-users) for Rmaung - https://phabricator.wikimedia.org/T266250 (10Sbodington) As JAnstee´s manager, I approve the request. Please let me know if you need any additional information. [23:03:25] (03CR) 10RLazarus: [C: 03+1] docker_registry_ha: Have nginx serve /srv/homepage for / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:09:52] (03CR) 10Volans: [C: 04-1] "Just one wrong expression (or typo)" (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [23:10:10] (03CR) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:10:17] (03PS3) 10Legoktm: docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) [23:10:58] (03PS12) 10Razzi: Add cookbook for rebooting druid nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) [23:11:26] (03CR) 10Razzi: Add cookbook for rebooting druid nodes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [23:11:56] PROBLEM - Check systemd state on ms-be2055 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:04] (03CR) 10RLazarus: [C: 03+1] docker_registry_ha: Have nginx serve /srv/homepage for / [puppet] - 10https://gerrit.wikimedia.org/r/655792 (https://phabricator.wikimedia.org/T179696) (owner: 10Legoktm) [23:14:11] (03CR) 10Volans: [C: 03+1] "LGTM, I'll leave it to Luca for the final review and to answer my previous question on repooling." [cookbooks] - 10https://gerrit.wikimedia.org/r/651636 (https://phabricator.wikimedia.org/T269596) (owner: 10Razzi) [23:18:39] (03CR) 10Legoktm: [C: 04-1] "There's already some documentation for config options in README.rst, do you want to keep it just there or in both places? Probably the new" (034 comments) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655412 (https://phabricator.wikimedia.org/T183545) (owner: 10Giuseppe Lavagetto) [23:19:50] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:04] (03CR) 10Legoktm: "> Patch Set 2:" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/655411 (https://phabricator.wikimedia.org/T219398) (owner: 10Giuseppe Lavagetto) [23:28:24] (03PS1) 10RobH: updating puppet repo for ml-serve100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/655793 (https://phabricator.wikimedia.org/T267050) [23:29:52] (03CR) 10RobH: [C: 03+2] updating puppet repo for ml-serve100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/655793 (https://phabricator.wikimedia.org/T267050) (owner: 10RobH) [23:29:58] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install ml-serve100[1-4] - https://phabricator.wikimedia.org/T267050 (10RobH) [23:31:10] (03PS2) 10RobH: updating puppet repo for ml-serve100[1-4] [puppet] - 10https://gerrit.wikimedia.org/r/655793 (https://phabricator.wikimedia.org/T267050) [23:32:53] (03CR) 10Dzahn: [C: 03+1] "compiler output and code look good to me" [puppet] - 10https://gerrit.wikimedia.org/r/655763 (owner: 10Ssingh) [23:37:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users and wmf LDAP for fkaelin - https://phabricator.wikimedia.org/T267817 (10Dzahn) 05Resolved→03Open [23:40:45] (03CR) 10Dzahn: "thank you, I'd rather just merge, finding out whether this can really be deleted will be more work :)" [puppet] - 10https://gerrit.wikimedia.org/r/655511 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:41:58] (03CR) 10Dzahn: [C: 03+2] "unused but maybe used again in the future" [puppet] - 10https://gerrit.wikimedia.org/r/655511 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [23:43:08] (03PS1) 10Legoktm: sbuild: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655794 (https://phabricator.wikimedia.org/T266479) [23:43:10] (03PS1) 10Legoktm: scap: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655795 (https://phabricator.wikimedia.org/T266479) [23:43:12] (03PS1) 10Legoktm: shiny_server: Switch require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/655796 (https://phabricator.wikimedia.org/T266479) [23:52:54] 10SRE, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10razzi) 05Open→03Resolved Cluster is up and running! [23:57:14] PROBLEM - PHP opcache health on mw2224 is CRITICAL: CRITICAL: opcache cache-hit ratio is below 99.85% https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health