[00:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Evening backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201204T0000). [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:03:21] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:53] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [00:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:07:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:16] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01004 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:12:47] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [00:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:51] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [00:14:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:53] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [00:16:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:33] (03PS1) 10Cwhite: profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) [00:29:59] (03PS2) 10Cwhite: profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) [00:31:01] (03PS2) 10Cwhite: profile: make a logstash templates directory and relocate existing templates [puppet] - 10https://gerrit.wikimedia.org/r/645200 (https://phabricator.wikimedia.org/T234565) [00:31:30] (03PS3) 10Cwhite: profile: make a logstash templates directory and relocate existing templates [puppet] - 10https://gerrit.wikimedia.org/r/645200 (https://phabricator.wikimedia.org/T234565) [00:32:36] (03PS3) 10Cwhite: profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) [00:34:27] (03PS4) 10Cwhite: profile: add logstash ecs 1.7.0-1 template [puppet] - 10https://gerrit.wikimedia.org/r/645209 (https://phabricator.wikimedia.org/T234565) [00:41:44] (03CR) 10CRusnov: "This change is ready for review." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [00:41:58] (03CR) 10jerkins-bot: [V: 04-1] ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [00:42:30] (03CR) 10CRusnov: "This is a first pass, i need to possibly clean up and test more of it. The core bits are tested independently, but I cannot test on -next " [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [00:46:09] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 49424696 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:46:59] (03PS2) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) [00:47:01] (03PS1) 10Andrew Bogott: Make cloudvirt2002-dev ceph-enabled [puppet] - 10https://gerrit.wikimedia.org/r/645213 (https://phabricator.wikimedia.org/T265965) [00:47:29] (03CR) 10jerkins-bot: [V: 04-1] ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [00:47:43] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 745563800 and 160 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:47:53] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 489235888 and 170 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:49:55] PROBLEM - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 144477792 and 292 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:15] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 286726232 and 312 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:50:33] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 308647632 and 328 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:56:53] (03PS3) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) [00:59:29] (03PS1) 10Cwhite: add version into index pattern at build time [software/ecs] - 10https://gerrit.wikimedia.org/r/645214 [00:59:39] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1628094496 and 128 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:01:54] Hmm, my familiarity with maps is quite low but will try to poke around here [01:03:04] Created https://phabricator.wikimedia.org/T269406 [01:04:04] !log T269406 https://grafana.wikimedia.org/d/000000305/maps-performances?viewPanel=11&orgId=1&var-cluster=maps1&from=1606827063027&to=1607043666975 shows that the normal daily dropoff in lag did not occur today, leading to the criticals. It's possible some sort of daily job has failed [01:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:12] T269406: MAPS osm replication lag critical in eqiad - https://phabricator.wikimedia.org/T269406 [01:06:03] (03PS4) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) [01:06:15] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1164393440 and 60 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:11:39] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 70720 and 143 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:27] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 22200 and 193 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:59] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 42672 and 225 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:28] ^ Right as I was about to ack them all too [01:13:35] Well can't hurt to ack the remaining ones I suppose [01:13:45] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1378848218920 and 0 seconds Ryan Kemper https://phabricator.wikimedia.org/T269406 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:45] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 109331096 and 192 seconds Ryan Kemper https://phabricator.wikimedia.org/T269406 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:45] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 9981439448 and 670 seconds Ryan Kemper https://phabricator.wikimedia.org/T269406 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:45] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 547280504 and 266 seconds Ryan Kemper https://phabricator.wikimedia.org/T269406 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:45] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1285039504 and 241 seconds Ryan Kemper https://phabricator.wikimedia.org/T269406 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:45] ACKNOWLEDGEMENT - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 3277376976 and 381 seconds Ryan Kemper https://phabricator.wikimedia.org/T269406 https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:51] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 275 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:13:53] RECOVERY - Postgres Replication Lag on maps1007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 74784 and 278 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:14:47] RECOVERY - Postgres Replication Lag on maps1008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 117584 and 333 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:15:23] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 182880 and 367 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:17:13] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 144560 and 477 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:24:07] (03CR) 10Andrew Bogott: [C: 03+2] Make cloudvirt2002-dev ceph-enabled [puppet] - 10https://gerrit.wikimedia.org/r/645213 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [01:24:14] (03PS2) 10Andrew Bogott: Make cloudvirt2002-dev ceph-enabled [puppet] - 10https://gerrit.wikimedia.org/r/645213 (https://phabricator.wikimedia.org/T265965) [01:25:05] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Make cloudvirt2002-dev ceph-enabled [puppet] - 10https://gerrit.wikimedia.org/r/645213 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [01:26:49] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 53030664 and 682 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:28:29] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1257728 and 780 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:38:43] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 26249232 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:42:01] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [01:42:05] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2029576 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:43:55] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [01:44:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:22] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 58543528 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:53:26] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1010568 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:54:32] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22322200 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:55:34] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1427120 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:43:01] PROBLEM - SSH on ms-be1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:46:15] RECOVERY - SSH on ms-be1030 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:50:20] ACKNOWLEDGEMENT - MD RAID on ms-be1022 is CRITICAL: CRITICAL: State: degraded, Active: 2, Working: 2, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T269409 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [03:50:24] 10Operations, 10ops-eqiad: Degraded RAID on ms-be1022 - https://phabricator.wikimedia.org/T269409 (10ops-monitoring-bot) [04:21:23] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:35] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:51] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:52] (03PS1) 10Ammarpad: Remove unsupported arg in MediaWiki::doPostOutputShutdown() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 [04:41:40] (03PS2) 10Ammarpad: Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 [04:43:33] (03PS1) 10Andrew Bogott: nova compute: make live_migration_uri dc-specific [puppet] - 10https://gerrit.wikimedia.org/r/645223 (https://phabricator.wikimedia.org/T265965) [04:45:55] (03CR) 10DannyS712: [C: 03+1] Remove unsupported arg in MediaWiki::doPostOutputShutdown() call [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645222 (owner: 10Ammarpad) [04:46:51] (03CR) 10Andrew Bogott: [C: 03+2] nova compute: make live_migration_uri dc-specific [puppet] - 10https://gerrit.wikimedia.org/r/645223 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [04:50:49] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:53:19] (03PS1) 10Andrew Bogott: Cloudvirt2001-dev make a ceph-backed hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/645224 (https://phabricator.wikimedia.org/T265965) [04:55:55] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:58:05] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:39:37] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:54:43] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10Marostegui) Thank you both! [05:58:33] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:16:48] (03PS1) 10Marostegui: db1151,db1152,db1153: Clarify the future of these host [puppet] - 10https://gerrit.wikimedia.org/r/645226 (https://phabricator.wikimedia.org/T269324) [06:18:43] (03CR) 10Marostegui: [C: 03+2] db1151,db1152,db1153: Clarify the future of these host [puppet] - 10https://gerrit.wikimedia.org/r/645226 (https://phabricator.wikimedia.org/T269324) (owner: 10Marostegui) [06:21:08] (03PS1) 10Marostegui: install_server: Do not reimage clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/645227 (https://phabricator.wikimedia.org/T267090) [06:21:22] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage clouddb1019 [puppet] - 10https://gerrit.wikimedia.org/r/645227 (https://phabricator.wikimedia.org/T267090) (owner: 10Marostegui) [06:22:03] (03CR) 10Marostegui: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/645114 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [06:22:03] (03CR) 10Marostegui: [C: 03+2] wikireplicas: let clouddb1020 join the party [puppet] - 10https://gerrit.wikimedia.org/r/645114 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [06:23:38] (03PS2) 10Marostegui: wikireplicas: let clouddb1020 join the party [puppet] - 10https://gerrit.wikimedia.org/r/645114 (https://phabricator.wikimedia.org/T268312) (owner: 10Bstorm) [06:33:44] 10Operations, 10DBA: Refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10Marostegui) [06:33:46] 10Operations, 10DBA: db1080-95 batch possibly suffering BBU issues - https://phabricator.wikimedia.org/T258386 (10Marostegui) 05Openโ†’03Declined All these hosts are going away once T258361 is completed so I am going to close this, as we are not really going to do anything to this list other than decommissio... [07:02:38] !log Increase pvs on db[1151-1155] T269324 T268742 [07:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:52] T268742: Test upgrading sanitarium hosts to Buster + 10.4 - https://phabricator.wikimedia.org/T268742 [07:02:53] T269324: Productionize x2 databases - https://phabricator.wikimedia.org/T269324 [07:09:29] !log Stop mysql on clouddb1016 to clone clouddb1020 T267090 [07:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:37] T267090: Productionize clouddb10[13-20] - https://phabricator.wikimedia.org/T267090 [07:21:12] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:28] (03PS2) 10Elukey: role::zookeeper:test: set cluster name and prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/645188 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [07:25:32] (03CR) 10Elukey: [C: 03+2] role::zookeeper:test: set cluster name and prometheus instance [puppet] - 10https://gerrit.wikimedia.org/r/645188 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [07:29:26] (03CR) 10Elukey: [C: 03+2] profile::hive::client: rename parameter [puppet] - 10https://gerrit.wikimedia.org/r/645122 (owner: 10Elukey) [07:34:00] (03PS1) 10Elukey: admin: remove access for user dstrine [puppet] - 10https://gerrit.wikimedia.org/r/645275 (https://phabricator.wikimedia.org/T268801) [07:35:34] (03CR) 10DannyS712: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645153 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201204T0800) [08:02:40] (03CR) 10Ayounsi: "Thanks!" (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [08:03:58] (03PS7) 10Ayounsi: Run Homer during the decom cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 [08:04:01] (03CR) 10Muehlenhoff: [C: 03+2] dumps::generation::server::statsender: Drop OS check [puppet] - 10https://gerrit.wikimedia.org/r/645056 (owner: 10Muehlenhoff) [08:11:27] (03CR) 10Muehlenhoff: [C: 04-1] "David is still a member of the cn=wmf LDAP group and I guess he still uses that for things like Turnilo or Logstash? If so, they need to b" [puppet] - 10https://gerrit.wikimedia.org/r/645275 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [08:14:13] (03CR) 10Ayounsi: [C: 03+1] "Thanks! Feel free to convert anything that can to the common logging schema if it makes sens to do it now." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/645181 (https://phabricator.wikimedia.org/T268806) (owner: 10Cwhite) [08:17:24] (03CR) 10Muehlenhoff: [C: 03+2] Move bsd-mailx (providing mail(1)) to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/645071 (https://phabricator.wikimedia.org/T268725) (owner: 10Muehlenhoff) [08:21:18] 10Operations, 10Parsoid, 10serviceops: Upgrade Parsoid servers to buster - https://phabricator.wikimedia.org/T268524 (10MoritzMuehlenhoff) >>! In T268524#6667989, @Dzahn wrote: > What can I say.. it did not happen this time. Reimaged, ran puppet.. waited a bit, checked Icinga.. all green with the following e... [08:30:58] (03CR) 10Elukey: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/645275 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [08:37:26] (03CR) 10Muehlenhoff: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/645275 (https://phabricator.wikimedia.org/T268801) (owner: 10Elukey) [08:58:22] !log installing lxml security updates [08:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:38] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:26:46] !log installing mutt security updates [09:26:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:46] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10faidon) OK, to add a little more color: - The VLAN configuration is not important. `brctl addif brq7425e328-56 eno2np1` is enough to r... [09:30:47] !log installing zsh security updates on stretch [09:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:34] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [09:54:34] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:54:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:54] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:17] (03PS1) 10Muehlenhoff: Add IDP service definition for RT [puppet] - 10https://gerrit.wikimedia.org/r/645306 [10:01:56] PROBLEM - Check systemd state on cumin2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:34] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:06:15] (03PS1) 10Vlad.shapik: CommonSettings: OAuth 2.0 refresh tokens expire after 1 minute [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645308 (https://phabricator.wikimedia.org/T269152) [10:06:35] 10Operations, 10Pybal, 10Traffic: pybal should automatically reconnect to etcd - https://phabricator.wikimedia.org/T169765 (10elukey) Re-discovered this issue after working on T267065 In the task two etcd/zookeeper nodes were scheduled to be moved: conf1005 and conf1006. We were able to move the former to a... [10:11:38] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:14:20] (03PS1) 10Ammarpad: Assign urlshortener-create-url permission [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645309 (https://phabricator.wikimedia.org/T229633) [10:21:04] 10Operations, 10SRE-Access-Requests: Requesting access to maps for mbsantos and jgiannelos - https://phabricator.wikimedia.org/T269357 (10akosiaris) >>! In T269357#6666944, @MSantos wrote: >> I am also interested in is the `to be testing the new puppet rules` part. Could you please share a bit on how this will... [10:24:02] (03PS1) 10Muehlenhoff: Make check-cumin-aliases always return 0 [puppet] - 10https://gerrit.wikimedia.org/r/645311 [10:25:49] (03CR) 10Alexandros Kosiaris: [C: 03+2] Fix formatting issues in README.rst [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/645123 (owner: 10Ahmon Dancy) [10:26:58] (03Merged) 10jenkins-bot: Fix formatting issues in README.rst [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/645123 (owner: 10Ahmon Dancy) [10:28:38] !log resetting cumin-check-aliases.service on cumin* hosts [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:12] RECOVERY - Check systemd state on cumin2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:46] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:31:02] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:31:16] !log setting db1133 as read-write for backup testing [10:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:35:17] 10Operations, 10Puppet, 10Mail, 10User-MoritzMuehlenhoff: Include mail on standard_packages.pp - https://phabricator.wikimedia.org/T268725 (10MoritzMuehlenhoff) 05Openโ†’03Resolved a:03MoritzMuehlenhoff bsd-mailx is now installed on all Stretch and Buster systems. [10:37:25] (03PS1) 10Daniel Kinzler: Revert "Hard-deprecate all public property access on CacheTime and ParserOutput." [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/645312 [10:37:56] (03PS3) 10Jbond: realm: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:39:15] 10Operations, 10conftool, 10serviceops, 10Datacenter-Switchover: Disable maintenance scripts via conftool - https://phabricator.wikimedia.org/T266717 (10Volans) >>! In T266717#6668161, @RLazarus wrote: > I'm not sure if the switchdc cookbook could automatically deduce the correct value for `maintenance_hos... [10:40:16] (03CR) 10Jbond: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [10:57:40] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:58:12] (03PS1) 10Mforns: analytics::refinery::job::data_purge.pp: Fix netflow typo [puppet] - 10https://gerrit.wikimedia.org/r/645314 [11:00:15] (03CR) 10jerkins-bot: [V: 04-1] Revert "Hard-deprecate all public property access on CacheTime and ParserOutput." [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/645312 (owner: 10Daniel Kinzler) [11:03:17] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [11:05:22] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10Volans) >>! In T268202#6668134, @ops-monitoring-bot wrote: > Icinga downtime for 40 days, 0:00:00 set by razzi@cumin1001 on 1 host(s) and their services with reason: n... [11:06:32] (03PS1) 10Muehlenhoff: Remove obsolete template [puppet] - 10https://gerrit.wikimedia.org/r/645315 [11:10:29] (03CR) 10Elukey: [C: 03+2] analytics::refinery::job::data_purge.pp: Fix netflow typo [puppet] - 10https://gerrit.wikimedia.org/r/645314 (owner: 10Mforns) [11:10:45] (03CR) 10Jbond: [C: 03+1] Add IDP service definition for RT [puppet] - 10https://gerrit.wikimedia.org/r/645306 (owner: 10Muehlenhoff) [11:12:27] (03CR) 10Volans: Make check-cumin-aliases always return 0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645311 (owner: 10Muehlenhoff) [11:17:07] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/643979 (owner: 10Ayounsi) [11:17:11] (03CR) 10Muehlenhoff: Make check-cumin-aliases always return 0 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645311 (owner: 10Muehlenhoff) [11:19:40] (03PS4) 10Aklapper: Phab: Use our custom Priority field value in tooltip on Reports page [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428) [11:20:16] (03CR) 10Aklapper: "Two years later, who to give a +2?" [puppet] - 10https://gerrit.wikimedia.org/r/455271 (https://phabricator.wikimedia.org/T91428) (owner: 10Aklapper) [11:23:31] (03CR) 10Volans: [C: 03+1] "LGTM but depends on enabling the email send for the systemd timer." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645311 (owner: 10Muehlenhoff) [11:24:10] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [11:24:10] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:56] (03PS7) 10JMeybohm: Add calico helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/644462 (https://phabricator.wikimedia.org/T267653) [11:27:58] (03PS1) 10JMeybohm: Split out RBAC rules and service accoutns for typa and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) [11:33:29] (03CR) 10Volans: [C: 04-1] "Probably not the bet approach, also unrelated changes inline." (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [11:33:50] (03PS1) 10Jbond: pki::server: add ocsp proxy [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) [11:34:27] (03CR) 10Muehlenhoff: "The approach looks good to me, but I can't conclusively tell whether all the required grep/awk/shell features are implemented in the busyb" [puppet] - 10https://gerrit.wikimedia.org/r/645081 (owner: 10Jbond) [11:35:19] (03CR) 10jerkins-bot: [V: 04-1] pki::server: add ocsp proxy [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [11:35:35] (03PS1) 10Elukey: profile::kerberos::client: improve the user experience with kinit [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) [11:36:54] (03PS2) 10Elukey: profile::kerberos::client: improve the user experience with kinit [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) [11:37:06] (03PS2) 10Jbond: pki::server: add ocsp proxy [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) [11:38:15] (03CR) 10Elukey: "Moritz: this is an idea to improve the user experience of people, but if you think it is not worth it I'll trash it :D" [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) (owner: 10Elukey) [11:38:19] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26924/console" [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) (owner: 10Elukey) [11:39:35] (03PS2) 10JMeybohm: Split out RBAC rules and service accoutns for typa and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) [11:43:08] (03PS2) 10Daniel Kinzler: Revert "Hard-deprecate all public property access on CacheTime and ParserOutput." [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/645312 [11:43:10] (03PS3) 10Jbond: pki::server: add ocsp proxy [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) [11:43:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26925/console" [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [11:46:22] (03PS4) 10Jbond: ki::server: add ocsp proxy [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) [11:47:18] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26926/console" [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [11:47:47] (03PS6) 10Volans: sre.hosts.downtime: convert to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) [11:48:03] (03PS3) 10Daniel Kinzler: Revert "Hard-deprecate all public property access on CacheTime and ParserOutput." [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/645312 (https://phabricator.wikimedia.org/T269396) [11:48:08] (03CR) 10Jbond: [V: 03+1 C: 03+2] ki::server: add ocsp proxy [puppet] - 10https://gerrit.wikimedia.org/r/645318 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [11:48:33] (03CR) 10Volans: "Thanks for the review, replies inline." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans) [11:50:14] (03PS5) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [11:50:47] (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo) [11:51:49] (03PS1) 10Alexandros Kosiaris: Add tokens and users for 3 new k8s services [labs/private] - 10https://gerrit.wikimedia.org/r/645323 (https://phabricator.wikimedia.org/T265893) [11:52:31] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add tokens and users for 3 new k8s services [labs/private] - 10https://gerrit.wikimedia.org/r/645323 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [11:53:35] (03PS6) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [11:54:52] (03PS1) 10Jbond: pki: fix proxy pass [puppet] - 10https://gerrit.wikimedia.org/r/645324 [11:56:26] (03CR) 10Jbond: [C: 03+2] pki: fix proxy pass [puppet] - 10https://gerrit.wikimedia.org/r/645324 (owner: 10Jbond) [11:57:22] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26928/console" [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [11:58:01] akosiaris: will merge you priv repo changes [11:58:42] (03CR) 10Muehlenhoff: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/645120 (owner: 10Elukey) [11:59:41] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Thanks! apertium and linkrecommendation I 'll proceed with the rest of the changes anyway, socketpuppet-api, I 'll leave to hnowlan." [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [12:00:06] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add tokens and users for 3 new k8s services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645064 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [12:02:34] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:04:36] (03PS1) 10Jbond: pki: dont purge responses dir [puppet] - 10https://gerrit.wikimedia.org/r/645327 [12:05:53] jbond42: thanks! [12:06:17] np [12:06:29] (03CR) 10Jbond: [C: 03+2] pki: dont purge responses dir [puppet] - 10https://gerrit.wikimedia.org/r/645327 (owner: 10Jbond) [12:07:14] (03CR) 10Alexandros Kosiaris: [C: 03+2] kube-apiserver: Use the infrastructure users file directly [puppet] - 10https://gerrit.wikimedia.org/r/645053 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [12:07:46] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:21:39] (03PS1) 10Marostegui: db1154,db1155: Clarify their future [puppet] - 10https://gerrit.wikimedia.org/r/645333 (https://phabricator.wikimedia.org/T268742) [12:22:18] (03CR) 10Marostegui: [C: 03+2] db1154,db1155: Clarify their future [puppet] - 10https://gerrit.wikimedia.org/r/645333 (https://phabricator.wikimedia.org/T268742) (owner: 10Marostegui) [12:28:18] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 226419400 and 276 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:50] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 19642752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:28:52] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 88561192 and 310 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:29:08] PROBLEM - Postgres Replication Lag on maps1006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 529814872 and 325 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:29:10] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1958760144 and 328 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:31:06] (03PS4) 10Alexandros Kosiaris: deployment_server: Add k8s-staging-codfw [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) [12:31:08] (03PS4) 10Alexandros Kosiaris: k8s:apiserver: Manage kube user/group [puppet] - 10https://gerrit.wikimedia.org/r/645054 (https://phabricator.wikimedia.org/T244335) [12:31:46] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1421552 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:16] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1394616 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:30] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26930/console" [puppet] - 10https://gerrit.wikimedia.org/r/645054 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [12:32:34] RECOVERY - Postgres Replication Lag on maps1006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1252248 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:32:36] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1829944 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:33:31] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26931/console" [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [12:34:02] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 59723496 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:06] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 297894240 and 169 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:34:44] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] "PCC is happy, comments addressed, merging" [puppet] - 10https://gerrit.wikimedia.org/r/645052 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [12:35:38] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 46771624 and 262 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:02] (03PS1) 10Muehlenhoff: Enable CAS for RT (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/645334 [12:36:18] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 453157592 and 301 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:56] PROBLEM - Postgres Replication Lag on maps1010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 253078288 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:36:58] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21223120 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:30] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 361532168 and 17 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:37:48] PROBLEM - Postgres Replication Lag on maps1009 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1255535552 and 66 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:24] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38630912 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:42] RECOVERY - Postgres Replication Lag on maps1010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 8384 and 49 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:38:42] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 50 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:14] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 74024 and 81 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:16] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 84 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:34] RECOVERY - Postgres Replication Lag on maps1009 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 266480 and 102 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:42] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 173399720 and 9 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:39:48] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 75328744 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:04] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 137200144 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:40:10] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s-codfw-staging: Add DNS RRs [dns] - 10https://gerrit.wikimedia.org/r/645041 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [12:41:08] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2016176 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:36] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 56 and 25 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:43:42] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 30 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:24] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 72 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/645334 (owner: 10Muehlenhoff) [12:44:58] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 106 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:45:04] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 17648 and 112 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:47:23] (03PS2) 10Alexandros Kosiaris: Add apertium namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/644530 (https://phabricator.wikimedia.org/T255672) [12:47:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add apertium namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/644530 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [12:48:52] (03Merged) 10jenkins-bot: Add apertium namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/644530 (https://phabricator.wikimedia.org/T255672) (owner: 10Alexandros Kosiaris) [12:51:00] (03PS1) 10ArielGlenn: make sure bz2 header is read when reading blocks backwards [dumps/mwbzutils] - 10https://gerrit.wikimedia.org/r/645340 (https://phabricator.wikimedia.org/T269225) [12:51:10] RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:56:22] PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:31] (03PS2) 10Muehlenhoff: Enable CAS for RT (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/645334 [13:07:37] !log create apertium namespace on k8s clusters. T255672 [13:07:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:48] T255672: Migrate apertium to the deployment pipeline - https://phabricator.wikimedia.org/T255672 [13:19:20] 10Operations, 10ops-eqiad, 10cloud-services-team (Hardware): Degraded RAID on labstore1006 - https://phabricator.wikimedia.org/T268281 (10Cmjohnson) After a few days of back and forth nonsensical emails with HPE they are finally shipping the disk today. [13:19:48] (03PS3) 10Muehlenhoff: Enable CAS for RT [puppet] - 10https://gerrit.wikimedia.org/r/645334 [13:20:16] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) I started adding servers to the racks, I anticipate that these should be ready by the end of next week. [13:23:22] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10Jakob_WMDE) [13:27:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/645334 (owner: 10Muehlenhoff) [13:28:18] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10WMDE-leszek) I approve this request from WMDE side. @Ottomata still to make the approval for WMF. [13:28:33] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10WMDE-leszek) [13:31:01] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10ssingh) a:03ssingh [13:38:37] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'production' . [13:38:37] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'apertium' for release 'staging' . [13:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:42] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) There are a couple of fatal errors on this server. I have pulled a TSR report from the server and sent to Dell. This may be a bad motherboard. A fat... [13:54:02] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10elukey) @Cmjohnson is there any chance that Dell could replace the server? [13:56:51] (03CR) 10Muehlenhoff: "This would work as a hack, but there's a more elegant way to pursue this: We should rather explore krenew (shipped in Debian as package ks" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) (owner: 10Elukey) [13:57:39] 10Operations, 10ops-eqiad, 10Analytics-Clusters: an-presto1004 shows only the NIC in the boot list - https://phabricator.wikimedia.org/T268951 (10Cmjohnson) @elukey no, they will only replace parts [13:59:30] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T267672 (10Cmjohnson) @XioNoX it appears a new fiber will need to be run. I will get to this next week. Do we need to schedule any downtime? The total time offline is 1 minute (long enough for me to swap out... [14:00:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] "2 pedantic comments, but LGTM." (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [14:07:33] (03PS3) 10Elukey: profile::kerberos::client: improve the user experience with kinit [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) [14:08:06] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10Ottomata) Ah, we should for sure not page on this. I just looked, and if monitoring is enabled we set `critical => true` for the Kafka Broker Server process: https://... [14:08:59] (03CR) 10Elukey: "> Patch Set 2:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) (owner: 10Elukey) [14:09:48] 10Operations, 10SRE-Access-Requests: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10Ottomata) APPROVED! @WMDE-leszek will also need a Kerberos principal and should be in one of either the `wmf` or `nda` LDAP group. [14:11:01] (03PS4) 10Elukey: profile::kerberos::client: improve the user experience with kinit [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) [14:15:32] (03PS1) 10Jbond: cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) [14:15:57] (03CR) 10jerkins-bot: [V: 04-1] cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [14:18:30] (03PS1) 10Andrew Bogott: wmcs instance backups: rearrange backups and reduce storage time [puppet] - 10https://gerrit.wikimedia.org/r/645368 (https://phabricator.wikimedia.org/T269419) [14:18:50] (03PS2) 10Jbond: cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) [14:19:17] (03CR) 10jerkins-bot: [V: 04-1] cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [14:22:02] (03PS3) 10Jbond: cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) [14:22:27] (03CR) 10jerkins-bot: [V: 04-1] cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [14:22:58] (03CR) 10Andrew Bogott: [C: 03+2] wmcs instance backups: rearrange backups and reduce storage time [puppet] - 10https://gerrit.wikimedia.org/r/645368 (https://phabricator.wikimedia.org/T269419) (owner: 10Andrew Bogott) [14:23:58] (03PS4) 10Jbond: cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) [14:25:25] 10Operations, 10ops-eqiad: Interface errors on cr1-eqiad:xe-3/2/1 - https://phabricator.wikimedia.org/T267672 (10ayounsi) No need to schedule anything. Thanks. [14:28:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/645334 (owner: 10Muehlenhoff) [14:29:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) (owner: 10Elukey) [14:29:47] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] k8s:apiserver: Manage kube user/group [puppet] - 10https://gerrit.wikimedia.org/r/645054 (https://phabricator.wikimedia.org/T244335) (owner: 10Alexandros Kosiaris) [14:35:38] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01002 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:36:55] (03PS1) 10Alexandros Kosiaris: kubernetes::master: Don't create the chained cert [puppet] - 10https://gerrit.wikimedia.org/r/645370 [14:38:38] (03PS1) 10Elukey: zookeeper: Support a standalone server's mbeans in the JMX exporter's conf [puppet] - 10https://gerrit.wikimedia.org/r/645371 (https://phabricator.wikimedia.org/T268202) [14:38:48] (03CR) 10Alexandros Kosiaris: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26940/console" [puppet] - 10https://gerrit.wikimedia.org/r/645370 (owner: 10Alexandros Kosiaris) [14:38:51] (03PS1) 10Ssingh: admin: add jakob to analytics groups (wmde) [puppet] - 10https://gerrit.wikimedia.org/r/645372 (https://phabricator.wikimedia.org/T269444) [14:39:38] (03CR) 10Alexandros Kosiaris: [V: 03+1 C: 03+2] kubernetes::master: Don't create the chained cert [puppet] - 10https://gerrit.wikimedia.org/r/645370 (owner: 10Alexandros Kosiaris) [14:41:58] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26941/console" [puppet] - 10https://gerrit.wikimedia.org/r/645371 (https://phabricator.wikimedia.org/T268202) (owner: 10Elukey) [14:42:41] 10Operations, 10SRE-Access-Requests, 10Wikimedia-Mailing-lists: Please create testing-infrastructure mailing list - https://phabricator.wikimedia.org/T269327 (10ssingh) 05Openโ†’03Resolved The requested mailing list has been created. Marking this as resolved but please reopen if there are any issues. Thanks! [14:43:08] (03PS5) 10Jbond: cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) [14:44:06] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10ssingh) [14:48:32] 10Operations, 10Wikidata, 10Wikidata-Query-Service: Migrate WDQS to Debian Buster - https://phabricator.wikimedia.org/T244753 (10Gehel) [14:50:03] (03PS6) 10Jbond: cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) [14:50:05] (03PS1) 10Jbond: cfssl: add ocsp timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/645373 [14:52:09] (03CR) 10Elukey: [V: 03+1] "For some reason this shows up as no op, but it must be something weird between the JMX define and pcc." [puppet] - 10https://gerrit.wikimedia.org/r/645371 (https://phabricator.wikimedia.org/T268202) (owner: 10Elukey) [14:52:49] (03PS1) 10Alexandros Kosiaris: k8s: Partially revert the removal of chained certs [puppet] - 10https://gerrit.wikimedia.org/r/645374 [14:52:53] (03PS7) 10Jbond: cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) [14:55:56] (03CR) 10Alexandros Kosiaris: [C: 03+2] k8s: Partially revert the removal of chained certs [puppet] - 10https://gerrit.wikimedia.org/r/645374 (owner: 10Alexandros Kosiaris) [14:57:33] (03CR) 10Jbond: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/645371 (https://phabricator.wikimedia.org/T268202) (owner: 10Elukey) [14:58:25] (03CR) 10Jbond: [C: 03+2] cfssl: add ocsp refresh script and timer [puppet] - 10https://gerrit.wikimedia.org/r/645367 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [14:58:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10ssingh) >>! In T269444#6669321, @Ottomata wrote: > APPROVED! ๐Ÿ˜„ > @WMDE-leszek will also need a Kerberos principal and should be in one of either the `... [14:58:32] (03CR) 10Elukey: [V: 03+1] "I tested this on an-conf1001 (one of three nodes of the zookeeper analytics cluster), checking the before/after diff, and the only thing t" [puppet] - 10https://gerrit.wikimedia.org/r/645371 (https://phabricator.wikimedia.org/T268202) (owner: 10Elukey) [14:59:11] jbond42: thanks :D [14:59:14] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytics Data for Jakob_WMDE - https://phabricator.wikimedia.org/T269444 (10Ottomata) OH OOPS I did thank you. [14:59:17] :) np [15:00:43] (03PS2) 10Jbond: cfssl: add ocsp timer jobs [puppet] - 10https://gerrit.wikimedia.org/r/645373 [15:01:37] (03PS1) 10Alexandros Kosiaris: linkrecommendation: Create the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/645376 (https://phabricator.wikimedia.org/T265893) [15:03:13] (03PS3) 10Jbond: cfssl: move ocsprefesh to a timer [puppet] - 10https://gerrit.wikimedia.org/r/645373 (https://phabricator.wikimedia.org/T268882) [15:03:39] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) >>! In T260445#6669187, @Cmjohnson wrote: > I started adding servers to the racks, I anticipate that these should be ready by the end of nex... [15:05:17] (03PS2) 10Alexandros Kosiaris: linkrecommendation: Create the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/645376 (https://phabricator.wikimedia.org/T265893) [15:06:47] (03PS4) 10Jbond: cfssl: move ocsprefesh to a timer [puppet] - 10https://gerrit.wikimedia.org/r/645373 (https://phabricator.wikimedia.org/T268882) [15:08:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Create the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/645376 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [15:09:31] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10Cmjohnson) @elukey I was barely able to scrape enough u space to get all of these into racks. I will do my best to balance but most of the free 2U... [15:09:42] (03Merged) 10jenkins-bot: linkrecommendation: Create the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/645376 (https://phabricator.wikimedia.org/T265893) (owner: 10Alexandros Kosiaris) [15:11:22] (03PS5) 10Jbond: cfssl: move ocsprefesh to a timer [puppet] - 10https://gerrit.wikimedia.org/r/645373 (https://phabricator.wikimedia.org/T268882) [15:13:24] (03CR) 10Jbond: [C: 03+2] cfssl: move ocsprefesh to a timer [puppet] - 10https://gerrit.wikimedia.org/r/645373 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [15:14:37] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:48] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:06] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:15] (03PS1) 10Jbond: cfssl: add ocsp refresh timer [puppet] - 10https://gerrit.wikimedia.org/r/645377 (https://phabricator.wikimedia.org/T268882) [15:20:40] (03PS2) 10Jbond: cfssl: add ocsp refresh timer [puppet] - 10https://gerrit.wikimedia.org/r/645377 (https://phabricator.wikimedia.org/T268882) [15:22:45] (03CR) 10Jbond: [C: 03+2] cfssl: add ocsp refresh timer [puppet] - 10https://gerrit.wikimedia.org/r/645377 (https://phabricator.wikimedia.org/T268882) (owner: 10Jbond) [15:27:10] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt2001-dev make a ceph-backed hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/645224 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [15:27:16] (03PS2) 10Andrew Bogott: Cloudvirt2001-dev make a ceph-backed hypervisor [puppet] - 10https://gerrit.wikimedia.org/r/645224 (https://phabricator.wikimedia.org/T265965) [15:27:39] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker10[18-41] - https://phabricator.wikimedia.org/T260445 (10elukey) Updating the task after a chat over IRC: ideally the 24 new nodes could be spread 6 on each row, and some asymmetry in the final distributio... [15:28:58] (03PS1) 10Jbond: cfssl::ocsp: add description to timer [puppet] - 10https://gerrit.wikimedia.org/r/645384 [15:29:38] (03CR) 10Jbond: [C: 03+2] cfssl::ocsp: add description to timer [puppet] - 10https://gerrit.wikimedia.org/r/645384 (owner: 10Jbond) [15:30:19] (03CR) 10Muehlenhoff: "Adding Otto" [puppet] - 10https://gerrit.wikimedia.org/r/645372 (https://phabricator.wikimedia.org/T269444) (owner: 10Ssingh) [15:32:10] (03CR) 10Ppchelko: [C: 03+1] "Leaving it up to the train conductors to decide when to merge." [core] (wmf/1.36.0-wmf.20) - 10https://gerrit.wikimedia.org/r/645312 (https://phabricator.wikimedia.org/T269396) (owner: 10Daniel Kinzler) [15:32:37] 10Operations, 10vm-requests, 10Patch-For-Review: Eq: 5 VM request for kafka-test-eqiad cluster - https://phabricator.wikimedia.org/T268202 (10razzi) @Ottomata Yeah, I'll add an `$is_critical` parameter. [15:34:16] (03PS1) 10Jbond: cfssl: comment out timer as it causes puppet compile to hang [puppet] - 10https://gerrit.wikimedia.org/r/645387 [15:34:51] (03CR) 10Jbond: [C: 03+2] cfssl: comment out timer as it causes puppet compile to hang [puppet] - 10https://gerrit.wikimedia.org/r/645387 (owner: 10Jbond) [15:36:42] (03PS1) 10Jbond: (WIP) CFSSL::ocsp refresh: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/645351 [15:45:08] (03PS1) 10Jbond: compiler: add defaults to cloud [puppet] - 10https://gerrit.wikimedia.org/r/645388 [15:45:09] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:49] (03CR) 10Jbond: [C: 03+2] compiler: add defaults to cloud [puppet] - 10https://gerrit.wikimedia.org/r/645388 (owner: 10Jbond) [15:47:04] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:14] (03CR) 10Ottomata: [C: 03+1] profile::kerberos::client: improve the user experience with kinit [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) (owner: 10Elukey) [15:54:38] RECOVERY - Kafka Broker Server #page on kafka-test1006 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [15:55:03] I was starting to worry [15:55:19] huge relief [15:57:41] razzi: ---^ [15:57:49] are you working on it? [15:59:22] elukey: yeah, working on it, and will make a patch to disable paging for this test cluster shortly [15:59:48] razzi: sure, but is everything downtimed? Let's try to coordinate to avoid other pages to SRE [15:59:51] thanks <3 here to help if you need anything [16:00:42] fwiw kafka-test1006 is downtimed since yesterday, that just doesn't suppress recoveries [16:01:08] yep yep, what I meant was also to avoid bringing up/down brokers for the moment [16:01:13] ๐Ÿ‘ [16:01:14] to avoid other surprises :D [16:06:49] (03PS1) 10Andrew Bogott: Added dummy password for profile::mariadb::grants::cloudinfra::labspuppet_pass [labs/private] - 10https://gerrit.wikimedia.org/r/645393 [16:07:14] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Added dummy password for profile::mariadb::grants::cloudinfra::labspuppet_pass [labs/private] - 10https://gerrit.wikimedia.org/r/645393 (owner: 10Andrew Bogott) [16:07:21] (03PS1) 10Jbond: cfssl: use split not join [puppet] - 10https://gerrit.wikimedia.org/r/645394 [16:07:54] (03CR) 10Jbond: [C: 03+2] cfssl: use split not join [puppet] - 10https://gerrit.wikimedia.org/r/645394 (owner: 10Jbond) [16:16:33] elukey: razzi: btw in cases like these, you can/should set in hiera `profile::base::notifications: disabled` on the host, that will prevent any icinga pages from going out, and without the race condition of needing to create the alerts to then downtime them in icinga [16:16:40] elukey: razzi: https://wikitech.wikimedia.org/wiki/Icinga#Disabling_notifications_programmatically [16:18:10] cdanis: yep yep, the "critical" flag was set to true by mistake, Razzi is working on a patch [16:18:11] hm oh I did not know that cdanis thank you [16:18:44] cdanis: we don't need a page for the test cluster, but regular alert yes [16:18:49] yeah that was my fault i should lhave caught that, i told ra zzi to enable monitorinig for the test broker, didn't catch the possible page [16:18:54] ya [16:20:43] (03CR) 10Elukey: [C: 03+2] profile::kerberos::client: improve the user experience with kinit [puppet] - 10https://gerrit.wikimedia.org/r/645320 (https://phabricator.wikimedia.org/T268985) (owner: 10Elukey) [16:23:19] (03PS1) 10Jbond: cfssl: ocsp refresh fix minor script issues [puppet] - 10https://gerrit.wikimedia.org/r/645396 [16:24:26] (03CR) 10Jbond: [C: 03+2] cfssl: ocsp refresh fix minor script issues [puppet] - 10https://gerrit.wikimedia.org/r/645396 (owner: 10Jbond) [16:38:17] (03PS2) 10Jbond: (WIP) CFSSL::ocsp refresh: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/645351 [16:38:56] (03PS3) 10Jbond: (WIP) CFSSL::ocsp refresh: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/645351 [16:44:15] (03PS1) 10Razzi: kafka: make alerts not critical for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) [16:45:16] (03CR) 10Ottomata: [C: 03+1] kafka: make alerts not critical for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [16:50:31] (03PS1) 10Andrew Bogott: Glance: enable rbd for glance in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/645399 (https://phabricator.wikimedia.org/T265965) [16:51:01] (03CR) 10Andrew Bogott: [C: 03+2] Glance: enable rbd for glance in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/645399 (https://phabricator.wikimedia.org/T265965) (owner: 10Andrew Bogott) [16:53:06] (03PS1) 10Andrew Bogott: Codfw1dev cloudcontrol nodes: remove redundant ceph definition [puppet] - 10https://gerrit.wikimedia.org/r/645400 [16:54:05] (03CR) 10Andrew Bogott: [C: 03+2] Codfw1dev cloudcontrol nodes: remove redundant ceph definition [puppet] - 10https://gerrit.wikimedia.org/r/645400 (owner: 10Andrew Bogott) [17:00:49] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) I've upgraded the NIC bios from 21.40.20.00 to 21.65.33.33 (latest). Handing this back to @andrew to push back into testing... [17:01:11] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) [17:02:27] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) [17:03:01] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) [17:06:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) I ran @faidon 's command on cloudvirt1025 and I still have an ssh session... so that seems promising. ` root@cloudvirt1025:~... [17:06:50] (03PS3) 10JMeybohm: Split out RBAC rules and service accoutns for typa and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) [17:07:47] (03PS4) 10JMeybohm: Split out RBAC rules and service accounts for typa and CNI [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) [17:12:09] (03CR) 10Elukey: [C: 03+1] "LGTM, left a not about a nit, but we'd need to ping other people to they are aware." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [17:13:01] (03CR) 10Elukey: [C: 04-1] "Sorry spoke too soon, I don't see the setting for kafka main, see role::kafka::main" [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [17:13:08] (03PS5) 10Razzi: sqoop: Ensure /tmp/sqoop-jars/ is present [puppet] - 10https://gerrit.wikimedia.org/r/644347 (https://phabricator.wikimedia.org/T251788) [17:16:44] 10Operations, 10LDAP-Access-Requests: Onboarding Genoveva, access request to ldap/wmf - https://phabricator.wikimedia.org/T269365 (10ssingh) 05Openโ†’03Resolved Request has been merged and user has been added to the `wmf` group. Please reopen if there are any issues. Thanks! [17:20:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) IRC update: Andrew asked me to also flash cloudvirt1026 to increase their test pool, so it has now gone from 21.40.20.00 to 21.6... [17:22:36] !log [urbanecm@mwmaint1002 ~/uploads]$ mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Wilfredor . # T269452 [17:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:05] T269452: Please upload large files to Wikimedia Commons - https://phabricator.wikimedia.org/T269452 [17:25:23] (03CR) 10JMeybohm: [C: 04-1] prometheus::k8s: Support arbitrary clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/644262 (owner: 10Alexandros Kosiaris) [17:29:33] 10Operations, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10Andrew) a:05Andrewโ†’03aborrero [17:38:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) a:05Andrewโ†’03RobH 1025 and 1026 look good! @robh, please upgrade 1027, 1028, 1029 and 1030 accordingly. Thank you! [17:38:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt1025 connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) Ease of reference: driver page and all versions. https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=7cm1n&... [17:39:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) [17:49:44] (03CR) 10Razzi: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [17:50:02] (03PS2) 10Razzi: kafka: make alerts not critical for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) [18:00:13] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26944/console" [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:04:38] (03CR) 10Elukey: [V: 03+1] kafka: make alerts not critical for test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:08:05] (03PS1) 10JMeybohm: calico: Bind the calico-cni Role to the calico-cni user [deployment-charts] - 10https://gerrit.wikimedia.org/r/645408 (https://phabricator.wikimedia.org/T267653) [18:11:57] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) cloudvirt1027 nic firmware upgraded from 21.40.20.00 to 21.65.33.33; host has rebooted back into the OS but fails initial b... [18:16:56] (03PS1) 10JMeybohm: Order k8s_infrastructure_users by id [labs/private] - 10https://gerrit.wikimedia.org/r/645410 (https://phabricator.wikimedia.org/T244335) [18:16:59] (03PS1) 10JMeybohm: k8s_infrastructure_users: fix type of client-infrastructure [labs/private] - 10https://gerrit.wikimedia.org/r/645411 (https://phabricator.wikimedia.org/T244335) [18:17:02] (03PS1) 10JMeybohm: k8s_infrastructure_users: add calico-cni [labs/private] - 10https://gerrit.wikimedia.org/r/645412 (https://phabricator.wikimedia.org/T267653) [18:18:51] (03PS5) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) [18:18:53] (03CR) 10CRusnov: ganeti-netbox-sync: Add post-sync PuppetDB import where necessary (033 comments) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/645212 (https://phabricator.wikimedia.org/T263768) (owner: 10CRusnov) [18:22:34] (03CR) 10Cwhite: kafka: make alerts not critical for test cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:24:05] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) cloudvirt1029 nic firmware upgraded from 21.40.20.00 to 21.65.33.33 cloudvirt1030 nic firmware upgraded from 21.40.20.00 to... [18:25:07] (03PS3) 10Razzi: kafka: make alerts not critical for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) [18:25:09] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10RobH) These are all upgraded, and this should end up clearing this issue. I've not resolved it yet, but if it is still working b... [18:27:51] (03PS7) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [18:28:27] (03CR) 10jerkins-bot: [V: 04-1] [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 (owner: 10Jcrespo) [18:32:22] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Order k8s_infrastructure_users by id [labs/private] - 10https://gerrit.wikimedia.org/r/645410 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [18:32:28] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s_infrastructure_users: fix type of client-infrastructure [labs/private] - 10https://gerrit.wikimedia.org/r/645411 (https://phabricator.wikimedia.org/T244335) (owner: 10JMeybohm) [18:32:33] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s_infrastructure_users: add calico-cni [labs/private] - 10https://gerrit.wikimedia.org/r/645412 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [18:39:04] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/26946/console" [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:39:38] (03CR) 10Razzi: [C: 03+2] kafka: make alerts not critical for test cluster [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:40:20] (03CR) 10Elukey: [V: 03+1 C: 03+1] "Nice job! LGTM now :)" [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:41:23] (03CR) 10Dzahn: [C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/645398 (https://phabricator.wikimedia.org/T268202) (owner: 10Razzi) [18:46:11] (03CR) 10Dzahn: [C: 03+2] "confirmed, we are using the "httponly" config and port 443 is used by envoy to terminate TLS locally" [puppet] - 10https://gerrit.wikimedia.org/r/645315 (owner: 10Muehlenhoff) [19:02:25] (03CR) 10Dzahn: "compiled on all 5 classes. results inline. all noop" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:06:26] (03CR) 10Dzahn: [C: 03+2] mediawiki: replace hiera with lookup, add data types in all profiles (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:06:59] (03CR) 10Dzahn: [C: 03+2] mediawiki: replace hiera with lookup, add data types in all profiles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:08:42] (03CR) 10Dzahn: "this completes converting all mediawiki profiles away from hiera and adding data types to all parameters" [puppet] - 10https://gerrit.wikimedia.org/r/633029 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:11:57] (03PS8) 10Jcrespo: [WIP] We continue with swift listing and download tests for media backups [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/643980 [19:12:38] (03CR) 10Dzahn: [V: 03+1] "compiler shows this is complete noop on wdqs: https://puppet-compiler.wmflabs.org/compiler1001/26952/" [puppet] - 10https://gerrit.wikimedia.org/r/645203 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:14:18] (03CR) 10Dzahn: "> The error was actually a trailing slash." [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:16:25] (03CR) 10Dzahn: "yea.. I mean.. this file is not even a proper class." [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:20:32] (03PS1) 10JMeybohm: k8s_infrastructure_users: add calicoctl [labs/private] - 10https://gerrit.wikimedia.org/r/645415 (https://phabricator.wikimedia.org/T267653) [19:20:57] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] k8s_infrastructure_users: add calicoctl [labs/private] - 10https://gerrit.wikimedia.org/r/645415 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [19:26:49] (03PS16) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:28:18] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:28:24] (03CR) 10CRusnov: "> Patch Set 15: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:29:33] (03PS17) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:30:54] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/26953/ https://puppet-compiler.wmflabs.org/compiler1002/26954/" [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:31:00] (03CR) 10jerkins-bot: [V: 04-1] netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) (owner: 10CRusnov) [19:32:14] (03PS2) 10JMeybohm: Remove calico::builder [puppet] - 10https://gerrit.wikimedia.org/r/645078 (https://phabricator.wikimedia.org/T266893) [19:32:16] (03PS1) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [19:32:45] (03CR) 10Dzahn: "confirmed noop on cp1075,cp1078, mwdebug1003,..." [puppet] - 10https://gerrit.wikimedia.org/r/645202 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:34:37] (03PS18) 10CRusnov: netbox: Adjust settings for supporting Netbox 2.9 series [puppet] - 10https://gerrit.wikimedia.org/r/643354 (https://phabricator.wikimedia.org/T266488) [19:34:42] (03CR) 10JMeybohm: Split out RBAC rules and service accounts for typa and CNI (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/645317 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [19:38:10] (03PS2) 10Dzahn: ntp: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) [19:39:06] (03CR) 10Dzahn: [C: 03+1] "Let's get this merged. Do you want to split it to first do deployment-prep as Subbu suggests?" [puppet] - 10https://gerrit.wikimedia.org/r/577043 (owner: 10C. Scott Ananian) [19:39:40] (03CR) 10jerkins-bot: [V: 04-1] ntp: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:44:23] (03PS3) 10Dzahn: ntp: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) [19:46:00] (03CR) 10jerkins-bot: [V: 04-1] ntp: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:47:45] (03PS4) 10Dzahn: ntp: replace hiera() with lookup(), move use_chrony to parameters [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) [19:49:41] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/26957/dns3001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/645206 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [19:56:11] (03PS7) 10Dzahn: httpd: make it possible to configure server admin email address [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [19:56:33] 10Operations, 10ops-codfw: codfw: ms-be2058 memory error - https://phabricator.wikimedia.org/T269266 (10Papaul) DellEMC Service Request # 1045379761 || POWEREDGE R740XD2 || Service Tag - JN1R673 || Memory Failure || DPS 412652293 [ ref:_00D0bGaMp._5002R1EuuuX:ref ] [19:57:41] 10Operations, 10ops-codfw: RMA failed codfw C7 switch - WMF6114 - https://phabricator.wikimedia.org/T267950 (10Papaul) Scheduled Delivery Monday 12/07/2020 Estimated Time by 9:00 P.M. View delivery time window with UPS My Choiceยฎ. Continue [19:57:53] (03CR) 10Dzahn: "sorry for letting this sit for so lon. I am now getting back to it. First I did the necessary manual rebase. And now I think we should d" [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [20:01:34] (03CR) 10Dzahn: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [20:02:28] (03CR) 10Dzahn: "Yea..ehm.. so Krinkle said back in February "needs to moved instead of removed". Based on that it would be -1." [puppet] - 10https://gerrit.wikimedia.org/r/524925 (https://phabricator.wikimedia.org/T187716) (owner: 10Jforrester) [20:03:30] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/26958/" [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [20:11:32] (03PS2) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [20:13:42] (03PS1) 10Nray: Turn off Growth Study Screener Quick Survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/645425 (https://phabricator.wikimedia.org/T269369) [20:13:51] (03PS1) 10JMeybohm: Add tokens for calico::kubernetes cni and ctl [labs/private] - 10https://gerrit.wikimedia.org/r/645426 (https://phabricator.wikimedia.org/T267653) [20:14:05] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Add tokens for calico::kubernetes cni and ctl [labs/private] - 10https://gerrit.wikimedia.org/r/645426 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [20:21:03] (03PS3) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [20:28:09] (03PS4) 10JMeybohm: calico: Add support for calico 3.x with kubernetes datastore [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) [20:29:05] (03PS1) 10Dzahn: add 'mad' (Madurese) to project languages, create mad.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/645428 (https://phabricator.wikimedia.org/T269437) [20:30:43] (03CR) 10Dzahn: "approved by langcom: https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Madurese_2" [dns] - 10https://gerrit.wikimedia.org/r/645428 (https://phabricator.wikimedia.org/T269437) (owner: 10Dzahn) [20:31:19] (03CR) 10JMeybohm: "PCC looks fine for existing hosts: https://puppet-compiler.wmflabs.org/compiler1003/26961/" [puppet] - 10https://gerrit.wikimedia.org/r/645417 (https://phabricator.wikimedia.org/T267653) (owner: 10JMeybohm) [20:37:36] 10Operations, 10VPS-project-Codesearch: Graduate codesearch to production - https://phabricator.wikimedia.org/T268199 (10Dzahn) I agree that we should not fall into the trap of not doing things "because soon gitlab" because it will likely take a long time during which we will have gerrit, phabricator, github a... [20:38:04] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Enable support for nested VMs [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [20:38:09] (03PS11) 10Andrew Bogott: openstack: Enable support for nested VMs [puppet] - 10https://gerrit.wikimedia.org/r/638146 (https://phabricator.wikimedia.org/T267433) (owner: 10Ahmon Dancy) [20:38:20] (03CR) 10JMeybohm: [C: 03+1] "The change LGTM but PCC seems to treat the file as new (maybe that's normal when changing to template() and I have not seen it yet). In sp" [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [20:40:07] (03CR) 10Dzahn: "> The change LGTM but PCC seems to treat the file as new" [puppet] - 10https://gerrit.wikimedia.org/r/592712 (https://phabricator.wikimedia.org/T251005) (owner: 10Reedy) [20:45:35] (03PS1) 10Dzahn: httpd: change default server admin from webmaster@ to noc@ [puppet] - 10https://gerrit.wikimedia.org/r/645431 (https://phabricator.wikimedia.org/T251005) [21:24:04] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10wiki_willy) a:03Cmjohnson Arrived Dec 3 [21:36:00] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10wiki_willy) Hi @Cmjohnson - can you add the S/N for db1153: https://netbox.wikimedia.org/dcim/devices/2960/ Looks like it's shooting an error on this Netbox repo... [21:36:15] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: (Need By: 2020-11-29) rack/setup/install db11[51-76] - https://phabricator.wikimedia.org/T267043 (10wiki_willy) a:03Cmjohnson [21:55:10] (03CR) 10MF-Warburg: [C: 03+1] "Looks good to me and my limited knowledge of the technical aspects!" [dns] - 10https://gerrit.wikimedia.org/r/645428 (https://phabricator.wikimedia.org/T269437) (owner: 10Dzahn) [22:13:28] (03CR) 10Dzahn: [C: 03+2] add 'mad' (Madurese) to project languages, create mad.wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/645428 (https://phabricator.wikimedia.org/T269437) (owner: 10Dzahn) [22:17:50] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10RobH) [22:17:53] (03PS4) 10Jbond: (WIP) CFSSL::ocsp refresh: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/645351 [22:18:13] (03CR) 10Dzahn: "@MF-Warburg thank you. https://mad.wikipedia.org works now fyi, if a new wiki is the first wiki to exist in this language then we have " [dns] - 10https://gerrit.wikimedia.org/r/645428 (https://phabricator.wikimedia.org/T269437) (owner: 10Dzahn) [22:18:14] 10Operations, 10ops-codfw, 10DC-Ops: (Need By: 2021-01-30) rack/setup/install frqueue2002.frack.codfw.wmnet - https://phabricator.wikimedia.org/T269481 (10RobH) [22:26:02] 10Operations, 10Puppet, 10User-jbond: require_package should mark packages as manually installed - https://phabricator.wikimedia.org/T195981 (10Dzahn) We are removing require_package across the board now in T266479. [22:27:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): cloudvirt10[25-30] connection issues on primary nic - https://phabricator.wikimedia.org/T269313 (10Andrew) 05Openโ†’03Resolved [22:27:07] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Hardware): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [22:28:13] (03PS1) 10Dzahn: deployment_server: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/645442 (https://phabricator.wikimedia.org/T266479) [22:28:53] (03PS5) 10Jbond: (WIP) CFSSL::ocsp refresh: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/645351 [22:30:31] (03PS1) 10Dzahn: mw_rc_irc: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/645443 (https://phabricator.wikimedia.org/T266479) [22:30:44] (03PS6) 10Jbond: (WIP) CFSSL::ocsp refresh: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/645351 [22:32:40] (03PS1) 10Dzahn: ipmi: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/645444 (https://phabricator.wikimedia.org/T266479) [22:33:17] (03CR) 10Jbond: [C: 03+2] (WIP) CFSSL::ocsp refresh: add systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/645351 (owner: 10Jbond) [22:34:23] (03PS1) 10Dzahn: gerrit: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/645445 (https://phabricator.wikimedia.org/T266479) [22:36:28] (03PS1) 10Dzahn: apt::repository: require_package -> ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/645446 (https://phabricator.wikimedia.org/T266479) [22:40:32] (03PS1) 10Jbond: DO NOT MERGE: test pcc bug [puppet] - 10https://gerrit.wikimedia.org/r/645448 [22:40:56] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: test pcc bug [puppet] - 10https://gerrit.wikimedia.org/r/645448 (owner: 10Jbond) [22:44:18] (03PS2) 10Jbond: DO NOT MERGE: test pcc bug [puppet] - 10https://gerrit.wikimedia.org/r/645448 [22:44:49] (03CR) 10jerkins-bot: [V: 04-1] DO NOT MERGE: test pcc bug [puppet] - 10https://gerrit.wikimedia.org/r/645448 (owner: 10Jbond) [23:01:29] 10Operations, 10ops-eqiad, 10DC-Ops: document all scs connections - https://phabricator.wikimedia.org/T175876 (10RobH) a:05RobHโ†’03None [23:32:42] (03PS1) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [23:33:06] (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:39:01] (03PS2) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [23:40:23] (03PS3) 10Dzahn: wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) [23:40:27] (03CR) 10jerkins-bot: [V: 04-1] wikistats: replace all cron jobs with systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/645455 (https://phabricator.wikimedia.org/T265138) (owner: 10Dzahn) [23:45:32] (03CR) 10RLazarus: sre.hosts.downtime: convert to class API (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/633484 (https://phabricator.wikimedia.org/T221212) (owner: 10Volans)