[07:56:07] !log Upgrading ceph services on eqiad, starting with mons/managers (T280641) [07:56:08] dcaro: Unknown project "Upgrading" [07:56:08] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [07:58:34] !log admin Upgrading ceph services on eqiad, starting with mons/managers (T280641) [07:58:38] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:14:27] !log admin During the upgrade, ceph detected a clock skew on cloudcephmon1002, looking (T280641) [08:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:14:33] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [08:15:30] !log admin During the upgrade, ceph detected a clock skew on cloudcephmon1002, it went away, I'm guessing systemd-timesyncd fixed it (T280641) [08:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:18:07] !log admin During the upgrade, ceph detected a clock skew on cloudcephmon1002, cloudcephmon1001, they are back (T280641) [08:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:18:25] !log admin All equiad ceph mons and mgrs upgraded (T280641) [08:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:21:22] !log admin The clock skew seems intermittent, there's another task to follw it T275860 (T280641) [08:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:21:27] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [08:21:28] T275860: ceph: check time sync setup - https://phabricator.wikimedia.org/T275860 [08:21:50] !log admin Upgrading all the ceph osds on eqiad (T280641) [08:21:53] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:58:25] !log admin During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) (T280641) [08:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:58:30] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [08:58:51] !log admin During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) all from osd.58 (T280641) [08:58:54] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [08:59:23] !log admin During the upgrade, started getting warning 'slow osd heartbacks in the back', meaning that pings between osds are really slow (up to 190s) all from osd.58, currently on cloudcephosd1002 (T280641) [08:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:03:52] !log admin Waiting for slow heartbeats from osd.58(cloudcephosd1002) to recover... (T280641) [09:03:56] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [09:03:56] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [10:34:35] !log admin Slow/blocked opns from cloudcephmon03, "osd_failure(failed timeout osd.32..." (cloudcephosd1005), unset the cluster noout/norebalance and went away in a few secs, setting it again and continuing... (T280641) [10:34:42] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:34:44] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [10:57:21] !log admin Got a PG getting stuck on 'remapping' after the OSD came up, had to unset the norebalance and then set it again to get it unstuck (T280641) [10:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [10:57:27] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [11:06:01] !log admin All ceph server side upgraded to Octopus! \o/ (T280641) [11:06:06] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [11:06:06] T280641: ceph: Upgrade to latest Nautilus/Octopus to fix CVE-2021-20288 - https://phabricator.wikimedia.org/T280641 [12:54:28] !admin re-taking the upgrade of cloudvirt1039 now that we have the octopus repos/ceph services [13:42:33] dcaro: you need a 'log' in ^ [13:42:53] :facepalm: xd [14:06:41] This morning I'm getting permission denied trying to ssh into programs-and-events-dashboard.globaleducation.eqiad.wmflabs [14:07:49] ragesoss: anything interesting in the VM "log" tab in horizon? [14:10:48] !log tools.bodh Deploying I317addc133 (T280903) [14:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.bodh/SAL [14:10:51] T280903: bodh: Allow to input Q and L items - https://phabricator.wikimedia.org/T280903 [14:20:11] ragesoss: something is broken with ldap integration on that host, do you mind if I reboot it? [14:21:18] andrewbogott that's fine [14:21:56] !log globaleducation rebooting programs-and-events-dashboard to see if that gets ldap/user accounts sorted out [14:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Globaleducation/SAL [14:24:18] ragesoss: try now? [14:28:42] andrewbogott I'm in! [14:28:43] thanks!@ [14:31:26] ragesoss: that VM doesn't look 100% healthy to me but now you can see for yourself [14:31:58] andewbogott what looks unhealthy to you about it? [14:32:45] (It's definitely not healthy insofar as Apache has been going down somewhat frequently over the last two weeks, but I haven't found the cause) [14:41:50] ragesoss: I haven't looked in depth but puppet throws a ton of warnings that I've never seen before [14:42:00] It might be nothing but suggests some kind of weird version mismatch [14:49:16] andewbogott I guess because of the fact it's a pretty old VM and some things have been upgraded manually? I think I should retire this VM and build a new one for Rails. Should be easy now that the database is on a separate VM. [14:57:26] ragesoss: yeah, that's what I suspect. Best to move up to Buster [18:11:09] !log admin adding cloudvirt1040, 1041 and 1042 to the 'ceph' host aggregate -- T275081 [18:11:12] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [18:11:13] T275081: (Need By: TBD) rack/setup/install cloudvirt104[0-6].eqiad.wmnet - https://phabricator.wikimedia.org/T275081 [19:40:36] !log admin putting cloudvirt1040 into the maintenance aggregate pending more info about T281399 [19:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [19:40:40] T281399: cloudvirt1040 primary NIC disconnected - https://phabricator.wikimedia.org/T281399 [20:48:11] !log admin cleaning up references to deleted hypervisors with mysql:root@localhost [nova_eqiad1]> delete from compute_nodes where hypervisor_version != '5002000'; [20:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL [21:11:11] !log admin cleaning up more references to deleted hypervisors with delete from services where topic='compute' and version != 53; [21:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Admin/SAL