[00:01:12] (03PS1) 10Dzahn: hieradata/labs: set cluster_search and deployment server for staging instance [puppet] - 10https://gerrit.wikimedia.org/r/561929 [00:02:31] (03CR) 10Paladox: [C: 03+1] hieradata/labs: set cluster_search and deployment server for staging instance [puppet] - 10https://gerrit.wikimedia.org/r/561929 (owner: 10Dzahn) [00:07:17] (03PS2) 10Dzahn: hieradata/labs: set cluster_search hosts and puppetmaster for devtools phab [puppet] - 10https://gerrit.wikimedia.org/r/561929 [00:09:26] (03PS2) 10BryanDavis: support tools: Add script to rebuild all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561730 [00:11:32] (03CR) 10Dzahn: [C: 03+2] hieradata/labs: set cluster_search hosts and puppetmaster for devtools phab [puppet] - 10https://gerrit.wikimedia.org/r/561929 (owner: 10Dzahn) [00:29:59] (03CR) 10BryanDavis: support tools: Add script to rebuild all images (032 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561730 (owner: 10BryanDavis) [00:31:13] (03CR) 10BryanDavis: "> Patch Set 2: Code-Review+2" [puppet] - 10https://gerrit.wikimedia.org/r/561437 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [01:19:07] (03PS1) 10Dzahn: mediawiki::php: allow setting PHP version to 7.3 for buster [puppet] - 10https://gerrit.wikimedia.org/r/561931 [01:20:55] (03CR) 10Dzahn: "We tried to setup the first deployment_server on buster in labs and ran into this issue not being able to set the PHP version for extensio" [puppet] - 10https://gerrit.wikimedia.org/r/561931 (owner: 10Dzahn) [01:27:23] (03PS1) 10BryanDavis: Bump python version to 3.7 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561932 [01:27:25] (03PS1) 10BryanDavis: Update deployment for new Kubernetes cluster [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561933 [01:27:59] (03CR) 10BryanDavis: [C: 03+2] Bump python version to 3.7 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561932 (owner: 10BryanDavis) [01:28:22] (03Merged) 10jenkins-bot: Bump python version to 3.7 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561932 (owner: 10BryanDavis) [01:45:51] (03PS2) 10BryanDavis: Update deployment and control script for new Kubernetes cluster [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561933 [01:59:36] (03CR) 10BryanDavis: [C: 03+2] Update deployment and control script for new Kubernetes cluster [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561933 (owner: 10BryanDavis) [01:59:55] (03Merged) 10jenkins-bot: Update deployment and control script for new Kubernetes cluster [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561933 (owner: 10BryanDavis) [02:03:35] (03PS1) 10BryanDavis: Add missing selector to Deployment [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561935 [02:03:51] (03CR) 10BryanDavis: [C: 03+2] Add missing selector to Deployment [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561935 (owner: 10BryanDavis) [02:04:13] (03Merged) 10jenkins-bot: Add missing selector to Deployment [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561935 (owner: 10BryanDavis) [02:05:46] (03PS1) 10BryanDavis: Add missing label to Deployment [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561936 [02:05:59] (03CR) 10BryanDavis: [C: 03+2] Add missing label to Deployment [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561936 (owner: 10BryanDavis) [02:06:21] (03Merged) 10jenkins-bot: Add missing label to Deployment [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561936 (owner: 10BryanDavis) [02:09:00] (03PS1) 10BryanDavis: bin/jouncebot.sh: use /usr/bin/kubectl consistently [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561937 [02:09:26] jouncebot: refresh [02:09:26] I refreshed my knowledge about deployments. [02:10:05] jouncebot: next [02:10:06] In 57 hour(s) and 19 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200106T1130) [02:15:05] (03CR) 10BryanDavis: [C: 03+2] bin/jouncebot.sh: use /usr/bin/kubectl consistently [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561937 (owner: 10BryanDavis) [02:15:27] (03Merged) 10jenkins-bot: bin/jouncebot.sh: use /usr/bin/kubectl consistently [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561937 (owner: 10BryanDavis) [03:30:05] (03PS1) 10BryanDavis: Update for mwclient v0.10.0 Site constructor change [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561942 [03:30:07] (03PS1) 10BryanDavis: Add Black and flake8 add-on lint checks [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561943 [03:30:42] (03CR) 10BryanDavis: [C: 03+2] Update for mwclient v0.10.0 Site constructor change [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561942 (owner: 10BryanDavis) [03:31:05] (03Merged) 10jenkins-bot: Update for mwclient v0.10.0 Site constructor change [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561942 (owner: 10BryanDavis) [03:31:58] (03CR) 10BryanDavis: [C: 03+2] Add Black and flake8 add-on lint checks [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561943 (owner: 10BryanDavis) [03:32:25] (03Merged) 10jenkins-bot: Add Black and flake8 add-on lint checks [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561943 (owner: 10BryanDavis) [03:33:38] (03PS1) 10BryanDavis: Bump mwclient minimum version [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561944 [03:33:52] (03CR) 10BryanDavis: [C: 03+2] Bump mwclient minimum version [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561944 (owner: 10BryanDavis) [03:34:18] (03Merged) 10jenkins-bot: Bump mwclient minimum version [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/561944 (owner: 10BryanDavis) [07:14:24] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241873 (10ops-monitoring-bot) [07:26:38] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241873 (10Peachey88) [08:24:54] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T228853 (10Andrew) [08:24:55] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Andrew) [08:25:25] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T215892 (10Andrew) [08:25:29] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Andrew) [08:25:34] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241873 (10Andrew) [08:25:37] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Andrew) [09:29:38] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10Nuria) Approved on my end [11:30:30] (03CR) 10Phamhi: [C: 03+1] "LGTM" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561730 (owner: 10BryanDavis) [11:31:56] (03CR) 10Phamhi: [C: 03+2] toolforge: replace diamond redis monitoring with prometheus [puppet] - 10https://gerrit.wikimedia.org/r/561437 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [11:33:25] (03CR) 10Phamhi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/561437 (https://phabricator.wikimedia.org/T210993) (owner: 10BryanDavis) [11:33:51] (03CR) 10Phamhi: [C: 03+2] support tools: Add script to rebuild all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561730 (owner: 10BryanDavis) [11:34:23] (03Merged) 10jenkins-bot: support tools: Add script to rebuild all images [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561730 (owner: 10BryanDavis) [13:06:31] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 22330088 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:08:15] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 38453368 and 8 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:08:19] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 193592 and 42 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:10:03] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 38568 and 7 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [13:53:42] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241881 (10ops-monitoring-bot) [14:26:01] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10ops-monitoring-bot) [14:56:15] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) I see ` [Sat Jan 4 08:56:39 2020] megaraid_sas 0000:18:00.0: 155794 (631458161s/0x0004/CRIT) - Enclosure PD 20(c None/p1) phy bad for slot 4 ` in dmesg. Checking some other things quick because... [14:57:44] (03PS1) 10Andrew Bogott: nova: depool cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/561985 (https://phabricator.wikimedia.org/T241882) [14:58:36] (03PS2) 10Andrew Bogott: nova: depool cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/561985 (https://phabricator.wikimedia.org/T241882) [15:00:51] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) This is the behavior that led to T216218 and then T230289 In fact, this is pretty much exactly the same as {T230289}. Checking that the filesystem is still mounted ok. [15:02:05] (03CR) 10Andrew Bogott: [C: 03+2] nova: depool cloudvirt1016 [puppet] - 10https://gerrit.wikimedia.org/r/561985 (https://phabricator.wikimedia.org/T241882) (owner: 10Andrew Bogott) [15:03:23] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) It seems like the filesystem is ok, but there are no hot spares at this point, so if it kicks 2 more disks out, it'll cause problems. So far so good on that. [15:06:42] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) To be clear, I am relating this to T230289 is because it thinks the disks are //removed//, not failed. [15:06:56] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) [15:06:59] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Bstorm) [15:07:50] (03PS1) 10Andrew Bogott: Depool cloudvirt1024, raid controller issues [puppet] - 10https://gerrit.wikimedia.org/r/561987 (https://phabricator.wikimedia.org/T241884) [15:08:13] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241873 (10Bstorm) [15:08:16] 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) [15:09:39] (03CR) 10Andrew Bogott: [C: 03+2] Depool cloudvirt1024, raid controller issues [puppet] - 10https://gerrit.wikimedia.org/r/561987 (https://phabricator.wikimedia.org/T241884) (owner: 10Andrew Bogott) [15:12:02] 10Operations, 10ops-eqiad, 10Patch-For-Review: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) According to the "livecycle" logs in idrac, it had trouble communicating with the disks and then marked them removed. Basically the same as before and again, 2 disks on the sa... [15:18:17] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) [15:23:37] (03CR) 10Bstorm: "There's a mistake in this patch. I think the buster images were all "toolforge" in webservice, not just the sssd ones (https://github.com" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/561730 (owner: 10BryanDavis) [15:32:41] 10Operations, 10observability, 10Patch-For-Review, 10User-fgiunchedi, 10cloud-services-team (Kanban): Deprecate Diamond collectors in Cloud VPS - https://phabricator.wikimedia.org/T210993 (10bd808) [15:41:50] 10Operations, 10ops-eqiad, 10Cloud-Services, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install labvirt101[5-8] - https://phabricator.wikimedia.org/T165531 (10aborrero) [15:42:07] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241886 (10ops-monitoring-bot) [15:44:20] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014) - https://phabricator.wikimedia.org/T138509 (10aborrero) [15:44:36] 10Operations, 10ops-eqiad, 10Patch-For-Review: rack/setup/install/deploy labvirt1012 labvirt1013 labvirt1014 nodes (cloudvirt1012 cloudvirt1013 cloudvirt1014) - https://phabricator.wikimedia.org/T138509 (10aborrero) [16:18:57] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10bd808) p:05Triage→03High [16:28:25] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241886 (10Bstorm) [16:28:25] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) [16:34:22] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [16:34:22] !log aborrero@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [16:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:26] !log aborrero@cumin1001 START - Cookbook sre.hosts.downtime [16:34:26] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:34:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:35] (03PS1) 10BryanDavis: toolschecker: update k8s config reading [puppet] - 10https://gerrit.wikimedia.org/r/561996 (https://phabricator.wikimedia.org/T240923) [18:33:18] (03CR) 10jerkins-bot: [V: 04-1] toolschecker: update k8s config reading [puppet] - 10https://gerrit.wikimedia.org/r/561996 (https://phabricator.wikimedia.org/T240923) (owner: 10BryanDavis) [18:38:08] (03PS2) 10BryanDavis: toolschecker: update k8s config reading [puppet] - 10https://gerrit.wikimedia.org/r/561996 (https://phabricator.wikimedia.org/T240923) [18:48:32] (03CR) 10Bstorm: [C: 03+2] toolschecker: update k8s config reading [puppet] - 10https://gerrit.wikimedia.org/r/561996 (https://phabricator.wikimedia.org/T240923) (owner: 10BryanDavis) [19:08:58] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 47330016 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:10:44] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 97624 and 96 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:11:01] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Degraded RAID on cloudvirt1024 - https://phabricator.wikimedia.org/T241884 (10Bstorm) @aborrero When migrating cyberbot-db-01, the script died with ` total size is 304,445,262,469 speedup is 1.00 wmcs-cold-migrate: INFO: cyber... [19:11:07] (03PS1) 10BryanDavis: toolschecker: check node ready status on new k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/562000 [20:00:49] (03CR) 10Kosta Harlan: [C: 03+1] GrowthExperiments: use local search in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/561927 (https://phabricator.wikimedia.org/T235717) (owner: 10Gergő Tisza) [20:04:48] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 24548336 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:06:36] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 19256 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:32:53] 10Puppet, 10Cloud-VPS: role::simplelamp fails to start mysql due to apparmor - https://phabricator.wikimedia.org/T128642 (10bd808) [21:27:14] 10Puppet, 10Cloud-VPS: role::simplelamp takes ownership of all content in /etc/apache2/sites-enabled - https://phabricator.wikimedia.org/T169368 (10bd808) [23:55:51] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Security-Team, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10Tgr) Support is decent nowadays, with only some mobile browsers not recognizing it. (Related...