[00:00:05] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210212T0000). [00:00:05] Cladis, kemayo, and nray: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:31] (03CR) 10DLynch: "recheck" [extensions/WikiEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663404 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:00:47] 👀 [00:01:15] (03CR) 10DLynch: "recheck" [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663403 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:01:23] 🤷🏻‍♂️ [00:03:12] (03PS4) 10Dzahn: Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 [00:03:39] (03CR) 10jerkins-bot: [V: 04-1] Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 (owner: 10Dzahn) [00:05:23] (03PS5) 10Dzahn: Revert "mwdebug: allow rsyncing home dirs [puppet] - 10https://gerrit.wikimedia.org/r/663401 [00:05:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "mwdebug: allow rsyncing home dirs [puppet] - 10https://gerrit.wikimedia.org/r/663401 (owner: 10Dzahn) [00:06:29] (03PS6) 10Dzahn: Revert "mwdebug: allow rsyncing home dirs [puppet] - 10https://gerrit.wikimedia.org/r/663401 [00:08:06] PROBLEM - mediawiki-installation DSH group on mw1329 is CRITICAL: Host mw1329 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:08:39] legoktm: ^ that's yours, right [00:10:37] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1329 is CRITICAL: Host mw1329 is not in mediawiki-installation dsh group daniel_zahn not pooled https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:11:42] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade [00:11:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade [00:11:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:52] PROBLEM - mediawiki-installation DSH group on mw1332 is CRITICAL: Host mw1332 is not in mediawiki-installation dsh group https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:12:00] .. [00:13:05] RoanKattouw, Niharika, Urbanecm any ETA for the window? :) [00:13:13] * Cladis got some cooking half-done [00:13:40] Oh, do we have another b&c [00:13:56] I can deploy today [00:14:13] "Evening backport window", yeah [00:14:36] let's do it [00:14:42] 🎉 [00:14:53] Kemayo: nray: around? [00:15:21] i see Kemayo [00:15:24] I'm around. I request to go last as mine isn't ready yet [00:15:36] (and I may have to defer to another day) [00:15:37] (03CR) 10Urbanecm: [C: 03+2] Enabling extension SandboxLink on ltwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663668 (https://phabricator.wikimedia.org/T273957) (owner: 10Base) [00:15:44] nray: ack, thanks [00:16:06] Kemayo: do we want to go with the backports? [00:16:31] Yup, so long as .30 is still not on group2. [00:16:31] (03Merged) 10jenkins-bot: Enabling extension SandboxLink on ltwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663668 (https://phabricator.wikimedia.org/T273957) (owner: 10Base) [00:16:32] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1330 is CRITICAL: Host mw1330 is not in mediawiki-installation dsh group daniel_zahn reimage in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:16:32] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1331 is CRITICAL: Host mw1331 is not in mediawiki-installation dsh group daniel_zahn reimage in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:16:32] ACKNOWLEDGEMENT - mediawiki-installation DSH group on mw1332 is CRITICAL: Host mw1332 is not in mediawiki-installation dsh group daniel_zahn reimage in progress https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [00:16:51] Kemayo: ack. Can you make them get a +2 from jenkins please? [00:17:12] Cladis: your ltwiki patch is on mwdebug1001, please test [00:17:16] Urbanecm: I've been waiting on a recheck for the last 20 minutes. [00:17:25] i see [00:18:20] Urbanecm: The test failure is way off in something thoroughly unrelated, so I'm hoping it's random transitory jenkins junk. [00:18:21] Urbanecm: it is there, all seem to be ok [00:18:30] *seems [00:18:54] thanks Cladis, syncing [00:19:12] (03CR) 10Urbanecm: [C: 03+2] Adding WQ as namespace alias for itwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663678 (https://phabricator.wikimedia.org/T273362) (owner: 10Base) [00:19:46] (03PS2) 10Urbanecm: Adding WQ as namespace alias for itwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663678 (https://phabricator.wikimedia.org/T273362) (owner: 10Base) [00:19:53] (03CR) 10Urbanecm: [C: 03+2] Adding WQ as namespace alias for itwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663678 (https://phabricator.wikimedia.org/T273362) (owner: 10Base) [00:20:31] (03CR) 10Urbanecm: [C: 03+2] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663403 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:20:39] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 53229b0f41eb8cc3e8a90157283913c7d69810df: Enabling extension SandboxLink on ltwiki (T273957) (duration: 01m 07s) [00:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:20:45] T273957: Extension SandboxLink on Lithuanian Wikipedia - https://phabricator.wikimedia.org/T273957 [00:20:45] (03CR) 10Urbanecm: [C: 03+2] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/WikiEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663404 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:21:10] (03Merged) 10jenkins-bot: Adding WQ as namespace alias for itwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663678 (https://phabricator.wikimedia.org/T273362) (owner: 10Base) [00:21:22] (03CR) 10jerkins-bot: [V: 04-1] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/WikiEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663404 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:21:24] (03PS3) 10Nray: Enable WVUI search on beta (for Vector skin) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 [00:21:31] Cladis: your second patch is at mwdebug1001 [00:21:32] (03CR) 10Nray: [C: 04-1] "not ready to merge yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 (owner: 10Nray) [00:22:14] Kemayo: failed for 2nd time too :/ [00:22:41] Urbanecm: "WQ" works, "Project" did not get broken — lgtm [00:22:48] thx, syncing [00:22:52] Urbanecm: Hm, yeah. If you need to roll it back, that's okay -- I can go try to work out what's going on. One of the config patches should still go through, though. [00:23:08] i can sync the configs, that's not an issue [00:23:26] Only this one: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/663672 [00:23:36] The other one needs the logging that's in those backports. [00:23:46] got it [00:24:04] (03PS2) 10Urbanecm: Oversample DiscussionTools EditAttemptStep logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663672 (https://phabricator.wikimedia.org/T273946) (owner: 10DLynch) [00:24:14] (03CR) 10Urbanecm: [C: 03+2] Oversample DiscussionTools EditAttemptStep logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663672 (https://phabricator.wikimedia.org/T273946) (owner: 10DLynch) [00:24:50] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: f051c6cdaa162ce2ea42aa53a24e50bb4aa8a793: Adding WQ as namespace alias for itwikiquote (T273362) (duration: 01m 10s) [00:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:24:55] T273362: Add namespace alias for Italian Wikiquote - https://phabricator.wikimedia.org/T273362 [00:24:56] Cladis: done [00:25:08] Urbanecm: thanks ^_^ [00:25:11] (03Merged) 10jenkins-bot: Oversample DiscussionTools EditAttemptStep logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663672 (https://phabricator.wikimedia.org/T273946) (owner: 10DLynch) [00:25:13] np [00:25:33] Kemayo: can you test your patch somehow, please? [00:26:33] (03CR) 10jerkins-bot: [V: 04-1] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663403 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:28:01] (03CR) 10Urbanecm: "This change is ready for review." [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663727 (owner: 10Urbanecm) [00:28:27] Kemayo: just uploaded an empty change. If it fails, I'll overrule jenkins. [00:28:34] Urbanecm: 👍🏻 [00:29:37] !log mwscript namespaceDupes.php itwikiquote --fix # T273362 [00:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:29:54] Kemayo: did you se my message on testing the other config change that merged a while ago? [00:30:08] !log mwscript namespaceDupes.php itwikiquote --fix --add-prefix=BROKEN # T273362 [00:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:30:12] T273362: Add namespace alias for Italian Wikiquote - https://phabricator.wikimedia.org/T273362 [00:30:51] Urbanecm: Oh, sorry, I can confirm that it seems to be out correctly. [00:31:09] at mwdebug1001? [00:31:13] Ye. [00:31:14] s [00:31:16] okay, syncing [00:32:48] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: a022f2b506089ab518b74c1dfca78924c06dc80f: Oversample DiscussionTools EditAttemptStep logging (T273946) (duration: 01m 08s) [00:32:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:54] T273946: Adjust Discussion Tools' sampling rates - https://phabricator.wikimedia.org/T273946 [00:32:55] should be live [00:33:19] I can confirm that, too. [00:33:23] great :) [00:33:31] let's wait for the empty change then [00:34:12] !log removing 2 files for legal compliance [00:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:42:24] (03CR) 10Nray: Enable WVUI search on beta (for Vector skin) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 (owner: 10Nray) [00:45:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:48] PROBLEM - Check systemd state on sodium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:16] (03CR) 10Jdlrobson: [C: 03+1] Enable WVUI search on beta (for Vector skin) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 (owner: 10Nray) [00:47:59] 10SRE, 10Platform Engineering, 10Traffic, 10cloud-services-team (Kanban): Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10tstarling) I commented on the parent task. [00:48:56] (03PS1) 10Base: Adding import sources for zh_yuewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663711 (https://phabricator.wikimedia.org/T274597) [00:49:09] Urbanecm: you still around? [00:49:11] yeah [00:49:19] Kemayo: so, empty change failed too [00:49:25] if still around, i can deploy them [00:49:33] jouncebot: now [00:49:33] For the next 0 hour(s) and 10 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210212T0000) [00:49:47] Urbanecm: cool, just a heads up that my patch will be ready whenever you get around to it [00:49:52] great [00:50:11] (03CR) 10Urbanecm: [C: 03+2] Enable WVUI search on beta (for Vector skin) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 (owner: 10Nray) [00:50:23] nray: as it's beta-only, it will be deployed automatically within next 30 minutes [00:50:32] cool, thank you! [00:50:32] if it doesn't, shout :) [00:50:34] (03CR) 10jerkins-bot: [V: 04-1] [DNM] Test jerkins [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663727 (owner: 10Urbanecm) [00:50:41] thank you Urbanecm [00:50:45] any time [00:50:51] Urbanecm: how about deploying another config change? Unless you are going to sleep :) [00:51:01] Cladis: we can do it! [00:51:08] Kemayo: ping? [00:51:14] (03Merged) 10jenkins-bot: Enable WVUI search on beta (for Vector skin) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 (owner: 10Nray) [00:51:15] Urbanecm: https://gerrit.wikimedia.org/r/663711 [00:51:52] (03PS2) 10Urbanecm: Add import sources for zh_yuewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663711 (https://phabricator.wikimedia.org/T274597) (owner: 10Base) [00:51:58] (03CR) 10Urbanecm: [C: 03+2] Add import sources for zh_yuewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663711 (https://phabricator.wikimedia.org/T274597) (owner: 10Base) [00:52:38] Urbanecm: Sorry, had to answer the door. I am back. [00:52:43] great [00:52:49] Kemayo: I can do the backports now [00:52:53] empty change failed too [00:53:02] (03Merged) 10jenkins-bot: Add import sources for zh_yuewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663711 (https://phabricator.wikimedia.org/T274597) (owner: 10Base) [00:53:03] Awesome. (Or, arguably, not.) [00:53:08] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663403 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:53:22] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/WikiEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663404 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [00:53:43] Cladis: your config patch is at mwdebug1001 [00:54:25] Urbanecm: can see the sources at Special:Import [00:54:30] great, syncing [00:54:44] Kemayo: your backports are at mwdebug1001 too, but without the config change now. Should i pull config too? [00:55:07] Yes, please. [00:55:22] okay, gimme a minute [00:55:37] (03PS4) 10Urbanecm: Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554) (owner: 10Bartosz Dziewoński) [00:55:43] (03CR) 10Urbanecm: [C: 03+2] Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554) (owner: 10Bartosz Dziewoński) [00:56:31] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 5d92ed15c51d57f43bad054d0469f54848b84d6a: Add import sources for zh_yuewiki (T274597) (duration: 01m 13s) [00:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:36] T274597: Add import sources for zh_yuewiki - https://phabricator.wikimedia.org/T274597 [00:56:39] (03Merged) 10jenkins-bot: Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554) (owner: 10Bartosz Dziewoński) [00:56:57] Kemayo: config is at mwdebug1001 together with backports [00:57:01] Cladis: and yours is live :) [00:57:07] Urbanecm: thank you! :) [00:57:10] no problem [00:58:39] Urbanecm: I confirm it's working there. [00:58:44] great, let's sync it [00:58:57] Urbanecm: Thanks! [00:58:57] Kemayo: should i sync backports first, config second, the other way around, or it doesn't matter? [00:59:12] Doesn't matter so long as it's within a minute or two of each other. [00:59:17] okay, great [01:01:34] !log urbanecm@deploy1001 Synchronized php-1.36.0-wmf.27/extensions/VisualEditor/: c86cd00076c9f1857f4bafb04a15640ad66da863: de4a562d3baec77c85bfa05ba59778b882a6f9d2: VE backports (T273096) (duration: 01m 15s) [01:01:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:01:38] T273096: Add DiscussionTools a/b test bucket information to events from VisualEditor and WikiEditor. - https://phabricator.wikimedia.org/T273096 [01:02:48] !log urbanecm@deploy1001 sync-file aborted: 389f7f1fdc9ad4a0c163ccfe1d80f2aaec7f8038: Enable DiscussionTools Reply Tool A/B test (duration: 00m 48s) [01:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:02] PROBLEM - Postgres Replication Lag on maps1008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 14274104592 and 1474 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:04:12] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 389f7f1fdc9ad4a0c163ccfe1d80f2aaec7f8038: Enable DiscussionTools Reply Tool A/B test (T273554) (duration: 01m 08s) [01:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:18] T273554: Make config change to enable Reply Tool A/B test - https://phabricator.wikimedia.org/T273554 [01:04:20] Kemayo: should be all live! [01:04:58] anything else? [01:06:59] !log Evening B&C done [01:07:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:55] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1329.eqiad.wmnet [01:08:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:06] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1330.eqiad.wmnet [01:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:11] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1331.eqiad.wmnet [01:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:08:15] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1332.eqiad.wmnet [01:08:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:13:32] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 13668229392 and 1398 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:20:32] RECOVERY - mediawiki-installation DSH group on mw1329 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:20:56] RECOVERY - mediawiki-installation DSH group on mw1332 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [01:21:34] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1332.eqiad.wmnet [01:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:40] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1331.eqiad.wmnet [01:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:46] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1330.eqiad.wmnet [01:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:21:54] !log legoktm@cumin1001 conftool action : set/pooled=yes; selector: name=mw1329.eqiad.wmnet [01:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:00] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:26] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:12:54] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:19:42] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 74 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:20:30] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:32:38] (03PS1) 10Andrew Bogott: Nova vendordata: change the mime-type of the cloud-config section [puppet] - 10https://gerrit.wikimedia.org/r/663716 [02:32:42] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 48 probes of 599 (alerts on 65) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:37:50] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:12] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:43:42] Jdlrobson are you around? [02:45:30] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:36] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:55:51] (03CR) 10Legoktm: [C: 03+1] docker-pkg: add ca_bundle configuration [puppet] - 10https://gerrit.wikimedia.org/r/663588 (https://phabricator.wikimedia.org/T274306) (owner: 10JMeybohm) [03:29:52] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:36:36] !log krinkle@deploy1001 Started deploy [integration/docroot@3c943ba]: I89e1ec881 [03:36:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:36:44] !log krinkle@deploy1001 Finished deploy [integration/docroot@3c943ba]: I89e1ec881 (duration: 00m 08s) [03:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:15:55] (03CR) 10DLynch: "Note: we verified that test is broken in .27 in general, and so this patch could be deployed." [extensions/WikiEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663404 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [05:16:08] (03CR) 10DLynch: "Note: we verified that test is broken in .27 in general, and so this patch could be deployed." [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663403 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [05:43:44] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.071 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:43:08] RECOVERY - Check systemd state on sodium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:59] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Joe) According to data in turnilo, the flow of requests has gone down significantly after the app authors fixed the issue... [07:37:21] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Sdkb) Just a quick heads up, apparently //Vice// found this phab ticket worthy of an article: https://www.vice.com/en/arti... [07:52:22] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:52:36] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:53:44] PROBLEM - aqs endpoints health on aqs1009 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:53:44] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:53:44] PROBLEM - aqs endpoints health on aqs1006 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:53:50] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipedia page 0 returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:53:57] this is due to druid, fixing in a sec sigh [07:54:01] problem still not resolved [07:54:32] !log roll restart of druid brokers on druid-public - locked after scheduled datasource deletion [07:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:04] RECOVERY - aqs endpoints health on aqs1009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:56:10] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:56:10] RECOVERY - aqs endpoints health on aqs1006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:56:20] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:57:18] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [07:57:30] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210212T0800) [08:10:07] 10ops-eqiad: ms-be1038 NIC link down - https://phabricator.wikimedia.org/T274622 (10fgiunchedi) [08:14:59] !log reimaging bast2002 to buster [08:15:00] 10ops-eqiad: ms-be1038 NIC link down - https://phabricator.wikimedia.org/T274622 (10fgiunchedi) p:05Triage→03High Please diagnose with priority as with this host we're two down already in ms-be1* (the other being ms-be1034 in T274488) [08:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:36] (03PS3) 10Filippo Giunchedi: alertmanager: route Performance team alerts [puppet] - 10https://gerrit.wikimedia.org/r/663238 (https://phabricator.wikimedia.org/T272979) [08:25:26] !log jynus@cumin1001 dbctl commit (dc=all): 'Increase db1163 traffic to 10%', diff saved to https://phabricator.wikimedia.org/P14331 and previous config saved to /var/cache/conftool/dbconfig/20210212-082526-jynus.json [08:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:03] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: route Performance team alerts [puppet] - 10https://gerrit.wikimedia.org/r/663238 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [08:29:52] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on bast2002.wikimedia.org with reason: REIMAGE [08:29:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:56] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast2002.wikimedia.org with reason: REIMAGE [08:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:17] (03PS1) 10Muehlenhoff: Point wmf-update-known-hosts-production to bast3005 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/663785 [08:53:19] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Point wmf-update-known-hosts-production to bast3005 [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/663785 (owner: 10Muehlenhoff) [09:15:31] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] docker-pkg: add ca_bundle configuration [puppet] - 10https://gerrit.wikimedia.org/r/663588 (https://phabricator.wikimedia.org/T274306) (owner: 10JMeybohm) [09:23:15] (03CR) 10Gehel: "A few more minor comments. Looks good enough to be merged as-is if you want." (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/644482 (https://phabricator.wikimedia.org/T260949) (owner: 10MSantos) [09:31:26] !log installing node-y18n security updates [09:31:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:12] !log jynus@cumin1001 dbctl commit (dc=all): 'Increase db1163 traffic to 20%', diff saved to https://phabricator.wikimedia.org/P14333 and previous config saved to /var/cache/conftool/dbconfig/20210212-093211-jynus.json [09:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:49] (03CR) 10JMeybohm: [C: 03+1] "Cool, let's try this!" (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:45:21] !log jynus@cumin1001 dbctl commit (dc=all): 'Increase db1163 traffic to 30%', diff saved to https://phabricator.wikimedia.org/P14334 and previous config saved to /var/cache/conftool/dbconfig/20210212-094520-jynus.json [09:45:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:05] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2003.codfw.wmnet [09:46:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:17] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) I've setup higher MTU on backup1002 and backup2001 as per @ayounsi suggestion, and will do a bac... [09:52:50] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1085.eqiad.wmnet [09:52:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1086.eqiad.wmnet [09:53:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:28] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2037.codfw.wmnet [09:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:38] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2038.codfw.wmnet [09:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:48] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3060.esams.wmnet [09:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3061.esams.wmnet [09:54:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:08] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4032.ulsfo.wmnet [09:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4026.ulsfo.wmnet [09:54:30] (03PS2) 10Effie Mouzeli: WIP: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [09:54:32] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5012.eqsin.wmnet [09:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5006.eqsin.wmnet [09:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:09] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:57:19] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2003.codfw.wmnet [09:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:34] (03PS3) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [10:01:43] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2004.codfw.wmnet [10:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:09] 10SRE, 10DNS, 10Traffic: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10Vgutierrez) p:05Triage→03Medium [10:02:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5006.eqsin.wmnet [10:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:08] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2037.codfw.wmnet [10:04:11] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5012.eqsin.wmnet [10:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2038.codfw.wmnet [10:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1085.eqiad.wmnet [10:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:30] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3060.esams.wmnet [10:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:40] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3061.esams.wmnet [10:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:49] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1086.eqiad.wmnet [10:04:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4026.ulsfo.wmnet [10:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:05] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4032.ulsfo.wmnet [10:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:06] (03PS4) 10Effie Mouzeli: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [10:11:47] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:12:42] 10SRE, 10Data-Persistence-Backup, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) It is very early to say, but if it has any impact at all, so far it looks negative (codfw->eqiad... [10:12:57] (03PS1) 10Urbanecm: Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) [10:12:57] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2004.codfw.wmnet [10:12:59] (03PS7) 10Legoktm: docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) [10:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:08] (03PS2) 10Urbanecm: Revert "Revert "Enable SandboxLink at viwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663736 (https://phabricator.wikimedia.org/T272796) [10:13:44] 10SRE, 10serviceops, 10User-jijiki: Upgrade memcached to version 1.6.x - https://phabricator.wikimedia.org/T270315 (10jijiki) [10:14:24] (03PS1) 10Vgutierrez: wikimedia.org: Add Apple Business Manager TXT record [dns] - 10https://gerrit.wikimedia.org/r/663794 (https://phabricator.wikimedia.org/T274592) [10:16:15] (03CR) 10Legoktm: [C: 03+2] docker_registry_ha: Have restricted/ images that are limited read/write [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [10:16:36] 10SRE, 10DNS, 10Traffic, 10Patch-For-Review: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10Vgutierrez) I've created the patch that adds the TXT record (https://gerrit.wikimedia.org/r/c/operations/dns/+/663794), could you review it @bblack? [10:18:15] !log jynus@cumin1001 dbctl commit (dc=all): 'Increase db1163 traffic to 50%', diff saved to https://phabricator.wikimedia.org/P14335 and previous config saved to /var/cache/conftool/dbconfig/20210212-101814-jynus.json [10:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:49] (03PS1) 10Effie Mouzeli: hieradata: enable onhost memcached socket use on mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [10:21:40] PROBLEM - Docker registry HTTPS interface on registry1002 is CRITICAL: connect to address 10.64.32.139 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/Docker [10:21:42] PROBLEM - Docker registry HTTP interface on registry1002 is CRITICAL: connect to address 10.64.32.139 and port 81: Connection refused https://wikitech.wikimedia.org/wiki/Docker [10:21:50] ^ me, working on it [10:22:09] !log mwscript importImages.php --wiki=commonswiki --comment-ext=txt --user=Victorgrigas . # T274608 [10:22:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:14] T274608: Server side upload for Victorgrigas - https://phabricator.wikimedia.org/T274608 [10:22:39] !log depooled registry1002 while fixing/debugging nginx config [10:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:37] (03PS30) 10Jbond: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:24:07] !log installing wireshark security updates for stretch [10:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:42] (03PS1) 10Legoktm: docker_registry_ha: Fix nginx syntax for if blocks [puppet] - 10https://gerrit.wikimedia.org/r/663797 [10:25:30] (03PS31) 10Jbond: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:27:07] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28029/console" [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [10:32:30] (03PS2) 10Effie Mouzeli: hieradata: enable onhost memcached socket use on mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [10:37:21] (03CR) 10Effie Mouzeli: [V: 04-1 C: 04-1] "Needs some more love https://puppet-compiler.wmflabs.org/compiler1003/28030/mwdebug1003.eqiad.wmnet/change.mwdebug1003.eqiad.wmnet.err" [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [10:39:20] RECOVERY - cassandra CQL 10.64.0.12:9042 on maps1005 is OK: TCP OK - 0.000 second response time on 10.64.0.12 port 9042 https://phabricator.wikimedia.org/T93886 [10:39:21] !log jynus@cumin1001 dbctl commit (dc=all): 'Increase db1163 traffic to 75%', diff saved to https://phabricator.wikimedia.org/P14336 and previous config saved to /var/cache/conftool/dbconfig/20210212-103921-jynus.json [10:39:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:41] ACKNOWLEDGEMENT - Docker registry HTTPS interface on registry1002 is CRITICAL: connect to address 10.64.32.139 and port 443: Connection refused Legoktm fixing it https://wikitech.wikimedia.org/wiki/Docker [10:40:51] ACKNOWLEDGEMENT - Docker registry HTTP interface on registry1002 is CRITICAL: connect to address 10.64.32.139 and port 81: Connection refused Legoktm fixing it https://wikitech.wikimedia.org/wiki/Docker [10:41:34] RECOVERY - cassandra service on maps1005 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:41:53] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2039.codfw.wmnet [10:41:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:05] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2040.codfw.wmnet [10:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:53] (03PS1) 10Legoktm: Revert "docker_registry_ha: Have restricted/ images that are limited read/write" [puppet] - 10https://gerrit.wikimedia.org/r/663737 [10:46:07] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Revert "docker_registry_ha: Have restricted/ images that are limited read/write" [puppet] - 10https://gerrit.wikimedia.org/r/663737 (owner: 10Legoktm) [10:48:36] RECOVERY - Docker registry HTTPS interface on registry1002 is OK: HTTP OK: HTTP/1.1 200 OK - 2581 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/Docker [10:48:38] RECOVERY - Docker registry HTTP interface on registry1002 is OK: HTTP OK: Status line output matched HTTP/1.1 403 - 407 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Docker [10:50:03] !log repooled registry1002 after revert [10:50:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:40] (03Abandoned) 10Legoktm: docker_registry_ha: Fix nginx syntax for if blocks [puppet] - 10https://gerrit.wikimedia.org/r/663797 (owner: 10Legoktm) [10:53:04] (03PS1) 10Arturo Borrero Gonzalez: cloudgw2002-dev: give it proper puppet role [puppet] - 10https://gerrit.wikimedia.org/r/663799 (https://phabricator.wikimedia.org/T272963) [10:54:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw2002-dev: give it proper puppet role [puppet] - 10https://gerrit.wikimedia.org/r/663799 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [10:57:43] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2040.codfw.wmnet [10:57:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2039.codfw.wmnet [10:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:07] (03PS32) 10Jbond: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:06:35] !log installing xcftools security updates [11:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:01] (03PS1) 10Elukey: druid: tune Broker settings for the Public cluster [puppet] - 10https://gerrit.wikimedia.org/r/663800 (https://phabricator.wikimedia.org/T270173) [11:10:12] !log jynus@cumin1001 dbctl commit (dc=all): 'Increase db1163 traffic to 100%', diff saved to https://phabricator.wikimedia.org/P14337 and previous config saved to /var/cache/conftool/dbconfig/20210212-111010-jynus.json [11:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:21] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1087.eqiad.wmnet [11:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:30] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1088.eqiad.wmnet [11:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:42] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3062.esams.wmnet [11:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3063.esams.wmnet [11:10:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:29] (03PS2) 10Elukey: druid: tune Broker settings for the Public cluster [puppet] - 10https://gerrit.wikimedia.org/r/663800 (https://phabricator.wikimedia.org/T270173) [11:13:04] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [11:13:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:31] (03CR) 10Legoktm: "This was reverted because I didn't validate the nginx config syntax and I didn't fully understand how if blocks work in nginx. I'll submit" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [11:14:54] !log installing golang-1.11 security updates [11:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:09] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudgw2002-dev.codfw.wmnet with reason: REIMAGE [11:15:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:13] (03PS5) 10Jbond: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:21:46] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3063.esams.wmnet [11:21:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:03] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3062.esams.wmnet [11:22:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:14] (03PS1) 10Arturo Borrero Gonzalez: keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) [11:22:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1087.eqiad.wmnet [11:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:20] !log installing node-ini security updates [11:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:22:24] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1088.eqiad.wmnet [11:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:41] 10SRE: Integrate Buster 10.8 point update - https://phabricator.wikimedia.org/T274099 (10MoritzMuehlenhoff) [11:25:41] !log installing device-tree-compiler updates from buster point release [11:25:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:03] (03PS33) 10Jbond: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:26:10] (03PS6) 10Jbond: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:27:16] !log installing emacs updates from buster point release [11:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:19] (03CR) 10Jbond: [C: 03+1] "The error about network::constants was valid, you are only allowed to include that class in profiles. As such i refactored the change a l" [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:31:55] (03PS2) 10Arturo Borrero Gonzalez: keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) [11:32:12] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2041.codfw.wmnet [11:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:22] (03CR) 10jerkins-bot: [V: 04-1] keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [11:32:24] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2042.codfw.wmnet [11:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:11] (03PS3) 10Arturo Borrero Gonzalez: keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) [11:44:16] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2041.codfw.wmnet [11:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:30] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2042.codfw.wmnet [11:44:36] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Paragon) - https://phabricator.wikimedia.org/T274631 (10Pablo) [11:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:57] (03PS4) 10Arturo Borrero Gonzalez: keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) [11:58:24] (03CR) 10jerkins-bot: [V: 04-1] keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [11:59:38] (03PS5) 10Arturo Borrero Gonzalez: keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) [12:01:31] (03PS6) 10Arturo Borrero Gonzalez: keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) [12:02:39] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] keepalived: add support for custom template [puppet] - 10https://gerrit.wikimedia.org/r/663801 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [12:04:03] (03PS7) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [12:04:05] (03PS1) 10Giuseppe Lavagetto: Move scaffold functions to ruby [deployment-charts] - 10https://gerrit.wikimedia.org/r/663807 [12:04:27] (03CR) 10jerkins-bot: [V: 04-1] Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [12:05:58] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1334 bytes in 0.158 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:06:25] (03PS8) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [12:06:49] (03CR) 10jerkins-bot: [V: 04-1] Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [12:07:37] <_joe_> ok this is much better! jayme ^^ we finally have CI on scaffolding :) [12:09:28] (03PS9) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [12:11:38] RECOVERY - tilerator on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:11:54] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on maps2007.codfw.wmnet with reason: Resyncing database [12:11:55] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on maps2007.codfw.wmnet with reason: Resyncing database [12:12:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:11] _joe_: great! [12:13:33] <_joe_> now let me fix all the stuff that I did and that rubocop finds naughty in scaffold.rb :P [12:13:50] RECOVERY - tileratorui on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 315 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [12:19:06] (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [12:25:27] (03Abandoned) 10Urbanecm: [DNM] Test jerkins [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663727 (owner: 10Urbanecm) [12:28:32] (03PS2) 10Giuseppe Lavagetto: Move scaffold functions to ruby [deployment-charts] - 10https://gerrit.wikimedia.org/r/663807 [12:28:34] (03PS10) 10Giuseppe Lavagetto: Add support for php deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 [12:34:42] (03CR) 10Giuseppe Lavagetto: Add support for php deployments (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/651757 (owner: 10Giuseppe Lavagetto) [12:37:55] (03CR) 10Giuseppe Lavagetto: "> Patch Set 4:" [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741) (owner: 10Giuseppe Lavagetto) [12:41:59] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) [12:42:13] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) [12:42:45] (03CR) 10Joal: "LGTM! Thanks for looking deeper into that elukey!" [puppet] - 10https://gerrit.wikimedia.org/r/663800 (https://phabricator.wikimedia.org/T270173) (owner: 10Elukey) [12:46:57] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "my only doubt is if we should rename it to python3-stretch too..." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663345 (https://phabricator.wikimedia.org/T274435) (owner: 10BryanDavis) [12:47:12] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Code LGTM, and thanks for doing it!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663346 (https://phabricator.wikimedia.org/T274435) (owner: 10BryanDavis) [12:47:55] (03PS4) 10ArielGlenn: refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) [12:49:01] (03PS1) 10Majavah: Restore userlink for IP range. [core] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663738 (https://phabricator.wikimedia.org/T274526) [12:50:46] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:51:51] <_joe_> 205? [12:51:53] <_joe_> lol [12:52:55] <_joe_> how do we even send so many reset content responses, pretty great [12:53:21] (03PS1) 10Alexandros Kosiaris: restbase: Remove graphoid config [puppet] - 10https://gerrit.wikimedia.org/r/663812 (https://phabricator.wikimedia.org/T242855) [12:53:22] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [12:53:23] (03PS1) 10Alexandros Kosiaris: Remove graphoid from services_proxy [puppet] - 10https://gerrit.wikimedia.org/r/663813 (https://phabricator.wikimedia.org/T242855) [12:53:25] (03PS1) 10Alexandros Kosiaris: Remove graphoid deployment references [puppet] - 10https://gerrit.wikimedia.org/r/663814 (https://phabricator.wikimedia.org/T242855) [12:53:27] (03PS1) 10Alexandros Kosiaris: graphoid: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/663815 (https://phabricator.wikimedia.org/T242855) [12:53:29] (03PS1) 10Alexandros Kosiaris: graphoid: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/663816 (https://phabricator.wikimedia.org/T242855) [12:53:31] (03PS1) 10Alexandros Kosiaris: graphoid: Remove conftool data [puppet] - 10https://gerrit.wikimedia.org/r/663817 (https://phabricator.wikimedia.org/T242855) [12:53:33] (03PS1) 10Alexandros Kosiaris: graphoid: Remove LVS IP from scb [puppet] - 10https://gerrit.wikimedia.org/r/663818 (https://phabricator.wikimedia.org/T242855) [12:53:35] (03PS1) 10Alexandros Kosiaris: graphoid: Remove all puppet references [puppet] - 10https://gerrit.wikimedia.org/r/663819 (https://phabricator.wikimedia.org/T242855) [12:53:59] <_joe_> akosiaris: that is one satisfying patch series to write heh [12:56:02] (03PS1) 10Kormat: wmfmariadb: Don't use socket format if no socket is set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663821 [12:56:26] 10SRE, 10Graphoid, 10serviceops, 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26), and 2 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10akosiaris) Patches being uploaded. I 've tried to cover everything, but maybe I missed something. In the course of the next week they will be slowly d... [13:00:20] (03PS1) 10Alexandros Kosiaris: graphoid: Remove all RRs for it [dns] - 10https://gerrit.wikimedia.org/r/663822 (https://phabricator.wikimedia.org/T242855) [13:00:55] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:02:34] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [13:03:56] (03CR) 10Kormat: [C: 03+2] wmfmariadb: Don't use socket format if no socket is set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663821 (owner: 10Kormat) [13:04:24] (03PS2) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:06:06] (03CR) 10jerkins-bot: [V: 04-1] cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) (owner: 10Arturo Borrero Gonzalez) [13:06:36] (03Merged) 10jenkins-bot: wmfmariadb: Don't use socket format if no socket is set [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663821 (owner: 10Kormat) [13:07:28] (03PS3) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:09:13] (03PS1) 10Kormat: test_replication_tree: Make common _run() method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663824 [13:10:16] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps1005.eqiad.wmnet [13:10:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:00] (03CR) 10JMeybohm: [C: 03+1] linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [13:11:14] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:19] (03PS4) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:11:46] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (Paragon) - https://phabricator.wikimedia.org/T274631 (10Pablo) [13:14:13] (03PS5) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:17:31] (03PS1) 10Jbond: openstack: quick PoC porting wmcs-enc-cli to a spicerack module [software/spicerack] - 10https://gerrit.wikimedia.org/r/663826 [13:18:29] (03CR) 10Kormat: [C: 03+2] test_replication_tree: Make common _run() method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663824 (owner: 10Kormat) [13:19:29] (03PS6) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:21:15] (03PS1) 10Kormat: integration: Start/stop env for every test module. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663827 [13:22:06] (03CR) 10jerkins-bot: [V: 04-1] openstack: quick PoC porting wmcs-enc-cli to a spicerack module [software/spicerack] - 10https://gerrit.wikimedia.org/r/663826 (owner: 10Jbond) [13:22:09] 10SRE, 10Graphoid, 10Projects-Cleanup, 10serviceops, and 3 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10hashar) Tagging #cleanup for the repositories archival. I guess we can empty up `mediawiki/service/graphoid.git` with a note pointing back to this task, mark the repository r... [13:22:17] (03Merged) 10jenkins-bot: test_replication_tree: Make common _run() method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663824 (owner: 10Kormat) [13:24:27] (03CR) 10Kormat: [C: 03+2] integration: Start/stop env for every test module. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663827 (owner: 10Kormat) [13:24:57] (03PS7) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:26:00] (03CR) 10Jbond: "As far as i can see a lot of the code in wmcs/libs you get for free by using the openstack back end. The rest you could probably easily a" [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro) [13:27:06] (03PS8) 10Arturo Borrero Gonzalez: cloudgw: introduce HA by using keepalived/VRRP [puppet] - 10https://gerrit.wikimedia.org/r/663823 (https://phabricator.wikimedia.org/T272963) [13:28:18] (03Merged) 10jenkins-bot: integration: Start/stop env for every test module. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663827 (owner: 10Kormat) [13:38:02] (03PS2) 10Gehel: wdqs: explicit shutdown of Blazegraph during reboots. [cookbooks] - 10https://gerrit.wikimedia.org/r/662988 [13:42:32] (03PS5) 10ArielGlenn: refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) [13:44:53] (03PS1) 10Alexandros Kosiaris: apertium: Remove the old non TLS release [deployment-charts] - 10https://gerrit.wikimedia.org/r/663833 [13:45:20] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Pablo) [13:48:48] (03PS6) 10ArielGlenn: refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) [13:50:50] (03CR) 10David Caro: "> Patch Set 2:" [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro) [13:51:59] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1089.eqiad.wmnet [13:52:05] vgutierrez@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [13:52:08] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1090.eqiad.wmnet [13:52:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:26] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3064.esams.wmnet [13:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:37] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3065.esams.wmnet [13:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:52] (03CR) 10David Caro: "Thanks for the port, a couple comments:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/663826 (owner: 10Jbond) [13:55:01] (03PS7) 10ArielGlenn: refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) [13:55:47] (03PS1) 10Klausman: dns: Add SRV records for ml-etcd clusters [dns] - 10https://gerrit.wikimedia.org/r/663836 [13:56:12] (03PS2) 10Klausman: dns: Add SRV records for ml-etcd clusters [dns] - 10https://gerrit.wikimedia.org/r/663836 (https://phabricator.wikimedia.org/T273071) [13:56:19] (03CR) 10ArielGlenn: [C: 03+2] refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [13:59:19] (03CR) 10Jbond: wmcs: first try on creating a new etcd for toolforge (039 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro) [14:02:32] (03CR) 10Jbond: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/663826 (owner: 10Jbond) [14:03:50] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3064.esams.wmnet [14:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:58] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3065.esams.wmnet [14:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:38] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1089.eqiad.wmnet [14:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1090.eqiad.wmnet [14:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:20] (03PS3) 10Klausman: dns: Add SRV records for ml-etcd clusters [dns] - 10https://gerrit.wikimedia.org/r/663836 (https://phabricator.wikimedia.org/T273071) [14:10:55] (03CR) 10David Caro: "> the cloud cumin hosts already have access to this endpoint so this would work with the current module of running spicerack i.e., from a " [software/spicerack] - 10https://gerrit.wikimedia.org/r/663826 (owner: 10Jbond) [14:14:06] (03CR) 10Jbond: "> Patch Set 1:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/663826 (owner: 10Jbond) [14:19:13] (03CR) 10Ppchelko: [C: 04-1] "I guess the first-first step is to remove graphoid from RESTBase codebase." [puppet] - 10https://gerrit.wikimedia.org/r/663812 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [14:23:48] (03CR) 10Ppchelko: [C: 04-1] "https://github.com/wikimedia/restbase/pull/1287 - needs to be deployed first. Can do that on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/663812 (https://phabricator.wikimedia.org/T242855) (owner: 10Alexandros Kosiaris) [14:25:18] (03CR) 10David Caro: wmcs: first try on creating a new etcd for toolforge (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/658637 (owner: 10David Caro) [14:26:07] (03CR) 10Elukey: "LGTM! I added some folks of the serviceops team to validate naming etc.., not sure if there is a convention for these kind of things :)" [dns] - 10https://gerrit.wikimedia.org/r/663836 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [14:32:33] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1022.eqiad.wmnet [14:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:48] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Remove the old non TLS release [deployment-charts] - 10https://gerrit.wikimedia.org/r/663833 (owner: 10Alexandros Kosiaris) [14:37:40] (03Merged) 10jenkins-bot: apertium: Remove the old non TLS release [deployment-charts] - 10https://gerrit.wikimedia.org/r/663833 (owner: 10Alexandros Kosiaris) [14:37:56] (03CR) 10Andrew Bogott: [C: 03+2] Nova vendordata: change the mime-type of the cloud-config section [puppet] - 10https://gerrit.wikimedia.org/r/663716 (owner: 10Andrew Bogott) [14:39:32] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1022.eqiad.wmnet [14:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:36] (03PS1) 10Filippo Giunchedi: thanos: bump limit open files for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/663840 [15:08:14] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:12] (03CR) 10Elukey: [C: 03+2] druid: tune Broker settings for the Public cluster [puppet] - 10https://gerrit.wikimedia.org/r/663800 (https://phabricator.wikimedia.org/T270173) (owner: 10Elukey) [15:16:21] !log roll restart druid broker on druid-public to pick up new settings [15:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:28] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] "LGTM, merging. Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/663836 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [15:18:07] (03PS2) 10Filippo Giunchedi: thanos: bump limit open files for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/663840 [15:22:00] !log rolling reboot of alert[12]001 hosts for updates [15:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:55] (03CR) 10CDanis: [C: 03+2] thanos: bump limit open files for thanos-store [puppet] - 10https://gerrit.wikimedia.org/r/663840 (owner: 10Filippo Giunchedi) [15:26:11] godog: whoops accidentally hit +2 instead of +1. you merge at your leisure :) [15:26:23] 10SRE, 10Graphoid, 10Projects-Cleanup, 10serviceops, and 3 others: Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10DannyS712) [15:26:34] cdanis: haha! thank you sir, appreciate it [15:28:15] !log mforns@deploy1001 Started deploy [analytics/refinery@9cd1297]: Fix for data quality alarms after BigTop migration [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] [15:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:20] icinga down? [15:28:30] expected I think, see SAL [15:28:42] ah, cool then [15:28:43] ? [15:28:47] I got worried [15:28:57] ok [15:29:15] whew thanks [15:29:27] I guess there's no way to downtime that :P [15:29:59] sorry for the noise :/ yeah we ought to push that timeout to be longer than a normal host reboot [15:30:20] rzl: ye olde "crontab comment to downtime" trick perhaps [15:36:21] yeah, there's a way to downtime the external check: https://wikitech.wikimedia.org/wiki/Wikitech-static#Meta-monitoring [15:49:52] it looks like the VO incident hasn't auto-resolved -- it's supposed to, right? feels like that's been happening a lot, maybe my imagination [15:51:00] yeah looking into it as well, it should yes [15:51:06] thanks herron [15:52:47] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:07] looking... [15:55:17] RECOVERY - Check systemd state on lists1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] README: line wrapping for easier source reading [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/663322 (owner: 10BryanDavis) [15:57:46] (03Merged) 10jenkins-bot: README: line wrapping for easier source reading [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/663322 (owner: 10BryanDavis) [15:58:11] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Clsuter for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Vgutierrez) p:05Triage→03Medium a:03Vgutierrez I'm unable to find a Wikitech account for the provided email address, @ChristineDeKock let me know if you... [15:59:07] (03PS1) 10BBlack: VCL: log and clear 5XX Set-Cookie headers [puppet] - 10https://gerrit.wikimedia.org/r/663845 (https://phabricator.wikimedia.org/T274514) [16:00:14] (03PS1) 10Vgutierrez: admin: Add brennen to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/663846 (https://phabricator.wikimedia.org/T274601) [16:01:17] (03PS2) 10Vgutierrez: admin: Add brennen to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/663846 (https://phabricator.wikimedia.org/T274601) [16:01:28] 10SRE, 10observability: Icinga meta monitoring pages during icinga host reboots - https://phabricator.wikimedia.org/T274662 (10herron) p:05Triage→03Medium [16:02:45] 10SRE, 10Wikimedia-Logstash, 10observability, 10Patch-For-Review, 10User-herron: Onboard at least 10 new non-sensitive log producers to the logging pipeline - https://phabricator.wikimedia.org/T205852 (10hashar) [16:04:48] (03PS3) 10Effie Mouzeli: hieradata: enable onhost memcached socket use on mwdebug1003 [puppet] - 10https://gerrit.wikimedia.org/r/663796 (https://phabricator.wikimedia.org/T273115) [16:05:42] (03CR) 10Klausman: Add etcd role for ML Team's new clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [16:07:12] !log mforns@deploy1001 Finished deploy [analytics/refinery@9cd1297]: Fix for data quality alarms after BigTop migration [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] (duration: 38m 56s) [16:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:42] (03PS1) 10Vgutierrez: admin: Add Pablo Aragon (paragon) user [puppet] - 10https://gerrit.wikimedia.org/r/663849 (https://phabricator.wikimedia.org/T274631) [16:07:44] (03PS1) 10Vgutierrez: admin: Add paragon to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/663850 (https://phabricator.wikimedia.org/T274631) [16:07:55] !log mforns@deploy1001 Started deploy [analytics/refinery@9cd1297] (thin): Fix for data quality alarms after BigTop migration THIN [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] [16:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:02] !log mforns@deploy1001 Finished deploy [analytics/refinery@9cd1297] (thin): Fix for data quality alarms after BigTop migration THIN [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] (duration: 00m 06s) [16:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Research Scientist (Paragon) - https://phabricator.wikimedia.org/T274631 (10Vgutierrez) p:05Triage→03Medium patches ready, waiting for @leila's confirmation [16:08:31] !log mforns@deploy1001 Started deploy [analytics/refinery@9cd1297] (hadoop-test): Fix for data quality alarms after BigTop migration TEST [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] [16:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:25] RECOVERY - cassandra service on maps2007 is OK: OK - cassandra is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:11:20] (03CR) 10Elukey: Add etcd role for ML Team's new clusters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [16:11:38] !log joining maps2007 to cassandra cluster [16:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:11:59] (03CR) 10CDanis: VCL: log and clear 5XX Set-Cookie headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663845 (https://phabricator.wikimedia.org/T274514) (owner: 10BBlack) [16:12:35] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:12:37] !log mforns@deploy1001 Finished deploy [analytics/refinery@9cd1297] (hadoop-test): Fix for data quality alarms after BigTop migration TEST [analytics/refinery@9cd129764edbac04b192c922ec0a975bc47455a5] (duration: 04m 05s) [16:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:45] (03PS7) 10Hnowlan: tegola: Add docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) [16:15:08] (03CR) 10Elukey: "Checked via pcc and there seems to be an error https://puppet-compiler.wmflabs.org/compiler1002/28047/ml-etcd1001.eqiad.wmnet/change.ml-et" [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [16:15:27] (03CR) 10Hnowlan: tegola: Add docker image. (033 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [16:16:32] (03CR) 10Elukey: "Ah right nevermind, it is of course the missing fake cert for the puppet compiler. We add fake secrets in https://gerrit.wikimedia.org/r/a" [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [16:17:00] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Mvolz) >>! In T273741#6820298, @Joe wrote: >>>! In T273741#6816099, @Joe wrote: >>>>! In T273741#6815874, @Majavah wrote:... [16:17:03] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Great job" [software/benchmw] - 10https://gerrit.wikimedia.org/r/661808 (owner: 10Legoktm) [16:17:37] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:17:53] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663846 (https://phabricator.wikimedia.org/T274601) (owner: 10Vgutierrez) [16:18:39] (03CR) 10Elukey: "Let's create the certs in the private repo via cergen (plus the public part in operations/puppet) then we should be ready to go. I'll let " [puppet] - 10https://gerrit.wikimedia.org/r/663200 (https://phabricator.wikimedia.org/T273071) (owner: 10Klausman) [16:20:37] (03CR) 10Vgutierrez: [C: 03+2] admin: Add brennen to gerrit-root [puppet] - 10https://gerrit.wikimedia.org/r/663846 (https://phabricator.wikimedia.org/T274601) (owner: 10Vgutierrez) [16:21:46] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-brennen: Requesting access to gerrit1001/gerrit1002 for brennen - https://phabricator.wikimedia.org/T274601 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [16:26:17] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-druid1001.eqiad.wmnet [16:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:12] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-druid1001.eqiad.wmnet [16:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:27] (03CR) 10BBlack: VCL: log and clear 5XX Set-Cookie headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663845 (https://phabricator.wikimedia.org/T274514) (owner: 10BBlack) [16:30:54] (03PS2) 10BBlack: VCL: log and clear 5XX Set-Cookie headers [puppet] - 10https://gerrit.wikimedia.org/r/663845 (https://phabricator.wikimedia.org/T274514) [16:31:00] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) >>! In T265138#6817957, @jbond wrote: > @Ladsgroup This could well be to do with how puppetlabs defines core type however it has definitel... [16:33:03] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Dzahn) **All hiera() lookups removed across the repo.** [16:34:07] RECOVERY - cassandra CQL 10.192.32.46:9042 on maps2007 is OK: TCP OK - 0.033 second response time on 10.192.32.46 port 9042 https://phabricator.wikimedia.org/T93886 [16:34:12] (03PS1) 10Andrew Bogott: Remove wmcs-region-migrate-quotas.py [puppet] - 10https://gerrit.wikimedia.org/r/663852 [16:34:14] (03PS1) 10Andrew Bogott: wmcs-cold-migrate.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663853 (https://phabricator.wikimedia.org/T239584) [16:34:16] (03PS1) 10Andrew Bogott: wmcs-region-migrate.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663854 (https://phabricator.wikimedia.org/T239584) [16:34:18] (03PS1) 10Andrew Bogott: wmcs-region-migrate-security-groups.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) [16:35:34] (03PS1) 10Andrew Bogott: wmf_sink: use password class from keystoneauth1 instead of keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/663856 (https://phabricator.wikimedia.org/T239584) [16:35:40] (03CR) 10jerkins-bot: [V: 04-1] wmcs-region-migrate-security-groups.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [16:36:04] (03PS1) 10RobH: aqs101[0-5] puppet repo updates [puppet] - 10https://gerrit.wikimedia.org/r/663857 (https://phabricator.wikimedia.org/T267414) [16:36:42] (03PS2) 10Andrew Bogott: wmcs-region-migrate-security-groups.py: use keystoneauth1 instead of keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) [16:36:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10RobH) [16:37:19] (03CR) 10Andrew Bogott: [C: 03+2] Remove wmcs-region-migrate-quotas.py [puppet] - 10https://gerrit.wikimedia.org/r/663852 (owner: 10Andrew Bogott) [16:37:25] 10SRE, 10Commons, 10Traffic, 10Patch-For-Review: Investigate unusual media traffic pattern for AsterNovi-belgii-flower-1mb.jpg on Commons - https://phabricator.wikimedia.org/T273741 (10Dzahn) >>! In T273741#6826701, @Mvolz wrote: > The sort of suggests the developer community needs an example image to use... [16:37:34] (03CR) 10jerkins-bot: [V: 04-1] wmcs-region-migrate-security-groups.py: use keystoneauth1 instead of keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [16:38:12] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-client1001.eqiad.wmnet [16:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:41] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:38:47] (03PS3) 10Andrew Bogott: wmcs-region-migrate-security-groups: use keystoneauth1 for password class [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) [16:38:54] (03CR) 10Andrew Bogott: [C: 03+2] wmf_sink: use password class from keystoneauth1 instead of keystoneclient [puppet] - 10https://gerrit.wikimedia.org/r/663856 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [16:43:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-client1001.eqiad.wmnet [16:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:47] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:43:51] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-test-ui1001.eqiad.wmnet [16:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-test-ui1001.eqiad.wmnet [16:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:08] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-tool1005.eqiad.wmnet [16:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:53] (03PS1) 10Jgreen: add A and PTR records for frqueue100[34] [dns] - 10https://gerrit.wikimedia.org/r/663859 (https://phabricator.wikimedia.org/T266365) [16:50:06] (03CR) 10Jgreen: [C: 03+2] add A and PTR records for frqueue100[34] [dns] - 10https://gerrit.wikimedia.org/r/663859 (https://phabricator.wikimedia.org/T266365) (owner: 10Jgreen) [16:51:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) [16:53:07] (03CR) 10Giuseppe Lavagetto: [C: 03+1] tegola: Add docker image. [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/654662 (https://phabricator.wikimedia.org/T270170) (owner: 10Hnowlan) [16:53:33] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] python3: move to subdir in preparation for Buster variant [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663345 (https://phabricator.wikimedia.org/T274435) (owner: 10BryanDavis) [16:53:42] (03PS1) 10Hnowlan: mtail: add exception handling in tests for non-Debian OSes [puppet] - 10https://gerrit.wikimedia.org/r/663860 [16:54:38] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] python3-buster: Base image for python 3.7 projects [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663346 (https://phabricator.wikimedia.org/T274435) (owner: 10BryanDavis) [16:55:29] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Clsuter for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) [16:55:32] (03CR) 10CDanis: [C: 03+1] VCL: log and clear 5XX Set-Cookie headers [puppet] - 10https://gerrit.wikimedia.org/r/663845 (https://phabricator.wikimedia.org/T274514) (owner: 10BBlack) [16:57:01] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10jbond) >>! In T265138#6826739, @Dzahn wrote: > **All hiera() lookups removed across the repo.** Amazing thanks for all the effort 💃🏻 🎉 [16:57:27] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Clsuter for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) Hi @Vgutierrez . I have updated the email to reflect that of my Wikitech account. Thank you! [16:58:25] (03PS1) 10Giuseppe Lavagetto: python3-buster: use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663861 [16:58:45] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] python3-buster: use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663861 (owner: 10Giuseppe Lavagetto) [16:59:28] (03PS2) 10Giuseppe Lavagetto: python3-buster: use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663861 [16:59:51] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] python3-buster: use seed_image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663861 (owner: 10Giuseppe Lavagetto) [17:00:23] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Clsuter for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10ChristineDeKock) [17:00:47] (03CR) 10Bstorm: wmf_sink: use password class from keystoneauth1 instead of keystoneclient (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663856 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [17:01:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-tool1005.eqiad.wmnet [17:01:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] VCL: log and clear 5XX Set-Cookie headers [puppet] - 10https://gerrit.wikimedia.org/r/663845 (https://phabricator.wikimedia.org/T274514) (owner: 10BBlack) [17:08:06] (03PS1) 10Elukey: sre.presto.roll-restart-workers: move to class api [cookbooks] - 10https://gerrit.wikimedia.org/r/663863 (https://phabricator.wikimedia.org/T269925) [17:08:37] !log elukey@cumin1001 START - Cookbook sre.presto.reboot-workers for Presto analytics cluster: Reboot Presto nodes - elukey@cumin1001 [17:08:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:32] !log cp*: disable puppet ahead of https://gerrit.wikimedia.org/r/c/operations/puppet/+/663845 [17:13:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:26] (03PS1) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [17:16:52] (03PS7) 10CRusnov: install_server/dhcp: dhcpd.conf include mechanism support machinery [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) [17:16:54] (03CR) 10Andrew Bogott: wmf_sink: use password class from keystoneauth1 instead of keystoneclient (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663856 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [17:18:02] (03CR) 10Andrew Bogott: [C: 03+1] "I don't have anything to migrate right now (thanks to Ceph, migration is a pretty rare event these days) but I hacked up a version that ju" [puppet] - 10https://gerrit.wikimedia.org/r/663853 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [17:18:33] (03CR) 10Andrew Bogott: "We aren't doing region migrations these days so this script is prone to rot; no good way to test this patch but equivalent patches have wo" [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [17:18:43] PROBLEM - Check systemd state on search-loader2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:19:08] (03CR) 10Andrew Bogott: [C: 03+1] "We aren't doing region migrations these days so this script is prone to rot; no good way to test this patch but equivalent patches have wo" [puppet] - 10https://gerrit.wikimedia.org/r/663854 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [17:19:30] (03CR) 10BBlack: [C: 03+2] VCL: log and clear 5XX Set-Cookie headers [puppet] - 10https://gerrit.wikimedia.org/r/663845 (https://phabricator.wikimedia.org/T274514) (owner: 10BBlack) [17:19:43] (03CR) 10Elukey: "manifests/site.pp:node 'labsdb1012.eqiad.wmnet'{" [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) (owner: 10Razzi) [17:20:44] (03PS1) 10Effie Mouzeli: memcached::instance: Add support for memcached 1.6.x [puppet] - 10https://gerrit.wikimedia.org/r/663868 (https://phabricator.wikimedia.org/T270315) [17:23:23] !log cp*: re-enabling puppet after successful agent run on one host as a test! [17:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:06] (03PS2) 10Andrew Bogott: wmcs-cold-migrate.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663853 (https://phabricator.wikimedia.org/T239584) [17:26:08] (03PS2) 10Andrew Bogott: wmcs-region-migrate.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663854 (https://phabricator.wikimedia.org/T239584) [17:26:10] (03PS4) 10Andrew Bogott: wmcs-region-migrate-security-groups: use keystoneauth1 for password class [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) [17:26:12] (03PS1) 10Andrew Bogott: nova_fullstack_test: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663869 (https://phabricator.wikimedia.org/T239584) [17:26:14] (03PS1) 10Andrew Bogott: labs-ip-alias-dump.py: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663870 (https://phabricator.wikimedia.org/T239584) [17:26:16] (03PS1) 10Andrew Bogott: mwopenstackclients: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663871 (https://phabricator.wikimedia.org/T239584) [17:26:18] (03PS1) 10Andrew Bogott: prometheus-labs-targets: Replace use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663872 (https://phabricator.wikimedia.org/T239584) [17:28:09] (03CR) 10jerkins-bot: [V: 04-1] mwopenstackclients: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663871 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [17:28:44] (03CR) 10jerkins-bot: [V: 04-1] prometheus-labs-targets: Replace use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663872 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [17:30:15] (03PS2) 10Andrew Bogott: mwopenstackclients: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663871 (https://phabricator.wikimedia.org/T239584) [17:30:17] (03PS2) 10Andrew Bogott: prometheus-labs-targets: Replace use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663872 (https://phabricator.wikimedia.org/T239584) [17:33:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto analytics cluster: Reboot Presto nodes - elukey@cumin1001 [17:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:34] 10SRE, 10Performance-Team, 10serviceops, 10Patch-For-Review, 10User-jijiki: Enable "/*/mw-with-onhost-tier/" route for MediaWiki where safe - https://phabricator.wikimedia.org/T264604 (10jijiki) @aaron now that T252564 has been unblocked, after I finish with T273115, I think we should proceed with movin... [17:44:23] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 [17:45:16] (03PS1) 10Jgreen: nsca_frack_cfg.erb: remove frqueue1002, add frqueue100[34] [puppet] - 10https://gerrit.wikimedia.org/r/663874 (https://phabricator.wikimedia.org/T266365) [18:00:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) [18:00:42] (03CR) 10Jgreen: [C: 03+2] nsca_frack_cfg.erb: remove frqueue1002, add frqueue100[34] [puppet] - 10https://gerrit.wikimedia.org/r/663874 (https://phabricator.wikimedia.org/T266365) (owner: 10Jgreen) [18:08:17] (03CR) 10Clarakosi: [C: 03+1] "Will deploy next week" [deployment-charts] - 10https://gerrit.wikimedia.org/r/663873 (owner: 10PipelineBot) [18:08:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) 05Open→03Resolved [18:10:50] (03PS1) 10Cwhite: profile: add gerrit log duplication and ecs mutations [puppet] - 10https://gerrit.wikimedia.org/r/663876 (https://phabricator.wikimedia.org/T234565) [18:12:51] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:12:59] PROBLEM - tilerator on maps2007 is CRITICAL: connect to address 10.192.32.46 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [18:14:53] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:15:55] (03CR) 10CRusnov: "pcc output: https://puppet-compiler.wmflabs.org/compiler1002/28052/" [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [18:16:21] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:25:06] (03PS1) 10Jcrespo: dbbackups: disable all ES db bacula runs until next week [puppet] - 10https://gerrit.wikimedia.org/r/663877 (https://phabricator.wikimedia.org/T79922) [18:27:28] !log ebernhardson@deploy1001 Started deploy [wikimedia/discovery/analytics@98264b8]: airflow: review and correct usage of catchup=False [18:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:00] (03PS3) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [18:29:27] (03CR) 10jerkins-bot: [V: 04-1] Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [18:30:39] !log ebernhardson@deploy1001 Finished deploy [wikimedia/discovery/analytics@98264b8]: airflow: review and correct usage of catchup=False (duration: 03m 10s) [18:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:33] (03PS4) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [18:35:27] (03PS1) 10Elukey: sre.hosts.decommission: move to class API [cookbooks] - 10https://gerrit.wikimedia.org/r/663878 [18:35:46] !log mwdebug2002 now a buster VM; you can find a .tar.gz in your home dir with the contents of your previous home [18:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:40:42] (03CR) 10RobH: [C: 03+2] aqs101[0-5] puppet repo updates [puppet] - 10https://gerrit.wikimedia.org/r/663857 (https://phabricator.wikimedia.org/T267414) (owner: 10RobH) [18:53:46] (03PS5) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [18:56:05] PROBLEM - Host ms-be1034.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [18:57:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by robh on cumin1001.eqiad.wmnet for hosts: ` ['aqs1010.eqiad.wmnet', 'aqs1011.eqiad.wmnet', 'aqs1012.eqiad... [18:59:17] (03PS6) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [19:02:35] !log rebooting and reimaging mwdebug2001 to buster T274023 [19:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:40] T274023: Convert mwdebug VMs to debian buster - https://phabricator.wikimedia.org/T274023 [19:02:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug2001.codfw.wmnet with reason: OS upgrade [19:02:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug2001.codfw.wmnet with reason: OS upgrade [19:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:28] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:06:05] !log milimetric@deploy1001 Started deploy [analytics/refinery@366962f]: Fix for mediarequest per file cassandra job [19:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:11:06] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE [19:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:02] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE [19:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:12] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1010.eqiad.wmnet with reason: REIMAGE [19:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:51] (03CR) 10Mforns: [C: 03+1] "LGTM! Exciting :]" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [19:15:05] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE [19:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:11] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1011.eqiad.wmnet with reason: REIMAGE [19:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:36] (03CR) 10Mforns: [C: 03+1] "> Patch Set 1: Code-Review-2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [19:17:19] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1012.eqiad.wmnet with reason: REIMAGE [19:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:03] !log milimetric@deploy1001 Finished deploy [analytics/refinery@366962f]: Fix for mediarequest per file cassandra job (duration: 11m 58s) [19:18:05] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE [19:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:08] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aqs1013.eqiad.wmnet with reason: REIMAGE [19:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:06] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE [19:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:24] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1353.eqiad.wmnet with reason: REIMAGE [19:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1358.eqiad.wmnet with reason: REIMAGE [19:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:28] (03CR) 10Mholloway: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [19:26:46] (03CR) 10Mforns: [C: 03+1] "> Patch Set 1: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662767 (https://phabricator.wikimedia.org/T274172) (owner: 10Mholloway) [19:27:05] !log milimetric@deploy1001 Started deploy [analytics/refinery@e0c09a2]: Fix for mediarequest per file cassandra job - 2 [19:27:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:06] !log robh@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on aqs1015.eqiad.wmnet with reason: REIMAGE [19:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:24] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1353.eqiad.wmnet with reason: REIMAGE [19:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:14] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1358.eqiad.wmnet with reason: REIMAGE [19:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:34] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10wiki_willy) 05Declined→03Open a:03Jclark-ctr [19:30:56] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10wiki_willy) Hi @fgiunchedi - @Jclark-ctr is going to use some parts from decommissioned servers to try and get the server back up. Thanks, Willy [19:34:22] !log Train status: Rolling back commonswiki to wmf.27 due to T274589 (refs T271344) [19:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:30] T274589: No atomic section is open (got LocalFile::lockingTransaction) - https://phabricator.wikimedia.org/T274589 [19:34:30] T271344: 1.36.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T271344 [19:34:35] RECOVERY - Host ms-be1034.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.74 ms [19:37:38] thanks twentyafterfour (I guess I'm not really here... late on a Friday evening. but thanks in any case.) [19:37:43] RECOVERY - Host ms-be1034 is UP: PING WARNING - Packet loss = 33%, RTA = 0.25 ms [19:37:50] :) [19:38:05] !log milimetric@deploy1001 Finished deploy [analytics/refinery@e0c09a2]: Fix for mediarequest per file cassandra job - 2 (duration: 11m 01s) [19:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:35] (03PS1) 1020after4: roll back commonswiki to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663884 [19:38:39] !log milimetric@deploy1001 Started deploy [analytics/refinery@e0c09a2] (thin): Fix for mediarequest per file cassandra job - 2 [19:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:38:46] !log milimetric@deploy1001 Finished deploy [analytics/refinery@e0c09a2] (thin): Fix for mediarequest per file cassandra job - 2 (duration: 00m 06s) [19:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:24] (03CR) 1020after4: [C: 03+2] roll back commonswiki to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663884 (owner: 1020after4) [19:40:12] (03Merged) 10jenkins-bot: roll back commonswiki to 1.36.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663884 (owner: 1020after4) [19:40:43] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10Jclark-ctr) @fgiunchedi Was able to get server to boot with minimal configurations 1cpu 1 dimm. swapped both cpu's (so they will be matching speed) with a recently decommissioned server (ms-be1018) reinstalled... [19:41:50] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10wiki_willy) Nice work @Jclark-ctr, much appreciated. >>! In T274488#6827226, @Jclark-ctr wrote: > @fgiunchedi Was able to get server to boot with minimal configurations 1cpu 1 dimm. > > swapped both cpu's (so t... [19:42:20] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10Jclark-ctr) 05Open→03Resolved [19:42:59] !log mwdebug2001 now on buster - mwdebug1003 rebooting and reimaging to stretch [19:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:21] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: roll back commonswiki to 1.36.0-wmf.27 due to T274589 [19:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:25] T274589: No atomic section is open (got LocalFile::lockingTransaction) - https://phabricator.wikimedia.org/T274589 [19:44:52] 10SRE, 10ops-eqiad: ms-be1038 NIC link down - https://phabricator.wikimedia.org/T274622 (10wiki_willy) a:03Jclark-ctr [19:46:52] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug1003.eqiad.wmnet with reason: OS upgrade [19:46:52] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug1003.eqiad.wmnet with reason: OS upgrade [19:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:31] PROBLEM - very high load average likely xfs on ms-be1034 is CRITICAL: CRITICAL - load average: 110.20, 100.55, 55.55 https://wikitech.wikimedia.org/wiki/Swift [19:51:40] (03CR) 10Jcrespo: [C: 03+2] dbbackups: disable all ES db bacula runs until next week [puppet] - 10https://gerrit.wikimedia.org/r/663877 (https://phabricator.wikimedia.org/T79922) (owner: 10Jcrespo) [19:52:26] PROBLEM - Host mw1353 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:26] PROBLEM - Host mw1358 is DOWN: PING CRITICAL - Packet loss = 100% [19:52:57] ^ failed downtime from wmf-reimage [19:53:04] fixing [19:53:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw1353.eqiad.wmnet with reason: OS upgrade [19:53:35] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw1353.eqiad.wmnet with reason: OS upgrade [19:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:43] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mw1358.eqiad.wmnet with reason: OS upgrade [19:53:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mw1358.eqiad.wmnet with reason: OS upgrade [19:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:08] (03PS7) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [19:56:11] !log mwdebug2002 - restart memcached [19:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:18] RECOVERY - Host ms-be1038 is UP: PING OK - Packet loss = 0%, RTA = 0.15 ms [19:58:34] 10SRE, 10ops-eqiad: ms-be1038 NIC link down - https://phabricator.wikimedia.org/T274622 (10Jclark-ctr) @fgiunchedi replaced failed SFP on switch [19:58:47] 10SRE, 10ops-eqiad: ms-be1038 NIC link down - https://phabricator.wikimedia.org/T274622 (10Jclark-ctr) 05Open→03Resolved [20:05:28] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['aqs1014.eqiad.wmnet'] ` Of which those **FAILED**: ` ['aqs1014.eqiad.wmnet'] ` [20:07:06] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10RobH) [20:07:50] RECOVERY - Host mw1353 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [20:07:50] RECOVERY - Host mw1358 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [20:08:21] 19:43:16 /usr/bin/sudo -u root -- /usr/local/sbin/check-and-restart-php php7.2-fpm 100 on mwdebug2001.codfw.wmnet returned [7]: [20:08:23] 19:43:21 1 hosts had failures restarting php-fpm [20:08:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1353.eqiad.wmnet'] ` an... [20:09:36] RECOVERY - very high load average likely xfs on ms-be1034 is OK: OK - load average: 69.78, 78.24, 79.65 https://wikitech.wikimedia.org/wiki/Swift [20:09:39] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1353.eqiad.wmnet [20:09:40] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1358.eqiad.wmnet'] ` an... [20:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:44] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10RobH) a:05RobH→03Cmjohnson All hosts except aqs1014 imaged and set to staged. aqs1014 has an unseated or bad patch cable for its production network link: ` Broadcom UNDI PXE-2.1... [20:09:58] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1358.eqiad.wmnet [20:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:34] twentyafterfour: I think mutante just finished reimaging it [20:10:46] 19:42 mutante: mwdebug2001 now on buster - mwdebug1003 rebooting and reimaging to stretch [20:13:06] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:07] !log mwdebug2001 - restarted memcached [20:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:17:28] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:18:06] twentyafterfour: fixed. I manually ran the same check-and-restart-php php7.2-fpm [20:18:23] it's fresh on buster. copying home dir backups [20:25:29] jouncebot: now [20:25:29] For the next 11 hour(s) and 34 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210212T0800) [20:28:20] PROBLEM - Check systemd state on alert1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:10] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1358.eqiad.wmnet [20:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:19] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1353.eqiad.wmnet [20:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:16] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:31:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:32:15] !log mw1353, mw1358 - scap pull, repooled [20:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:57] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade [20:35:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on mwdebug1002.eqiad.wmnet with reason: OS upgrade [20:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:57] !log mwdebug1003 now on buster - mwdebug1002 rebooting and reimaging to buster [20:37:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:59] (03PS8) 10ArielGlenn: Generation of json dumps for wikimedia commons [puppet] - 10https://gerrit.wikimedia.org/r/629121 (https://phabricator.wikimedia.org/T259067) (owner: 10Cparle) [20:48:09] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1356.eqiad.wmnet with reason: REIMAGE [20:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1357.eqiad.wmnet with reason: REIMAGE [20:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1356.eqiad.wmnet with reason: REIMAGE [20:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:13] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1357.eqiad.wmnet with reason: REIMAGE [20:52:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:52] (03CR) 10Dzahn: [C: 03+2] Revert "mwdebug: allow rsyncing home dirs [puppet] - 10https://gerrit.wikimedia.org/r/663401 (owner: 10Dzahn) [21:01:12] mwdebug1002 not rebooting properly (VM), the ones in codfw had no issues.. hrmm [21:01:36] so far simply takes a long time and no output on console [21:01:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10Jgreen) @wiki_willy @jclark-ctr we're done with frqueue1002 and can be decommed and removed {T274671}. When you're ready start on payments boxes, we can also shut down payments1004... [21:02:42] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install payments100[5-8] - https://phabricator.wikimedia.org/T266481 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [21:03:54] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr Moving over to @Jclark-ctr to receive and replace the memory, since @Cmjohnson is out on vacation next week. Thanks, Willy [21:04:45] (03PS3) 10Andrew Bogott: wmcs-cold-migrate.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663853 (https://phabricator.wikimedia.org/T239584) [21:04:47] (03PS3) 10Andrew Bogott: wmcs-region-migrate.py: use keystoneauth1 instead of keystoneclient for auth [puppet] - 10https://gerrit.wikimedia.org/r/663854 (https://phabricator.wikimedia.org/T239584) [21:04:49] (03PS5) 10Andrew Bogott: wmcs-region-migrate-security-groups: use keystoneauth1 for password class [puppet] - 10https://gerrit.wikimedia.org/r/663855 (https://phabricator.wikimedia.org/T239584) [21:04:51] (03PS2) 10Andrew Bogott: nova_fullstack_test: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663869 (https://phabricator.wikimedia.org/T239584) [21:04:53] (03PS2) 10Andrew Bogott: labs-ip-alias-dump.py: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663870 (https://phabricator.wikimedia.org/T239584) [21:04:55] (03PS3) 10Andrew Bogott: prometheus-labs-targets: Replace use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663872 (https://phabricator.wikimedia.org/T239584) [21:04:57] (03PS3) 10Andrew Bogott: mwopenstackclients: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663871 (https://phabricator.wikimedia.org/T239584) [21:08:03] (03CR) 10Andrew Bogott: [C: 03+2] mwopenstackclients: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663871 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [21:08:11] (03PS4) 10Andrew Bogott: mwopenstackclients: Replace another use of keystoneclient with keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663871 (https://phabricator.wikimedia.org/T239584) [21:15:46] (03CR) 10Dzahn: mailman3: Start apache2 for web (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/657950 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [21:15:47] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1357.eqiad.wmnet'] ` an... [21:17:46] 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Research Intern (ChristineDeKock) - https://phabricator.wikimedia.org/T274304 (10Aklapper) [21:18:05] (03Abandoned) 10Dzahn: gerrit: split replica hosts into separate role/profile [puppet] - 10https://gerrit.wikimedia.org/r/649752 (owner: 10Dzahn) [21:18:21] (03Abandoned) 10Dzahn: gerrit: drop is_replica and replica_hosts after splitting roles [puppet] - 10https://gerrit.wikimedia.org/r/651821 (owner: 10Dzahn) [21:23:31] (03CR) 10Dzahn: [C: 03+1] "I am not sure there is a path forward here. Paladox and myself tried to help by making it possible to use Gerrit in cloud but there are th" [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox) [21:26:16] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1357.eqiad.wmnet [21:26:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:57] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:31:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1356.eqiad.wmnet'] ` an... [21:54:24] (03PS2) 10Razzi: Remove labsdb1012 from puppet in preparation for rename [puppet] - 10https://gerrit.wikimedia.org/r/663865 (https://phabricator.wikimedia.org/T269211) [21:57:51] (03PS3) 10Krinkle: Reword wmfEtcdApplyDBConfig() comments to better match those in LBFactoryMulti [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658473 (owner: 10Aaron Schulz) [21:57:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1348.eqiad.wmnet with reason: REIMAGE [21:58:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:07] (03CR) 10Krinkle: [C: 03+2] Reword wmfEtcdApplyDBConfig() comments to better match those in LBFactoryMulti [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658473 (owner: 10Aaron Schulz) [21:59:47] (03Merged) 10jenkins-bot: Reword wmfEtcdApplyDBConfig() comments to better match those in LBFactoryMulti [mediawiki-config] - 10https://gerrit.wikimedia.org/r/658473 (owner: 10Aaron Schulz) [22:00:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1348.eqiad.wmnet with reason: REIMAGE [22:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:30] * Krinkle testing on mwdebug1003 [22:01:40] mutante: I understand mwdebug1002 is down right now, expected right? [22:02:13] Krinkle: it broke unexpectedly while reimaging it [22:02:30] ok, well, anyway, np [22:03:07] Krinkle: beware 1003 is stretch right now, depending what you want [22:03:13] 2001 and 2002 are already buster [22:03:25] I will recreate a new 1002 on buster [22:03:41] I'll test on 1001 for now [22:03:46] I see 1003 fingerprint changed since last week [22:04:02] so I'll skip that for now since I haven't got the new info on that yet [22:05:12] that's good, the mail would be out about now if it wasn't for the unexpected issue [22:09:02] (03CR) 10Krinkle: [C: 03+2] PoolCounter.php: Swap stringified class for ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663367 (owner: 10Reedy) [22:09:51] (03Merged) 10jenkins-bot: PoolCounter.php: Swap stringified class for ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663367 (owner: 10Reedy) [22:15:11] !log krinkle@deploy1001 Synchronized wmf-config/etcd.php: b3447343a cleanup (duration: 05m 20s) [22:15:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:41] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [22:26:34] PROBLEM - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% [22:31:05] (03CR) 10Hashar: [C: 04-1] "> Patch Set 12:" [puppet] - 10https://gerrit.wikimedia.org/r/641778 (owner: 10Paladox) [22:32:08] !log krinkle@deploy1001 Synchronized wmf-config/PoolCounterSettings.php: Idc385de0 cleanup (duration: 05m 14s) [22:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:39] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1281.eqiad.wmnet with reason: REIMAGE [22:42:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:44:43] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1281.eqiad.wmnet with reason: REIMAGE [22:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:43] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1282.eqiad.wmnet with reason: REIMAGE [22:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:48] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1282.eqiad.wmnet with reason: REIMAGE [22:47:50] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1283.eqiad.wmnet with reason: REIMAGE [22:47:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:48:45] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1284.eqiad.wmnet with reason: REIMAGE [22:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:50] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1283.eqiad.wmnet with reason: REIMAGE [22:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:19] (03PS1) 10CDanis: Add prepending to esams/knams transits [homer/public] - 10https://gerrit.wikimedia.org/r/663745 [22:51:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1348.eqiad.wmnet'] ` an... [22:51:48] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1284.eqiad.wmnet with reason: REIMAGE [22:51:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:52:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [22:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:16] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [23:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:33] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:23] (03PS1) 10PipelineBot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663908 [23:10:42] (03CR) 10Ladsgroup: "Can anyone review this please 🥺" [puppet] - 10https://gerrit.wikimedia.org/r/657950 (https://phabricator.wikimedia.org/T256542) (owner: 10Ladsgroup) [23:11:20] (03CR) 10Dduvall: [C: 03+2] blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663908 (owner: 10PipelineBot) [23:11:47] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:12:53] (03Merged) 10jenkins-bot: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663908 (owner: 10PipelineBot) [23:14:06] !log dduvall@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [23:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:03] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Clean up and make more re-usable [software/benchmw] - 10https://gerrit.wikimedia.org/r/661808 (owner: 10Legoktm) [23:24:37] !log dduvall@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:02] 10SRE, 10ops-eqiad, 10DC-Ops: update hostname labels on logstash103[345] & db11[51-76] - https://phabricator.wikimedia.org/T273922 (10wiki_willy) [23:26:53] !log dduvall@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [23:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:44] 10SRE, 10SRE-tools: sre.hosts.decomission -> generate_dns_snippets - > Cumin execution failed - https://phabricator.wikimedia.org/T274689 (10Dzahn) [23:37:31] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install logstash103[345] - https://phabricator.wikimedia.org/T267666 (10RobH) Ok, in checking these hosts, all of them appear to have their network setup properly in netbox/on switch but fail media check. Since netbox even has the dac cable labels, I s... [23:38:53] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1281.eqiad.wmnet [23:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:09] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1221.eqiad.wmnet [23:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:39:21] errr [23:41:08] !log legoktm@cumin1001 conftool action : set/pooled=inactive; selector: name=mw1221.eqiad.wmnet [23:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:34] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1356.eqiad.wmnet [23:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:58] (I made a typo, mw1221 doesn't exist) [23:42:07] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1348.eqiad.wmnet [23:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:16] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1282.eqiad.wmnet [23:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:27] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1283.eqiad.wmnet [23:42:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:42:42] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1281.eqiad.wmnet', 'mw12... [23:43:56] !log legoktm@cumin1001 conftool action : set/pooled=no; selector: name=mw1284.eqiad.wmnet [23:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:53:57] PROBLEM - Kafka Broker Replica Max Lag on kafka-jumbo1005 is CRITICAL: 1.203e+07 ge 5e+06 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=16&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [23:54:31] ^ not a problem, will extend downtime window [23:56:38] thanks razzi [23:58:45] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1356.eqiad.wmnet [23:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:59:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1348.eqiad.wmnet [23:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log