[00:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T0000) [00:00:04] No GERRIT patches in the queue for this window AFAICS. [00:00:06] Krinkle: no, not yet, that kind of thing could be done though with rsync and puppet [00:00:37] as you said, let's first worry about doc1001->1002 though [00:00:44] and later about codfw [00:03:24] !log milimetric@deploy1001 Started deploy [analytics/refinery@3da19b6] (thin): More fixes for jobs after cluster upgrade [00:03:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:31] !log milimetric@deploy1001 Finished deploy [analytics/refinery@3da19b6] (thin): More fixes for jobs after cluster upgrade (duration: 00m 07s) [00:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:22] (03PS1) 10Cwhite: profile: bugfix dot_expander [puppet] - 10https://gerrit.wikimedia.org/r/663333 [00:12:41] (03PS1) 10Cwhite: profile: update netdev rsyslog template to ecs 1.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/663081 [00:23:21] (03CR) 10Legoktm: [C: 03+2] Revert "profiler: Send data to excimer-buster pipeline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663078 (owner: 10Legoktm) [00:24:06] (03Merged) 10jenkins-bot: Revert "profiler: Send data to excimer-buster pipeline" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663078 (owner: 10Legoktm) [00:26:44] !log legoktm@deploy1001 Synchronized wmf-config/profiler.php: Revert "profiler: Send data to excimer-buster pipeline" (duration: 02m 00s) [00:26:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:33:31] (03CR) 10Legoktm: [C: 03+2] "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/663079 (owner: 10Legoktm) [00:38:20] (03PS1) 10Legoktm: arclamp: Actually remove the excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/663336 [00:40:07] (03PS2) 10Legoktm: arclamp: Actually remove the excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/663336 [00:40:37] (03CR) 10jerkins-bot: [V: 04-1] arclamp: Actually remove the excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/663336 (owner: 10Legoktm) [00:41:34] (03PS3) 10Legoktm: arclamp: Actually remove the excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/663336 [00:42:19] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/27999/console" [puppet] - 10https://gerrit.wikimedia.org/r/663336 (owner: 10Legoktm) [00:43:51] (03CR) 10Legoktm: [V: 03+1 C: 03+2] arclamp: Actually remove the excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/663336 (owner: 10Legoktm) [00:47:42] (03PS1) 10Legoktm: arclamp: Remove traces of excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/663340 [00:48:26] (03CR) 10Legoktm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28000/console" [puppet] - 10https://gerrit.wikimedia.org/r/663340 (owner: 10Legoktm) [00:49:05] (03CR) 10Legoktm: [V: 03+1 C: 03+2] arclamp: Remove traces of excimer-buster pipeline [puppet] - 10https://gerrit.wikimedia.org/r/663340 (owner: 10Legoktm) [01:00:04] twentyafterfour: (Dis)respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T0100). Please do the needful. [01:10:26] (03PS1) 10BryanDavis: python3: move to subdir in preparation for Buster variant [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663345 (https://phabricator.wikimedia.org/T274435) [01:10:28] (03PS1) 10BryanDavis: python3-buster: Base image for python 3.7 projects [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/663346 (https://phabricator.wikimedia.org/T274435) [01:49:47] (03CR) 10Ori.livneh: "Ping?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/597654 (https://phabricator.wikimedia.org/T253160) (owner: 10Ori.livneh) [02:07:25] !log milimetric@deploy1001 Started deploy [analytics/refinery@01d811f]: Fix spelling error in mediacounts job [02:07:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:31] !log milimetric@deploy1001 Finished deploy [analytics/refinery@01d811f]: Fix spelling error in mediacounts job (duration: 11m 06s) [02:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:36] !log milimetric@deploy1001 Started deploy [analytics/refinery@01d811f] (thin): Fix spelling error in mediacounts job [02:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:18:42] !log milimetric@deploy1001 Finished deploy [analytics/refinery@01d811f] (thin): Fix spelling error in mediacounts job (duration: 00m 06s) [02:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:31] (03PS1) 10Reedy: PoolCounter.php: Swap stringified class for ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663367 [02:43:25] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:50:49] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:53] PROBLEM - MariaDB Replica Lag: s1 #page on db1134 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1468.87 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:08:34] here [03:09:10] probably depooling it, quick check first [03:10:06] here [03:10:07] here [03:10:31] o/ [03:10:45] !log depooled db1134 [03:10:49] !log rzl@cumin1001 dbctl commit (dc=all): 'depool db1134', diff saved to https://phabricator.wikimedia.org/P14310 and previous config saved to /var/cache/conftool/dbconfig/20210211-031048-rzl.json [03:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:23] still not sure what's actually happening there, more eyes thoroughly welcome :) [03:12:04] nothing recent on phab, opening a fresh task [03:12:14] checking dbtree.wikimedia.org to see what that specific one is [03:12:31] s1 [03:12:42] whatever happened there happened at 02:44 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1134&var-port=9104 [03:13:11] https://tendril.wikimedia.org/host/view/db1134.eqiad.wmnet/3306 [03:14:24] can someone put in a downtime? to expire during EU working hours tomorrow please [03:14:26] the Todo says "if unsure, call DBA" [03:14:36] ok, doing the downtime [03:14:54] "if unsure, call a DBA" is under the "master comes back in read-only" section, this is much less scary :) [03:15:01] mutante: thanks [03:15:14] the second part of that sentence was supposed to be .. "but it's also not the master" [03:15:20] per debtree [03:16:39] uptime 30.4 minutes? [03:16:49] 10SRE: db1134 placeholder task - https://phabricator.wikimedia.org/T274472 (10Dzahn) [03:16:58] ACKNOWLEDGEMENT - MariaDB Replica Lag: s1 #page on db1134 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1907.47 seconds daniel_zahn https://phabricator.wikimedia.org/T274472 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:17:24] oh fair enough, I'll move my draft to use T274472 instead :) thanks [03:17:25] T274472: db1134 placeholder task - https://phabricator.wikimedia.org/T274472 [03:17:48] in the SMS we received there should be a code to ACK it on victorops, wanna send that as a reply? [03:17:58] i ACKED on Icinga [03:18:01] already acked on VO [03:18:22] and placeholder ticket T274472 because the downtime needs one anyways [03:19:01] Feb 11 02:42:44 db1134 mysqld[3122]: 210211 2:42:44 [ERROR] mysqld got signal 7 ; [03:19:38] syslog indicates memory corruption [03:21:41] 10SRE: db1134 placeholder task - https://phabricator.wikimedia.org/T274472 (10colewhite) ` Feb 11 02:42:44 db1134 kernel: [1694159.910376] Disabling lock debugging due to kernel taint Feb 11 02:42:44 db1134 kernel: [1694159.910497] mce: [Hardware Error]: Machine check events logged Feb 11 02:42:44 db1134 kernel:... [03:21:51] 10SRE, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10RLazarus) p:05Triage→03High [03:23:18] the rest of s1 looks healthy afaict https://grafana-rw.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s1&var-role=All [03:23:26] any objections to leaving it there and standing down? [03:23:26] ok, the that took me a bit with the timezones and being specific in icinga about the 4 alerting services, but downtimes done [03:23:37] the downtimes will expire at 10am Madrid time [03:23:51] sgtm, thank you! [03:24:15] for 4 services on db1134 that are all replica related. but not all the other basic checks [03:24:41] but because we ACKed, it also means the next page would only be on the next state change.. if it comes back or becomes unknown or whatever [03:26:09] I'm going to go ahead and resolve in VO [03:26:10] sounds good to me to give it to DBA like that [03:26:21] unless there is something else [03:26:25] 10SRE, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10colewhite) `racadm getsel` ` Record: 11 Date/Time: 02/11/2021 01:38:37 Source: system Severity: Critical Description: Correctable memory error rate exceeded for DIMM_B3. -----------------------------------... [03:26:58] yeah, looks from shdubsh's findings that they'll just need to get dcops in there and swap a bad dimm [03:27:16] i see the ticket already transformed massively [03:27:19] (03CR) 10Krinkle: [C: 03+1] PoolCounter.php: Swap stringified class for ::class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663367 (owner: 10Reedy) [03:27:43] yea, cool, thanks shdubsh [03:28:08] that's what exactly what dcops would have asked for, nod [03:28:12] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10colewhite) [03:28:41] i'll just add ops-eqiad right away [03:28:53] heh, also already done [03:29:24] thanks y'all, I'm not sure there's much else we can do tonight :) [03:29:37] == [03:29:44] have a good evening all [03:30:03] you too [03:30:27] 👊 [03:30:42] g'night [03:32:04] good night. [03:37:45] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:03] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:16] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) **As of today all jobrunners/videoscalers across eqiad and codfw are all 100... [03:54:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [04:19:11] (03PS1) 10Andrew Bogott: prepare_cinder_volume: Improve /etc/fstab check [puppet] - 10https://gerrit.wikimedia.org/r/663377 (https://phabricator.wikimedia.org/T274469) [04:20:28] (03PS2) 10Andrew Bogott: prepare_cinder_volume: Improve /etc/fstab check [puppet] - 10https://gerrit.wikimedia.org/r/663377 (https://phabricator.wikimedia.org/T274469) [04:24:12] (03CR) 10Andrew Bogott: [C: 03+2] prepare_cinder_volume: Improve /etc/fstab check [puppet] - 10https://gerrit.wikimedia.org/r/663377 (https://phabricator.wikimedia.org/T274469) (owner: 10Andrew Bogott) [04:47:45] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:50:17] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:57:45] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:00:13] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:17:19] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:19:27] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 39.22 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:19:51] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:21:57] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: (C)60 le (W)70 le 97.48 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [05:27:25] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:29:53] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:42:19] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:47:13] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:54:43] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:59:47] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:06:13] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:12:41] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:31] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:20:05] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:25:57] * kart_ deploying cxserver .. [06:26:23] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2021-02-10-134029-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/663213 (https://phabricator.wikimedia.org/T274133) (owner: 10KartikMistry) [06:26:31] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:27:48] (03Merged) 10jenkins-bot: Update cxserver to 2021-02-10-134029-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/663213 (https://phabricator.wikimedia.org/T274133) (owner: 10KartikMistry) [06:33:48] !log kartik@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [06:33:49] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:35:46] !log kartik@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:05] !log kartik@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [06:41:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:19] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:45:19] !log Updated cxserver to 2021-02-10-134029-production (T274133, T273456, T271980) [06:45:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:45:27] T274133: [[w:ru:Янская стоянка]] is not translatable via CX - https://phabricator.wikimedia.org/T274133 [06:45:27] T271980: Create Wikipedia Altai - https://phabricator.wikimedia.org/T271980 [06:45:27] T273456: Create Wikipedia Meitei - https://phabricator.wikimedia.org/T273456 [06:48:45] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:51:11] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:03:29] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:06:00] !log powercycle thumbor1001 - no ssh, no mgmt serial tty available, no racadm getsel infos [07:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:45] !log elukey@puppetmaster1001 conftool action : set/pooled=no; selector: name=thumbor1001.eqiad.wmnet [07:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:09] it was already not pooled, but I've set it just to be sure [07:07:53] RECOVERY - Host thumbor1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [07:07:53] PROBLEM - haproxy alive on thumbor1001 is CRITICAL: CRITICAL check_alive invalid response https://wikitech.wikimedia.org/wiki/HAProxy [07:15:13] RECOVERY - haproxy alive on thumbor1001 is OK: OK check_alive uptime 436s https://wikitech.wikimedia.org/wiki/HAProxy [07:23:58] (03PS1) 10Elukey: admin: add kzeta to analytics-privatedta-users [puppet] - 10https://gerrit.wikimedia.org/r/663383 (https://phabricator.wikimedia.org/T272982) [07:25:56] !log pool thumbor1001 [07:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:40] (03CR) 10Ayounsi: [C: 03+2] Improve loopback DHCP term [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [07:30:14] (03Merged) 10jenkins-bot: Improve loopback DHCP term [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [07:30:17] (03CR) 10Ayounsi: "> Patch Set 2: Code-Review+1" [homer/public] - 10https://gerrit.wikimedia.org/r/663176 (owner: 10Ayounsi) [07:35:43] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:39:27] (03PS1) 10Legoktm: wikimedia/shellbox: Don't unconditionally allowPath( 'limit.sh' ) [vendor] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663388 (https://phabricator.wikimedia.org/T274474) [07:39:42] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host mc1021.eqiad.wmnet [07:39:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:21] (03CR) 10Legoktm: [C: 03+2] wikimedia/shellbox: Don't unconditionally allowPath( 'limit.sh' ) [vendor] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663388 (https://phabricator.wikimedia.org/T274474) (owner: 10Legoktm) [07:41:46] gonna deploy a UBN fix for shellbox in a little bit ^ [07:43:09] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:44:53] !log push improved loopback dhcp term to all routers [07:44:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:37] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:46:12] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1021.eqiad.wmnet [07:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:52] (03PS1) 10Elukey: Rename the Cloudera CDH specific configs to Apache Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/663518 (https://phabricator.wikimedia.org/T274345) [07:51:55] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast5002.wikimedia.org [07:51:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:59] PROBLEM - SSH on elastic2054 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:59:34] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5002.wikimedia.org [07:59:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:45] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast4003.wikimedia.org [08:01:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:52] (03Merged) 10jenkins-bot: wikimedia/shellbox: Don't unconditionally allowPath( 'limit.sh' ) [vendor] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663388 (https://phabricator.wikimedia.org/T274474) (owner: 10Legoktm) [08:07:32] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4003.wikimedia.org [08:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:03] legoktm: since you're in sre being the deployer, shall I from platform be the sre buddy for the deploy? :-D [08:09:26] I've been watching the task of course since it's a train blocker [08:09:41] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1002.eqiad.wmnet [08:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:45] :D buddies are always appreciated [08:09:50] ok :-D [08:09:56] I synced it to mwdebug1003 and tested it there [08:10:01] oh good [08:10:01] just started the sync-file everywhere [08:10:20] I was going to queue up a test in the browser but if you've already done it... [08:11:04] just added echo "aksdjfhsdjkfh"; to a random page and previewed it [08:11:06] !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.30/vendor/wikimedia/shellbox/src/Command/BashWrapper.php: wikimedia/shellbox: Don't unconditionally allowPath( 'limit.sh' ) - T274474 (duration: 01m 32s) [08:11:08] nice [08:11:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:10] T274474: Unexpected syntax highlight error - https://phabricator.wikimedia.org/T274474 [08:11:51] ah they are done already? that was fast [08:12:30] tested on mw.o, works too [08:12:35] (03PS2) 10Elukey: Rename the Cloudera CDH specific configs to Apache Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/663518 (https://phabricator.wikimedia.org/T274345) [08:12:42] looks ok [08:12:48] but I don't know how to clear that api portal page [08:12:54] action=purge is no more I think [08:13:26] and https://en.wiktionary.org/wiki/Module:labels/data/topical works again [08:13:38] null edit should work [08:14:07] I don't think I have dit perms over there [08:15:07] or it's cleverly hidden, but most likely I just don't have the rights [08:15:56] ?action=purge worked [08:16:04] huh I tried it and failed >_< [08:16:11] well thanks for kicking it again [08:16:58] and thanks for the fix! [08:17:00] weird [08:17:06] yeah [08:17:11] probably a typo or something. who knows [08:17:44] we are back to 0 train blockers again after this, right? [08:18:15] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host bast3005.wikimedia.org [08:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:25] as far as I'm aware, there's nothing else on the phab task [08:18:32] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 22): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28002/console" [puppet] - 10https://gerrit.wikimedia.org/r/663518 (https://phabricator.wikimedia.org/T274345) (owner: 10Elukey) [08:18:35] didn't see anything alarming scroll by in IRC either [08:20:12] exceptions log seems to be full of the usual timeouts and nothing much else [08:22:41] 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10fgiunchedi) [08:24:07] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast3005.wikimedia.org [08:24:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:31] apergos: if you have a moment, I could use +2s on https://gerrit.wikimedia.org/r/q/topic:%2522shellbox-103%2522+ - it's bumping Shellbox properly for master [08:24:42] lemme look [08:28:06] legoktm: what's going on with the removal of the config*.json files? [08:29:34] apergos: see https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/662678 those files are only used for running a server instance of Shellbox, not when it's used as a library for MediaWiki. So we removed them from what composer will bundle [08:29:54] ah fine [08:29:59] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1003.eqiad.wmnet [08:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:11] ok note I am not checkig any of these hashes or anything, but yeah I'll do the merge [08:30:41] wait, I gotta understand the one substantive change (sorry) [08:31:16] https://gerrit.wikimedia.org/r/c/mediawiki/libs/Shellbox/+/663384 has the rationale [08:32:20] and thanks :)) [08:33:48] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) <3 the response, you not only did exactly with manuel would have done (depool from traffic), you also discovered the core reason why mysql failed (hw memory errors). Thank you a lot! [08:34:23] ok (and please add something to the commit message about it) [08:35:13] !log swift codfw-prod decrease HDD weight for ms-be20[16-27] - T272837 [08:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:17] T272837: Decom ms-be[2016-2027] from swift - https://phabricator.wikimedia.org/T272837 [08:36:31] apergos: I linked that commit in the message [08:36:36] good [08:38:02] I guess I need to wait for jenkins on the second one [08:38:05] silly thing [08:38:51] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1001.eqiad.wmnet [08:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:13] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1003.eqiad.wmnet [08:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:30] 10SRE, 10netops, 10observability: Ingest Cron and Root Alerts Into Logstash - https://phabricator.wikimedia.org/T274377 (10ayounsi) Those servers don't have direct external connectivity, so we will have to be creative, eg.; * setup some kind of IMAP relay either with external connectivity of through the pro... [08:46:26] (03PS1) 10Mvolz: Update zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/663524 (https://phabricator.wikimedia.org/T274262) [08:46:56] (03PS2) 10Alexandros Kosiaris: cxserver: Swith to use services proxy for apertium [deployment-charts] - 10https://gerrit.wikimedia.org/r/659836 [08:47:13] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/643069 (owner: 10PipelineBot) [08:47:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (IIRC the debian-glue isn't setup for this package and given the way the plugins are build it would also be quite an effort for" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse) [08:47:55] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/646878 (owner: 10PipelineBot) [08:48:03] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1001.eqiad.wmnet [08:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:18] (03CR) 10JMeybohm: [C: 03+1] "LGTM, nice!" [puppet] - 10https://gerrit.wikimedia.org/r/662807 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [08:48:20] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/651918 (owner: 10PipelineBot) [08:48:33] (03PS1) 10Jcrespo: icinga: Disable notifications for db1134 while under maintenance [puppet] - 10https://gerrit.wikimedia.org/r/663525 (https://phabricator.wikimedia.org/T274472) [08:48:50] (03PS2) 10Jcrespo: icinga: Disable notifications for db1134 while under maintenance [puppet] - 10https://gerrit.wikimedia.org/r/663525 (https://phabricator.wikimedia.org/T274472) [08:49:38] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663016 (owner: 10PipelineBot) [08:51:45] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/663383 (https://phabricator.wikimedia.org/T272982) (owner: 10Elukey) [08:52:06] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1002.eqiad.wmnet [08:52:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:29] (03CR) 10JMeybohm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663064 (https://phabricator.wikimedia.org/T273521) (owner: 10Legoktm) [08:55:04] (03CR) 10Jcrespo: [C: 03+2] icinga: Disable notifications for db1134 while under maintenance [puppet] - 10https://gerrit.wikimedia.org/r/663525 (https://phabricator.wikimedia.org/T274472) (owner: 10Jcrespo) [08:55:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] cxserver: Swith to use services proxy for apertium [deployment-charts] - 10https://gerrit.wikimedia.org/r/659836 (owner: 10Alexandros Kosiaris) [08:57:04] (03Merged) 10jenkins-bot: cxserver: Swith to use services proxy for apertium [deployment-charts] - 10https://gerrit.wikimedia.org/r/659836 (owner: 10Alexandros Kosiaris) [08:59:04] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1003.eqiad.wmnet [08:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:10] (03PS4) 10Kormat: mysql_root_clients: Allow orch access to clouddb [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) [08:59:16] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1004.eqiad.wmnet [08:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:05] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1002.eqiad.wmnet [09:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:16] (03CR) 10Kormat: [C: 03+2] mysql_root_clients: Allow orch access to clouddb [puppet] - 10https://gerrit.wikimedia.org/r/662697 (https://phabricator.wikimedia.org/T273606) (owner: 10Kormat) [09:03:03] (03CR) 10Elukey: [C: 03+2] admin: add kzeta to analytics-privatedta-users [puppet] - 10https://gerrit.wikimedia.org/r/663383 (https://phabricator.wikimedia.org/T272982) (owner: 10Elukey) [09:03:34] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1003.eqiad.wmnet [09:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:35] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10elukey) 05Open→03Resolved a:03elukey @kzimmerman should be done! Let me know if you still have issues :) [09:05:44] 10SRE, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10elukey) [09:08:49] (03CR) 10Muehlenhoff: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/661189 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [09:09:00] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host rpki2001.codfw.wmnet [09:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:09] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1003.eqiad.wmnet [09:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:18] (03CR) 10JMeybohm: [C: 04-1] "Nice! Two more general notes:" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:12:12] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [09:12:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:12:36] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) db1134 is likely to be unavailable for a long period of time due to T274472#6821332. It was the candidate master, which means we have to choose other one for that. * ~~db1083~~ * ~~db2112~~ (and... [09:13:50] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2001.codfw.wmnet [09:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:19:23] PROBLEM - Host thumbor1004 is DOWN: PING CRITICAL - Packet loss = 100% [09:21:02] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Kormat) ` $ mw-section-groups s1 eqiad db1083 0 db1084 200 api db1099:3311 50 contributions,logpager,recentchanges,recentchangeslinked,watchlist db1105:3311... [09:24:01] (03CR) 10WMDE-Fisch: [C: 03+1] [DNM] ReferenceTooltips gadget names for ReferencePreviews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663185 (https://phabricator.wikimedia.org/T274353) (owner: 10Thiemo Kreuz (WMDE)) [09:26:08] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) db1118 it is, then, seems also the most reliable one of the list (based on no past crashes/longevity serving traffic). [09:26:45] PROBLEM - Check systemd state on kubernetes1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:16] (03CR) 10Kosta Harlan: "> Patch Set 16: Code-Review-1" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:28:45] (03PS17) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [09:29:31] (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:29:33] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:31:26] (03PS1) 10Jcrespo: mariadb: Promote db1118 as the new candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/663531 (https://phabricator.wikimedia.org/T274472) [09:31:45] (03PS2) 10Jcrespo: mariadb: Promote db1118 as the new candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/663531 (https://phabricator.wikimedia.org/T274472) [09:33:29] (03PS18) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [09:34:23] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/663531 (https://phabricator.wikimedia.org/T274472) (owner: 10Jcrespo) [09:35:49] (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:38:29] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [09:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:39] (03CR) 10JMeybohm: linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [09:39:58] (03CR) 10Jcrespo: [C: 03+2] mariadb: Promote db1118 as the new candidate master for s1 [puppet] - 10https://gerrit.wikimedia.org/r/663531 (https://phabricator.wikimedia.org/T274472) (owner: 10Jcrespo) [09:42:43] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:43:46] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'cxserver' for release 'production' . [09:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:27] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [09:44:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:55] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [09:44:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:55] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:05] RECOVERY - Check systemd state on kubernetes1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:53:51] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host thumbor1004.eqiad.wmnet [09:53:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:03] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete netfilter sysctl setting [puppet] - 10https://gerrit.wikimedia.org/r/662908 (owner: 10Muehlenhoff) [09:55:27] (03CR) 10Elukey: "I am not a great fan of having the same config replicated over and over in hiera, I personally prefer the defaults in profiles, but I know" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [09:58:31] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1083.eqiad.wmnet [09:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:45] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp1084.eqiad.wmnet [09:58:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:01] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2035.codfw.wmnet [09:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:13] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp2036.codfw.wmnet [09:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:27] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3058.esams.wmnet [09:59:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:40] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp3059.esams.wmnet [09:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:51] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4031.ulsfo.wmnet [09:59:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:00] (03CR) 10JMeybohm: linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [10:00:01] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp4025.ulsfo.wmnet [10:00:02] (03PS1) 10Ayounsi: Remove sampling feature flag [homer/public] - 10https://gerrit.wikimedia.org/r/663533 [10:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:15] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5011.eqsin.wmnet [10:00:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:20] !log vgutierrez@cumin1001 START - Cookbook sre.hosts.reboot-single for host cp5005.eqsin.wmnet [10:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:38] 10SRE, 10DBA: Decom dbmonitor2001 - https://phabricator.wikimedia.org/T274496 (10MoritzMuehlenhoff) [10:00:55] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [10:02:06] 10SRE, 10serviceops, 10Kubernetes, 10Patch-For-Review: Migrate to helm v3 - https://phabricator.wikimedia.org/T251305 (10akosiaris) [10:02:10] (03CR) 10Elukey: [V: 03+1 C: 03+2] Rename the Cloudera CDH specific configs to Apache Bigtop [puppet] - 10https://gerrit.wikimedia.org/r/663518 (https://phabricator.wikimedia.org/T274345) (owner: 10Elukey) [10:02:30] going to merge this and run puppet veery carefully :) [10:02:55] !log switching db1118 to row_format=STATEMENT as new s1 master candidate [10:02:56] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:02:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:26] (03CR) 10Jbond: [C: 03+1] dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:07:16] (03CR) 10Jbond: [C: 03+2] "LGTM will merege" [puppet] - 10https://gerrit.wikimedia.org/r/663051 (owner: 10Dzahn) [10:07:34] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be1004.eqiad.wmnet [10:07:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:21] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5011.eqsin.wmnet [10:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:00] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp5005.eqsin.wmnet [10:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:09] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2036.codfw.wmnet [10:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:26] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3058.esams.wmnet [10:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:32] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1083.eqiad.wmnet [10:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp3059.esams.wmnet [10:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:45] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp1084.eqiad.wmnet [10:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:36] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp2035.codfw.wmnet [10:13:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [10:14:23] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4031.ulsfo.wmnet [10:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:41] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4025.ulsfo.wmnet [10:14:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:21] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be1004.eqiad.wmnet [10:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [10:18:47] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2001.codfw.wmnet [10:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:40] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db1118.eqiad.wmnet with reason: Depooling to change binglog_format T274472 [10:19:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db1118.eqiad.wmnet with reason: Depooling to change binglog_format T274472 [10:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:45] T274472: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 [10:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:00] !log kormat@cumin1001 dbctl commit (dc=all): 'db1118 depooling: change binlog_format', diff saved to https://phabricator.wikimedia.org/P14312 and previous config saved to /var/cache/conftool/dbconfig/20210211-101959-kormat.json [10:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:00] (03PS1) 10Ayounsi: Capirca POC [homer/public] - 10https://gerrit.wikimedia.org/r/663535 (https://phabricator.wikimedia.org/T273865) [10:25:08] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2001.codfw.wmnet [10:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:24] (03CR) 10jerkins-bot: [V: 04-1] Capirca POC [homer/public] - 10https://gerrit.wikimedia.org/r/663535 (https://phabricator.wikimedia.org/T273865) (owner: 10Ayounsi) [10:25:27] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) a:05Marostegui→03jcrespo I am taking db1163 to, at least temporarily, substitute db1134 due to T274472. [10:28:24] (03PS1) 10Ayounsi: Capirca POC [software/homer] - 10https://gerrit.wikimedia.org/r/663536 (https://phabricator.wikimedia.org/T273865) [10:32:50] (03CR) 10Muehlenhoff: [C: 03+1] dhcpd: create and include files for option 82 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:33:20] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2002.codfw.wmnet [10:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:40] !log kormat@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 33%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14313 and previous config saved to /var/cache/conftool/dbconfig/20210211-103440-kormat.json [10:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:45] T274472: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 [10:35:49] (03PS19) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [10:37:34] (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [10:38:05] (03PS2) 10Ayounsi: Capirca POC [homer/public] - 10https://gerrit.wikimedia.org/r/663535 (https://phabricator.wikimedia.org/T273865) [10:39:46] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2002.codfw.wmnet [10:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:07] (03CR) 10Filippo Giunchedi: alertmanager: route Performance team alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/663238 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [10:40:44] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2003.codfw.wmnet [10:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:48] (03PS1) 10Alexandros Kosiaris: apertium: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/663538 [10:40:50] (03PS1) 10Alexandros Kosiaris: apertium: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/663539 [10:40:52] (03PS1) 10Alexandros Kosiaris: apertium: Remove conftool data [puppet] - 10https://gerrit.wikimedia.org/r/663540 [10:40:54] (03PS1) 10Alexandros Kosiaris: Remove apertium-admins group [puppet] - 10https://gerrit.wikimedia.org/r/663541 [10:40:56] (03PS1) 10Alexandros Kosiaris: apertium: Cleanup scb cluster, puppet [puppet] - 10https://gerrit.wikimedia.org/r/663542 [10:40:58] (03PS1) 10Alexandros Kosiaris: Remove apertium-plain LVS service [puppet] - 10https://gerrit.wikimedia.org/r/663543 [10:44:57] (03PS1) 10Kormat: integration: Use a fixture for starting/stopping env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663544 [10:45:41] (03PS2) 10Kormat: integration: Use a fixture for starting/stopping env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663544 [10:46:33] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Switch to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/663538 (owner: 10Alexandros Kosiaris) [10:48:31] (03CR) 10Kormat: [C: 03+2] integration: Use a fixture for starting/stopping env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663544 (owner: 10Kormat) [10:48:36] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2003.codfw.wmnet [10:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:33] (03CR) 10Volans: "reply inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [10:49:44] !log kormat@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 66%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14314 and previous config saved to /var/cache/conftool/dbconfig/20210211-104943-kormat.json [10:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:48] T274472: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 [10:50:09] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-be2004.codfw.wmnet [10:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:29] (03Merged) 10jenkins-bot: integration: Use a fixture for starting/stopping env [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663544 (owner: 10Kormat) [10:55:00] (03PS1) 10Kormat: replication_tree: Add --no-color [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663546 [10:55:47] (03PS2) 10Kormat: replication_tree: Add --no-color [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663546 [10:56:59] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-be2004.codfw.wmnet [10:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] mvolz: Time to snap out of that daydream and deploy Services – Citoid / Zotero. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T1100). [11:00:46] (03PS1) 10Jcrespo: db1163: Reimage to stretch to potentially become s1 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/663549 (https://phabricator.wikimedia.org/T258361) [11:01:01] (03PS2) 10Jcrespo: db1163: Reimage to stretch to potentially become s1 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/663549 (https://phabricator.wikimedia.org/T258361) [11:01:35] (03CR) 10Jbond: "See inline" (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [11:01:41] (03CR) 10Mvolz: [C: 03+2] Update zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/663524 (https://phabricator.wikimedia.org/T274262) (owner: 10Mvolz) [11:03:34] (03Merged) 10jenkins-bot: Update zotero [deployment-charts] - 10https://gerrit.wikimedia.org/r/663524 (https://phabricator.wikimedia.org/T274262) (owner: 10Mvolz) [11:03:43] !log installing firejail security updates on Stretch [11:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:47] !log kormat@cumin1001 dbctl commit (dc=all): 'db1118 (re)pooling @ 100%: changed binlog_format T274472', diff saved to https://phabricator.wikimedia.org/P14315 and previous config saved to /var/cache/conftool/dbconfig/20210211-110447-kormat.json [11:04:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:51] T274472: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 [11:06:05] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'zotero' for release 'staging' . [11:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:44] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:30] (03CR) 10Jbond: [C: 03+1] dhcpd: create and include files for option 82 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [11:13:56] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'zotero' for release 'production' . [11:14:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:31] (03CR) 10Jbond: [C: 03+2] base::service_unit: drop support for sysV and upstart init scripts [puppet] - 10https://gerrit.wikimedia.org/r/661917 (https://phabricator.wikimedia.org/T273743) (owner: 10Jbond) [11:15:26] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:16:03] (03CR) 10Kormat: [C: 03+2] replication_tree: Add --no-color [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663546 (owner: 10Kormat) [11:16:18] 10SRE, 10SRE-tools, 10homer, 10netbox, and 2 others: Investigate Capirca - https://phabricator.wikimedia.org/T273865 (10ayounsi) Limitations identified: Some ACLs currently have Jinja code in them, which is not possible through Capirca. The easiest cases have (or can) be mitigated by either: * removing th... [11:17:22] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [11:17:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:37] (03CR) 10Valerio Bozzolan: "Just a question. What about the HTTP status code 429 Too Many Requests?" [puppet] - 10https://gerrit.wikimedia.org/r/663004 (https://phabricator.wikimedia.org/T273741) (owner: 10Giuseppe Lavagetto) [11:19:09] (03Merged) 10jenkins-bot: replication_tree: Add --no-color [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663546 (owner: 10Kormat) [11:20:24] (03CR) 10Kormat: [C: 03+1] db1163: Reimage to stretch to potentially become s1 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/663549 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [11:21:37] (03CR) 10Jcrespo: [C: 03+2] db1163: Reimage to stretch to potentially become s1 candidate master [puppet] - 10https://gerrit.wikimedia.org/r/663549 (https://phabricator.wikimedia.org/T258361) (owner: 10Jcrespo) [11:23:31] hey, volans, I really need to deploy a change to install1003 for an incident response [11:23:53] jynus: oh sorry, my bad, I forgot to re-enable it [11:23:59] would it be possible to enable puppet or manually change an entry? [11:24:08] both would work for mw [11:24:11] *me [11:24:27] re-enabled, running puppet [11:24:29] (03CR) 10Mvolz: [C: 03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663270 (owner: 10PipelineBot) [11:24:29] sorry about that [11:24:35] no problem caused :-) [11:24:46] RECOVERY - Host thumbor1004 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [11:24:53] can you see db1163 on stretch reimage now? [11:25:09] I guess I can look myself at the diff [11:25:16] still running [11:25:23] yep + option pxelinux.pathprefix "http://apt.wikimedia.org/tftpboot/stretch-installer/"; [11:25:31] thank, good news! [11:25:33] !log jiji@cumin1001 conftool action : set/pooled=yes; selector: name=thumbor1004.eqiad.wmnet [11:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:54] jynus: puppet run completed [11:25:54] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2001.codfw.wmnet [11:25:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:14] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663270 (owner: 10PipelineBot) [11:26:52] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` db1163.eqiad.wmnet ` The log can be found in `/var/log/wmf-auto-reimage/2021021... [11:27:59] (03CR) 10Volans: [C: 03+2] dhcpd: create and include files for option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663233 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [11:28:43] (03CR) 10Volans: [C: 03+2] dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) (owner: 10Volans) [11:29:04] (03PS4) 10Volans: dhcpd: move sretest1002 to option 82 [puppet] - 10https://gerrit.wikimedia.org/r/663234 (https://phabricator.wikimedia.org/T221388) [11:32:44] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) [11:33:21] 10SRE, 10Data-Persistence-Backup, 10SRE-swift-storage, 10Traffic, 10netops: Depool codfw swift cluster - https://phabricator.wikimedia.org/T267338 (10jcrespo) I asked filippo to delay the maintenance 1 week due to unexpected workload on my side, which would prevent me to be ready by next week. [11:34:30] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10aborrero) a:05aborrero→03Papaul I don't think the server is racked yet, so assigning back to @Papaul [11:35:18] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2001.codfw.wmnet [11:35:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:34] !log mvolz@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [11:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:40] (03CR) 10Effie Mouzeli: [C: 03+2] profile::memcached::instance: remove "default_values" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/647190 (owner: 10Elukey) [11:39:18] !log jiji@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor2002.codfw.wmnet [11:39:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:47] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE [11:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:55] !log mvolz@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [11:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:47] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1163.eqiad.wmnet with reason: REIMAGE [11:41:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:44] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:44:58] (03PS3) 10Ayounsi: Capirca POC [homer/public] - 10https://gerrit.wikimedia.org/r/663535 (https://phabricator.wikimedia.org/T273865) [11:45:11] !log mvolz@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [11:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:12] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:48:51] 10SRE, 10ops-eqiad, 10DBA, 10Patch-For-Review: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1163.eqiad.wmnet'] ` and were **ALL** successful. [11:49:26] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor2002.codfw.wmnet [11:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:50:01] (03PS6) 10Base: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) [11:51:36] (03PS5) 10ArielGlenn: refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) [11:52:19] (03PS6) 10ArielGlenn: refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) [11:53:25] (03CR) 10Noa wmde: [C: 03+1] wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662967 (https://phabricator.wikimedia.org/T204031) (owner: 10Lucas Werkmeister (WMDE)) [11:55:47] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663261 (owner: 10PipelineBot) [11:55:55] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/663269 (owner: 10PipelineBot) [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T1200). [12:00:04] matthiasmullie and Lucas_WMDE: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [12:00:14] \o [12:00:30] i assume you'll all self-service [12:00:51] 10SRE, 10Traffic, 10Wikimedia-General-or-Unknown: Disable caching on the main page for anonymous users - https://phabricator.wikimedia.org/T119366 (10Kaganer) There is a difference between logged in and unlogged sessions. See [[ https://ba.wikipedia.org/wiki/Баш_бит | https://ba.wikipedia.org/wiki/Баш_бит ]]... [12:00:53] I can do that, but I’ll also only be properly around a bit later [12:01:32] PROBLEM - tilerator on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [12:02:24] PROBLEM - Postgres Replication Lag on maps1005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 15942435400 and 41287 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:02:38] PROBLEM - cassandra CQL 10.64.0.12:9042 on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [12:04:25] (03PS1) 10Elukey: bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) [12:06:07] !log restart-failed systemd on cumin1001 after s5 eqiad snapshot failed [12:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:18] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:10:36] PROBLEM - cassandra service on maps1005 is CRITICAL: CRITICAL - Expecting active but unit cassandra is inactive https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:10:54] alright, if matthiasmullie isn’t around yet I can start with my own changes [12:11:12] (03PS2) 10Lucas Werkmeister (WMDE): wikidata: add Dagbani to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662970 (https://phabricator.wikimedia.org/T272242) [12:11:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] wikidata: add Dagbani to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662970 (https://phabricator.wikimedia.org/T272242) (owner: 10Lucas Werkmeister (WMDE)) [12:12:14] (03Merged) 10jenkins-bot: wikidata: add Dagbani to wmgExtraLanguageNames [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662970 (https://phabricator.wikimedia.org/T272242) (owner: 10Lucas Werkmeister (WMDE)) [12:12:38] pulled to mwdebug1001, testing [12:13:12] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:13:19] seems fine, syncing [12:13:42] (03PS2) 10Elukey: bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) [12:14:00] (03PS2) 10Lucas Werkmeister (WMDE): wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662967 (https://phabricator.wikimedia.org/T204031) [12:14:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662967 (https://phabricator.wikimedia.org/T204031) (owner: 10Lucas Werkmeister (WMDE)) [12:15:12] PROBLEM - tileratorui on maps1005 is CRITICAL: connect to address 10.64.0.12 and port 6535: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tileratorui [12:15:16] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:662970|wikidata: add Dagbani to wmgExtraLanguageNames (T272242)]] (duration: 01m 29s) [12:15:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:21] T272242: Language code "dag" for Dagbani does not work for lexemes - https://phabricator.wikimedia.org/T272242 [12:15:57] (03Merged) 10jenkins-bot: wikidata: post edit constraint jobs on 50% of edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662967 (https://phabricator.wikimedia.org/T204031) (owner: 10Lucas Werkmeister (WMDE)) [12:16:30] this one can’t really be tested, I’ll just sync it [12:17:37] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Switch to service_setup [puppet] - 10https://gerrit.wikimedia.org/r/663539 (owner: 10Alexandros Kosiaris) [12:18:02] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:662967|wikidata: post edit constraint jobs on 50% of edits (T204031)]] (up from 40%) (duration: 01m 08s) [12:18:05] (03PS1) 10Jbond: update cas meta data [puppet] - 10https://gerrit.wikimedia.org/r/663560 [12:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:07] T204031: Deploy regular running of wikidata constraint checks using the job queue - https://phabricator.wikimedia.org/T204031 [12:18:19] alright, and with that I’m done I think [12:19:20] (03CR) 10Jbond: [C: 03+2] update cas meta data [puppet] - 10https://gerrit.wikimedia.org/r/663560 (owner: 10Jbond) [12:20:26] (03PS3) 10Elukey: bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) [12:21:46] hm, there seems to be a spike of job queue failures [12:21:51] (could not enqueue jobs from stream …) [12:22:09] but not on any of the wikis that the changes I deployed touched [12:22:52] ok it already went away again (was only for 2 minutes) [12:23:58] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.11:2737]) https://wikitech.wikimedia.org/wiki/PyBal [12:24:24] PROBLEM - PyBal IPVS diff check on lvs2009 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.11:2737]) https://wikitech.wikimedia.org/wiki/PyBal [12:24:56] PROBLEM - PyBal IPVS diff check on lvs2010 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.1.11:2737]) https://wikitech.wikimedia.org/wiki/PyBal [12:24:58] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Services in IPVS but unknown to PyBal: set([10.2.2.11:2737]) https://wikitech.wikimedia.org/wiki/PyBal [12:26:20] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 22): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28006/console" [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [12:27:53] 10SRE, 10Platform Engineering, 10Traffic, 10cloud-services-team (Kanban): Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10aborrero) For the record, we already have a dedicated phab task for the traffic team: {T273737} [12:30:17] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Remove conftool data [puppet] - 10https://gerrit.wikimedia.org/r/663540 (owner: 10Alexandros Kosiaris) [12:30:18] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:30:44] RECOVERY - PyBal IPVS diff check on lvs2009 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:31:10] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] "Group going away for good and never coming back, per the usual, best to just remove it." [puppet] - 10https://gerrit.wikimedia.org/r/663541 (owner: 10Alexandros Kosiaris) [12:31:16] RECOVERY - PyBal IPVS diff check on lvs2010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:31:16] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [12:31:20] (03PS2) 10Alexandros Kosiaris: Remove apertium-admins group [puppet] - 10https://gerrit.wikimedia.org/r/663541 [12:31:24] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove apertium-admins group [puppet] - 10https://gerrit.wikimedia.org/r/663541 (owner: 10Alexandros Kosiaris) [12:33:13] Lucas_WMDE: if done with deploying, i guess i can ship something? [12:33:20] sure [12:33:39] (03PS7) 10Urbanecm: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) (owner: 10Base) [12:33:41] (03PS1) 10Elukey: hadoop: set a dedicated HDFS NN service port for test/backup [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) [12:33:57] (I didn’t !log finished yet in case Matthias shows up) [12:33:58] (03CR) 10Urbanecm: [C: 03+2] Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) (owner: 10Base) [12:34:04] i see [12:34:12] PROBLEM - LVS apertium codfw port 4737/tcp - Machine Translation service. apertium.svc.eqiad.wmnet IPv4 on apertium.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.11 and port 2737: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:35:00] PROBLEM - LVS apertium eqiad port 4737/tcp - Machine Translation service. apertium.svc.eqiad.wmnet IPv4 on apertium.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.11 and port 2737: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:35:07] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2001.codfw.wmnet [12:35:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:52] (03Merged) 10jenkins-bot: Changing frwiktionary's wmgBabelMainCategory [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662720 (https://phabricator.wikimedia.org/T274137) (owner: 10Base) [12:36:38] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apertium-plain on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apertium-plain is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:36:50] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:36:59] this is me ^ killing the old service. [12:37:16] apertium now listens fine to 4737 [12:37:51] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [12:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:10] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium-plain on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium-plain is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:38:18] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium-plain on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium-plain is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:38:18] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:38:24] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:38:29] akosiaris: your merge? [12:38:38] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:38:48] Urbanecm: yes, known. [12:38:52] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apertium-plain on puppetmaster2001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apertium-plain is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:38:52] okay, thanks :) [12:38:59] (03PS1) 10Jbond: profile::idp::standalone: create a profile for managing idp on cloud [puppet] - 10https://gerrit.wikimedia.org/r/663562 [12:39:24] (03PS4) 10Elukey: bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) [12:39:26] (03PS2) 10Elukey: hadoop: set a dedicated HDFS NN service port for test/backup [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) [12:40:32] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d2b1df105afd9f9c9c047ae9c0a434674f43d505: Changing frwiktionary wmgBabelMainCategory (T274137) (duration: 01m 08s) [12:40:33] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2001.codfw.wmnet [12:40:34] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apertium-plain on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:40:34] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:37] T274137: Change name of Babel main category for French Wiktionary - https://phabricator.wikimedia.org/T274137 [12:40:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:20] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [12:41:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:26] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28008/console" [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [12:44:20] PROBLEM - cassandra-a SSL 10.192.48.132:7001 on sessionstore2003 is CRITICAL: SSL CRITICAL - Certificate sessionstore2003-a valid until 2021-03-13 12:44:14 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [12:44:46] PROBLEM - cassandra-a SSL 10.64.48.178:7001 on sessionstore1003 is CRITICAL: SSL CRITICAL - Certificate sessionstore1003-a valid until 2021-03-13 12:44:11 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [12:45:00] PROBLEM - cassandra-a SSL 10.64.32.85:7001 on sessionstore1002 is CRITICAL: SSL CRITICAL - Certificate sessionstore1002-a valid until 2021-03-13 12:44:10 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [12:45:12] PROBLEM - cassandra-a SSL 10.192.32.101:7001 on sessionstore2002 is CRITICAL: SSL CRITICAL - Certificate sessionstore2002-a valid until 2021-03-13 12:44:13 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [12:45:14] PROBLEM - cassandra-a SSL 10.64.0.144:7001 on sessionstore1001 is CRITICAL: SSL CRITICAL - Certificate sessionstore1001-a valid until 2021-03-13 12:44:09 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [12:45:32] (03CR) 10Jbond: [C: 03+2] profile::idp::standalone: create a profile for managing idp on cloud [puppet] - 10https://gerrit.wikimedia.org/r/663562 (owner: 10Jbond) [12:45:37] (03PS5) 10Elukey: bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) [12:45:39] (03PS3) 10Elukey: hadoop: set a dedicated HDFS NN service port for test/backup [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) [12:45:52] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2002.codfw.wmnet [12:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:00] PROBLEM - cassandra-a SSL 10.192.16.95:7001 on sessionstore2001 is CRITICAL: SSL CRITICAL - Certificate sessionstore2001-a valid until 2021-03-13 12:44:12 +0000 (expires in 29 days) https://phabricator.wikimedia.org/T120662 [12:47:28] PROBLEM - Confd template for /srv/config-master/pybal/codfw/apertium-plain on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/codfw/apertium-plain is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:47:30] PROBLEM - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster1001 is CRITICAL: Compilation of file /srv/config-master/pybal/eqiad/apertium is broken https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:47:43] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28009/console" [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [12:48:52] PROBLEM - Thanos compact has not run on alert1001 is CRITICAL: 4.481e+05 ge 24 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [12:49:05] (03PS1) 10Jbond: P:idp:standalone fix template [puppet] - 10https://gerrit.wikimedia.org/r/663564 [12:50:07] (03PS1) 10Effie Mouzeli: WIP: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) [12:50:32] (03PS2) 10Alexandros Kosiaris: Remove apertium-plain LVS service [puppet] - 10https://gerrit.wikimedia.org/r/663543 [12:50:50] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Remove apertium-plain LVS service [puppet] - 10https://gerrit.wikimedia.org/r/663543 (owner: 10Alexandros Kosiaris) [12:51:17] (03PS6) 10Elukey: bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) [12:51:19] (03PS4) 10Elukey: hadoop: set a dedicated HDFS NN service port for test/backup [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) [12:51:43] (03CR) 10jerkins-bot: [V: 04-1] WIP: mediawiki::mcrouter: add onhost memcached unix socket support [puppet] - 10https://gerrit.wikimedia.org/r/663565 (https://phabricator.wikimedia.org/T273115) (owner: 10Effie Mouzeli) [12:53:09] (03CR) 10Jbond: [C: 03+2] P:idp:standalone fix template [puppet] - 10https://gerrit.wikimedia.org/r/663564 (owner: 10Jbond) [12:53:47] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28010/console" [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [12:53:53] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2002.codfw.wmnet [12:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:00] (03PS1) 10Hnowlan: similar-users: deploy rebuilt image [deployment-charts] - 10https://gerrit.wikimedia.org/r/663566 (https://phabricator.wikimedia.org/T274262) [12:54:12] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apertium-plain on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:54:26] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:54:38] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:54:50] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apertium-plain on puppetmaster2001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:55:57] (03CR) 10Hnowlan: [C: 03+2] similar-users: deploy rebuilt image [deployment-charts] - 10https://gerrit.wikimedia.org/r/663566 (https://phabricator.wikimedia.org/T274262) (owner: 10Hnowlan) [12:57:06] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apertium-plain on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:57:18] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apertium on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:57:32] (03Merged) 10jenkins-bot: similar-users: deploy rebuilt image [deployment-charts] - 10https://gerrit.wikimedia.org/r/663566 (https://phabricator.wikimedia.org/T274262) (owner: 10Hnowlan) [12:57:44] (03PS1) 10Jbond: correct service id [puppet] - 10https://gerrit.wikimedia.org/r/663567 [12:57:45] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'similar-users' for release 'main' . [12:57:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:58:03] (03CR) 10Jbond: [V: 03+2 C: 03+2] correct service id [puppet] - 10https://gerrit.wikimedia.org/r/663567 (owner: 10Jbond) [12:58:52] RECOVERY - Confd template for /srv/config-master/pybal/codfw/apertium-plain on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:58:52] RECOVERY - Confd template for /srv/config-master/pybal/eqiad/apertium on puppetmaster1001 is OK: No errors detected https://wikitech.wikimedia.org/wiki/Confd%23Monitoring [12:59:01] (03PS1) 10Alexandros Kosiaris: apertium: remove duplicate check_command [puppet] - 10https://gerrit.wikimedia.org/r/663568 [12:59:25] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] apertium: remove duplicate check_command [puppet] - 10https://gerrit.wikimedia.org/r/663568 (owner: 10Alexandros Kosiaris) [13:00:09] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:00:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:51] (03PS1) 10Base: Added Kokebok namespace to nowikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663569 (https://phabricator.wikimedia.org/T274265) [13:01:32] (03PS1) 10LSobanski: instances.yaml: Add db1163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663570 [13:02:05] (03PS1) 10Jbond: apereo_cas: disable notify on servie changes [puppet] - 10https://gerrit.wikimedia.org/r/663571 [13:02:54] (03CR) 10Kormat: [C: 04-1] instances.yaml: Add db1163 to dbctl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663570 (owner: 10LSobanski) [13:03:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28011/console" [puppet] - 10https://gerrit.wikimedia.org/r/663571 (owner: 10Jbond) [13:03:20] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe2003.codfw.wmnet [13:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:57] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'similar-users' for release 'main' . [13:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:10] (03CR) 10Jcrespo: "Question:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663570 (owner: 10LSobanski) [13:05:34] (03CR) 10Jbond: [V: 03+1 C: 03+2] apereo_cas: disable notify on servie changes [puppet] - 10https://gerrit.wikimedia.org/r/663571 (owner: 10Jbond) [13:05:47] (03PS2) 10Base: Added Kokebok namespace to nowikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663569 (https://phabricator.wikimedia.org/T274265) [13:08:20] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe2003.codfw.wmnet [13:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:11] (03PS7) 10Elukey: bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) [13:09:13] (03PS5) 10Elukey: hadoop: set a dedicated HDFS NN service port for test/backup [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) [13:09:59] RECOVERY - LVS apertium eqiad port 4737/tcp - Machine Translation service. apertium.svc.eqiad.wmnet IPv4 on apertium.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 5945 bytes in 1.049 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:09:59] RECOVERY - LVS apertium codfw port 4737/tcp - Machine Translation service. apertium.svc.eqiad.wmnet IPv4 on apertium.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 5945 bytes in 1.179 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:10:04] (03CR) 10Elukey: [C: 03+2] bigtop: allow to split the HDFS Namenode RPC thread queue [puppet] - 10https://gerrit.wikimedia.org/r/663558 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [13:10:35] (03PS5) 10Matthias Mullie: Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [13:12:37] (03CR) 10Matthias Mullie: [C: 03+2] Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [13:12:50] (03CR) 10Elukey: [C: 03+2] hadoop: set a dedicated HDFS NN service port for test/backup [puppet] - 10https://gerrit.wikimedia.org/r/663561 (https://phabricator.wikimedia.org/T273629) (owner: 10Elukey) [13:13:33] (03Merged) 10jenkins-bot: Add external entity search URI for new MediaSearch extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662792 (https://phabricator.wikimedia.org/T265939) (owner: 10Anne Tomasevich) [13:16:44] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1001.eqiad.wmnet [13:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:50] 10SRE: Add POP Ganeti clusters to makevm cookbook - https://phabricator.wikimedia.org/T242828 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete, the current cookbook allows to add VMs in the Ganeti clusters in our edges. [13:22:01] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1001.eqiad.wmnet [13:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:32] (03PS2) 10LSobanski: instances.yaml: Add db1163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) [13:24:00] (03CR) 10jerkins-bot: [V: 04-1] instances.yaml: Add db1163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) (owner: 10LSobanski) [13:24:34] (03CR) 10LSobanski: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) (owner: 10LSobanski) [13:25:49] (03CR) 10Kormat: [C: 04-1] instances.yaml: Add db1163 to dbctl (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) (owner: 10LSobanski) [13:27:33] !log re-adding ganeti5002 to the eqsin Ganeti cluster following mainboard replacement/reinstall T261130 [13:27:33] (03PS3) 10LSobanski: instances.yaml: Add db1163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) [13:27:35] (03PS7) 10ArielGlenn: refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) [13:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:39] T261130: ganeti5002 was down / powered off, machine check entries in SEL - https://phabricator.wikimedia.org/T261130 [13:28:00] !log test grafana 7.4.1 upgrade on grafana2001 - T263747 [13:28:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:04] T263747: Upgrade Grafana to 7.4 - https://phabricator.wikimedia.org/T263747 [13:28:11] (03CR) 10Jcrespo: "Resolved" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) (owner: 10LSobanski) [13:28:16] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) With the above patches merged, and with: `lang=bash root@install1003:/etc/dhcp# cat opt82-entries.ttyS1-115200 host sretest1002 { host-identifier option agent.circuit-id "asw2-d-eqiad:ge-6/0/5.0:private1-d-eqiad";... [13:28:18] (03CR) 10Kormat: [C: 03+1] "LGTM 💯 🎉" [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) (owner: 10LSobanski) [13:28:30] (03CR) 10ArielGlenn: [C: 03+2] refactor script for wikidata and commons rdf dumps [puppet] - 10https://gerrit.wikimedia.org/r/661170 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [13:28:36] (03CR) 10Jcrespo: [C: 03+1] instances.yaml: Add db1163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) (owner: 10LSobanski) [13:31:32] (03PS1) 10Jbond: P:idp::standalone: add simple lask app to test idp [puppet] - 10https://gerrit.wikimedia.org/r/663576 [13:33:17] (03CR) 10jerkins-bot: [V: 04-1] P:idp::standalone: add simple lask app to test idp [puppet] - 10https://gerrit.wikimedia.org/r/663576 (owner: 10Jbond) [13:37:37] 10SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (10MoritzMuehlenhoff) [13:41:00] (03PS1) 10Filippo Giunchedi: grafana: support multiple read-only vhost via server alias [puppet] - 10https://gerrit.wikimedia.org/r/663577 (https://phabricator.wikimedia.org/T263747) [13:41:14] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10BBlack) I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id might be useful metadata here in addition to the abstract name of the vlan (e.g. scenarios where we might... [13:41:17] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1002.eqiad.wmnet [13:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:52] 10SRE, 10Dumps-Generation, 10Platform Engineering, 10serviceops, and 2 others: Upgrade snapshot hosts to Buster - https://phabricator.wikimedia.org/T269377 (10ArielGlenn) While it would be nice to continue to make the wikidata entity dumps more easy to run in deployment-prep, it can wait a bit while I move... [13:43:18] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28012/console" [puppet] - 10https://gerrit.wikimedia.org/r/663577 (https://phabricator.wikimedia.org/T263747) (owner: 10Filippo Giunchedi) [13:44:15] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10ayounsi) From the doc: > Specify that the circuit ID suboption value contains the VLAN ID rather than the VLAN name (the default): > [edit vlans vlan-name forwarding-options dhcp-security option-82] > user@switch# set circu... [13:44:38] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28013/console" [puppet] - 10https://gerrit.wikimedia.org/r/663577 (https://phabricator.wikimedia.org/T263747) (owner: 10Filippo Giunchedi) [13:48:41] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1002.eqiad.wmnet [13:48:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:47] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host thanos-fe1003.eqiad.wmnet [13:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:49] (03CR) 10LSobanski: [C: 03+2] instances.yaml: Add db1163 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/663570 (https://phabricator.wikimedia.org/T274472) (owner: 10LSobanski) [13:53:51] (03PS2) 10Jbond: P:idp::standalone: add simple lask app to test idp [puppet] - 10https://gerrit.wikimedia.org/r/663576 [13:53:55] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thanos-fe1003.eqiad.wmnet [13:53:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:02] 10SRE, 10netops: Test dhcp-option 82 - https://phabricator.wikimedia.org/T221388 (10Volans) >>! In T221388#6822703, @BBlack wrote: > I'm probably not up to date on concrete plans built on top of this, but it seems like having the numeric vlan id might be useful metadata here in addition to the abstract name of... [13:57:16] (03CR) 10Jbond: [C: 03+2] P:idp::standalone: add simple lask app to test idp [puppet] - 10https://gerrit.wikimedia.org/r/663576 (owner: 10Jbond) [13:59:29] (03PS1) 10Jbond: P:idp::standalone correct indent [puppet] - 10https://gerrit.wikimedia.org/r/663578 [13:59:42] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:idp::standalone correct indent [puppet] - 10https://gerrit.wikimedia.org/r/663578 (owner: 10Jbond) [14:00:04] twentyafterfour and hashar: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Mediawiki train - American+European Version (secondary timeslot) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T1400). [14:02:16] !log filippo@cumin1001 START - Cookbook sre.hosts.reboot-single for host netmon2001.wikimedia.org [14:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:50] (03PS1) 10Elukey: hadoop: temporary disable the service port for backup [puppet] - 10https://gerrit.wikimedia.org/r/663579 [14:03:57] (03CR) 10Elukey: [C: 03+2] hadoop: temporary disable the service port for backup [puppet] - 10https://gerrit.wikimedia.org/r/663579 (owner: 10Elukey) [14:05:14] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for urbanecm - https://phabricator.wikimedia.org/T274318 (10Vgutierrez) p:05Triage→03Medium [14:06:00] (03PS5) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [14:06:35] (03PS6) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [14:08:11] (03PS1) 10Jbond: idp::standalon: drop uwsgi::app in favour of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/663580 [14:08:13] (03PS1) 10Vgutierrez: admin: Add urbanecm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/663581 (https://phabricator.wikimedia.org/T274318) [14:08:13] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:37] (03CR) 10Jbond: [V: 03+2 C: 03+2] idp::standalon: drop uwsgi::app in favour of service::uwsgi [puppet] - 10https://gerrit.wikimedia.org/r/663580 (owner: 10Jbond) [14:10:30] (03PS1) 10Joal: Fix oozie sharelib creation script [puppet] - 10https://gerrit.wikimedia.org/r/663582 (https://phabricator.wikimedia.org/T274322) [14:11:41] (03PS7) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [14:11:44] !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2001.wikimedia.org [14:11:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:39] (03PS8) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [14:13:04] (03CR) 10David Caro: toolforge.etcdctl: add new etcdctl module (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [14:13:37] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:54] (03CR) 10Lars Wirzenius: "I'm afraid our puppet stuff is way too mysterious to me for me to review changes to operations/puppet." [puppet] - 10https://gerrit.wikimedia.org/r/650306 (owner: 10Dzahn) [14:15:08] (03CR) 10Elukey: [C: 03+2] Fix oozie sharelib creation script [puppet] - 10https://gerrit.wikimedia.org/r/663582 (https://phabricator.wikimedia.org/T274322) (owner: 10Joal) [14:17:03] (03PS3) 10Mholloway: Update sampling config syntax for test.instrumentation.sampled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662770 [14:18:22] (03PS1) 10Jbond: P:idp::standalone: add uwsgi app [puppet] - 10https://gerrit.wikimedia.org/r/663583 [14:18:28] (03CR) 10jerkins-bot: [V: 04-1] toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) (owner: 10David Caro) [14:18:36] (03CR) 10jerkins-bot: [V: 04-1] P:idp::standalone: add uwsgi app [puppet] - 10https://gerrit.wikimedia.org/r/663583 (owner: 10Jbond) [14:19:17] (03PS2) 10Jbond: P:idp::standalone: add uwsgi app [puppet] - 10https://gerrit.wikimedia.org/r/663583 [14:19:45] (03CR) 10Mholloway: [C: 03+2] Update sampling config syntax for test.instrumentation.sampled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662770 (owner: 10Mholloway) [14:21:17] (03CR) 10Jbond: [C: 03+2] P:idp::standalone: add uwsgi app [puppet] - 10https://gerrit.wikimedia.org/r/663583 (owner: 10Jbond) [14:21:33] (03Merged) 10jenkins-bot: Update sampling config syntax for test.instrumentation.sampled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/662770 (owner: 10Mholloway) [14:21:46] (03PS2) 10Vgutierrez: admin: Add urbanecm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/663581 (https://phabricator.wikimedia.org/T274318) [14:22:55] (03PS1) 10Muehlenhoff: Depool poolcounter1005 for reboot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663584 [14:24:12] (03PS1) 10Jbond: P:idp::standalone: use correct name [puppet] - 10https://gerrit.wikimedia.org/r/663585 [14:24:20] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: EventStreams: Update sampling config syntax for test.instrumentation.sampled (duration: 01m 08s) [14:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:28] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:idp::standalone: use correct name [puppet] - 10https://gerrit.wikimedia.org/r/663585 (owner: 10Jbond) [14:30:55] 10SRE, 10Platform Engineering, 10Traffic, 10cloud-services-team (Kanban): Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10Ladsgroup) @daniel I think the most important part of the greenlight is if ratelimit in mediaiwki is going affect... [14:33:19] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: bugfix dot_expander [puppet] - 10https://gerrit.wikimedia.org/r/663333 (owner: 10Cwhite) [14:36:13] (03PS9) 10David Caro: toolforge.etcdctl: add new etcdctl module [software/spicerack] - 10https://gerrit.wikimedia.org/r/661921 (https://phabricator.wikimedia.org/T267412) [14:39:30] (03CR) 10Kormat: [C: 03+1] "Looks good as far as i can tell :)" [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [14:40:09] (03CR) 10JMeybohm: [C: 04-1] "You will also have to bump the charts version in Chart.yaml" (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [14:44:46] !log kormat@cumin1001 dbctl commit (dc=all): 'Add db1163 to s1 T258361', diff saved to https://phabricator.wikimedia.org/P14318 and previous config saved to /var/cache/conftool/dbconfig/20210211-144445-kormat.json [14:44:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:51] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [14:51:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (AFAICT anyways)" [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [14:52:33] 10SRE, 10Wikimedia-Apache-configuration, 10Wikimedia-Site-requests, 10Patch-For-Review: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 (10Base) Is there a blocker here? [14:53:25] (03PS5) 10Jbond: interface: update rps script to also set the number of queues via ethtool [puppet] - 10https://gerrit.wikimedia.org/r/662688 (https://phabricator.wikimedia.org/T236208) [14:54:04] (03PS1) 10Elukey: Revert "hadoop: temporary disable the service port for backup" [puppet] - 10https://gerrit.wikimedia.org/r/663392 [14:54:36] (03CR) 10Elukey: [C: 03+2] Revert "hadoop: temporary disable the service port for backup" [puppet] - 10https://gerrit.wikimedia.org/r/663392 (owner: 10Elukey) [14:55:01] (03PS3) 10Base: Added Kokebok namespace to nowikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663569 (https://phabricator.wikimedia.org/T274265) [15:04:07] (03PS1) 10JMeybohm: docker-pkg: add ca_bundle configuration [puppet] - 10https://gerrit.wikimedia.org/r/663588 (https://phabricator.wikimedia.org/T274306) [15:05:37] PROBLEM - Hadoop Namenode - Primary on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:05:56] this is me -^ [15:06:05] PROBLEM - Check systemd state on an-worker1118 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:06:06] it is the backup cluster, for some reason it was in a weird state [15:06:07] PROBLEM - Hadoop HDFS Zookeeper failover controller on an-worker1118 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:07:19] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/28015/console" [puppet] - 10https://gerrit.wikimedia.org/r/663588 (https://phabricator.wikimedia.org/T274306) (owner: 10JMeybohm) [15:08:13] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:25] RECOVERY - Check systemd state on an-worker1118 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:08:27] RECOVERY - Hadoop HDFS Zookeeper failover controller on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.tools.DFSZKFailoverController https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:10:15] RECOVERY - Hadoop Namenode - Primary on an-worker1118 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:12:29] (03PS1) 10Kormat: integration: Display error messages on env start/stop failure [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663589 [15:15:15] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:59] (03CR) 10jerkins-bot: [V: 04-1] integration: Display error messages on env start/stop failure [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663589 (owner: 10Kormat) [15:21:11] (03PS2) 10Filippo Giunchedi: alertmanager: route Performance team alerts [puppet] - 10https://gerrit.wikimedia.org/r/663238 (https://phabricator.wikimedia.org/T272979) [15:21:35] (03CR) 10Filippo Giunchedi: alertmanager: route Performance team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663238 (https://phabricator.wikimedia.org/T272979) (owner: 10Filippo Giunchedi) [15:30:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/663581 (https://phabricator.wikimedia.org/T274318) (owner: 10Vgutierrez) [15:32:32] (03PS2) 10Kormat: integration: Display error messages on env start/stop failure [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663589 [15:37:16] (03CR) 10Kormat: [C: 03+2] integration: Display error messages on env start/stop failure [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663589 (owner: 10Kormat) [15:39:28] !log powercycle elastic2054 [15:39:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:47] (03Merged) 10jenkins-bot: integration: Display error messages on env start/stop failure [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663589 (owner: 10Kormat) [15:39:53] !log powercycle elastic2054 - T274555 [15:39:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:57] T274555: elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 [15:40:16] ryankemper: ^ [15:41:37] PROBLEM - Host elastic2054 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:53] RECOVERY - Host elastic2054 is UP: PING OK - Packet loss = 0%, RTA = 33.07 ms [15:43:37] (03PS1) 10Arturo Borrero Gonzalez: cloud: dumps: allow mounting dumps NFS from the tools-codfw1dev project [puppet] - 10https://gerrit.wikimedia.org/r/663593 (https://phabricator.wikimedia.org/T272397) [15:44:23] RECOVERY - SSH on elastic2054 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:44:57] (03CR) 10Bstorm: [C: 03+1] "Can't guarantee it will go well 😊" [puppet] - 10https://gerrit.wikimedia.org/r/663593 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [15:45:01] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db1163 at 1% T258361', diff saved to https://phabricator.wikimedia.org/P14320 and previous config saved to /var/cache/conftool/dbconfig/20210211-154501-kormat.json [15:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:07] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [15:45:07] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single for host serpens.wikimedia.org [15:45:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:14] !log depooling elastic2054 - T274555 [15:46:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:21] T274555: elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 [15:46:52] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: dumps: allow mounting dumps NFS from the tools-codfw1dev project [puppet] - 10https://gerrit.wikimedia.org/r/663593 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [15:47:05] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host serpens.wikimedia.org [15:47:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:03] !log jynus@cumin1001 dbctl commit (dc=all): 'Depool 1163', diff saved to https://phabricator.wikimedia.org/P14321 and previous config saved to /var/cache/conftool/dbconfig/20210211-154902-jynus.json [15:49:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:49] PROBLEM - MD RAID on elastic2054 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:50:49] !log ban elastic2054 from shard allocation - T274555 [15:50:50] ACKNOWLEDGEMENT - MD RAID on elastic2054 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T274556 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:54] 10SRE, 10ops-codfw: Degraded RAID on elastic2054 - https://phabricator.wikimedia.org/T274556 (10ops-monitoring-bot) [15:52:12] (03CR) 10Cwhite: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/663577 (https://phabricator.wikimedia.org/T263747) (owner: 10Filippo Giunchedi) [15:52:26] !log deploying fixed grants to db1163 [15:52:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:51] 10SRE, 10ops-codfw: Degraded RAID on elastic2054 - https://phabricator.wikimedia.org/T274556 (10Gehel) [15:53:51] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] grafana: support multiple read-only vhost via server alias [puppet] - 10https://gerrit.wikimedia.org/r/663577 (https://phabricator.wikimedia.org/T263747) (owner: 10Filippo Giunchedi) [15:54:19] 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10Gehel) @Papaul : it looks like sda is failing, confirmed by T274556. The server is depooled and banned from the cluster. Could you do your magic to find a new SSD? Thanks! [15:57:07] (03CR) 10Cwhite: [C: 03+2] profile: bugfix dot_expander [puppet] - 10https://gerrit.wikimedia.org/r/663333 (owner: 10Cwhite) [16:03:03] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10Papaul) a:05Papaul→03aborrero @aborrero I do not have detail on this server. it said the server was spare , which spare was it? [16:03:18] 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10Papaul) a:03Papaul [16:07:14] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/663599 [16:07:30] (03PS1) 10Kormat: integration: Add tests for db-replication-tree [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663600 [16:07:54] 10SRE, 10Analytics, 10SRE-Access-Requests: Add kzeta to analytics-privatedata-users - https://phabricator.wikimedia.org/T272982 (10kzimmerman) Thanks @elukey ! I'm able to access the data that I couldn't earlier :) [16:09:36] (03PS2) 10Kormat: integration: Add tests for db-replication-tree [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663600 (https://phabricator.wikimedia.org/T265266) [16:10:59] (03PS2) 10Alexandros Kosiaris: apertium: Cleanup scb cluster, puppet [puppet] - 10https://gerrit.wikimedia.org/r/663542 [16:11:04] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:11:44] 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10Papaul) @Gehel the server is under warranty, I can request a replacement disk for sda. [16:11:54] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:12:58] (03CR) 10Alexandros Kosiaris: [C: 03+2] apertium: Cleanup scb cluster, puppet [puppet] - 10https://gerrit.wikimedia.org/r/663542 (owner: 10Alexandros Kosiaris) [16:13:08] !log kormat@cumin1001 dbctl commit (dc=all): 'Pool db1163 at 1%, again T258361', diff saved to https://phabricator.wikimedia.org/P14323 and previous config saved to /var/cache/conftool/dbconfig/20210211-161308-kormat.json [16:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:13] T258361: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 [16:13:26] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [16:14:37] (03CR) 10Dzahn: "thank you 😊" [puppet] - 10https://gerrit.wikimedia.org/r/663051 (owner: 10Dzahn) [16:16:34] (03PS1) 10Arturo Borrero Gonzalez: dumps: distribution: nfs: allow mounts from cloud public IPv4 networks [puppet] - 10https://gerrit.wikimedia.org/r/663603 (https://phabricator.wikimedia.org/T272397) [16:16:39] (03CR) 10Cwhite: [C: 03+2] profile: update netdev rsyslog template to ecs 1.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/663081 (owner: 10Cwhite) [16:18:12] (03CR) 10Filippo Giunchedi: [C: 03+1] profile: remove deprecated syslog input [puppet] - 10https://gerrit.wikimedia.org/r/662009 (https://phabricator.wikimedia.org/T217032) (owner: 10Cwhite) [16:18:40] (03CR) 10Bstorm: [C: 03+1] "Hope it works." [puppet] - 10https://gerrit.wikimedia.org/r/663603 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [16:18:44] (03PS2) 10Arturo Borrero Gonzalez: dumps: distribution: nfs: allow mounts from cloud public IPv4 networks [puppet] - 10https://gerrit.wikimedia.org/r/663603 (https://phabricator.wikimedia.org/T272397) [16:21:47] (03PS3) 10Arturo Borrero Gonzalez: dumps: distribution: nfs: allow mounts from cloud public IPv4 networks [puppet] - 10https://gerrit.wikimedia.org/r/663603 (https://phabricator.wikimedia.org/T272397) [16:22:57] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) [16:23:15] 10SRE, 10DBA, 10Patch-For-Review: Productionize db1155-db1175 and refresh and decommission db1074-db1095 (22 servers) - https://phabricator.wikimedia.org/T258361 (10jcrespo) a:05jcrespo→03Marostegui [16:24:09] (03CR) 10Kormat: [C: 03+2] integration: Add tests for db-replication-tree [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663600 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [16:25:14] 10SRE, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade Calico - https://phabricator.wikimedia.org/T207804 (10akosiaris) [16:28:47] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1376.eqiad.wmnet with reason: REIMAGE [16:28:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:53] (03Merged) 10jenkins-bot: integration: Add tests for db-replication-tree [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/663600 (https://phabricator.wikimedia.org/T265266) (owner: 10Kormat) [16:28:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE [16:28:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:19] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1375.eqiad.wmnet with reason: REIMAGE [16:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:49] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1376.eqiad.wmnet with reason: REIMAGE [16:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:26] (03CR) 10Bstorm: [C: 03+1] dumps: distribution: nfs: allow mounts from cloud public IPv4 networks [puppet] - 10https://gerrit.wikimedia.org/r/663603 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [16:31:26] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:30] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:31:48] (03PS1) 10Cwhite: profile: remove type field for netdev-ecs. [puppet] - 10https://gerrit.wikimedia.org/r/663607 [16:32:00] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:04] (03PS11) 10Jcrespo: dbbackups: Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [16:32:06] (03PS1) 10Jcrespo: mariadb: Reenable notifications for db1163 once it has been pooled [puppet] - 10https://gerrit.wikimedia.org/r/663608 (https://phabricator.wikimedia.org/T274472) [16:32:19] (03PS2) 10Jcrespo: mariadb: Reenable notifications for db1163 once it has been pooled [puppet] - 10https://gerrit.wikimedia.org/r/663608 (https://phabricator.wikimedia.org/T274472) [16:32:41] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1375.eqiad.wmnet with reason: REIMAGE [16:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:48] (03PS2) 10Cwhite: profile: remove type field for netdev-ecs. [puppet] - 10https://gerrit.wikimedia.org/r/663607 [16:33:54] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE [16:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:22] 10SRE: Create cookbook to add a node to a Ganeti cluster - https://phabricator.wikimedia.org/T274527 (10akosiaris) [16:34:40] (03CR) 10Cwhite: [C: 03+2] profile: remove type field for netdev-ecs. [puppet] - 10https://gerrit.wikimedia.org/r/663607 (owner: 10Cwhite) [16:34:45] (03PS3) 10Dzahn: phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) [16:36:20] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:22] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:25] (03CR) 10jerkins-bot: [V: 04-1] phabricator: convert statistics mail crons to systemd timers [puppet] - 10https://gerrit.wikimedia.org/r/661536 (https://phabricator.wikimedia.org/T273673) (owner: 10Dzahn) [16:37:48] 10SRE, 10ops-eqiad, 10observability: eqiad: Move logstash1020 to rack A8 - https://phabricator.wikimedia.org/T273984 (10Cmjohnson) 05Open→03Resolved fixed [16:38:00] PROBLEM - PHP7 rendering on mw1368 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [16:38:25] !log mw1368 - File "/usr/lib/python3/dist-packages/spicerack/remote.py", line 637, in _execute raise RemoteExecutionError(ret, 'Cumin execution failed') [16:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:58] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10aborrero) a:05aborrero→03RobH I have no idea, perhaps we should ask @RobH who created this task. This sever wasn't in my radar before this task. [16:39:08] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:39:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: cumin execution failed during wmf-reimaged [16:39:08] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1368.eqiad.wmnet with reason: cumin execution failed during wmf-reimaged [16:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:24] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:48] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dumps: distribution: nfs: allow mounts from cloud public IPv4 networks [puppet] - 10https://gerrit.wikimedia.org/r/663603 (https://phabricator.wikimedia.org/T272397) (owner: 10Arturo Borrero Gonzalez) [16:41:54] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:07] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10Cmjohnson) 05Open→03Declined @fgiunchedi I tried pulling the power and resetting, that would be the typical fix but it didn't work. Historically when this has happened we've had to replace the motherboard. This... [16:42:18] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:26] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:28] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:40] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:42:48] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:40] RECOVERY - Host ms-be1034 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [16:45:23] (03PS3) 10Jcrespo: mariadb: Reenable notifications for db1163 once it has been pooled [puppet] - 10https://gerrit.wikimedia.org/r/663608 (https://phabricator.wikimedia.org/T274472) [16:45:42] 10SRE, 10ops-eqiad: maps1005.eqiad.wmnet: possible cable issues - https://phabricator.wikimedia.org/T274387 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson Replaced the cable, that should fix your issue. Re-open if the problem persists [16:47:01] (03PS4) 10Jcrespo: mariadb: Reenable notifications for db1163 once it has been pooled [puppet] - 10https://gerrit.wikimedia.org/r/663608 (https://phabricator.wikimedia.org/T274472) [16:47:49] 10SRE, 10ops-eqiad: Degraded RAID on ms-be1032 - https://phabricator.wikimedia.org/T272209 (10Cmjohnson) @fgiunchedi with ms-be1034 going down and out, I can use a disk from that server to fix this issue. Let me know if you want to do that? [16:49:39] (03PS1) 10Cwhite: profile: remove type field for all ecs-formatted events [puppet] - 10https://gerrit.wikimedia.org/r/663613 (https://phabricator.wikimedia.org/T234565) [16:51:18] (03CR) 10Brennen Bearnes: [C: 03+1] docker-pkg: add ca_bundle configuration [puppet] - 10https://gerrit.wikimedia.org/r/663588 (https://phabricator.wikimedia.org/T274306) (owner: 10JMeybohm) [16:55:27] (03CR) 10LSobanski: [C: 03+1] mariadb: Reenable notifications for db1163 once it has been pooled [puppet] - 10https://gerrit.wikimedia.org/r/663608 (https://phabricator.wikimedia.org/T274472) (owner: 10Jcrespo) [16:56:23] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1376.eqiad.wmnet'] ` an... [16:56:27] (03PS5) 10Jcrespo: mariadb: Reenable notifications for db1163 once it has been pooled [puppet] - 10https://gerrit.wikimedia.org/r/663608 (https://phabricator.wikimedia.org/T274472) [16:59:02] 10SRE, 10ops-codfw: codfw: add VC-links IDs to Netbox - https://phabricator.wikimedia.org/T268749 (10Papaul) 05Open→03Resolved Row D complete [16:59:26] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications for db1163 once it has been pooled [puppet] - 10https://gerrit.wikimedia.org/r/663608 (https://phabricator.wikimedia.org/T274472) (owner: 10Jcrespo) [16:59:43] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Cmjohnson) [17:00:04] jbond42 and cdanis: #bothumor I � Unicode. All rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T1700). [17:00:29] 10SRE, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Hardware): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Cmjohnson) 05Open→03Resolved @bstorm completed, resolving this task [17:01:00] (03PS4) 10Jdlrobson: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 [17:01:04] (03PS3) 10Jdlrobson: Labs should override all logo definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) [17:01:12] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [17:02:02] (03CR) 10Vgutierrez: [C: 03+2] admin: Add urbanecm to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/663581 (https://phabricator.wikimedia.org/T274318) (owner: 10Vgutierrez) [17:03:12] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1376.eqiad.wmnet [17:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:32] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 13, number_of_data_nodes: 7, active_primary_shards: 472, active_shards: 903, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_f [17:03:32] ask_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [17:03:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for urbanecm - https://phabricator.wikimedia.org/T274318 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez Done, you should have an email regarding your kerberos account password @Urbanecm [17:07:10] (03CR) 10Herron: [C: 03+1] profile: remove type field for all ecs-formatted events [puppet] - 10https://gerrit.wikimedia.org/r/663613 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:07:40] !log mw1375 - powercycle - stuck at reboot [17:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:48] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [17:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:18] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10wiki_willy) Hi @fgiunchedi - since this server is at the 4yr mark, are you ok with decommissioning it? Thanks, Willy [17:11:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1376.eqiad.wmnet [17:11:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:59] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1375.eqiad.wmnet'] ` an... [17:13:41] (03CR) 10Phuedx: [C: 03+1] Labs should override all logo definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [17:14:43] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for urbanecm - https://phabricator.wikimedia.org/T274318 (10Urbanecm) Thanks, I confirm I got the mail. [17:15:36] (03CR) 10Urbanecm: [C: 04-1] "logos.php is generated automatically. Please also update logos/manage.py (script for generation of the logos.php file) and logos/README.md" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [17:20:44] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:20:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:07] PROBLEM - Too many messages in kafka logging-eqiad #o11y on alert1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash7-eqiad instance=kafkamon1002 job=burrow partition={0,5} prometheus=ops site=eqiad topic={rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasourc [17:21:07] ter=logging-eqiad&var-topic=All&var-consumer_group=All [17:23:20] looks like rsyslog-notice and udp_localhost-info are lagging on kafka-logging eqiad [17:23:20] (03CR) 10Jdlrobson: "> Patch Set 4: Code-Review-1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [17:23:39] (03PS4) 10Jdlrobson: Labs should override all logo definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) [17:23:46] (03CR) 10Tjones: [C: 03+2] Add extra-analysis-khmer [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/662923 (https://phabricator.wikimedia.org/T274203) (owner: 10DCausse) [17:24:11] RECOVERY - Too many messages in kafka logging-eqiad #o11y on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:24:50] 10ops-codfw, 10Traffic: codfw: lvs2007 : iDRAC is unable to communicate with power management firmware error - https://phabricator.wikimedia.org/T274571 (10Papaul) [17:24:55] annnd they've caught up, ok [17:25:20] 10ops-codfw, 10Traffic: codfw: lvs2007 : iDRAC is unable to communicate with power management firmware error - https://phabricator.wikimedia.org/T274571 (10Papaul) p:05Triage→03Medium [17:25:37] PROBLEM - Host ms-be1034 is DOWN: PING CRITICAL - Packet loss = 100% [17:26:08] !log lvs2007 - puppet disabled, downtimed in icinga - T274571 [17:26:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:13] T274571: codfw: lvs2007 : iDRAC is unable to communicate with power management firmware error - https://phabricator.wikimedia.org/T274571 [17:27:08] !log lvs2007 - stopping pybal - T274571 [17:27:11] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [17:27:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:47] the input rate still is a bit high on those topics though, keeping an eye [17:28:19] but seems like a burst, should settle out [17:29:06] 10Puppet, 10SRE, 10Patch-For-Review: Unused puppet resources audit, early 2021 - https://phabricator.wikimedia.org/T272559 (10aborrero) [17:30:23] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:30:45] ^ codfw bgp status alert is from lvs2007 log entries above, will ack them [17:31:29] !log lvs2007 - shutting down host - T274571 [17:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:34] T274571: codfw: lvs2007 : iDRAC is unable to communicate with power management firmware error - https://phabricator.wikimedia.org/T274571 [17:31:36] (03CR) 10Ladsgroup: Remove old OpenStack Rocky files/templates/manifests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663027 (owner: 10Andrew Bogott) [17:32:13] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1376.eqiad.wmnet [17:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:44] ACKNOWLEDGEMENT - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal Brandon Black T274571 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:35:01] 10SRE, 10ops-eqiad: ms-be1034 not powering on - https://phabricator.wikimedia.org/T274488 (10wiki_willy) [17:35:50] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1375.eqiad.wmnet [17:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:58] (03PS20) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [17:36:00] (03CR) 10CRusnov: dhcp: Introduce automation proxies for management networks (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:36:27] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1375.eqiad.wmnet [17:36:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:43] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:37:09] PROBLEM - Host lvs2007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [17:37:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:38:44] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1368.eqiad.wmnet'] ` Of... [17:39:01] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:39:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [17:43:01] (03CR) 10Dzahn: "fair enough, Lars, i'll remove you. no hard feelings" [puppet] - 10https://gerrit.wikimedia.org/r/650306 (owner: 10Dzahn) [17:43:57] (03Abandoned) 10Dzahn: switch doc.wikimedia.org to doc1002 backend [dns] - 10https://gerrit.wikimedia.org/r/650625 (https://phabricator.wikimedia.org/T247653) (owner: 10Dzahn) [17:44:30] (03PS21) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [17:44:32] (03PS22) 10Jbond: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:44:37] RECOVERY - Host lvs2007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 33.66 ms [17:44:47] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:45:13] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:46:35] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) a:03wiki_willy Chris or John, please help us with this- Based on hw logs, it seems a typical case of memory stick going wrong. Host is depooled from service and can be rebooted/serviced in any... [17:47:23] (03CR) 10Dzahn: [V: 03+1] "The very last one before the hiera() saga is done." [puppet] - 10https://gerrit.wikimedia.org/r/663289 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [17:48:25] (03PS23) 10Jbond: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:49:02] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:49:28] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10jcrespo) [17:50:08] (03PS24) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [17:50:47] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:51:12] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10wiki_willy) a:05wiki_willy→03Cmjohnson @Cmjohnson /@jclark-ctr - just a heads up, this is higher priority and the server is still under warranty, through November 2021. Thanks, Willy [17:52:40] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Cmjohnson) [17:52:44] !log lvs2007 - starting up puppet + pybal - T274571 [17:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:50] T274571: codfw: lvs2007 : iDRAC is unable to communicate with power management firmware error - https://phabricator.wikimedia.org/T274571 [17:53:52] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 81, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:54:13] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1374.eqiad.wmnet with reason: REIMAGE [17:54:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:25] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Cmjohnson) [17:54:39] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Cmjohnson) 05Open→03Resolved @Jgreen These servers are ready for you to do your installs. the idrac password is setup to the temporary password, I'll DM you in IRC. I am resolving... [17:56:01] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Cmjohnson) [17:56:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1374.eqiad.wmnet with reason: REIMAGE [17:56:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:56:32] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10Cmjohnson) a:05Cmjohnson→03RobH Assigning to @robh to finish install [17:56:35] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1363.eqiad.wmnet with reason: REIMAGE [17:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:00] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE [17:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:51] (03PS25) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [17:58:28] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [17:58:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1363.eqiad.wmnet with reason: REIMAGE [17:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:15] !log lvs2007 - downtimes ended, back in service - T274571 [17:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:19] T274571: codfw: lvs2007 : iDRAC is unable to communicate with power management firmware error - https://phabricator.wikimedia.org/T274571 [17:59:42] (03PS26) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [17:59:45] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 56, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:00:04] chrisalbon and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T1800). [18:00:26] (03PS27) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [18:00:30] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [18:00:39] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE [18:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:36] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [18:01:41] (03CR) 10jerkins-bot: [V: 04-1] dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [18:02:51] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [18:03:38] (03PS4) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) [18:04:19] (03PS12) 10Jcrespo: dbbackups: Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) [18:05:17] (03CR) 10jerkins-bot: [V: 04-1] mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [18:07:14] I am going to deploy a change that may take a a few minutes to block puppet deploys, please tell me if that blocks you and I can wait [18:07:50] (it needs coordinated private and public puppet deploy, and revert of of both in case something goes badly) [18:08:09] RECOVERY - Check systemd state on relforge1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:08:39] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: TBD) rack/setup/install db21[45-52] - https://phabricator.wikimedia.org/T273568 (10Papaul) [18:09:19] proceeding now [18:09:31] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Move database backups-related puppet code to its own profile/role [puppet] - 10https://gerrit.wikimedia.org/r/662740 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:10:13] (03PS5) 10Dzahn: mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) [18:10:17] (03PS28) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [18:11:17] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [18:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:02] is the deployment train to group2 still rolling out today? [18:12:15] puppet looking good so far [18:13:03] PROBLEM - Host ms-be1038 is DOWN: PING CRITICAL - Packet loss = 100% [18:13:11] PROBLEM - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:16:16] ryankemper: relforge above is known? ^ [18:16:58] I think the puppet deploy went well, I will be around in case there is any other error [18:16:59] (03PS29) 10CRusnov: dhcp: Introduce automation proxies for management networks [puppet] - 10https://gerrit.wikimedia.org/r/662025 (https://phabricator.wikimedia.org/T271583) [18:17:03] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) https://netbox.wikimedia.org/dcim/devices/2140/ I've now renamed this spare host to cloudnet2004-dev, but it needs its hostname labels app... [18:17:20] gehel: let me check, I'm working on a patch to fix the known issue with `kibana.service` so if it's just that then yes [18:17:41] ok, I'll go for diner and let you silence it [18:17:58] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) [18:18:31] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) [18:18:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:04] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) [18:19:37] !log robh@cumin1001 START - Cookbook sre.dns.netbox [18:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:29] ACKNOWLEDGEMENT - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper https://phabricator.wikimedia.org/T262211#6817218 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:29] ACKNOWLEDGEMENT - Check systemd state on relforge1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ryan Kemper https://phabricator.wikimedia.org/T262211#6817218 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:59] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) a:05RobH→03Papaul [18:21:31] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) [18:21:37] 10SRE, 10ops-codfw, 10Traffic: codfw: lvs2007 : iDRAC is unable to communicate with power management firmware error - https://phabricator.wikimedia.org/T274571 (10Papaul) 05Open→03Resolved This issue was resolved for now by draining the power . [18:24:44] (03CR) 10Dzahn: [C: 03+2] "only affects mwdebug - and just for upgrading https://puppet-compiler.wmflabs.org/compiler1001/28021/" [puppet] - 10https://gerrit.wikimedia.org/r/662798 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [18:24:49] (03PS1) 10Jcrespo: dbbackups: Create new puppet module dbbackups, move backup check to it [puppet] - 10https://gerrit.wikimedia.org/r/663649 (https://phabricator.wikimedia.org/T138562) [18:27:19] (03PS5) 10Ammarpad: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [18:29:17] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10KFrancis) @CDanis Hello, pinging on this... Please send me Georgina's email address so I may send out the agreement for signatures. Thank you! [18:30:03] (03PS1) 10Dzahn: mwdebug: do not automatically sync home files, just manual [puppet] - 10https://gerrit.wikimedia.org/r/663651 [18:30:23] (03CR) 10Dzahn: [C: 03+2] mwdebug: do not automatically sync home files, just manual [puppet] - 10https://gerrit.wikimedia.org/r/663651 (owner: 10Dzahn) [18:30:34] (03PS2) 10Dzahn: mwdebug: do not automatically sync home files, just manual [puppet] - 10https://gerrit.wikimedia.org/r/663651 [18:33:41] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10CDanis) [18:33:49] 10SRE, 10LDAP-Access-Requests: Request to add Georgina Burnett to the ldap/wmde group - https://phabricator.wikimedia.org/T273780 (10CDanis) Email address sent privately. [18:40:16] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10Papaul) >>! In T267654#6824106, @RobH wrote: > https://netbox.wikimedia.org/dcim/devices/2140/ > > I've now renamed this spare host to cloudnet2... [18:42:16] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/28022/" [puppet] - 10https://gerrit.wikimedia.org/r/663649 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:42:39] (03PS6) 10Ammarpad: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [18:43:39] !log mw1374 - powercycled, reboot via ipmi issue [18:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:40] !log mw1368 - reboot via IPMI issue & can't powercycle "Unable to perform requested operation." - racreet [18:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:53] !log mw1368 - racadm racreset [18:46:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1368.eqiad.wmnet'] ` Of... [18:48:16] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1374.eqiad.wmnet'] ` an... [18:48:42] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1374.eqiad.wmnet [18:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:20] (03CR) 10Jcrespo: "More changes are needed, last 2 patches where the obvious ones, but now that they are separated I can improve them at my own pace." [puppet] - 10https://gerrit.wikimedia.org/r/663649 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [18:53:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [18:53:47] In the context of T274436 I am looking of the internal address of mathoid that can be used to access mathoid without restbase. Does anyone know how to retreive this information? [18:53:48] T274436: Enable RESTbaseless validation in wikibase - https://phabricator.wikimedia.org/T274436 [18:55:59] physikerwelt1: mathoid.discovery.wmnet:10042 per https://wikitech.wikimedia.org/wiki/Mathoid#Troubleshooting ? [18:56:59] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1374.eqiad.wmnet [18:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:20] mutante: thank you [18:57:52] yw [19:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T1900). [19:00:04] Cladis and Jdlrobson: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:18] xD [19:02:03] !lo mw1363 - powercycled [19:02:25] * Urbanecm waves, but mobile only rn. Can deploy in 15 minutes, hopefully. [19:03:02] mutante: wrong log statement, fyi [19:04:02] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:04:04] Urbanecm: thanks [19:04:17] !log mw1363 - powercycled, reboot issue [19:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1363.eqiad.wmnet'] ` an... [19:05:20] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:10:39] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE [19:10:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:40] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1368.eqiad.wmnet with reason: REIMAGE [19:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:18] !log robh@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:13:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:40] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1363.eqiad.wmnet [19:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:45] !log robh@cumin1001 START - Cookbook sre.dns.netbox [19:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:23] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) >>! In T267654#6824196, @Papaul wrote: >>>! In T267654#6824106, @RobH wrote: >> https://netbox.wikimedia.org/dcim/devices/2140/ >> >> I've... [19:14:35] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) 05Open→03Invalid [19:15:30] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1363.eqiad.wmnet [19:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:38] finally ready [19:17:42] Cladis: hi, i can deploy today [19:17:52] nice :) [19:17:59] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:06] (03CR) 10Urbanecm: [C: 03+2] Added Kokebok namespace to nowikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663569 (https://phabricator.wikimedia.org/T274265) (owner: 10Base) [19:18:58] (03Merged) 10jenkins-bot: Added Kokebok namespace to nowikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663569 (https://phabricator.wikimedia.org/T274265) (owner: 10Base) [19:19:52] Cladis: can you test it at mwdebug1001, please? [19:20:05] !log robh@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:55] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1362.eqiad.wmnet with reason: REIMAGE [19:20:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:11] RECOVERY - Postgres Replication Lag on maps1005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:21:49] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:21:56] (03PS20) 10Kosta Harlan: linkrecommendation: Cron job to load datasets [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) [19:22:08] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1365.eqiad.wmnet with reason: REIMAGE [19:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:13] (03CR) 10Andrew Bogott: [C: 03+2] Remove old OpenStack Rocky files/templates/manifests [puppet] - 10https://gerrit.wikimedia.org/r/663027 (owner: 10Andrew Bogott) [19:22:19] Urbanecm: seems to work, at least aliases and NSs themselves [19:22:26] great, let's sync it then [19:22:27] not checking if they count as content [19:22:48] since I would need to create some gibberish for that [19:22:59] sure, that's fine [19:23:03] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1362.eqiad.wmnet with reason: REIMAGE [19:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:34] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): (Need By: TBD) rack/setup/install cloudnet2004-dev - https://phabricator.wikimedia.org/T267654 (10RobH) [19:24:12] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 93e168cb7788c772895b47f239275544fb745358: Added Kokebok namespace to nowikibooks (T274265) (duration: 01m 20s) [19:24:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:24:16] T274265: Create custom namespace at no.wikibooks - https://phabricator.wikimedia.org/T274265 [19:24:18] Cladis: should be live! [19:24:20] anything else? [19:24:33] not for now :) [19:24:36] good :) [19:24:57] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1365.eqiad.wmnet with reason: REIMAGE [19:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:42] (03PS7) 10Urbanecm: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [19:25:46] (03CR) 10Urbanecm: [C: 03+2] Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [19:25:48] (03CR) 10Urbanecm: [C: 03+2] Labs should override all logo definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [19:26:50] (03Merged) 10jenkins-bot: Add inline documentation to configuration about updating logos regarding labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [19:26:53] (03Merged) 10jenkins-bot: Labs should override all logo definitions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [19:27:56] (03PS1) 10ArielGlenn: refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) [19:28:36] (03CR) 10jerkins-bot: [V: 04-1] refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [19:28:43] (03PS1) 10Andrew Bogott: archive-instances.py: move to python3 and update to use keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663662 (https://phabricator.wikimedia.org/T239584) [19:30:18] (03CR) 10Andrew Bogott: [C: 03+2] archive-instances.py: move to python3 and update to use keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663662 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [19:32:27] !log urbanecm@deploy1001 Synchronized wmf-config/logos.php: noop: a1244df3e829abc793113a7e32d1972db9f780a8: Add inline documentation to configuration about updating logos regarding labs (duration: 01m 08s) [19:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:01] DannyS712: can you prep cherry-picks for the stuff you want backported for GlobalWatchlist? and then list them on [[wt:Deployments]] [19:35:22] (03PS2) 10ArielGlenn: refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) [19:35:41] (03PS1) 10Andrew Bogott: Replace remaining uses of keystoneclient.session [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) [19:35:53] (03CR) 10jerkins-bot: [V: 04-1] refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [19:36:02] (03CR) 10Andrew Bogott: "These changes should be no-ops but we need to test the affected scripts before merging" [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [19:38:43] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1368.eqiad.wmnet'] ` an... [19:39:36] 10SRE, 10ops-eqiad, 10cloud-services-team (Hardware): cloudnet1004/cloudnet1003: network hiccups because broadcom driver/firmware problem - https://phabricator.wikimedia.org/T271058 (10Jclark-ctr) was able to download HP Service Pack for ProLiant with help with papaul. will be available next week to preform... [19:39:41] > DannyS712: can you prep cherry-picks for the stuff you want backported for GlobalWatchlist? and then list them on [[wt:Deployments]] [19:39:41] legoktm there may be another patch, https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalWatchlist/+/663393, so I want to wait until we're sure whats needed first [19:39:46] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1361.eqiad.wmnet with reason: REIMAGE [19:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:04] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1368.eqiad.wmnet [19:40:06] DannyS712: I merged that in master already [19:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:29] oh lol (as I said elsewhere, an LTA got my gmail account disabled so I didn't see) [19:40:40] !log mw1368 - had the reboot via IPMI issue, did DRAC reset and repeated wmf-autoreimage, issue did not happen again [19:40:42] lol - is that possible those days? [19:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:48] IRC notifications? [19:41:04] legoktm yeah, that still works. Urbanecm unfortunately [19:41:34] wondering how does one do it - feel free to PM if you wish [19:41:48] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1361.eqiad.wmnet with reason: REIMAGE [19:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:30] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1368.eqiad.wmnet [19:42:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:43] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Jclark-ctr) @EBernhardson would you be available monday to swap Ram being that it is over 90 days since error we can change memory [19:43:21] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:45:06] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:16] (03PS3) 10ArielGlenn: refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) [19:46:45] (03CR) 10jerkins-bot: [V: 04-1] refactor wikidata json dumps to be easier to test on deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/663661 (https://phabricator.wikimedia.org/T269377) (owner: 10ArielGlenn) [19:47:00] (03PS1) 10DannyS712: Restore RTL handling for non-Vue display [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663397 (https://phabricator.wikimedia.org/T274313) [19:47:01] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10EBernhardson) @Jclark-ctr yes, I'm available any time after 11 AM PST (19:00 UTC) monday. [19:47:33] (03CR) 10DannyS712: [C: 04-1] "Wait until confirmation that this + https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GlobalWatchlist/+/663393 actually fixes the issu" [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663397 (https://phabricator.wikimedia.org/T274313) (owner: 10DannyS712) [19:47:34] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10Gehel) >>! In T274555#6823403, @Papaul wrote: > @Gehel the server is under warranty, I can request a replacement disk for sda. Yes, please request that new disk! [19:47:47] (03PS1) 10DannyS712: Switch sidebar hook to onSidebarBeforeOutput [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663398 (https://phabricator.wikimedia.org/T274312) [19:48:09] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10Jclark-ctr) Forgot Monday is holiday Tuesday 11 AM PST? [19:48:53] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1362.eqiad.wmnet'] ` an... [19:51:19] RECOVERY - Thanos compact has not run on alert1001 is OK: (C)24 ge (W)12 ge 0.0258 https://wikitech.wikimedia.org/wiki/Thanos%23Alerts https://grafana.wikimedia.org/d/651943d05a8123e32867b4673963f42b/thanos-compact [19:52:43] (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [19:53:15] (03CR) 10Kosta Harlan: linkrecommendation: Cron job to load datasets (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/660394 (https://phabricator.wikimedia.org/T265893) (owner: 10Kosta Harlan) [19:53:16] legoktm okay, I've cherry picked everything to .30 so that the tests can start running, but waiting to double check on beta that it works first (still hasn't synced as far as I can tell) [19:54:25] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1362.eqiad.wmnet [19:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1362.eqiad.wmnet [19:55:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:08] (03PS1) 10Base: Enabling extension SandboxLink on ltwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663668 (https://phabricator.wikimedia.org/T273957) [19:56:41] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [19:57:18] (03PS1) 10Dzahn: DHCP: switch mwdebug hosts from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/663669 (https://phabricator.wikimedia.org/T274023) [19:57:37] (03CR) 10jerkins-bot: [V: 04-1] DHCP: switch mwdebug hosts from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/663669 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [19:57:56] (03CR) 10DannyS712: "This change is ready for review." [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663399 (https://phabricator.wikimedia.org/T274313) (owner: 10DannyS712) [19:58:26] (03PS2) 10Dzahn: DHCP: switch mwdebug hosts from stretch to buster installer [puppet] - 10https://gerrit.wikimedia.org/r/663669 (https://phabricator.wikimedia.org/T274023) [19:59:44] (03PS3) 10Dzahn: DHCP: switch mwdebug hosts buster installer, mwdebug1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/663669 (https://phabricator.wikimedia.org/T274023) [20:00:04] twentyafterfour and hashar: #bothumor My software never has bugs. It just develops random features. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T2000). [20:00:14] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1364.eqiad.wmnet with reason: REIMAGE [20:00:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:27] (03CR) 10Dzahn: [C: 03+2] DHCP: switch mwdebug hosts buster installer, mwdebug1003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/663669 (https://phabricator.wikimedia.org/T274023) (owner: 10Dzahn) [20:02:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1364.eqiad.wmnet with reason: REIMAGE [20:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:48] 10SRE, 10ops-eqiad, 10Discovery-Search (Current work): Memory issue on elastic1063 caused elasticsearch to be killed - https://phabricator.wikimedia.org/T265113 (10EBernhardson) Ooh, holiday! I forgot about that too. Yea tuesday will work. [20:07:37] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1361.eqiad.wmnet'] ` an... [20:08:02] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1361.eqiad.wmnet [20:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:30] twentyafterfour: looks like train can proceed [20:08:49] there is one slight regression related to ip ranges not being linked in Special:Log but that is very very minor imho [20:09:01] there is a patch for it on master which we would have to backport later on [20:09:06] but I don't think we should hold on it [20:09:30] !log mw1365 - powercycle - reboot issue [20:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:34] (03PS1) 10Razzi: Disable MaxMind archiving [puppet] - 10https://gerrit.wikimedia.org/r/663671 (https://phabricator.wikimedia.org/T273891) [20:11:17] (03CR) 10Jdlrobson: "Thanks Ammarpad for the help with this one and thank you Urbanecm for the deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663057 (owner: 10Jdlrobson) [20:11:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1361.eqiad.wmnet [20:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:11:31] happy to help Jdlrobson :) [20:12:29] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:12:41] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:13:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1360.eqiad.wmnet with reason: REIMAGE [20:13:38] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1365.eqiad.wmnet'] ` an... [20:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1360.eqiad.wmnet with reason: REIMAGE [20:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:17] (03PS3) 10DLynch: Enable DiscussionTools Reply Tool A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/661373 (https://phabricator.wikimedia.org/T273554) (owner: 10Bartosz Dziewoński) [20:19:34] (03PS1) 10DLynch: Oversample DiscussionTools EditAttemptStep logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663672 (https://phabricator.wikimedia.org/T273946) [20:21:54] (03PS1) 10Dzahn: mwdebug: flip rsync source and dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/663674 [20:23:19] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1365.eqiad.wmnet [20:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:17] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1365.eqiad.wmnet [20:24:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:14] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Cmjohnson) [20:25:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [20:26:03] !log new train blocker preventing deploy of 1.36.0-wmf.30 to all wikis. T274589 blocks T271344 [20:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:09] T274589: No atomic section is open (got LocalFile::lockingTransaction) - https://phabricator.wikimedia.org/T274589 [20:26:09] T271344: 1.36.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T271344 [20:26:12] (03CR) 10Dzahn: [C: 03+2] mwdebug: flip rsync source and dest hosts [puppet] - 10https://gerrit.wikimedia.org/r/663674 (owner: 10Dzahn) [20:26:21] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install aqs101[0-5] - https://phabricator.wikimedia.org/T267414 (10Cmjohnson) a:05Cmjohnson→03RobH These are ready for installs, assigning to @RobH [20:27:27] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Cmjohnson) pasting system event log Record: 10 Date/Time: 02/10/2021 23:35:50 Source: system Severity: Non-Critical Description: Correctable memory error rate exceeded for DIMM_B3. -------... [20:28:40] we're going to stay on wmf.27 forever, aren't we [20:29:21] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1359.eqiad.wmnet with reason: REIMAGE [20:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:34] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1359.eqiad.wmnet with reason: REIMAGE [20:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:53] MatmaRex: it seems like it [20:32:25] (03PS1) 10Base: Adding WQ as namespace alias for itwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663678 (https://phabricator.wikimedia.org/T273362) [20:33:22] 10SRE, 10ops-eqiad, 10DBA: Investigate and repool db1134 - https://phabricator.wikimedia.org/T274472 (10Cmjohnson) a ticket has been created with Dell for a new DIMM. Ticket number SR1051489398 [20:36:20] twentyafterfour: I am here [20:36:58] damn yet another mysterious blocker :-\ [20:38:29] (03CR) 10Ottomata: [C: 03+1] Disable MaxMind archiving [puppet] - 10https://gerrit.wikimedia.org/r/663671 (https://phabricator.wikimedia.org/T273891) (owner: 10Razzi) [20:39:07] (03CR) 10DannyS712: Restore RTL handling for non-Vue display [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663397 (https://phabricator.wikimedia.org/T274313) (owner: 10DannyS712) [20:39:31] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:40:12] it's not that mysterious. as long as we are blocking for all logspam it's going to be blocked forever I'm afraid [20:42:16] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1355.eqiad.wmnet with reason: REIMAGE [20:42:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:42:26] twentyafterfour: looks eventbus related? [20:42:44] or EventBus triggers an issue which is in mediawiki [20:43:28] (03PS1) 10Dzahn: Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 [20:43:41] (03CR) 10jerkins-bot: [V: 04-1] Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 (owner: 10Dzahn) [20:44:23] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1355.eqiad.wmnet with reason: REIMAGE [20:44:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:11] twentyafterfour: I wanted to confirm the FeatureFeeds issue from last week is definitely solved. Guess that will wait ;) [20:46:09] hashar: :( [20:47:15] legoktm for the GlobalWatchlist deployment, the fixes worked, backports are ready and listed at https://wikitech.wikimedia.org/wiki/Deployments#Week_of_February_08. [20:48:12] hashar twentyafterfour meta wiki is currently on .30 - is there a chance that it will be rolled back? Or does the train blocker only prevent further deployment to group2? We can't deploy GlobalWatchlist to meta on .27, so if there is a chance it'll be rolled back we should wait for the deployment [20:50:18] (03PS2) 10Dzahn: Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 [20:50:48] (03CR) 10jerkins-bot: [V: 04-1] Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 (owner: 10Dzahn) [20:51:10] DannyS712: there is a chance of rollback yeah [20:51:13] https://wikitech.wikimedia.org/wiki/Deployments/Holding_the_train says that you can't deploy new shiny features while the train is blocked [20:52:11] somewhere it's been discussed that we shouldn't leave the train partially deployted over the weekend and should instead roll back to a single version everywhere [20:52:46] !log mw1364 - powercycled [20:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:53:00] "you can't deploy new shiny features while the train is blocked" technically its already deployed to testwiki, but yeah, sending it to meta would result in much more usage. Okay, I'll look into fixing the blocker task [20:53:00] I don't know if there was a conclusion to that discussion, trying to find it [20:54:18] (03CR) 10Krinkle: "Review the effective diff at https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/5637/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [20:54:21] T260401 [20:54:22] T260401: Avoid unfinished train deploys over holidays, weekends, or other stretches of no-deploy days - https://phabricator.wikimedia.org/T260401 [20:55:49] (03CR) 10Razzi: [C: 03+2] Disable MaxMind archiving [puppet] - 10https://gerrit.wikimedia.org/r/663671 (https://phabricator.wikimedia.org/T273891) (owner: 10Razzi) [20:57:10] good luck w/ getting the train rolling, I'm off to bed [20:57:28] twentyafterfour was this issue seen in .29? [20:57:29] outcome: inconclusive. In the current situation it might be more disruptive to roll everything back to .27 than to leave things split over the weekend. On the other hand, the error that's blocking is occuring on group1 so leaving things as-is isn't good really [20:57:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1364.eqiad.wmnet'] ` an... [20:57:46] DannyS712: it seems new [20:58:10] okay - I didn't see anything obvious that could have caused it, so brute force - check every commit it https://www.mediawiki.org/wiki/MediaWiki_1.36/wmf.30 for pausible suspects [20:59:08] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1364.eqiad.wmnet [20:59:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:43] DannyS712: if GlobalWatchlist is not compat with wmf.27 then it should not have been enabled until the Monday after wmf.27 is gone from production. [21:00:53] hashar: the featurefeeds 'caching something unserializable' issue is indeed gone. [21:01:09] it sounds like it was enabled too early, and if we need to rollback the train, we'll disable the extension on testwiki if/when needed, or leave it broken since it's testwiki [21:01:37] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1364.eqiad.wmnet [21:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:02:48] (03CR) 10Krinkle: "boolean false is not a supported value for $wgLogos in MW afaik." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [21:02:48] Krinkle the extension works with wmf.27, its just missing a couple of patches that should be included before deployment to meta [21:03:27] I don't know what that means. what is missing the patches? [21:03:41] (03CR) 10Jdlrobson: "The diff is fine - the issue is the wmg variables seem to be ignored, in the construction of the final wgLogos that happens in wmf-config/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [21:04:04] Krinkle: I think it's fine to be enabled on testwiki on wmf.27, but it shouldn't reach meta in that state [21:04:19] the bugs are mostly UI things [21:04:21] (03CR) 10Krinkle: "It is the problem. See task. wgLogos[1.5]=false is in prod and thus background-image:url("")" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [21:04:34] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by dzahn on cumin1001.eqia... [21:04:40] (03CR) 10Krinkle: "> Patch Set 5:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [21:05:18] !log mw1360 - powercycling [21:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:32] apergos: awesome thank you :] [21:07:48] now go away and get some rest :-D [21:07:55] I am off happy train! [21:07:57] yeah [21:07:59] long days [21:08:00] ;D [21:08:35] (03CR) 10Jdlrobson: "Okay we're talking about 2 different problems. When I looked earlier I was seeing something different." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663263 (https://phabricator.wikimedia.org/T274210) (owner: 10Jdlrobson) [21:10:09] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1360.eqiad.wmnet'] ` an... [21:12:57] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1360.eqiad.wmnet [21:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:14:45] DannyS712: I think given the uncertainty, we should backport the patches to wmf.30, but hold off on the meta part until Tuesday...what do you think? [21:16:33] 10SRE, 10Domains, 10Traffic: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10Peachey88) [21:16:53] 10SRE, 10DNS, 10Traffic: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10Reedy) [21:17:05] 10SRE, 10DNS: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10Peachey88) [21:19:17] 10SRE, 10DNS, 10Traffic: Apple Business Manager: verify ownership of wikimedia.org - https://phabricator.wikimedia.org/T274592 (10Dzahn) [21:20:44] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1360.eqiad.wmnet [21:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:00] legoktm sure - ready to test the backports whenever [21:21:08] we can test on testwiki [21:21:29] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1354.eqiad.wmnet with reason: REIMAGE [21:21:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:20] DannyS712: ack, let's do that during the window we already have [21:23:36] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1354.eqiad.wmnet with reason: REIMAGE [21:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:03] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by legoktm on cumin1001.eq... [21:33:21] twentyafterfour could https://gerrit.wikimedia.org/r/q/b75ac3953e750fd6b1b29868a77dbebd7969fbdc be the cause of the lock issue? Its the only commit I see recentley about locking [21:34:33] production runs on mariadb, not sqlite [21:34:40] (03PS1) 10RobH: swap back to idrac legacy password [software] - 10https://gerrit.wikimedia.org/r/663684 [21:36:15] (03CR) 10RobH: [C: 03+2] swap back to idrac legacy password [software] - 10https://gerrit.wikimedia.org/r/663684 (owner: 10RobH) [21:36:57] !log mw1355, mw1359 - power cycling [21:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:37] (03CR) 10Bstorm: [C: 04-1] "So far, not so good." [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [21:37:51] DannyS712: This relates to uploading and Rdbms atomic sections. This isn't a code area that anyone currently knows very well to my knowledge, but I think it's important to let SDE become familiar with this through real-world examples like this, so I'd let it be for now. [21:38:23] it looks like the locking issue is a race condition. it's happening inside a catch {} block when a __destruct tries to end an atomic section that was already ended elsewhere [21:39:08] maybe not a race condition but just an unexpected execution path due to an interaction between a bunch of pieces and try/catch/throw behavior [21:39:22] Krinkle how did you conclude it was SDE-related? [21:39:32] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1359.eqiad.wmnet'] ` an... [21:39:55] DannyS712: uploading files and media management is owned by SDE team. [21:39:56] what's SDE? [21:40:03] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1359.eqiad.wmnet [21:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:40:11] it is not "Structurered Data on Commons" related specifically [21:40:14] oh [21:40:15] twentyafterfour: structured data engineering [21:40:27] oh, I thought it was structured data on commons - thanks for clarifying [21:40:52] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1355.eqiad.wmnet'] ` an... [21:41:10] DannyS712: which task for logstash missing all events? [21:41:16] (is it really missing all events?) [21:41:31] just filed T274593 - yes, looks like it [21:41:31] T274593: Logstash beta is not getting any events - https://phabricator.wikimedia.org/T274593 [21:42:38] I note that it is running kibana 5 instead of kibana 7 like prod [21:43:19] (03CR) 10Bstorm: [C: 04-1] Replace remaining uses of keystoneclient.session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [21:44:01] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1359.eqiad.wmnet [21:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:47] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1355.eqiad.wmnet [21:45:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:49:24] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1354.eqiad.wmnet'] ` an... [21:50:06] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1355.eqiad.wmnet [21:50:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:54] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw1354.eqiad.wmnet [21:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:35] !log dzahn@cumin1001 conftool action : set/pooled=yes; selector: name=mw1354.eqiad.wmnet [21:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:10] (03CR) 10Bstorm: [C: 04-1] Replace remaining uses of keystoneclient.session (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [21:57:11] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1329.eqiad.wmnet with reason: REIMAGE [21:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:06] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1330.eqiad.wmnet with reason: REIMAGE [21:59:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:12] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1329.eqiad.wmnet with reason: REIMAGE [21:59:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:00] (03PS1) 10RobH: mwlog1002 updates [puppet] - 10https://gerrit.wikimedia.org/r/663686 (https://phabricator.wikimedia.org/T267271) [22:00:04] Legoktm and DannyS712: My dear minions, it's time we take the moon! Just kidding. Time for GlobalWatchlist deployment to production deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T2200). [22:00:12] here [22:00:27] we're doing just the backports, not the actual enabling on meta [22:00:39] (03CR) 10Legoktm: [C: 03+2] Restore RTL handling for non-Vue display [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663397 (https://phabricator.wikimedia.org/T274313) (owner: 10DannyS712) [22:00:43] (03CR) 10Legoktm: [C: 03+2] Add @noflip commands for CSS [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663399 (https://phabricator.wikimedia.org/T274313) (owner: 10DannyS712) [22:00:50] (03CR) 10Legoktm: [C: 03+2] Switch sidebar hook to onSidebarBeforeOutput [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663398 (https://phabricator.wikimedia.org/T274312) (owner: 10DannyS712) [22:01:01] (03CR) 10RobH: [C: 03+2] mwlog1002 updates [puppet] - 10https://gerrit.wikimedia.org/r/663686 (https://phabricator.wikimedia.org/T267271) (owner: 10RobH) [22:01:05] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Dagbani Wikimedians Mailing List - https://phabricator.wikimedia.org/T274582 (10Dzahn) a:03Dzahn [22:01:05] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1331.eqiad.wmnet with reason: REIMAGE [22:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:11] thanks - I'll be ready to verify on testwiki once its on a debug host [22:01:15] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1330.eqiad.wmnet with reason: REIMAGE [22:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:02] ok [22:02:27] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10RobH) [22:03:07] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1332.eqiad.wmnet with reason: REIMAGE [22:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:15] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1331.eqiad.wmnet with reason: REIMAGE [22:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:04:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) [22:04:08] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) 05Resolved→03Open a:05Cmjohnson→03Jgreen [22:04:48] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10RobH) [22:04:50] legoktm once the train blocker is resolved, can enabling on meta go through a normal backport&config window, or does it still need its own dedicated time? Its already running in production on testwiki... [22:05:15] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1332.eqiad.wmnet with reason: REIMAGE [22:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:05:26] (03Merged) 10jenkins-bot: Restore RTL handling for non-Vue display [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663397 (https://phabricator.wikimedia.org/T274313) (owner: 10DannyS712) [22:05:28] (03Merged) 10jenkins-bot: Add @noflip commands for CSS [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663399 (https://phabricator.wikimedia.org/T274313) (owner: 10DannyS712) [22:05:36] (03Merged) 10jenkins-bot: Switch sidebar hook to onSidebarBeforeOutput [extensions/GlobalWatchlist] (wmf/1.36.0-wmf.30) - 10https://gerrit.wikimedia.org/r/663398 (https://phabricator.wikimedia.org/T274312) (owner: 10DannyS712) [22:05:48] I think it should be fine a in backport window but it's also up to the deployer if they feel comfortable enough [22:06:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: (Need By: TBD) rack/setup/install frqueue100[34] - https://phabricator.wikimedia.org/T266365 (10Jgreen) [22:06:18] okay. Just want to be able to deploy as soon as the train blocker is resolved [22:07:08] DannyS712: it's on mwdebug1002 now [22:07:50] testing [22:08:45] (03CR) 10Bstorm: [C: 03+1] "This one has the potential to affect a lot, but I don't see how it would be a problem either. Just please merge while I'm working just in " [puppet] - 10https://gerrit.wikimedia.org/r/663289 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:09:29] legoktm everything appears to work - there are some i18n issues I didn't see before, but that isn't important at the moment [22:09:42] i18n or ltr/rtl? [22:09:56] i18n [22:10:10] ack [22:10:19] specifically T260220 that I'll look into [22:10:20] T260220: Use proper messages for log entries - https://phabricator.wikimedia.org/T260220 [22:10:40] (03PS2) 10Bstorm: Replace remaining uses of keystoneclient.session [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:10:48] syncing [22:11:55] !log legoktm@deploy1001 Synchronized php-1.36.0-wmf.30/extensions/GlobalWatchlist: GlobalWatchlist backports (duration: 01m 11s) [22:11:55] 10SRE: Either include X-Varnish in Mediawiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs - https://phabricator.wikimedia.org/T274595 (10CDanis) [22:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:10] there we go :) [22:12:24] works! [22:12:42] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Dagbani Wikimedians Mailing List - https://phabricator.wikimedia.org/T274582 (10Dzahn) Hello @Masssly the list has been created. Here is the list info page: https://lists.wikimedia.org/mailman/listinfo/dagbani here is the admin page https://lists.wiki... [22:13:42] 10SRE: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs - https://phabricator.wikimedia.org/T274595 (10Legoktm) [22:13:44] (03CR) 10Bstorm: "This should clear the errors and appears to work. However, I noticed in the doc that the correct class for the auth argument to Session is" [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:13:50] 10SRE, 10Wikimedia-Mailing-lists: Request for creation: Dagbani Wikimedians Mailing List - https://phabricator.wikimedia.org/T274582 (10Dzahn) 05Open→03Resolved Please keep in mind there is just one admin password you share. You can change it if you like but need to sync with each other. [22:14:09] 10SRE: Either include X-Varnish in MediaWiki logs and include the X-Varnish in Varnish 5xx logs; or, include the beresp X-Request-Id in Varnish 5xx logs - https://phabricator.wikimedia.org/T274595 (10Tgr) Any preference? Including the varnish ID in all MediaWiki logs should be pretty easy, we'd just add another... [22:14:47] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:16:43] (03CR) 10Bstorm: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:19:04] (03PS1) 10Razzi: Remove MaxMind archiving code [puppet] - 10https://gerrit.wikimedia.org/r/663687 (https://phabricator.wikimedia.org/T273891) [22:19:19] (03PS3) 10Bstorm: Replace remaining uses of keystoneclient.session [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:23:12] (03CR) 10Bstorm: "This version definitely works on tools-clush-generator.py. Since nfs-exportd is somewhat scary, I can disable puppet on labstore1004 and m" [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:24:21] (03PS1) 10Andrew Bogott: archive-instances.py: fix use of keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) [22:24:54] (03CR) 10jerkins-bot: [V: 04-1] archive-instances.py: fix use of keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:25:07] (03PS2) 10Andrew Bogott: archive-instances.py: fix use of keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) [22:25:38] (03CR) 10jerkins-bot: [V: 04-1] archive-instances.py: fix use of keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:26:39] (03PS3) 10Andrew Bogott: archive-instances.py: fix use of keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) [22:27:27] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:27:38] (03CR) 10Andrew Bogott: [C: 03+2] archive-instances.py: fix use of keystoneauth1 [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:28:13] (03PS2) 10Dzahn: cloud: replace hiera in hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/663289 (https://phabricator.wikimedia.org/T209953) [22:28:38] (03CR) 10Bstorm: [C: 03+2] Replace remaining uses of keystoneclient.session [puppet] - 10https://gerrit.wikimedia.org/r/663666 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:28:45] (03CR) 10Dzahn: [C: 03+2] cloud: replace hiera in hiera with lookup [puppet] - 10https://gerrit.wikimedia.org/r/663289 (https://phabricator.wikimedia.org/T209953) (owner: 10Dzahn) [22:34:49] 10SRE, 10Patch-For-Review, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) [22:35:48] 10SRE, 10Epic, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) [22:36:54] (03CR) 10Bstorm: archive-instances.py: fix use of keystoneauth1 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:37:36] (03CR) 10Bstorm: "But" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663688 (https://phabricator.wikimedia.org/T239584) (owner: 10Andrew Bogott) [22:38:53] 10SRE, 10Epic, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) 05Open→03Resolved After [[ https://gerrit.wikimedia.org/r/q/topic:%22hiera-lookup%22+(status:open%20OR%20status:merged) | many many patches ]] this is now actuall... [22:38:57] 10Puppet, 10SRE, 10puppet-compiler, 10Patch-For-Review, 10User-jbond: OKR: Work required to prepare for puppet 6 - https://phabricator.wikimedia.org/T265138 (10Dzahn) [22:39:54] 10SRE, 10ops-codfw, 10Discovery-Search (Current work): elastic2054 unresponsive - https://phabricator.wikimedia.org/T274555 (10Papaul) Create Dispatch: Success You have successfully submitted request SR1051498857. [22:42:18] (03PS1) 10Zoranzoki21: toolserver_legacy: Remove unknown property "margins" [puppet] - 10https://gerrit.wikimedia.org/r/663690 (https://phabricator.wikimedia.org/T274562) [22:43:08] (03PS2) 10Zoranzoki21: toolserver_legacy: Remove unknown property "margins" [puppet] - 10https://gerrit.wikimedia.org/r/663690 (https://phabricator.wikimedia.org/T274562) [22:43:25] (03PS3) 10Dzahn: Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 [22:43:57] (03PS3) 10Zoranzoki21: toolserver_legacy: Remove unknown property "margins" [puppet] - 10https://gerrit.wikimedia.org/r/663690 (https://phabricator.wikimedia.org/T274562) [22:44:00] (03CR) 10jerkins-bot: [V: 04-1] Revert "mwdebug: allow rsyncing home dirs from any mwdebug* to a backup host" [puppet] - 10https://gerrit.wikimedia.org/r/663401 (owner: 10Dzahn) [22:44:15] (03CR) 10Cwhite: [C: 03+2] profile: remove type field for all ecs-formatted events [puppet] - 10https://gerrit.wikimedia.org/r/663613 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [22:47:26] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/663599 [22:47:59] jouncebot: now [22:47:59] For the next 1 hour(s) and 12 minute(s): GlobalWatchlist deployment to production (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T2200) [22:48:10] legoktm: are you currently deploying sth? [22:48:26] Urbanecm: no, we're done [22:48:30] thanks [22:48:46] (03CR) 10TerraCodes: [C: 04-1] "`text-decoration: none;` doesn't work for on me on the body element, so I'm assuming it needs to be applied to the `A` element that wraps " [puppet] - 10https://gerrit.wikimedia.org/r/663690 (https://phabricator.wikimedia.org/T274562) (owner: 10Zoranzoki21) [22:49:41] PROBLEM - Host mwmaint1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:49:49] (03PS1) 10Bstorm: keystone: Stay in keystoneauth1 for auth classes [puppet] - 10https://gerrit.wikimedia.org/r/663691 (https://phabricator.wikimedia.org/T239584) [22:50:24] what's with mwmaint? [22:52:16] (03PS4) 10Zoranzoki21: toolserver_legacy: Remove unknown property "margins" [puppet] - 10https://gerrit.wikimedia.org/r/663690 (https://phabricator.wikimedia.org/T274562) [22:52:19] connected to console on mwmaint1002 [22:52:28] no output [22:52:38] my scap will probably fail on it :/ [22:53:11] !log Deploy security patch for T274514 [22:53:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:28] !log powercycling crashed mwmaint1002 [22:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:48] narf [22:54:02] (03CR) 10Zoranzoki21: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/663690 (https://phabricator.wikimedia.org/T274562) (owner: 10Zoranzoki21) [22:54:20] thanks mutante :) [22:56:35] RECOVERY - Host mwmaint1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [22:56:45] 10SRE, 10serviceops, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO, 10User-jijiki: Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['mw1329.eqiad.wmnet', 'mw13... [22:57:10] !log Run scap pull at mwmaint1002 [22:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:28] ahhhhh [22:57:30] that was me [22:57:31] shit [22:57:34] i fucked up [22:57:52] i was supposed to be on mwlog1002 and was on mwmaint1002 [22:57:59] Urbanecm / musikanimal [22:58:00] mutante: [22:58:09] robh: oh? I like to know a reason, much better than wondering why :) [22:58:17] no absolutely, i own my mistakes [22:58:27] and I fucked up big time, letm me ensure tios done with my mistakes [22:58:30] ok, np, it is back [22:58:37] so, was it actually an "intentional" outage? [22:58:40] well, it has updated bios now. [22:58:45] heh [22:58:52] no, it wasnt suposed to crash, i just stupidly logged into the wrong mgmt interface becuse im stuipd [22:58:55] =P [22:58:59] 100% end user error [22:59:13] ive typed in 'mwmaint' a LOT more than ive ever typed 'mwlog' [22:59:19] well i guess it should have new bios anyway 🙂 [22:59:19] stuff happens, glad it was not reimaging [22:59:45] mutante: you ahve no idea i broke into a cold sweat here about 45 seconds ago when i realized what i did [22:59:51] I don't think anything bad happened [22:59:55] well, 2 minutes ago now but yeah [23:00:01] the periodic jobs will just run when they run [23:00:05] my fear was someone was deploying a change via it when i did that [23:00:18] i was on deploy1001, and all that happened is a timeout [23:00:25] nah, deploy* would have been a bit different [23:00:42] geeze, i do not like the adrenaline dump when you realize you took offline a live host. [23:00:50] i ran scap pull there to make it back it sync already (through my change is not supposed to touch it anyway) [23:00:58] dont worry rob, i know that feeling [23:01:33] Errare humanum est :) [23:02:32] glad it's not broken hardware and having to find replacement now [23:02:53] just in case...do we have backup mwmaint in eqiad? [23:03:02] (03PS1) 10Cwhite: iegreview: drop log_dest and add log_channel [puppet] - 10https://gerrit.wikimedia.org/r/663693 (https://phabricator.wikimedia.org/T215497) [23:03:03] the worst part is that 30 seconds of frenzied double checkign that you did indeed not mean to do what just happened [23:03:28] (03CR) 10Bstorm: keystone: Stay in keystoneauth1 for auth classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663691 (https://phabricator.wikimedia.org/T239584) (owner: 10Bstorm) [23:03:36] Urbanecm: one mwmaint in eqiad, one in codfw, my understanding is they can cross deploy but i could be wrong [23:03:43] (03PS1) 10Nray: Enable WVUI search on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 [23:03:45] cross run their scripts that is [23:04:23] yea, those are good questions I am actually not too sure about [23:04:33] there isn't a warm standby in eqiad [23:04:52] robh: codfw's mwmaint talks to codfw's part of MW infra, which is read only [23:05:00] no idea if it is possible to switchover just mwmaint [23:05:06] oh, my bad [23:05:22] setting one up would not be too hard.. but for that we need hw in spare pool [23:05:24] i vaguely recall failing over to one when ahving to upgrade the toher though [23:05:35] but perhaps it was when there were two in a site for whatever reason [23:05:48] it may be possible to failover it, no idea [23:05:49] eqiad is at mwmain1002, so there was an mwmaint1001 at some point in time [23:06:03] 'why do we need a second host' 'rob' [23:06:06] = P [23:06:06] that was a distro version update [23:06:18] (03PS2) 10Nray: Enable WVUI search on beta (for Vector skin) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663694 [23:06:23] that sounds right [23:07:13] (03CR) 10Bstorm: keystone: Stay in keystoneauth1 for auth classes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/663691 (https://phabricator.wikimedia.org/T239584) (owner: 10Bstorm) [23:07:15] maybe we would end up taking a random appserver and turn it into an mwmaint [23:07:33] hehe [23:07:44] it was also an open question whether mwmaint can be virtual [23:07:51] at one point [23:07:59] if so then we'd just fire up a ganeti [23:08:39] unrelatedly… i can't seem to `git pull` the operations/mediawiki-config repository [23:08:44] also that is supposed to move to kubernetes [23:08:50] other repos work fine [23:08:59] MatmaRex: what does "I can't" mean, please? [23:09:08] i'm pasting [23:09:24] https://phabricator.wikimedia.org/P14326 [23:09:35] MatmaRex: For what it's worth, I can do so. [23:10:11] 10SRE, 10SRE-Access-Requests, 10User-brennen: Requesting access to gerrit1001/gerrit1002 for brennen - https://phabricator.wikimedia.org/T274601 (10brennen) [23:10:12] +1, git pull works for me in that repo [23:10:13] What's the remote url you're using? (I'm on ssh://kemayo@gerrit.wikimedia.org:29418/operations/mediawiki-config ) [23:10:31] I think that can happen if you are on aa slow connection and unlucky but should be tranisent [23:10:38] same. ssh://matmarex@gerrit.wikimedia.org:29418/operations/mediawiki-config.git [23:10:55] i had vaguely similar issues in the past with mediawiki/core, but i don't remember if the errors were the same [23:11:26] well, it went through on the fourth try [23:11:37] if that happens it is always on very large repos like the 2 you mentioned and then goes away when retrying, afaict [23:11:52] when it hits some timeout [23:11:53] mutante: the message "fatal: the remote end hung up unexpectedly" makes me think that gerrit is to blame [23:12:36] dancy was seeing some slowness on cloning mw/core earlier today, so plausibly something is up. [23:13:01] https://phabricator.wikimedia.org/T263293 [23:13:05] https://phabricator.wikimedia.org/T263293 [23:13:11] ha, yeah. that's the one i filed last time [23:13:17] authored by MatmaRex [23:13:25] RECOVERY - Check systemd state on relforge1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:13:37] you know whats fun, searching for "fatal: something in phab causes an error [23:13:58] haha [23:14:05] mutante: needs to be quoted in the search string [23:14:27] (i have to remember that about 3x a week) [23:14:40] brennen: ACK :) thanks [23:15:02] brennen: Btw, I think my issue from earlier was due to a too-large --jobs value. It was 8. Lowered to 4 and now it's faster. [23:15:10] MatmaRex: yea, so back then we also ended up with some upstream bugs that were somehow related to high latency [23:15:11] dancy: ahh, gotcha. [23:15:12] (03PS1) 10Cwhite: profile: remove logstash inputs on legacy cluster [puppet] - 10https://gerrit.wikimedia.org/r/663697 (https://phabricator.wikimedia.org/T234854) [23:16:32] I saw it before when I was on a slow wifi but not when on a fast connection.. afaicr [23:19:21] I am now going to reinstall mwdebug2002.. starting with that because nobody seems to use it anyways [23:19:32] have a backup of the home dirs anyways [23:19:46] jouncebot: now [23:19:46] For the next 0 hour(s) and 40 minute(s): GlobalWatchlist deployment to production (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210211T2200) [23:19:52] jouncebot: next [23:19:52] In 0 hour(s) and 40 minute(s): Evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20210212T0000) [23:20:21] PROBLEM - Check systemd state on relforge1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:20:22] mutante: the current window is finished already, fwiw [23:20:30] Urbanecm: ok, thx [23:20:55] !log reimaging mwdebug2002 - stretch -> buster [23:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:10] Speaking of the evening backport window... does it seem like we're going to be abandoning .30 as well? [23:23:10] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade [23:23:11] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwdebug2002.codfw.wmnet with reason: OS upgrade [23:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:23:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:17] (03CR) 10Andrew Bogott: [C: 03+2] keystone: Stay in keystoneauth1 for auth classes [puppet] - 10https://gerrit.wikimedia.org/r/663691 (https://phabricator.wikimedia.org/T239584) (owner: 10Bstorm) [23:25:30] Urbanecm: when it comes to mwdebug1001, and that is reinstalled with a clean /srv and then it scap pulls.. will people possibly say they lost information, something that was staged [23:25:51] mutante: stagging on mwdebug1001 is always very short-lived [23:26:00] most deployments destroy it anyway [23:26:10] ok, great, yes, just making sure [23:26:44] i wouldn't worry about it. If people are stagging there right when you want to reimagine, a heads-up should fix it IMO [23:26:49] *reimage [23:27:06] *nod*, good [23:28:08] dont worry about it for the deploy in 40 min. .i'll just do codfw then [23:28:16] that is still 2 to go [23:29:05] (03PS1) 10DLynch: Log the DiscussionTools a/b test bucket for relevant schemas [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663403 (https://phabricator.wikimedia.org/T273096) [23:29:27] (03PS1) 10DLynch: Log the DiscussionTools a/b test bucket for relevant schemas [extensions/WikiEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663404 (https://phabricator.wikimedia.org/T273096) [23:32:10] !log robh@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mwlog1002.eqiad.wmnet with reason: REIMAGE [23:32:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:11] !log robh@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mwlog1002.eqiad.wmnet with reason: REIMAGE [23:34:13] (03PS1) 10Bartosz Dziewoński: Remove uses of removed VisualEditor config variables (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663699 (https://phabricator.wikimedia.org/T273177) [23:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:15] (03PS1) 10Bartosz Dziewoński: Remove uses of removed VisualEditor config variables (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) [23:34:17] (03PS1) 10Bartosz Dziewoński: Remove unneeded $wgHiddenPrefs[] = 'visualeditor-betatempdisable' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663701 (https://phabricator.wikimedia.org/T273188) [23:35:17] (03CR) 10Bartosz Dziewoński: [C: 04-1] "One day, the train will roll out, and we'll be able to deploy this. But today is not that day." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663701 (https://phabricator.wikimedia.org/T273188) (owner: 10Bartosz Dziewoński) [23:36:24] (03CR) 10Bartosz Dziewoński: [C: 03+1] "This is good to go without the VisualEditor patch, if anyone is bored." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663699 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [23:36:30] (03CR) 10Bartosz Dziewoński: [C: 03+1] "This is good to go without the VisualEditor patch, if anyone is bored." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/663700 (https://phabricator.wikimedia.org/T273177) (owner: 10Bartosz Dziewoński) [23:38:02] 10SRE, 10SRE-Access-Requests, 10User-brennen: Requesting access to gerrit1001/gerrit1002 for brennen - https://phabricator.wikimedia.org/T274601 (10thcipriani) > - access request (or expansion) has sign off of WMF sponsor/manager (sponser for volunteers, manager for wmf staff) Approved! [23:42:37] (03CR) 10CRusnov: "tagging in @Volans because it potentially produces a semantic change to your previous patch affecting automated dhcp configuration; this b" [puppet] - 10https://gerrit.wikimedia.org/r/663658 (https://phabricator.wikimedia.org/T271583) (owner: 10CRusnov) [23:44:47] !log Train status for wmf.30 (T271344) is blocked until monday. leaving wmf.30 on group1 and wmf.27 on group2 in spite of T260401 [23:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:53] T260401: Avoid unfinished train deploys over holidays, weekends, or other stretches of no-deploy days - https://phabricator.wikimedia.org/T260401 [23:44:53] T271344: 1.36.0-wmf.30 deployment blockers - https://phabricator.wikimedia.org/T271344 [23:47:44] !log reimaged mwdebug2002 with buster - since this is a VM: manually cleaned puppet cert on puppetmaster1001, signed new cert for same hostname, initial puppet run etc (T274023) [23:47:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:47:49] T274023: Convert mwdebug VMs to debian buster - https://phabricator.wikimedia.org/T274023 [23:50:01] mutante: i guess it was not a good idea to deploy right now [23:50:11] https://usercontent.irccloud-cdn.com/file/Wu26fBMY/image.png [23:50:30] !log Deploy security patch for T274514 [23:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:23] i guess puppet just needs to run at deploy1001, [23:51:33] to get rid of Host key verification failed. for the reimaged hosts? [23:51:37] Urbanecm: well, that debug host is up but currently installing packages [23:51:41] aha [23:51:47] so I'll ignore it [23:51:50] puppet is re-signed [23:52:20] yes please, and as soon as the puppet run is done I will scap pull [23:52:30] just 2002 for now, i need to go after that [23:52:57] when I checked if anyone ever logged in it was almost none [23:53:49] yeah, codfw servers are read only anyway, so they're just standby's [23:56:53] installing entire OS on VM: 5 min, running puppet the first time: > an hour :p [23:57:10] Hehe [23:57:59] (03CR) 10jerkins-bot: [V: 04-1] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/WikiEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663404 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [23:58:39] (03CR) 10jerkins-bot: [V: 04-1] Log the DiscussionTools a/b test bucket for relevant schemas [extensions/VisualEditor] (wmf/1.36.0-wmf.27) - 10https://gerrit.wikimedia.org/r/663403 (https://phabricator.wikimedia.org/T273096) (owner: 10DLynch) [23:59:13] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10RobH) I told the reimage script to output to the (invisble to it) procurement task T264639, so there is no in task log of the reimage being run, but I did so and it was success... [23:59:46] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10RobH) [23:59:53] 10SRE, 10ops-eqiad, 10DC-Ops: (Need By: TBD) rack/setup/install mwlog1002.eqiad.wmnet - https://phabricator.wikimedia.org/T267271 (10RobH) 05Open→03Resolved